Java regular expression performance
I'm trying to parse links with regular expressions in Java
But I think it's getting too slow For example, to extract all links from:
> http://news.google.com.ar/nwshp?hl=es&tab=wn
... took 34642 milliseconds (34 seconds!!!)
This is a regular expression:
private final String regexp = "<a.*?\\shref\\s*=\\s*([\\\"\\']*)(.*?)([\\\"\\'\\s].*?>|>)";
Flag of mode:
private static final int flags = Pattern.CASE_INSENSITIVE | Pattern.DOTALL |Pattern.MULTILINE | Pattern.UNICODE_CASE | Pattern.CANON_EQ;
The code might look like this:
private void processURL(URL url){
URLConnection connection;
Pattern pattern = Pattern.compile(regexp,flags);
try {
connection = url.openConnection();
InputStream in = connection.getInputStream();
BufferedReader bf = new BufferedReader(new InputStreamReader(in));
String html = new String();
String line = bf.readLine();
while(line!=null){
html += line;
line = bf.readLine();
}
bf.close();
Matcher matcher = pattern.matcher(html);
while (matcher.find()) {
System.out.println(matcher.group(2));
}
} catch (Exception e){
}
}
Can you give me a hint?
Additional data: 1mbit core 2 Duo 1GB ram single thread
Solution
Tip: do not use regular expressions for link extraction or other HTML "parsing" tasks!
There are 6 (six) duplicate groups in your regular expression Executing it will require a lot of backtracking In the worst case, it can even be close to o (n ^ 6), where n is the number of input characters You can alleviate this by replacing eager matching with lazy matching, but it is almost impossible to avoid morbid conditions; For example, when the input data format is wrong, the regular expressions do not match
A much better solution is to use some existing strict or allowed HTML parsers Even writing an ad - hoc parser manually is better than using gnarly regex
This page lists various HTML parsers for Java I've heard the good news about tagsoup and htmlcleaner
