Java regular expression performance

2020-08-07 • Java

I'm trying to parse links with regular expressions in Java

But I think it's getting too slow For example, to extract all links from:

> http://news.google.com.ar/nwshp?hl=es&tab=wn

... took 34642 milliseconds (34 seconds!!!)

This is a regular expression:

private final String regexp = "<a.*?\\shref\\s*=\\s*([\\\"\\']*)(.*?)([\\\"\\'\\s].*?>|>)";

Flag of mode:

private static final int flags = Pattern.CASE_INSENSITIVE | Pattern.DOTALL |Pattern.MULTILINE | Pattern.UNICODE_CASE | Pattern.CANON_EQ;

The code might look like this:

private void processURL(URL url){
    URLConnection connection;
    Pattern pattern = Pattern.compile(regexp,flags);
    try {
        connection = url.openConnection();
        InputStream in = connection.getInputStream();
        BufferedReader bf = new BufferedReader(new InputStreamReader(in));
        String html = new String();
        String line = bf.readLine();            
        while(line!=null){
            html += line;
            line = bf.readLine();
        }
        bf.close();
        Matcher matcher = pattern.matcher(html);
        while (matcher.find()) {
            System.out.println(matcher.group(2));
        }
     } catch (Exception e){
     }
 }

Can you give me a hint?

Additional data: 1mbit core 2 Duo 1GB ram single thread

Solution

Tip: do not use regular expressions for link extraction or other HTML "parsing" tasks!

There are 6 (six) duplicate groups in your regular expression Executing it will require a lot of backtracking In the worst case, it can even be close to o (n ^ 6), where n is the number of input characters You can alleviate this by replacing eager matching with lazy matching, but it is almost impossible to avoid morbid conditions; For example, when the input data format is wrong, the regular expressions do not match

A much better solution is to use some existing strict or allowed HTML parsers Even writing an ad - hoc parser manually is better than using gnarly regex

This page lists various HTML parsers for Java I've heard the good news about tagsoup and htmlcleaner

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.

THE END

Java

二维码

Java – what is the difference between these two urlencodings of URL

< <上一篇

Java – read from InputStream and write to OutputStream

下一篇>>

搜索内容

Java regular expression performance

Solution

热门文章