Java regular expressions run slowly

2019-06-14 • Java

I tried to use the darling fireball regular expression for matching URLs in Java, and I found a URL that led to the evaluation forever I modified the original regular expression to use Java syntax

private final static String pattern = 
"\\b" + 
"(" +                            // Capture 1: entire matched URL
  "(?:" +
    "[a-z][\\w-]+:" +                // URL protocol and colon
    "(?:" +
      "/{1,3}" +                        // 1-3 slashes
      "|" +                             //   or
      "[a-z0-9%]" +                     // Single letter or digit or '%'
                                        // (Trying not to match e.g. "URI::Escape")
    ")" +
    "|" +                            //   or
    "www\\d{0,3}[.]" +               // "www.","www1.","www2." … "www999."
    "|" +                            //   or
    "[a-z0-9.\\-]+[.][a-z]{2,4}/" +  // looks like domain name followed by a slash
  ")" +
  "(?:" +                           // One or more:
    "[^\\s()<>]+" +                      // Run of non-space,non-()<>
    "|" +                               //   or
    "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens,up to 2 levels
  ")+" +
  "(?:" +                           // End with:
    "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens,up to 2 levels
    "|" +                                   //   or
    "[^\\s`!\\-()\\[\\]{};:'\".,<>?«»“”‘’]" +        // not a space or one of these punct chars (updated to add a 'dash'
  ")" +
")";

// @see http://daringfireball.net/2010/07/improved_regex_for_matching_urls
private static final Pattern DARING_FIREBALL_PATTERN = Pattern.compile(pattern,Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);

If I try to run the following operation, it will always exist I narrowed it down to the match of balanced parentheses (I think) If you change the text in parentheses, it works normally, but about 15 characters, it starts to slow down

final Matcher matcher = pattern.matcher("https://goo.gl/a(something_really_long_in_balanced_parens)");
boolean found = matcher.find();

Is there any way to improve this regular expression so that lines will not exist forever? I have about 100 different URLs in the JUnit test class, and I need to continue working

Solution

Here's the problem:

"(?:" +                           // One or more:
"[^\\s()<>]+" +                      // Run of non-space,non-()<>
"|" +                               //   or
"\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens,up to 2 levels
")+"

You are a nested quantifier here This is a violation of any backtracking algorithm – for example, considering the regular expression / ^ (a) $/ matching string

aaaaaaaaaab

As a first attempt, internal quantifiers will match all Then the regular expression fails, so it exits one Then the external quantity phrase tries to match again, swallows the last one, and the regular expression fails again We basically get exponential behavior because quantifiers try various ways of splitting without actually making any progress

The solution is to occupy the quantifier (we indicate by positioning to the end of the quantifier) – we set the internal quantifier so that once matched, they won't let it go – they will persist until the match fails or the earlier quantifier retreats, and they must start again from other places in the string If we use / ^ (a) $/ as our regular expression, we will immediately fail on the mismatched string above, rather than trying to match it exponentially

Try to occupy these internal quantifiers and see if it helps

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.

THE END

Java

二维码

Java – how to calculate the intersection between two or more hashsets?

< <上一篇

Automatic boxing should be avoided in Java

下一篇>>

搜索内容

Java regular expressions run slowly

Solution

热门文章