Java regular expressions run slowly
I tried to use the darling fireball regular expression for matching URLs in Java, and I found a URL that led to the evaluation forever I modified the original regular expression to use Java syntax
private final static String pattern = "\\b" + "(" + // Capture 1: entire matched URL "(?:" + "[a-z][\\w-]+:" + // URL protocol and colon "(?:" + "/{1,3}" + // 1-3 slashes "|" + // or "[a-z0-9%]" + // Single letter or digit or '%' // (Trying not to match e.g. "URI::Escape") ")" + "|" + // or "www\\d{0,3}[.]" + // "www.","www1.","www2." … "www999." "|" + // or "[a-z0-9.\\-]+[.][a-z]{2,4}/" + // looks like domain name followed by a slash ")" + "(?:" + // One or more: "[^\\s()<>]+" + // Run of non-space,non-()<> "|" + // or "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" + // balanced parens,up to 2 levels ")+" + "(?:" + // End with: "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" + // balanced parens,up to 2 levels "|" + // or "[^\\s`!\\-()\\[\\]{};:'\".,<>?«»“”‘’]" + // not a space or one of these punct chars (updated to add a 'dash' ")" + ")"; // @see http://daringfireball.net/2010/07/improved_regex_for_matching_urls private static final Pattern DARING_FIREBALL_PATTERN = Pattern.compile(pattern,Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
If I try to run the following operation, it will always exist I narrowed it down to the match of balanced parentheses (I think) If you change the text in parentheses, it works normally, but about 15 characters, it starts to slow down
final Matcher matcher = pattern.matcher("https://goo.gl/a(something_really_long_in_balanced_parens)"); boolean found = matcher.find();
Is there any way to improve this regular expression so that lines will not exist forever? I have about 100 different URLs in the JUnit test class, and I need to continue working
Solution
Here's the problem:
"(?:" + // One or more: "[^\\s()<>]+" + // Run of non-space,non-()<> "|" + // or "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" + // balanced parens,up to 2 levels ")+"
You are a nested quantifier here This is a violation of any backtracking algorithm – for example, considering the regular expression / ^ (a) $/ matching string
aaaaaaaaaab
As a first attempt, internal quantifiers will match all Then the regular expression fails, so it exits one Then the external quantity phrase tries to match again, swallows the last one, and the regular expression fails again We basically get exponential behavior because quantifiers try various ways of splitting without actually making any progress
The solution is to occupy the quantifier (we indicate by positioning to the end of the quantifier) – we set the internal quantifier so that once matched, they won't let it go – they will persist until the match fails or the earlier quantifier retreats, and they must start again from other places in the string If we use / ^ (a) $/ as our regular expression, we will immediately fail on the mismatched string above, rather than trying to match it exponentially
Try to occupy these internal quantifiers and see if it helps