Java – how to use regular expressions to check whether HTML documents contain non empty script tags
I try to check whether an HTML document contains script tags that are not empty using regular expressions Regular expressions should match any script tag with anything other than spaces or newlines
I tried
<script\b[^>]*>[^.+$]</script>
However, this regular expression can only find a script label with a space
Solution
Don’t parse HTML with regexen! Seriously, in general, this is almost impossible Why do you use regular expressions here? It makes more sense to use HTML parser, although I can't give you any special advice because I don't know what language you're using For example, if you are using JavaScript DOM, you need the following:
var scripts = document.getElementsByTagName('script') var numScripts = scripts.length var textScripts = [] for (var i = 0; i < numScripts; ++i) if (scripts[i].text !== '') textScripts.push(scripts[i])
This will look at the structure of the HTML to determine the attributes of the script tag, not in messy text
Editor 1: obviously, you are using Java Unfortunately, I know nothing about html parsing in Java, so I can't give you any advice; However, look, because it's the way to go