Java – a regular expression used to remove HTML tags from a string
I am looking for a regular expression to delete all HTML tags in the string from JSP
Example 1
sampleString = "test string <i>in italics</i> continues";
Example 2
sampleString = "test string <i>in italics";
Example 3
sampleString = "test string <i";
The HTML tag may be complete, partial (no closing tag) or without the correct starting tag (missing closing angle brackets in the third example) itself
Thank you in advance
Solution
Case 3 cannot be used with regular expressions or parsers It may represent legal content So forget it
As for the specific questions covering cases 1 and 2, just use the HTML parser My favorite is jsup
String text = Jsoup.parse(html).text();
nothing more. By the way, it's also an HTML cleaner, if that's what you really pursue
Because you are using JSP, you can also use JSTL < C: out > or FN: escapexml() to avoid user controlled HTML input being inline in HTML (which may open XSS vulnerability)
<c:out value="${bean.property}" /> <input type="text" name="foo" value="${fn:escapeXml(param.foo)}" />
HTML tags are then not interpreted, but simply displayed as plain text