Java – a regular expression used to remove HTML tags from a string

I am looking for a regular expression to delete all HTML tags in the string from JSP

Example 1

sampleString = "test string <i>in italics</i> continues";

Example 2

sampleString = "test string <i>in italics";

Example 3

sampleString = "test string <i";

The HTML tag may be complete, partial (no closing tag) or without the correct starting tag (missing closing angle brackets in the third example) itself

Thank you in advance

Solution

Case 3 cannot be used with regular expressions or parsers It may represent legal content So forget it

As for the specific questions covering cases 1 and 2, just use the HTML parser My favorite is jsup

String text = Jsoup.parse(html).text();

nothing more. By the way, it's also an HTML cleaner, if that's what you really pursue

Because you are using JSP, you can also use JSTL < C: out > or FN: escapexml() to avoid user controlled HTML input being inline in HTML (which may open XSS vulnerability)

<c:out value="${bean.property}" />
<input type="text" name="foo" value="${fn:escapeXml(param.foo)}" />

HTML tags are then not interpreted, but simply displayed as plain text

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>