Java – when I need to escape an HTML string?
In my legacy project, I can see that escape html is used before sending strings to the browser
StringEscapeUtils.escapeHtml(stringBody);
I learned the purpose of escapehtml from API doc An example is given: –
For example: "bread" & "butter" becomes: "bread" & "butter".
My understanding is that when we send a string after escaping HTML, it converts the browser responsibility back to the original character Is that right?
But I didn't get the reason why, when, and what would happen if we sent the string body without escaping the HTML? If we do not do escape HTML before sending it to the browser, what is the cost
Solution
I can think of several possibilities to explain why sometimes strings are not escaped:
>Perhaps the original programmer was convinced that there were no special characters in the string in some places (but in my opinion, this would be a bad programming habit; the cost of avoiding the string to prevent future changes is very low) > the string has been escaped at that point in the code You never want to escape a string twice; The user will eventually see the escape sequence instead of the expected text. > The string is the actual HTML itself You don't want to escape HTML; You want the browser to handle it!
Edit - the reason for escape is that special characters such as & and < will eventually cause the browser to display content other than your intention A naked & technically an error in HTML Most browsers try to handle such errors intelligently and display them correctly in most cases (for example, if the string is text in < div >, it will almost certainly happen in your sample text.) However, because it is an incorrect tag, some browsers will not work properly; Assistive technologies (e.g., text to voice) may fail; And there may be other problems
Although browsers try their best to recover from bad tags, there are still several cases where they fail If your sample string is an attribute value, you absolutely need to escape quotation marks The browser cannot handle the following correctly:
<img alt=""bread" & "butter"" ... >
The general rule is that any character that is not marked but may be confused with a mark needs to be escaped
Note that there are several contexts that can display text in HTML documents, and they have separate escape requirements In the attribute value, you need to escape quotation marks and the & symbol (but not <) You must escape characters that are not represented in the document's character set (unlikely if you are using UTF-8, but this is not always the case) Within a text node, only & and < need to escape In the href value, characters that need to be escaped in the URL must be escaped (sometimes double escaped, so they will still be escaped after the browser cancels them once) In CDATA blocks, nothing should normally be escaped (at the HTML level) Finally, except for the danger of double escape, the cost of escaping all text is minimal: a small amount of extra processing and some extra bytes on the network