Java – how to convert HTML to text and keep line breaks
•
Java
How to convert HTML to text and keep line breaks (generated by elements such as BR, P, DIV) may use nekohtml or any good enough HTML parser
Example: Hello & lt peak; Br / > world to:
Hello\n World
Solution
This is my function to output text (including line breaks) by using the jsup iteration node
public static String htmlToText(InputStream html) throws IOException {
Document document = Jsoup.parse(html,null,"");
Element body = document.body();
return buildStringFromNode(body).toString();
}
private static StringBuffer buildStringFromNode(Node node) {
StringBuffer buffer = new StringBuffer();
if (node instanceof TextNode) {
TextNode textNode = (TextNode) node;
buffer.append(textNode.text().trim());
}
for (Node childNode : node.childNodes()) {
buffer.append(buildStringFromNode(childNode));
}
if (node instanceof Element) {
Element element = (Element) node;
String tagName = element.tagName();
if ("p".equals(tagName) || "br".equals(tagName)) {
buffer.append("\n");
}
}
return buffer;
}
The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
二维码
