Java – how to convert HTML to text and keep line breaks

How to convert HTML to text and keep line breaks (generated by elements such as BR, P, DIV) may use nekohtml or any good enough HTML parser

Example: Hello & lt peak; Br / > world to:

Hello\n  
World

Solution

This is my function to output text (including line breaks) by using the jsup iteration node

public static String htmlToText(InputStream html) throws IOException {
    Document document = Jsoup.parse(html,null,"");
    Element body = document.body();

    return buildStringFromNode(body).toString();
}

private static StringBuffer buildStringFromNode(Node node) {
    StringBuffer buffer = new StringBuffer();

    if (node instanceof TextNode) {
        TextNode textNode = (TextNode) node;
        buffer.append(textNode.text().trim());
    }

    for (Node childNode : node.childNodes()) {
        buffer.append(buildStringFromNode(childNode));
    }

    if (node instanceof Element) {
        Element element = (Element) node;
        String tagName = element.tagName();
        if ("p".equals(tagName) || "br".equals(tagName)) {
            buffer.append("\n");
        }
    }

    return buffer;
}
The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>