Java – prevent jsup. Java Parse delete end tag

2020-08-24 • Java

I'm using jsup Parse parses a piece of HTML

Everything else is great, but I should parse this HTML later in the PDF converter

For some reason, jsup Parse removes the closing tag, and PDF parser throws an exception about the lack of closing img tag

Can't load the XML resource (using TRaX transformer). org.xml.sax.SAXParseException; 
lineNumber: 115; columnNumber: 4; The element
type "img" must be terminated by the matching end-tag "</img>"

How to prevent jsup Parse delete closed img tag?

For example, this line:

<img src="C:\path\to\image\image.png"></img>

to turn to:

<img src="C:\path\to\image\image.png">

The same situation:

<img src="C:\path\to\image\image.png"/>

This is the code:

private void createPdf(File file,String content) throws IOException,DocumentException {
        OutputStream os = new FileOutputStream(file);
            content = tidyUpHTML(content);
            ITextRenderer renderer = new ITextRenderer();
            renderer.setDocumentFromString(content);
            renderer.layout();
            renderer.createPDF(os);
        os.close();
    }

This is the tidyUpHTML method invoked in the above method:

private String tidyUpHTML(String html) {
    org.jsoup.nodes.Document doc = Jsoup.parse(html);
    doc.select("a").unwrap();
    String fixedTags = doc.toString().replace("<br>","<br />");
    fixedTags = fixedTags.replace("<hr>","<hr />");
    fixedTags = fixedTags.replaceAll("&nbsp;","&#160;");
    return fixedTags;
}

Solution

Your PDF converter needs XHTML (because it needs to turn off the IMG tag) Set jsup to output to XHTML (XML)

org.jsoup.nodes.Document doc = Jsoup.parse(html);
document.outputSettings().Syntax( Document.OutputSettings.Syntax.xml);
doc.select("a").unwrap();
String fixedTags = doc.html();

See is it possible to convert HTML into XHTML with jsup 1.8 1?

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.

THE END

Java

二维码

Avoid NoSuchElementException in Java 8 streams

< <上一篇

java – spring data redis master slave config

下一篇>>

搜索内容

Java – prevent jsup. Java Parse delete end tag

Solution

热门文章