Java – prevent jsup. Java Parse delete end tag
•
Java
I'm using jsup Parse parses a piece of HTML
Everything else is great, but I should parse this HTML later in the PDF converter
For some reason, jsup Parse removes the closing tag, and PDF parser throws an exception about the lack of closing img tag
Can't load the XML resource (using TRaX transformer). org.xml.sax.SAXParseException; lineNumber: 115; columnNumber: 4; The element type "img" must be terminated by the matching end-tag "</img>"
How to prevent jsup Parse delete closed img tag?
For example, this line:
<img src="C:\path\to\image\image.png"></img>
to turn to:
<img src="C:\path\to\image\image.png">
The same situation:
<img src="C:\path\to\image\image.png"/>
This is the code:
private void createPdf(File file,String content) throws IOException,DocumentException { OutputStream os = new FileOutputStream(file); content = tidyUpHTML(content); ITextRenderer renderer = new ITextRenderer(); renderer.setDocumentFromString(content); renderer.layout(); renderer.createPDF(os); os.close(); }
This is the tidyUpHTML method invoked in the above method:
private String tidyUpHTML(String html) { org.jsoup.nodes.Document doc = Jsoup.parse(html); doc.select("a").unwrap(); String fixedTags = doc.toString().replace("<br>","<br />"); fixedTags = fixedTags.replace("<hr>","<hr />"); fixedTags = fixedTags.replaceAll(" "," "); return fixedTags; }
Solution
Your PDF converter needs XHTML (because it needs to turn off the IMG tag) Set jsup to output to XHTML (XML)
org.jsoup.nodes.Document doc = Jsoup.parse(html); document.outputSettings().Syntax( Document.OutputSettings.Syntax.xml); doc.select("a").unwrap(); String fixedTags = doc.html();
See is it possible to convert HTML into XHTML with jsup 1.8 1?
The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
二维码