Java – how to make jtidy format HTML documents well?
I'm using jtidy v.r938 I'm using this code to try to clean up the page
final Tidy tidy = new Tidy(); tidy.setQuiet(false); tidy.setShowWarnings(true); tidy.setShowErrors(0); tidy.setMakeClean(true); Document document = tidy.parseDOM(conn.getInputStream(),null);
But when I parse this URL- http://www.chicagoreader.com/chicago/EventSearch?narrowByDate=This +When week & eventcategory = 93922 & keywords = & page = 1, things are not cleaned up For example, a meta tag on a page is like
<Meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
Keep as
<Meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
Instead of "< / meta >" tag or displayed as "< meta http equiv =" content type "content =" text / HTML "; Character set = in UTF-8 "/ >" I will generate jtidy org w3c. dom. Document is output as string to confirm this
What can I do to make jtidy really clean up the page - even if it's formatted correctly? I realize that there are other tools, but this problem is related to the use of jtidy
Solution
If you need XML format, you need to specify several flags for tidy
private String cleanData(String data) throws UnsupportedEncodingException { Tidy tidy = new Tidy(); tidy.setInputEncoding("UTF-8"); tidy.setOutputEncoding("UTF-8"); tidy.setWraplen(Integer.MAX_VALUE); tidy.setPrintBodyOnly(true); tidy.setXmlOut(true); tidy.setSmartIndent(true); ByteArrayInputStream inputStream = new ByteArrayInputStream(data.getBytes("UTF-8")); ByteArrayOutputStream outputStream = new ByteArrayOutputStream(); tidy.parseDOM(inputStream,outputStream); return outputStream.toString("UTF-8"); }
Or just want XHTML forms
Tidy tidy = new Tidy(); tidy.setXHTML(true);