Java – how to make jtidy format HTML documents well?

I'm using jtidy v.r938 I'm using this code to try to clean up the page

final Tidy tidy = new Tidy();
tidy.setQuiet(false);
tidy.setShowWarnings(true);
tidy.setShowErrors(0);
tidy.setMakeClean(true);
Document document = tidy.parseDOM(conn.getInputStream(),null);

But when I parse this URL- http://www.chicagoreader.com/chicago/EventSearch?narrowByDate=This +When week & eventcategory = 93922 & keywords = & page = 1, things are not cleaned up For example, a meta tag on a page is like

<Meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

Keep as

<Meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

Instead of "< / meta >" tag or displayed as "< meta http equiv =" content type "content =" text / HTML "; Character set = in UTF-8 "/ >" I will generate jtidy org w3c. dom. Document is output as string to confirm this

What can I do to make jtidy really clean up the page - even if it's formatted correctly? I realize that there are other tools, but this problem is related to the use of jtidy

Solution

If you need XML format, you need to specify several flags for tidy

private String cleanData(String data) throws UnsupportedEncodingException {
    Tidy tidy = new Tidy();
    tidy.setInputEncoding("UTF-8");
    tidy.setOutputEncoding("UTF-8");
    tidy.setWraplen(Integer.MAX_VALUE);
    tidy.setPrintBodyOnly(true);
    tidy.setXmlOut(true);
    tidy.setSmartIndent(true);
    ByteArrayInputStream inputStream = new ByteArrayInputStream(data.getBytes("UTF-8"));
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    tidy.parseDOM(inputStream,outputStream);
    return outputStream.toString("UTF-8");
}

Or just want XHTML forms

Tidy tidy = new Tidy();
tidy.setXHTML(true);
The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>