Is there any Java HTML parser where the generated nodes retain the index of the original text?
I want to use HTML documents as XML queries (for example, using XPath), so I need to pass HTML through some form of HTML cleaner
However, I also want to modify the original source string according to the query results
Is there a Java HTML parser that keeps the index of the original source string, so I can find a node and modify the correct part of the original string?
Cheers!
Solution
It sounds like Jericho is almost exactly what you want It is a powerful HTML parser designed for non - intrusive modification of source documents
Although it is not used with DOM, Sax or Stax interfaces, it has custom APIs similar to these standards. You should be able to easily adapt your methods to these standards, or write adapters between any content you are using, Jericho For example, you can use jaxen to perform XPath queries on Jericho documents – see this blog entry example
Jericho has begin and end attributes for each element, even for a part of the element, such as tag name or even attribute name, so you can use this information to edit the document yourself, but what Jericho really found is the outputdocument class, which allows you to directly specify the replacement by calling the appropriate method using the Jericho element matching your query, Instead of explicitly calling getbegin () and getend () on them and passing them to some replacement methods