Is there any Java HTML parser where the generated nodes retain the index of the original text?

I want to use HTML documents as XML queries (for example, using XPath), so I need to pass HTML through some form of HTML cleaner

However, I also want to modify the original source string according to the query results

Is there a Java HTML parser that keeps the index of the original source string, so I can find a node and modify the correct part of the original string?

Cheers!

Solution

It sounds like Jericho is almost exactly what you want It is a powerful HTML parser designed for non - intrusive modification of source documents

Although it is not used with DOM, Sax or Stax interfaces, it has custom APIs similar to these standards. You should be able to easily adapt your methods to these standards, or write adapters between any content you are using, Jericho For example, you can use jaxen to perform XPath queries on Jericho documents – see this blog entry example

Jericho has begin and end attributes for each element, even for a part of the element, such as tag name or even attribute name, so you can use this information to edit the document yourself, but what Jericho really found is the outputdocument class, which allows you to directly specify the replacement by calling the appropriate method using the Jericho element matching your query, Instead of explicitly calling getbegin () and getend () on them and passing them to some replacement methods

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>