Java – how to programmatically check HTML documents
I have a database containing small HTML documents, and I need to programmatically insert several into PDF documents with iText or Aspose Words in a word document I need to preserve any formatting in the HTML document (within a reasonable range, it is necessary to respect < b > tags, and CSS like < span style = "blah" > is a good choice)
Both iText and aspese work (roughly):
Document document = new Document( Size.A4,Aspect.PORTRAIT ); document.setFont( "Helvetica",20,Font.BOLD ); document.insert( "some string" ) document.setBold( true ); document.insert( "A bold string" );
So (I think) I need some kind of HTML parser that I can check strings and styles to insert into my document
Can anyone suggest a good library or a wise way to solve this problem? The platform is Java
Solution
HTML parser is a good HTML parser
I use it to parse HTML on one of my projects
You can write your own filter to parse the HTML you want, so & lt; Br > labels should not be difficult to parse
You can parse CSS in cssselectornodefilter