Java – performance iText vs.pdfbox
I'm trying to convert PDF (my favorite book effective Java, if it's a problem) into text. I checked iText and Apache Pdf@R_405_2419 @I found that the performance is very different: using iText requires 2:521, using Pdf@R_405_2419 @:6:117.
PDFTextStripper stripper = new PDFTextStripper(); BUFFER.append(stripper.getText(PDDocument.load(pdf)));
This is for iText
PdfReader reader = new PdfReader(pdf); for (int i = 1; i <= reader.getNumberOfPages(); i++) { BUFFER.append(PdfTextExtractor.getTextFromPage(reader,i)); }
My question is, what does performance depend on? Is there any way to make it Pdf@R_405_2419 @Faster? Or just use iText? Can you explain how strategies affect performance?
Solution
One major difference is PDF@R_405_2419 @Text glyphs are always processed as glyphs, while iText usually processes its blocks as blocks (i.e. single string parameters of text drawing operations); This greatly reduces the resources required for iText In addition, the event oriented iText text parsing architecture means the resource burden ratio PDF@R_405_2419 @Low also PDF@R_405_2419 @It preserves the strict information that can be used for plain text extraction for a long time, thus spending more resources
However, the way the library initially loads documents may also be different You can try it here, PDF@R_405_2419 @Not only multiple pddocuments are provided Load overload, and some pddocuments are also provided Loadnonseq overload (in fact, pddocument.loadnonseq reads the document correctly, while pddocument.load may be deceived to misunderstand the PDF) All of these different variants may have different runtime behavior
IText brings a simple and more advanced text extraction strategy Simply assume that the text in the page content stream is displayed in reading order and more advanced text sorting By default, a more advanced one is used Therefore, you can accelerate iText and more by using simple strategies PDF@R_405_2419 @Always sort