Java pdfbox – read and modify PDF (Metaphone) with special characters

I'm trying to modify the PDF using this method (the first code block – use the pdfstreamparser and iterate over the pdfoperator, then update the cosstring if necessary):

http://www.coderanch.com/t/556009/open-source/PdfBox-Replace-String-double-pdf

I have encountered some problems with UTF-8 characters (diacritics): when I print the text I want to update, it displays as "society?? II na? Ionale" (where '?' is a code like 0002 or 0004)

Interestingly:

>When I write an updated PDF file, the characters display correctly (even if I can't detect and replace them) > If I try to remove the text using pdftextstripper's gettext (...), the text will be extracted perfectly. > I tried two pdf@R_496_2419 @Version: 1.5 0 (as described above) and 1.8 1 (the final written PDF file does not display special characters correctly, and a "null" string appears in the document)

What can I do (configure) for the class used to update PDF (or at least try...) so that all UTF-8 characters are displayed correctly?

Edit:

Screenshot:

Edit 2:

I searched pdftextstripper and its superclasses pdf@R_496_2419 @Source code, and then I found the extraction method of text:

At the beginning of the processstream method, we have

graphicsState = new PDGraphicsState(aPage.findCrop@R_496_2419@());

When stripping text in processencodedtext, use the following pdfont class instance:

final PDFont font = graphicsState.getTextState().getFont();

And extract text from byte [] using the following:

String c = font.encode( string,i,codeLength );

The new problem is that when I instantiate a pdfont class with the same 2 lines of code, I get a "null" font class, so I can't use it Encode (...) method The source code of these classes is here: http://grepcode.com/file/repo1.maven.org/maven2/org.apache.pdfbox/pdfbox/1.5.0/org/apache/pdfbox/util/PDFStreamEngine.java and http://grepcode.com/file/repo1.maven.org/maven2/org.apache.pdfbox/pdfbox/1.5.0/org/apache/pdfbox/util/PDFTextStripper.java

I'm digging more

Solution

You cannot replace only text in a string I don't say it easily I worked on acrobat many years ago and used the text search tool in the initial version, so I have a deep understanding of text coding The main problem is that every string in PDF is encoded in some way This is because PDF was made before Unicode was usually available and has a history in postscript Posctscript likes to provide very flexible coding methods for fonts and encourages re coding

Let's step back and understand the whole situation

The characters in PDF format string are displayed by text operator by default and encoded as a series of 8 characters To determine the glyph drawn for each byte, push the byte through the encoding vector of the font The encoding vector maps the byte to the font name, then finds it in the font and draws it on the page Please note that this description is half true (later)

Most applications that generate PDF are very friendly and only use standard encoding, such as standard encoding or winansi encoding, most of which are very reasonable Others will use standard coding and coding increment, which is the difference from standard coding to coding

Some applications try to be more frugal in the PDF they generate, so they look at the glyphs they use and decide on a subset of embedded fonts If they only use uppercase and lowercase Roman letters and numbers, they can reconstruct fonts without these elements, or they can choose to re index them and provide an encoding vector to make byte 0x00 go to font 'a', 0x01 go to font 'B', etc

Now back to half true and half false TrueType and OpenType fonts belong to a class of characters encoded by character ID (or CID) In this case, you can access Unicode, but there is another encoding step. You now map the string (now utf16be) to the CID, which is used to get the font from the font And for no good reason, Adobe uses postscript functions for mapping Again, this is a 3 / 4 truth, because there are different codes for the old management of Chinese, Japanese and Korean fonts

Therefore, before you can easily put characters into PDF font strings, you must ask several questions:

>Is my font a font? > Is my glyph in encoding? > What is the code of my glyph?

Any of them may be different from your expectations So, for example, if you want to put Ä (a direction), you have to see if the font has its font (it may not exist because the font is a subset) Then the font may have an interesting encoding that may not include glyphs Finally, the actual byte value for Ä may not be standard

Therefore, when I see someone trying to simply replace a large piece of text in PDF content, all I see is a painful world For most rational PDFs, this will be effective 90% of the time, but for any exotic - good luck Pdf text rendering quirks are painful and sometimes easier to think of as a write only format

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>