Java parsing word to get the picture position in the document
Preface (background introduction):
Apache POI is the next open source project of the Apache foundation. It is used to process office series documents and can create and parse documents in word, Excel and PPT formats.
There are two technologies for word document processing, namely hwpf (. DOC) and xwpf (. Docx). If you are familiar with these two technologies, you should be able to understand the pain of using java to parse word documents.
The two biggest problems are:
First, these two classes do not have a unified parent class and interface (xssf and HSSF next door despise it), so it is impossible to carry out interface programming in the same format;
Second, there is no interface for the relative position of the pictures in the document in the official API, which leads to that although you can get all the pictures in the document, you can't know where these pictures are. If you want to display the pictures in the future, you can't insert them into the correct position.
For the first point, I can't help it. I can study other related technologies, such as Jacob and doc4j, to see if there are other solutions. However, doc4j seems to be able to handle only 2007 documents (. Docx).
For the second point, this paper will give the author's solution. In fact, this is the purpose of this paper.
Note: for students who are simple and quick, just read Chapter 2 and Chapter 3;
1、 Preparatory knowledge
1. The two formats of word documents correspond to two different storage methods
As we all know, word documents have two storage formats: Doc and docx
Doc: it is customarily called word2003 and uses binary to store data; This is not the focus of our discussion today
Docx: word2007 uses XML to store data and formats
You may ask, why is the document that clearly ends in docx in XML format?
It's very simple: you can select a docx file and right-click to open it with the compression tool to get a directory structure like this:
So you think docx is a complete document. In fact, it is just a compressed file. (docx:?_?)
2. Definition format of XML in word document:
From the above, we know that docx documents use compressed files, that is, XML to describe data, so how to define the data in word documents?
For the sake of space, the entire compressed document will not be described in detail here. Here, only the following two files / folders are briefly introduced:
The first is document. In the word directory XML file, which is the definition of the whole document content;
The second is the media folder under the word directory. You can guess from the name that the multimedia content in the document is in this folder:
Figure 3: word / document XML (define document content)
Figure 4: Contents in the word / media folder
Here is the document Some key contents of XML document:
A: Document overall structure definition:
B: Document paragraph content:
C: Picture content definition:
If you are interested in children's shoes, you can take a look at the above three XML codes. I directly draw a conclusion here:
Word document Shema file: xmlns: W=“ http://schemas.openxmlformats.org/wordprocessingml/2006/main "
Document root node: < W: document > defines the beginning of the entire document
< W: body > is the child node of document and the main content of the document
< W: P > body sub node. A paragraph is a paragraph in a word document
The child node of < W: R > P element. A run defines a paragraph with the same format in the paragraph
The child node of the < W: T > Run element node is the content of the document
The child node of < W: Drawing > Run element defines a picture:
< W: inline > the specific application of drawing child nodes has not been deeply studied
< A: Graphic > define picture content
< pic: blipfill > this is the child node of the graphic document, which defines the index of the image content. Specifically, POI can get the resources corresponding to the image according to this name, and the key to obtaining the image location of the document is here
Generally speaking, xwpf parses docx documents by parsing XML documents, saving all nodes, converting them into more user-friendly attributes, and providing APIs for users
So we can use the interface provided by POI to get the document content, analyze the data in the document, and get the paragraph in which the picture is. Of course, you can also know which run element the picture is behind
2、 Realize
The first thing to mention is xwpf's encapsulation of XML elements:
< W: document > corresponds to xwpfddocument class
< W: run > corresponds to xwpfrun class
Basically, it only corresponds to the run layer. Because there are many sub elements of run, it is no longer encapsulated and defined in the lower level,
So we can only get the XML definition of all xwpfrun objects: CTR objects. Finally, CTR is used to read and parse the contents of the run element to obtain the image index.
Next, I want to talk about the definition of the whole XML element:
We can see that POI uses the xmlbeans technology under Apache to parse XML. The related technologies are not discussed in detail. The key points are to understand two points:
1: All elements in an XML document inherit an xmlobject interface after being encapsulated by an XML bean, so you can use this class to receive the obtained child elements;
2: Element traversal is done through xmlcursor. The specific acquisition of child elements is controlled according to the selectpath attribute of xmlcursor object. When the selectpath is ". / *", it is defined as traversing child elements;
Therefore, the following code is written: it can traverse the child elements of the current element and check the type of child elements:
Finally, you may have a question. Didn't you say that the < W: Drawing > element defines a picture?
that
What is this second judgment condition for?
Smart, you should have guessed
you 're right! In addition to the < W: Drawing > method, the XML in the docx document can also be defined by using the < W: Object > element,
Why are there only these two?
Because I only used the first way to analyze, I found that some pictures were lost, so I found the second way Maybe more than two? I don't know. Anyway, there's no problem for me at present
Perhaps smart you have encountered more situations in practice?
Using the XML parsing method mentioned above, I believe you can read it correctly and get the index value you want
More broadly, if POI has other APIs that are not provided, can we also implement them ourselves through XML parsing technology? This requires us to explore in practice. I believe time will give us the answer
Well, now that we have the index value, how do we get the image resources?
POI provides out of the box methods:
There is getpicturedatabyid (string picture) in xwpfdocument class;
Method can get the xwpfpictruedate object, which is the image resource
For specific operations, please refer to relevant blog posts and APIs, which will not be introduced in detail here
3、 Test:
Code tested with junit4:
Display results:
The picture name used here means that I have obtained the corresponding resources. In fact, if you are familiar with the content above, you will find that the name of the picture is actually the full name of all the pictures in the word / media folder.
In the corresponding xwpfpicturedata object, the binary data of the image can be obtained through the GetData () attribute, so that you can save it to the database or your local folder!
4、 Others:
Speaking of this, the second problem mentioned at the beginning has been solved here.
So, what about the first question?
If your system doesn't require high speed, my suggestion to you is to convert DOC documents into docx documents for parsing -- POI has mature APIs
If you want to consider performance, you have to write two methods to parse documents.
So How to get the relative position of pictures in doc type word documents?
I don't know. Or, you tell me?
The above method of parsing word in Java and obtaining the picture position in the document is all the content shared by Xiaobian. I hope it can give you a reference and support more programming tips.