How to delete all images / drawings from PDF files and leave only text in Java?

2019-06-08 • Java

I have a PDF file that is the output of the OCR processor. The OCR processor recognizes images and adds text to the PDF, but eventually places low-quality images instead of the original images (I don't know why anyone would do this, but they do)

So, I want to get this PDF, remove the image stream, and leave the text alone, so that I can (using iText page import function) to PDF, I create it with real images, and import it myself

Someone asked before, I have tried to use other tools to extract text coordinates (jpedal), but when I draw my PDF Text, it is not in the same position as the original

I'd rather do this in Java, but if another tool can do better, just let me know And it can just be image removal, and I can live there with drawings in PDF

Solution

I use Apache in similar situations PDF@R_967_2419 @.

To be more specific, try something like this:

import org.apache.pdf@R_967_2419@.exceptions.COSVisitorException;
import org.apache.pdf@R_967_2419@.exceptions.CryptographyException;
import org.apache.pdf@R_967_2419@.exceptions.InvalidPasswordException;
import org.apache.pdf@R_967_2419@.pdmodel.PDDocument;
import org.apache.pdf@R_967_2419@.pdmodel.PDDocumentCatalog;
import org.apache.pdf@R_967_2419@.pdmodel.PDPage;
import org.apache.pdf@R_967_2419@.pdmodel.PDResources;
import java.io.IOException;

public class Main {
    public static void main(String[] argv) throws COSVisitorException,InvalidPasswordException,CryptographyException,IOException {
        PDDocument document = PDDocument.load("input.pdf");

        if (document.isEncrypted()) {
            document.decrypt("");
        }

        PDDocumentCatalog catalog = document.getDocumentCatalog();
        for (Object pageObj :  catalog.getAllPages()) {
            PDPage page = (PDPage) pageObj;
            PDResources resources = page.findResources();
            resources.getImages().clear();
        }

        document.save("strippedOfImages.pdf");
    }
}

It should delete all types of images (PNG, JPEG,...) It should work like this:

Sample article http://s3.postimage.org/28f6boykk/before.jpg.

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.

THE END

Java

二维码

How to get the root node attribute on Java

< <上一篇

Detailed explanation of transaction management examples in spring

下一篇>>

搜索内容

How to delete all images / drawings from PDF files and leave only text in Java?

Solution

热门文章