How to delete all images / drawings from PDF files and leave only text in Java?
I have a PDF file that is the output of the OCR processor. The OCR processor recognizes images and adds text to the PDF, but eventually places low-quality images instead of the original images (I don't know why anyone would do this, but they do)
So, I want to get this PDF, remove the image stream, and leave the text alone, so that I can (using iText page import function) to PDF, I create it with real images, and import it myself
Someone asked before, I have tried to use other tools to extract text coordinates (jpedal), but when I draw my PDF Text, it is not in the same position as the original
I'd rather do this in Java, but if another tool can do better, just let me know And it can just be image removal, and I can live there with drawings in PDF
Solution
I use Apache in similar situations PDF@R_967_2419 @.
To be more specific, try something like this:
import org.apache.pdf@R_967_2419@.exceptions.COSVisitorException; import org.apache.pdf@R_967_2419@.exceptions.CryptographyException; import org.apache.pdf@R_967_2419@.exceptions.InvalidPasswordException; import org.apache.pdf@R_967_2419@.pdmodel.PDDocument; import org.apache.pdf@R_967_2419@.pdmodel.PDDocumentCatalog; import org.apache.pdf@R_967_2419@.pdmodel.PDPage; import org.apache.pdf@R_967_2419@.pdmodel.PDResources; import java.io.IOException; public class Main { public static void main(String[] argv) throws COSVisitorException,InvalidPasswordException,CryptographyException,IOException { PDDocument document = PDDocument.load("input.pdf"); if (document.isEncrypted()) { document.decrypt(""); } PDDocumentCatalog catalog = document.getDocumentCatalog(); for (Object pageObj : catalog.getAllPages()) { PDPage page = (PDPage) pageObj; PDResources resources = page.findResources(); resources.getImages().clear(); } document.save("strippedOfImages.pdf"); } }
It should delete all types of images (PNG, JPEG,...) It should work like this:
Sample article http://s3.postimage.org/28f6boykk/before.jpg.