Converting PDF to text is an
interesting task which has its use in many applications from search
engines indexing PDF documents to other data processing tasks. I was
looking for a java based API to convert PDF to text, or in other words a
PDF Text parser in java, after going through many articles, the PDFBox project came to my rescue. PDFBox
is a library which can handle different types of PDF documents
including encrypted PDF formats and extracts text and has a command line
utility as well to convert PDF to text documents.
I found the need to have a reusable java class to convert PDF Documents
to text in one of my projects and the below java code does the same
using the PDFBox java
API. It takes two command line parameters, the input PDF file and the
output text file, to which the parsed text from the PDF document will be
written.
This code was tested with PDFBox 0.7.3 although it should work with other versions of PDFBox as well, it can be easily integrated with other java applications and can be used as a command line utility as well, the steps to run this code is furnished below.
Listing 1: PDFTextParser.java
1: /* 2: * PDFTextParser.java 3: * 4: * 5: */ 6: 7: import org.pdfbox.cos.COSDocument; 8: import org.pdfbox.pdfparser.PDFParser; 9: import org.pdfbox.pdmodel.PDDocument; 10: import org.pdfbox.pdmodel.PDDocumentInformation; 11: import org.pdfbox.util.PDFTextStripper; 12: 13: import java.io.File; 14: import java.io.FileInputStream; 15: import java.io.PrintWriter; 16: 17: public class PDFTextParser { 18: 19: PDFParser parser; 20: String parsedText; 21: PDFTextStripper pdfStripper; 22: PDDocument pdDoc; 23: COSDocument cosDoc; 24: PDDocumentInformation pdDocInfo; 25: 26: // PDFTextParser Constructor 27: public PDFTextParser() { 28: } 29: 30: // Extract text from PDF Document 31: String pdftoText(String fileName) { 32: 33: System.out.println("Parsing text from PDF file " + fileName + "...."); 34: File f = new File(fileName); 35: 36: if (!f.isFile()) { 37: System.out.println("File " + fileName + " does not exist."); 38: return null; 39: } 40: 41: try { 42: parser = new PDFParser(new FileInputStream(f)); 43: } catch (Exception e) { 44: System.out.println("Unable to open PDF Parser."); 45: return null; 46: } 47: 48: try { 49: parser.parse(); 50: cosDoc = parser.getDocument(); 51: pdfStripper = new PDFTextStripper(); 52: pdDoc = new PDDocument(cosDoc); 53: parsedText = pdfStripper.getText(pdDoc); 54: } catch (Exception e) { 55: System.out.println("An exception occured in parsing the PDF Document."); 56: e.printStackTrace(); 57: try { 58: if (cosDoc != null) cosDoc.close(); 59: if (pdDoc != null) pdDoc.close(); 60: } catch (Exception e1) { 61: e.printStackTrace(); 62: } 63: return null; 64: } 65: System.out.println("Done."); 66: return parsedText; 67: } 68: 69: // Write the parsed text from PDF to a file 70: void writeTexttoFile(String pdfText, String fileName) { 71: 72: System.out.println("\nWriting PDF text to output text file " + fileName + "...."); 73: try { 74: PrintWriter pw = new PrintWriter(fileName); 75: pw.print(pdfText); 76: pw.close(); 77: } catch (Exception e) { 78: System.out.println("An exception occured in writing the pdf text to file."); 79: e.printStackTrace(); 80: } 81: System.out.println("Done."); 82: } 83: 84: //Extracts text from a PDF Document and writes it to a text file 85: public static void main(String args[]) { 86: 87: if (args.length != 2) { 88: System.out.println("Usage: java PDFTextParser "); 89: System.exit(1); 90: } 91: 92: PDFTextParser pdfTextParserObj = new PDFTextParser(); 93: String pdfToText = pdfTextParserObj.pdftoText(args[0]); 94: 95: if (pdfToText == null) { 96: System.out.println("PDF to Text Conversion failed."); 97: } 98: else { 99: System.out.println("\nThe text parsed from the PDF Document....\n" + pdfToText); 100: pdfTextParserObj.writeTexttoFile(pdfToText, args[1]); 101: } 102: } 103: }Explanation:The above code takes two command line parameters, the input PDF file and the output text file, the method pdftoText in line 31 handles the text parsing functionality and the writeTexttoFile method in line 70 writes the parsed text to the output file.
Compliling and Running the code:I used PDFBox 0.7.3 to compile/run the above code, so you need to add those jars in your java project settings.
1. Download PDFBox 0.7.3 from here.
2. Unzip PDFBox-0.7.3.zip.
3. Under the PDFBox-0.7.3 folder, add the jars in the lib (PDFBox-0.7.3.jar) and external directory (other external packages used by PDFBox-0.7.3) to the classpath to compile/run the code, it should work fine.
Note: I used JDK 1.6 to compile the above code.
References:
No comments:
Post a Comment