How to extract text from a PDF file with Apache PDFBox

Question

Using PDFBox 2.0.7, this is how I get the text of a PDF:

static String getText(File pdfFile) throws IOException {
    PDDocument doc = PDDocument.load(pdfFile);
    return new PDFTextStripper().getText(doc);
}

Call it like this:

try {
    String text = getText(new File("/home/me/test.pdf"));
    System.out.println("Text in PDF: " + text);
} catch (IOException e) {
    e.printStackTrace();
}

Since user oivemaria asked in the comments:

You can use PDFBox in your application by adding it to your dependencies in build.gradle:

dependencies {
  compile group: 'org.apache.pdfbox', name: 'pdfbox', version: '2.0.7'
}

Here’s more on dependency management using Gradle.

If you want to keep the PDF’s format in the parsed text, give PDFLayoutTextStripper a try.

Leave a Comment Cancel reply