FlaDev/PDFLayoutTextStripper


Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).

Download


Step 1. Add the JitPack repository to your build file

Add it in your root settings.gradle at the end of repositories:

	dependencyResolutionManagement {
		repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
		repositories {
			mavenCentral()
			maven { url 'https://jitpack.io' }
		}
	}

Add it in your settings.gradle.kts at the end of repositories:

	dependencyResolutionManagement {
		repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
		repositories {
			mavenCentral()
			maven { url = uri("https://jitpack.io") }
		}
	}

Add to pom.xml

	<repositories>
		<repository>
		    <id>jitpack.io</id>
		    <url>https://jitpack.io</url>
		</repository>
	</repositories>

Add it in your build.sbt at the end of resolvers:

 
    resolvers += "jitpack" at "https://jitpack.io"
        
    

Add it in your project.clj at the end of repositories:

 
    :repositories [["jitpack" "https://jitpack.io"]]
        
    

Step 2. Add the dependency

	dependencies {
		implementation 'com.github.FlaDev:PDFLayoutTextStripper:'
	}
	dependencies {
		implementation("com.github.FlaDev:PDFLayoutTextStripper:")
	}
	<dependency>
	    <groupId>com.github.FlaDev</groupId>
	    <artifactId>PDFLayoutTextStripper</artifactId>
	    <version></version>
	</dependency>

                            
    libraryDependencies += "com.github.FlaDev" % "PDFLayoutTextStripper" % ""
        
        

                            
    :dependencies [[com.github.FlaDev/PDFLayoutTextStripper ""]]
        
        

Readme


#PDFLayoutTextStripper

- Converts a PDF file into a text file while keeping the layout of the original PDF. Useful to extract the content from a table or a form in a PDF file. PDFLayoutTextStripper is a subclass of PDFTextStripper class (from the Apache PDFBox library).

  • Use cases
  • How to install
  • How to use

Use cases

Data extraction from a table in a PDF file example

Data extraction from a form in a PDF file example

How to install

1) Install apache pdfbox through Maven (to get the v1.8.13 click here )

warning: currently only pdfbox versions strictly inferior to version 2.0.0 are compatible with PDFLayoutTextStripper.java

2) Copy PDFLayoutTextStripper.java inside your main java package

How to use

package pdftest.pt;

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;

import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;

public class test {

	public static void main(String[] args) {
		String string = null;
        try {
            PDFParser pdfParser = new PDFParser(new FileInputStream("sample.pdf"));
            pdfParser.parse();
            PDDocument pdDocument = new PDDocument(pdfParser.getDocument());
            PDFTextStripper pdfTextStripper = new PDFLayoutTextStripper();
            string = pdfTextStripper.getText(pdDocument);
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        };
        System.out.println(string);
	}

}