-
Notifications
You must be signed in to change notification settings - Fork 432
Using the command line tabula extractor tool
Say you have a 1,000 page PDF file — or 1,000 separate PDF files — but each page is laid out identically and you want the same table from the middle of the page. Trying this in the full-blown web version of Tabula may not work due to the size, or because each file is a separate thing you have to upload and process.
You can use tabula-java
— the engine that powers Tabula — as a standalone command-line tool to handle these situations.
Contents:
You’ll need Java installed.
See the README for the latest download link. Simply place the .jar
file someplace you will be able to locate later.
You can test that your java and your .jar
file has been acquired correctly by running the following to see the command-line help text. (In this example, the tabula-java download has been placed in a folder called target
.)
java -jar ./target/tabula-0.9.0-jar-with-dependencies.jar --help
The command-line Tabula extractor tool needs the coordinates (in point measurements, not pixels) of the table you want to extract. (Alternatively, it can auto-detect tables, but if you're dealing with thousands of pages with identical regions, it's better to be explicit.)
You can either use the "full" Tabula app to get these coordinates, or manually measure using the Preview app in Mac OS X.
- Download Tabula from http://tabula.technology/ if you haven't already.
- Open Tabula and upload your PDF into the web page that appears.
- Select your table area(s) as usual and proceed to the "Preview & Export Extracted Data" step.
- Under Export Format, select "Script" instead of CSV, and then click "Export" to download the generated code. Save this file somewhere you can find it.
- Open the script you downloaded in a code editor.
- The Using tabula-extractor with coordinates section below will describe how to use the command-line invocation of Tabula.
- Note that the script export starts each line with
tabula
instead ofjava -jar /path/to/tabula.jar
— make sure you edit the script to use thisjava
invocation and the correct path of the downloaded.jar
file. - The generated script contains measurements already filled in, based on what you selected in the Tabula app. You can use this as a starting point to process many of the same type of document, for example if you have a monthly report that is generated as separate PDFs for each month, and the table you want is located in the exact same place each time.
- Open your PDF file in the Preview app
- Make sure
Tools > Rectangular selection
is checked. - Open the inspector by going to
Tools > Show inspector
. - Go to the "crop inspector" tab — second from the right, it looks like a ruler
- Change "Units" to Points
- Select the area you want on the page.
Note the left
, top
, height
, and width
parameters and calculate the following:
-
y1
=top
-
x1
=left
-
y2
=top + height
-
x2
=left + width
You'll need four values these in Part 3: "Using tabula-extractor with coordinates", below.
Open up your terminal.
You can now use these coordinates doing this:
java -jar ./target/tabula-0.9.0-jar-with-dependencies.jar -p all -a $y1,$x1,$y2,$x2 -o $csvfile $filename
where:
-
$y1
,$x1
, etc. are the numbers you got above -
$csvfile
is the name of a CSV file you'll write the tables out to -
$filename
is the name of the PDF file you're reading in.
You can safely ignore any SLF4J:
warning messages.
Example:
$ java -jar ./target/tabula-0.9.0-jar-with-dependencies.jar -p all -a 49.5,52.3285714,599.6571428,743.91428571 -o data_table.csv report.pdf
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
$ head -n10 data_table.csv
"Abdullah, Ghazanfar ","BRONX, NY "," HIV Primary Care, PC ","","$5,250 ","$5,250"
"Aberg, Judith ","NEW YORK, NY ",Judith Aberg ,"$2,000 ","","$2,000"
"Abriola, Kenneth ","GLAST ONBURY, CT ",Kenneth P Abriola ,"","$5,250 ","$5,250"
"Ahern, Barbara ","JAMAICA, NY ",Barbara A Ahern ,"","$2,750 ","$2,750"
"Akil, Bisher ","NEW YORK, NY ", Chelsea Village Medical PC ,"","$53,350 ","$53,350"
"Albrecht, Helmut ","COLUMBIA, SC ", Department of Internal Medicine ,"$6,000 ","","$6,000"
"Albrecht, Helmut ","COLUMBIA, SC ",Helmut Albrecht ,"$2,000 ","","$2,000"
"Alpert, Peter ","BRONX, NY ",Peter L Alpert ,"","$12,000 ","$12,000"
"Altidor, Sherly ","NEW YORK, NY ",Sherly Altidor ,"","$7,500 ","$7,500"
"Alvarado, Eduardo ","EL MONT E, CA ",Eduardo Alvarado ,"","$10,750 ","$10,750"
You can write a script like this to iterate over many identical-format PDFs in a directory:
#!/bin/bash
for f in /path/to/dir/*.pdf; do
java -jar /path/to/tabula/tabula-0.9.0-jar-with-dependencies.jar -p all -a 49.5,52.3285714,599.6571428,743.91428571 -o $f.csv $f
done