What is this?

An extractor for PDF files which converts to:

PNG files (one per page)
Alto XML files (this is the native format of Tesseract)
JSON files - a direct conversion of the Tesseract XML

How does this work?

It uses some shell tools to do the hard lifting:

ImageMagic for PDF to PNG conversion
Tesseract OCR for PNG to Alto XML conversion
Xsltpoc and a stylesheet for XML to JSON conversion

How do I run this?

The wrapper script can either run the local commands using packages you installed on your machine, or act as a front-end to using a Docker image which has all the required packages pre-installed.

./run.sh -h

Name	Name	Last commit message	Last commit date
Latest commit BraveSirRobin allow for relative paths for the Docker front end Dec 13, 2022 fa3133d · Dec 13, 2022 History 9 Commits
.github/workflows	.github/workflows	add wfd	Jan 25, 2022
Dockerfile	Dockerfile	fix2	Jan 25, 2022
README.md	README.md	documentation	Dec 13, 2022
run.sh	run.sh	allow for relative paths for the Docker front end	Dec 13, 2022
to-json.xslt	to-json.xslt	mostly functioning, the Docker interface needs work	Jan 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is this?

How does this work?

How do I run this?

About

Releases 8

Packages

Languages

BraveSirRobin/pdftxt

Folders and files

Latest commit

History

Repository files navigation

What is this?

How does this work?

How do I run this?

About

Resources

Stars

Watchers

Forks

Releases 8

Packages 0

Languages

Packages