pimmer

Exploratory code for PDF image mining. A multi page PDF will be split and converted to jpeg files that are mined for illustrations and images. Baed on https://github.com/megloff1/image-mining with added PDF splitting, a simple GUI and queue management.

Install

Make sure you have Git and Docker with docker-compose installed.
Get the latest version of this repository: git clone --depth 1 https://github.com/peterk/pimmer.git.
Copy the example_env file to .env and edit settings.
Make sure you have a folder called data in the project root folder (jobs and resulting image files will end up here). You can map output to a different local folder for the worker in docker-compose.yml.
Run docker-compose up -d. Wait a minute until the queue and worker is up.

The service is now running on http://localhost:7777.

If you are planning on processing a large number of documents you can start more workers with docker-compose up -d --scale worker=5 and then post files with curl to the /process/ endpoint:

curl -v --silent -F "file=@testdata/hat_catalog.pdf" http://0.0.0.0:7777/process/

Please report bugs and feedback in the Github issue tracker.

Results

The detected images will end up as individual image files in job folders in the ./data/results.

The job folder will also contain a json file per page with the coordinates of the detected images.

A digitized hat catalog like this:

... results in all the individual hat images:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

pimmer

Install

Results

Files

README.md

Latest commit

History

README.md

File metadata and controls

pimmer

Install

Results