Welcome to Doctor, Free Law Project's microservice for converting, extracting and modifiying documents.
The goal of this microservice is to isolate out these tools to let Courtlistener (a django site) be streamlined and easier to maintain. This service is setup to run with gunicorn with a series of endpoints that accept JSON, files, and parameters to transform Audio, Documents as well as extract, modify and replace metadata, text and other data.
In general, CL houses documents scraped and collected from hundreds of sources and these documents take many varied formats and versions.
This tool is designed to be connected securely from CL via a docker network called cl_net_overlay. But it can also be used directly by exposing port 5050. For more about development of the tool see the (soon coming) DEVELOPING.md file.
Assuming you have docker installed run:
docker run -d -p 5050:5050 freelawproject/doctor:latest
This will expose the endpoints on port 5050 with four gunicorn workers.
If you wish to have more gunicorn workers, you'll want to set the DOCTOR_WORKERS environment variable. You can do that with:
docker run -d -p 5050:5050 -e DOCTOR_WORKERS=16 freelawproject/doctor:latest
If you are doing OCR, you will certainly want more workers.
After the image is running, you should be able to test that you have a working environment by running
curl http://localhost:5050
which should return a text response.
Heartbeat detected.
The service currently supports the following tools:
- Convert audio files from wma, ogg, wav to MP3.
- Convert an image or images to a PDF.
- Identify the mime type of a file.
- OCR text from an image PDF.
- Extract text from PDF, RTF, DOC, DOCX, or WPD, HTML, TXT files.
- Create a thumbnail of the first page of a PDF.
- Get page count for a document.
A brief description and curl command for each endpoint is provided below.
Given a pdf, extract the text from it.
curl 'http://localhost:5050/extract/pdf/text/?ocr_available=True' \
-X 'POST' \
-F "file=@doctor/test_assets/image-pdf.pdf"
This is an important endpoint. Given a document it will extract out the text and return a page count (if possible) along with general metadata used in CL.
curl 'http://localhost:5050/extract/doc/text/' \
-X 'POST' \
-F "file=@doctor/test_assets/vector-pdf.pdf"
or if you need to OCR the document you pass in the ocr_available parameter.
curl 'http://localhost:5050/extract/doc/text/?ocr_available=True' \
-X 'POST' \
-F "file=@doctor/test_assets/image-pdf.pdf"
Presuming that the request was valid you should receive the following JSON response back.
This returns an JSON Response
And includes the following keys extracted_by_ocr
, err
, page_count
, content
The method accepts PDF (image and vector), DOC, DOCX, HTML, TXT and WPD files.
This method takes a document and returns the page count.
curl 'http://localhost:5050/utils/page-count/pdf/' \
-X 'POST' \
-F "file=@doctor/test_assets/image-pdf.pdf"
This will return an HTTP response with page count. In the above example it would return 2.
This method takes a document and returns the mime type.
curl 'http://localhost:5050/utils/mime-type/?mime=False' \
-X 'POST' \
-F "file=@doctor/test_assets/image-pdf.pdf"
returns as JSON response identifying the document type
{"mimetype": "PDF document, version 1.3"}
and
curl 'http://localhost:5050/utils/mime-type/?mime=True' \
-X 'POST' \
-F "file=@doctor/test_assets/image-pdf.pdf"
returns as JSON response identifying the document type
{"mimetype": "application/pdf"}
Another example
curl 'http://localhost:5050/utils/mime-type/?mime=True' \
-X 'POST' \
-F "file=@doctor/test_assets/word-doc.doc"
returns
{"mimetype": "application/msword"}
This method is useful for identifying the type of document, incorrect documents and weird documents.
This method will take an image PDF and return the PDF with transparent text overlayed on the document. This allows users to copy and paste (more or less) from our OCRd text.
curl 'http://localhost:5050/utils/add/text/pdf/' \
-X 'POST' \
-F "file=@doctor/test_assets/image-pdf.pdf" \
-o image-pdf-with-embedded-text.pdf
This endpoint returns the duration of an MP3 file.
curl 'http://localhost:5050/utils/audio/duration/' \
-X 'POST' \
-F "file=@doctor/test_assets/1.mp3"
Given an image or indeterminate length, this endpoint will convert it to a pdf.
curl 'http://localhost:5050/convert/image/pdf/' \
-X 'POST' \
-F "file=@doctor/test_assets/long-image.tiff" \
--output test-image-to-pdf.pdf
Keep in mind that this curl will write the file to the current directory.
Given a list of urls for images, this endpoint will convert them to a pdf.
curl 'http://localhost:5050/convert/images/pdf/?sorted_urls=%5B%22https%3A%2F%2Fcom-courtlistener-storage.s3-us-west-2.amazonaws.com%2Ffinancial-disclosures%2F2011%2FA-E%2FArmstrong-SB%2520J3.%252009.%2520CAN_R_11%2FArmstrong-SB%2520J3.%252009.%2520CAN_R_11_Page_1.tiff%22%2C+%22https%3A%2F%2Fcom-courtlistener-storage.s3-us-west-2.amazonaws.com%2Ffinancial-disclosures%2F2011%2FA-E%2FArmstrong-SB%2520J3.%252009.%2520CAN_R_11%2FArmstrong-SB%2520J3.%252009.%2520CAN_R_11_Page_2.tiff%22%5D' \
-X POST \
-o image.pdf
This returns the binary data of the pdf.
This method is used almost exclusively for financial disclosures.
Thumbnail takes a pdf and returns a png thumbnail of the first page.
curl 'http://localhost:5050/convert/pdf/thumbnail/' \
-X 'POST' \
-F "file=@doctor/test_assets/image-pdf.pdf" \
-o test-thumbnail.png
This returns the binary data of the thumbnail.
Keep in mind that this curl will also write the file to the current directory.
Give a PDF and a range or pages, this endpoint will return a zip file containing thumbnails for each page requested. For example if you want thumbnails for the first four pages you
curl 'http://localhost:5050/convert/pdf/thumbnails/?max_dimension=350&pages=%5B1%2C+2%2C+3%2C+4%5D' \
-X 'POST' \
-F "file=@doctor/test_assets/vector-pdf.pdf" \
-o thumbnails.zip
This will return four thumbnails in a zip file.
This endpoint takes an audio file and converts it to an MP3 file. This is used to convert different audio formats from courts across the country and standardizes the format for our end users.
This endpoint also adds the SEAL of the court to the MP3 file and updates the metadata to reflect our updates.
This isn't the cleanest of CURLs because we have to convert the large JSON file to a query string, but for proof of concept here is the result
curl 'http://localhost:5050/convert/audio/mp3/?audio_data=%7B%22court_full_name%22%3A+%22Testing+Supreme+Court%22%2C+%22court_short_name%22%3A+%22Testing+Supreme+Court%22%2C+%22court_pk%22%3A+%22test%22%2C+%22court_url%22%3A+%22http%3A%2F%2Fwww.example.com%2F%22%2C+%22docket_number%22%3A+%22docket+number+1+005%22%2C+%22date_argued%22%3A+%222020-01-01%22%2C+%22date_argued_year%22%3A+%222020%22%2C+%22case_name%22%3A+%22SEC+v.+Frank+J.+Custable%2C+Jr.%22%2C+%22case_name_full%22%3A+%22case+name+full%22%2C+%22case_name_short%22%3A+%22short%22%2C+%22download_url%22%3A+%22http%3A%2F%2Fmedia.ca7.uscourts.gov%2Fsound%2Fexternal%2Fgw.15-1442.15-1442_07_08_2015.mp3%22%7D' \
-X 'POST' \
-F "file=@doctor/test_assets/1.wma"
This returns the audio file as a file response.
Testing is designed to be run with the docker-compose.dev.yml
file. To see more about testing
checkout the DEVELOPING.md file.