A simple command-line tool to convert PDF files into Markdown format using the Mistral AI OCR API. This tool also extracts embedded images and saves them in a subdirectory relative to the output markdown file.
You can install the package directly from PyPI using pip:
pip install mistral-pdf-to-markdown
If you want to use the pdf2md
command from anywhere in your system without activating a specific virtual environment, the recommended way is to use pipx
:
-
Install
pipx
(if you don't have it already). Follow the official pipx installation guide. A common method is:python3 -m pip install --user pipx python3 -m pipx ensurepath
(Restart your terminal after running
ensurepath
) -
Install the package using
pipx
:pipx install mistral-pdf-to-markdown
This installs the package in an isolated environment but makes the pdf2md
command globally available.
Alternatively, if you want to install from the source:
-
Clone the repository:
git clone https://github.com/arcangelo7/mistral-pdf-to-markdown.git cd mistral-pdf-to-markdown
-
Install dependencies using Poetry:
poetry install
-
Set your Mistral API Key: You can set your API key as an environment variable:
export MISTRAL_API_KEY='your_api_key_here'
Alternatively, you can create a
.env
file in the project root directory with the following content:MISTRAL_API_KEY=your_api_key_here
You can also pass the API key directly using the
--api-key
option. -
Run the conversion:
The
convert
command processes a single PDF file.poetry run pdf2md convert <path/to/your/document.pdf> [options]
Or, if you have activated the virtual environment (
poetry shell
):pdf2md convert <path/to/your/document.pdf> [options]
Options for Single File Conversion:
--output
or-o
: Specify the path for the output Markdown file. If not provided, it defaults to the same name as the input PDF but with a.md
extension (e.g.,document.md
).--api-key
: Provide the Mistral API key directly.
The
convert-dir
command processes all PDF files in a specified directory.poetry run pdf2md convert-dir <path/to/directory/with/pdfs> [options]
Or, if you have activated the virtual environment (
poetry shell
):pdf2md convert-dir <path/to/directory/with/pdfs> [options]
Options for Directory Conversion:
--output-dir
or-o
: Specify the directory where output Markdown files will be saved. If not provided, it defaults to the same directory as the input PDFs.--api-key
: Provide the Mistral API key directly.--max-workers
or-w
: Maximum number of concurrent conversions (default: 2). Increase this value to process multiple files in parallel for faster conversion.
Image Handling:
The script will attempt to extract images embedded in the PDF.
- Images are saved in a subdirectory named
<output_filename_stem>_images
(e.g., if the output isreport.md
, images will be inreport_images/
). - The generated Markdown file will contain relative links pointing to the images in this subdirectory.
Examples:
# Convert a single PDF file (when installed with Poetry)
poetry run pdf2md convert ./my_report.pdf -o ./output/report.md
# Convert a single PDF file (when installed globally with pipx)
pdf2md convert ./my_report.pdf -o ./output/report.md
This command will create:
./output/report.md
(the markdown content)./output/report_images/
(a directory containing extracted images)
# Convert all PDF files in a directory (when installed with Poetry)
poetry run pdf2md convert-dir ./pdf_documents/ -o ./markdown_output/ -w 4
# Convert all PDF files in a directory (when installed globally with pipx)
pdf2md convert-dir ./pdf_documents/ -o ./markdown_output/ -w 4
This command will:
- Process all PDF files in the
./pdf_documents/
directory - Save the resulting Markdown files in the
./markdown_output/
directory - Process up to 4 files concurrently
- Create image directories for each output file as needed
An example output generated from example.pdf
(included in the repository) can be found in example.md, with its corresponding images located in the example_images/
directory.
This project is licensed under the ISC License - see the LICENSE file for details.