This application provides a streamlined method for converting PDF documents into Markdown format. Utilizing pdf2image
for image conversion and OpenAI's language models for text extraction and transformation, it automatically processes PDFs and outputs their contents in Markdown, adhering to specified formatting rules.
- Automated Conversion: Processes all PDF files placed in a designated folder, converting them into a series of images, and then extracting the textual content.
- Markdown Formatting: Converts the textual content extracted from the PDF pages into Markdown, following specified formatting rules.
- Table-to-JSON Conversion: Any tables found within the PDF are automatically converted into JSON representations.
- Header/Footer Ignoring: The system attempts to ignore headers and footers in the PDF, focusing on the main content.
-
PDF to Image Conversion:
Each PDF page is converted into an image usingpdf2image
. This ensures that even documents with complex formatting can be reliably processed. -
Text Extraction via LLM:
The images are then sent to an OpenAI-powered Large Language Model (LLM). The LLM:- Extracts the textual content from the image.
- Converts the extracted text into Markdown while adhering to the given rules:
- Ignore Headers and Footers: No extraneous header/footer details will be included.
- Convert Tables to JSON: Any tabular content is rendered as JSON objects rather than Markdown tables.
-
Output Generation:
The resulting Markdown content (including embedded JSON for tables) is appended to a.txt
file with a name corresponding to the original PDF (but without its extension).
- Python 3.8+ recommended.
- Dependencies:
- pdf2image
- base64 (Standard library)
- os (Standard library)
- langchain-core and langchain-openai for LLM integration.
- An active OpenAI API key with appropriate permissions.
Note: The code references an API key of the formsk-****************************
; make sure you replace this with your own API key.
-
Clone the Repository:
git clone https://github.com/kkaarrss/pdf-to-markdown.git cd pdf-to-markdown
-
Create and Activate a Virtual Environment (optional but recommended):
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install Dependencies:
pip install -r requirements.txt
Ensure that
pdf2image
,langchain-core
, andlangchain-openai
are listed inrequirements.txt
, along withpillow
(forpdf2image
), and other required packages. -
Set up Poppler (if using pdf2image):
- Linux: Install
poppler-utils
via your package manager (e.g.,sudo apt-get install poppler-utils
). - macOS: Use
brew install poppler
. - Windows: Download the latest Poppler binaries and add the
bin
directory to yourPATH
.
- Linux: Install
-
Configure Your OpenAI API Key: Replace the placeholder
sk-****************************
in the code with your actual OpenAI API key. Alternatively, you can set it as an environment variable or integrate it using a configuration file.
-
Place your PDFs: Place all the PDF files you want to convert into the
pdf/
directory. -
Run the Script:
python3 pdf_reader.py
-
Output: For each PDF, the script will generate a
.txt
file with a name matching the original PDF. This.txt
file will contain the Markdown-formatted output.- Headers and footers are omitted.
- Tables in the original PDF are represented as JSON objects in the Markdown output.
If you have a file named document.pdf
in the pdf/
folder, after running the script, you might see:
document.txt
containing the extracted markdown:# Introduction This is the introduction text from the PDF. **JSON Table Representation:** ```json [ {"Column1": "Value1", "Column2": "Value2"}, {"Column1": "Value3", "Column2": "Value4"} ]
-
No Output File Created:
Ensure that the PDF is accessible and thatpdf2image
can read and convert it. Check ifpoppler
is correctly installed. -
Incomplete or Incorrect Output:
The quality of the OCR and text extraction depends on the clarity of the PDF. Scanned PDFs with poor image quality may yield suboptimal results. -
API Rate Limits or Errors:
If you encounter API errors, you may be hitting rate limits or have an invalid API key. Check your OpenAI account and ensure your key is correct.
Contributions, bug reports, and feature requests are welcome! Please open an issue or submit a pull request on GitHub.
This project is licensed under the Apache License 2.0.