PDF-to-Markdown Conversion App

This application provides a streamlined method for converting PDF documents into Markdown format. Utilizing pdf2image for image conversion and OpenAI's language models for text extraction and transformation, it automatically processes PDFs and outputs their contents in Markdown, adhering to specified formatting rules.

Features

Automated Conversion: Processes all PDF files placed in a designated folder, converting them into a series of images, and then extracting the textual content.
Markdown Formatting: Converts the textual content extracted from the PDF pages into Markdown, following specified formatting rules.
Table-to-JSON Conversion: Any tables found within the PDF are automatically converted into JSON representations.
Header/Footer Ignoring: The system attempts to ignore headers and footers in the PDF, focusing on the main content.

How It Works

PDF to Image Conversion:
Each PDF page is converted into an image using pdf2image. This ensures that even documents with complex formatting can be reliably processed.
Text Extraction via LLM:
The images are then sent to an OpenAI-powered Large Language Model (LLM). The LLM:
- Extracts the textual content from the image.
- Converts the extracted text into Markdown while adhering to the given rules:
  - Ignore Headers and Footers: No extraneous header/footer details will be included.
  - Convert Tables to JSON: Any tabular content is rendered as JSON objects rather than Markdown tables.
Output Generation:
The resulting Markdown content (including embedded JSON for tables) is appended to a .txt file with a name corresponding to the original PDF (but without its extension).

Requirements

Python 3.8+ recommended.
Dependencies:
- pdf2image
- base64 (Standard library)
- os (Standard library)
- langchain-core and langchain-openai for LLM integration.
- An active OpenAI API key with appropriate permissions.
  Note: The code references an API key of the form sk-****************************; make sure you replace this with your own API key.

Installation and Setup

Clone the Repository:

git clone https://github.com/kkaarrss/pdf-to-markdown.git
cd pdf-to-markdown

Create and Activate a Virtual Environment (optional but recommended):

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install Dependencies:
```
pip install -r requirements.txt
```
Ensure that pdf2image, langchain-core, and langchain-openai are listed in requirements.txt, along with pillow (for pdf2image), and other required packages.
Set up Poppler (if using pdf2image):
- Linux: Install poppler-utils via your package manager (e.g., sudo apt-get install poppler-utils).
- macOS: Use brew install poppler.
- Windows: Download the latest Poppler binaries and add the bin directory to your PATH.
Configure Your OpenAI API Key: Replace the placeholder sk-**************************** in the code with your actual OpenAI API key. Alternatively, you can set it as an environment variable or integrate it using a configuration file.

Usage

Place your PDFs: Place all the PDF files you want to convert into the pdf/ directory.
Run the Script:
```
python3 pdf_reader.py
```
Output: For each PDF, the script will generate a .txt file with a name matching the original PDF. This .txt file will contain the Markdown-formatted output.
- Headers and footers are omitted.
- Tables in the original PDF are represented as JSON objects in the Markdown output.

Example

If you have a file named document.pdf in the pdf/ folder, after running the script, you might see:

document.txt containing the extracted markdown:

# Introduction

This is the introduction text from the PDF.

**JSON Table Representation:**
```json
[
  {"Column1": "Value1", "Column2": "Value2"},
  {"Column1": "Value3", "Column2": "Value4"}
]

Troubleshooting

No Output File Created:
Ensure that the PDF is accessible and that pdf2image can read and convert it. Check if poppler is correctly installed.
Incomplete or Incorrect Output:
The quality of the OCR and text extraction depends on the clarity of the PDF. Scanned PDFs with poor image quality may yield suboptimal results.
API Rate Limits or Errors:
If you encounter API errors, you may be hitting rate limits or have an invalid API key. Check your OpenAI account and ensure your key is correct.

Contributing

Contributions, bug reports, and feature requests are welcome! Please open an issue or submit a pull request on GitHub.

License

This project is licensed under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
pdf		pdf
LICENSE		LICENSE
README.md		README.md
pdf_reader.py		pdf_reader.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF-to-Markdown Conversion App

Features

How It Works

Requirements

Installation and Setup

Usage

Example

Troubleshooting

Contributing

License

About

Releases

Packages

Languages

License

kkaarrss/pdf-to-markup

Folders and files

Latest commit

History

Repository files navigation

PDF-to-Markdown Conversion App

Features

How It Works

Requirements

Installation and Setup

Usage

Example

Troubleshooting

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages