LLM-Tag-System README

Introduction

This code utilizes Tesseract and Natural Language Recognition (NER) to extract text from a PDF file, categorize it into specific categories, and then send the categorized text to an API.

Requirements

Python 3.9+
Tesseract OCR engine
Pillow(pdf2image) library
Pytesseract library
Requests library
Json library
ollama with llama2 model for local LLM processing

Prerequisites

Before running the script, ensure the following steps are completed:

Move all PDFs to one root folder (e.g., ./All) using mkdir_2.sh.
Move the PDF files to directories having sequential numbers using Mkdir.sh.

This ensures that the script can find and process the PDF files correctly.

Usage

Modify the root_folder variable: Replace it with the actual path to your root folder.
Modify the stop_keywords list: If you want to halt OCR detection on specific keywords, add them to this list.
Run the script: python FINAL.py

How the code works

PDF to Images: The script converts the PDF file into images and extracts text using Tesseract.
Text Preprocessing: The extracted text is processed to remove unnecessary characters and format it for the API.
Category Extraction: The script identifies the stop keywords and removes the lines containing them from the extracted text.
API Submission: The categorized text is sent to an API endpoint.
Subfolder Processing: The script can process multiple subfolders containing PDF files.
JSON Processing: The output from the API is returned as raw JSON and requires further processing. It is stored as formatted_json in the same directory as the script.

This ensures that the user is aware that the output from the API requires further processing.

Note

The script uses the llama2 model from ollama for enhanced text analysis and categorization.
This code is designed to process a specific PDF file and categorize it based on the provided stop keywords.
You may need to adjust the stop_keywords list based on the specific content of your PDF file.
The script assumes that the Tesseract and Pillow libraries are installed.
The script can be modified to process multiple PDF files in a directory.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.gitignore		.gitignore
FINAL.py		FINAL.py
Mkdir.sh		Mkdir.sh
Readme.md		Readme.md
formatted_response.json		formatted_response.json
mkdir_2.sh		mkdir_2.sh
response.json		response.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM-Tag-System README

Introduction

Requirements

Prerequisites

Usage

How the code works

Note

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

justushar/LLM-Tag-System

Folders and files

Latest commit

History

Repository files navigation

LLM-Tag-System README

Introduction

Requirements

Prerequisites

Usage

How the code works

Note

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages