Document Intelligence: Document Text Analysis and Artificial Intelligence Processing

Description

Document Intelligence is a document intelligence proof of concept that is designed to automatically process and analyze PDF files, emails and call recordings. It extracts relevant information, analyzes sentiment, clusters and categorizes documents based on similarity, and generates summaries. The information is used to automatically assign the correct account number to an external system. The project aims to automatically digest unstructured documents, such as phone calls and texts, and generate metadata indexes for the correct account in a remote system. The objective is for this process to occur end-to-end, from an unknown document to being correctly filed, without human intervention.

The project consists of several Python scripts that accomplish different tasks, including OCR processing, named entity recognition, similarity clustering, sentiment analysis, text summarization, audio transcription, email extraction, key-value pair extraction, and classification.

Installation / How to setup

You can use pip to install the required packages in your Python environment:

pip install textblob sys sklearn torch spacy csv gensim thefuzz threading transformers numpy chardet stanza glob google_auth_oauthlib pdf2image sumy time warnings flair vaderSentiment concurrent google fuzzywuzzy PyPDF4 re os openai base64 nltk pandas functools requests googleapiclient rake_nltk subprocess fitz pytesseract collections

You also need to download the necessary language model for spaCy:

python -m spacy download en_core_web_sm

Additionally, you need to install Poppler and Tesseract on your system and set the necessary dependencies for the Python project. To do this, follow these steps:

Download and install Poppler from https://pypi.org/project/pdf2image/. There is an up-to-date version for Windows users.

Move the extracted directory to a desired location on your system.

Add the bin/ directory to your path environment variable.

set PATH=%PATH%;C:\python\autoindex\poppler-0.68.0\bin

Alternatively, you can use the poppler_path = r"c:\path\to\poppler-xx\bin" argument in the convert_from_path function in the extract_text_from_pdf.py script.

Download and install Tesseract OCR from its official website (https://github.com/ub-mannheim/tesseract/wiki).

After installation, add the Tesseract OCR executable to your system’s path. For Windows, the default installation path is c:\program files\tesseract-ocr\tesseract.exe. Add this path to your system’s path variable or configure the PyTesseract library to use the installed Tesseract directly.

set PATH=%PATH%;c:\program files\tesseract-ocr\tesseract.exe

Verify your Tesseract installation: In the Python script.

import pytesseract
print(pytesseract.get_tesseract_version())

You should see something like tesseract 4.1.1 or higher.

To run the main script of the project, you can use the following command in your terminal:

python main.py

This will process all PDF documents in the 'documents' folder. The script will then perform the following tasks:

Extract the text from the PDF file using PyMuPDF and PyTesseract.
Analyze the text using spaCy, Gensim, TextBlob, Sumy, and Transformers.
Generate a summary of the document and print it to the terminal.
Extract the key-value pairs from the document and print them to the terminal.
Cluster the document based on its similarity to other documents in a predefined corpus and print the cluster label to the terminal.
Classify the document based on its sentiment and print the sentiment score to the terminal.
Extract any email addresses from the document and print them to the terminal.
Transcribe any audio files attached to the document using OpenAI API - Whisper and print the transcription to the terminal.
All programs can be called as stand-alone processes.

Programs can also all run individually

All Programs and Processes

Programs Short Version

Program	Description
extract_text_from_audio.py	Extracts text from audio files. It uses the Google Cloud Speech API to transcribe the audio into text.
extract_text_from_document.py	Extracts information from documents. It can extract entities, dates, keywords, and summaries from documents.
extract_text_from_pdf.py	Extracts text from PDF files. It uses the PyPDF2 library to convert PDF files to text.
extract_key_value_pairs.py	Extracts key-value pairs from text. It uses the spacy library to identify key-value pairs in text.
extract_named_entities.py	Extracts named entities from text. It uses the spacy library to identify named entities in text.
optical_character_recognition.py	Extracts text from images using optical character recognition (OCR). It uses the Tesseract OCR library to extract text.
sentiment_analysis.py	Analyzes the sentiment of text. It uses the VaderSentiment library to analyze the sentiment of text.
summarize_text.py	Summarizes text. It uses the sumy library to summarize text.
cluster_documents.py	Classifies text into categories using k-means clustering. It uses the scikit-learn library to perform k-means clustering.
convert_audio_format.py	Converts audio files from one format to another. It uses the ffmpeg library to convert audio files.
download_email.py	Downloads emails from a mail server. It uses the imaplib library to download emails from a mail server.
classify_documents.py	Classifies documents into categories. It uses the spacy library to extract features from documents and then uses a machine learning algorithm to classify the documents into categories.
compare_documents.py	Calculates the similarity between documents. It uses the gensim library to calculate the similarity between documents.
extract_features_from_text.py	Extracts features from text. It uses the spacy library to extract features from text.
manage_files.py	Provides utility functions for working with files. It can be used to read, write, and delete files.
sentiment_analysis_using_flair.py	Analyzes the sentiment of text using the FastText library.
compare_addresses.py	Compares addresses using the fuzzywuzzy library. It can be used to find similar addresses.
fuzzy_match_text.py	Performs fuzzy matching on text. It can be used to find similar text strings.
sentiment_analysis_using_huggingface.py	Analyzes the sentiment of text using the Hugging Face Transformers library.
extract_information_from_text.py	Extracts information from documents. It can extract entities, dates, keywords, and summaries from documents.
main.py	This is the main program of the project. It provides a command line interface for the other programs.
extract_named_entities_using_spacy.py	Extracts named entities from text. It uses the spacy library to identify named entities in text.
extract_keywords_using_rake.py	Extracts keywords from text. It uses the Rake library to extract keywords from text.
extract_relations_between_entities.py	Extracts relations between entities from text. It uses the spacy library to identify relations between entities in text.
cluster_documents_based_on_similarity.py	Clusters documents based on their similarity. It uses the scikit-learn library to perform k-means clustering.
preprocess_text.py	Preprocesses text. It can be used to clean, tokenize, and normalize text.
create_topic_model.py	Creates topic models from text. It uses the gensim library to create topic models from text.
sentiment_analysis_using_vader.py	Analyzes the sentiment of text using the VADER library.

Programs Long Version

extract_text_from_audio.py: A program that performs the following functions:

Get an audio file as an input argument
Transcribe the audio file using googleapiclient library
Write the transcription to a text file with the same name as the audio file
Called by: main.py or as standalone program

extract_text_from_document.py: A program that performs the following functions:

Get a PDF file as an input argument
Extract text from the PDF file using PyMuPDF library
Write the extracted text to a text file with the same name as the PDF file
Called by: autoExtractPDF2text.py

extract_text_from_pdf.py: A program that performs the following functions:

Get the input directory and output file name from the main program
Iterate over all PDF files in the input directory
Call autoextract.py to extract text from each PDF file
Called by: main.py or as standalone program

extract_key_value_pairs.py: A program that performs the following functions:

Get the input directory and output file name from the main program
Iterate over all text files in the input directory
Preprocess each text file using text_preprocessing library
Extract key-value pairs from each text file using info_extraction library
Write the extracted key-value pairs to a CSV file
Called by: main.py or as standalone program

extract_named_entities.py: A program that performs the following functions:

Get the input directory and output file name from the main program
Iterate over all text files in the input directory
Preprocess each text file using text_preprocessing library
Perform named entity recognition on each text file using spacy, flair, and transformers libraries
Write the extracted entities to a CSV file
Called by: main.py or as standalone program

optical_character_recognition.py: A program that performs the following functions:

Get the input directory and output file name from the main program
Iterate over all image files in the input directory
Convert each image file to a PDF file using pdf2image library
Perform OCR on each PDF file using pytesseract library
Write the OCR text to a text file with the same name as the image file
Use threading library to run multiple OCR processes in parallel
Called by: main.py or as standalone program

sentiment_analysis.py: A program that performs the following functions:

Get the input directory and output file name from the main program
Iterate over all text files in the input directory
Preprocess each text file using text_preprocessing library
Perform sentiment analysis on each text file using vaderSentiment, flair, and transformers libraries
Write the sentiment scores to a CSV file
Called by: main.py or as standalone program

summarize_text.py: A program that performs the following functions:

Get the input directory and output file name from the main program
Iterate over all text files in the input directory
Preprocess each text file using text_preprocessing library
Generate summaries for each text file using sumy, transformers, and openai libraries
Write the summaries to a CSV file
Called by: main.py or as standalone program

cluster_documents.py: A program that performs the following functions:

Get the input directory and output file name from the main program
Iterate over all text files in the input directory
Preprocess each text file using text_preprocessing library
Embed each text file using torch, spacy, gensim, and transformers libraries
Perform k-means clustering on the embeddings using sklearn library
Compare the extracted keywords with a list of predefined categories using fuzzywuzzy and thefuzz libraries
Assign the best matching category to each document
Write the categories to a CSV file
Called by: main.py or as standalone program

convert_audio_format.py: A program that performs the following functions:

Get an audio file as an input argument
Convert the audio file to a WAV format using subprocess library
Write the converted audio file to a WAV file with the same name as the original file
Called by: audioExtractText.py

download_email.py: A program that performs the following functions:

Get an email address and a password as input arguments
Connect to Gmail API using googleapiclient and google_auth_oauthlib libraries
Download all attachments from unread emails in a specified folder
Write the downloaded attachments to a specified directory
Called by: main.py or as standalone program

classify_documents.py: A program that performs the following functions:

Get a text file as an input argument
Preprocess the text file using text_preprocessing library
Embed the text file using torch and transformers libraries
Load a pre-trained document classification model using torch and transformers libraries
Predict the document class using the model
Write the predicted class to a CSV file
Called by: main.py or as standalone program

compare_documents.py: A program that performs the following functions:

Get two text files as input arguments
Preprocess the text files using text_preprocessing library
Embed the text files using torch and transformers libraries
Compute the cosine similarity between the embeddings using numpy library
Write the similarity score to a CSV file
Called by: main.py or as standalone program

extract_features_from_text.py:

A module that defines some functions for extracting features from text data, such as TF-IDF vectors, word embeddings, and document embeddings.

manage_files.py:

A module that defines some functions for working with files, such as reading, writing, deleting, moving, and renaming files.

sentiment_analysis_using_flair.py: A program that performs the following functions:

Get a text file as an input argument
Preprocess the text file using text_preprocessing library
Load a pre-trained sentiment analysis model using flair library
Predict the sentiment polarity and score using the model
Write the predicted sentiment to a CSV file
Called by: main.py or as standalone program

compare_addresses.py: A program that performs the following functions:

Get two addresses as input arguments
Compare the addresses using fuzzywuzzy library
Write the comparison score to a CSV file
Called by: main.py or as standalone program

fuzzy_match_text.py: A program that performs the following functions:

Get a text file and a list of categories as input arguments
Extract keywords from the text file using rake_nltk library
Compare the keywords with the categories using fuzzywuzzy and thefuzz libraries
Assign the best matching category to the text file
Write the assigned category to a CSV file
Called by: main.py or as standalone program

sentiment_analysis_using_huggingface.py: A program that performs the following functions:

Get a text file as an input argument
Preprocess the text file using text_preprocessing library
Load a pre-trained sentiment analysis model using transformers library
Predict the sentiment polarity and score using the model
Write the predicted sentiment to a CSV file
Called by: main.py or as standalone program

extract_information_from_text.py:

A module that defines some functions for extracting information from text data, such as entities, dates, keywords, and key-value pairs.

main.py: The main program that calls other programs and performs the following functions:

Get the input directory and output file name from the command-line arguments
Check if the input directory exists and is not empty
Call autoExtractPDF2text.py to extract text from PDF files
Call autoOCR_parallel.py to perform OCR on image files
Call autoNER.py to perform named entity recognition on text files
Call categorize_kmeans.py to perform k-means clustering on text files
Call autosummarize.py to generate summaries for text files
Call autosentiment.py to perform sentiment analysis on text files
Call autokvextract.py to extract key-value pairs from text files
Write the results to a CSV file

extract_named_entities_using_spacy.py: A program that performs the following functions:

Get a text file as an input argument
Preprocess the text file using text_preprocessing library
Perform named entity recognition on the text file using spacy, flair, and transformers libraries
Write the extracted entities to a CSV file
Called by: main.py or as standalone program

extract_keywords_using_rake.py: A program that performs the following functions:

Get a text file as an input argument
Preprocess the text file using text_preprocessing library
Extract keywords from the text file using rake_nltk library
Write the extracted keywords to a CSV file
Called by: main.py or as standalone program

extract_relations_between_entities.py: A program that performs the following functions:

Get a text file as an input argument
Preprocess the text file using text_preprocessing library
Perform relation extraction on the text file using spacy library
Write the extracted relations to a CSV file
Called by: main.py or as standalone program

cluster_documents_based_on_similarity.py: A program that performs the following functions:

Get the input directory and output file name from the main program
Iterate over all text files in the input directory
Preprocess each text file using text_preprocessing library
Embed each text file using torch, spacy, gensim, and transformers libraries
Perform similarity clustering on the embeddings using similarity_clustering library
Write the cluster labels to a CSV file
Called by: main.py or as standalone program

preprocess_text.py:

A module that defines some functions for preprocessing text data, such as cleaning, tokenizing, lemmatizing, and vectorizing.

create_topic_model.py: A program that performs the following functions:

Get a text file as an input argument
Preprocess the text file using text_preprocessing library
Perform topic modeling on the text file using gensim library
Write the extracted topics to a CSV file
Called by: main.py or as standalone program

sentiment_analysis_using_vader.py: A program that performs the following functions:

Get a text file as an input argument
Preprocess the text file using text_preprocessing library
Perform sentiment analysis on the text file using vaderSentiment library
Write the sentiment scores to a CSV file
Called by: main.py or as standalone program

Libraries

Library	Description
textblob	Natural language processing library that provides a simple API for common NLP tasks such as sentiment analysis and text classification.
feature_extraction	Library for extracting features from text data.
text_preprocessing	Library for preprocessing text data, such as cleaning and tokenizing.
sys	Library for interacting with the Python runtime system.
sklearn	Machine learning library that provides a wide range of supervised and unsupervised learning algorithms.
torch	Machine learning library that provides a fast and flexible deep learning framework.
spacy	Natural language processing library that provides a high-performance statistical NLP engine.
csv	Library for reading and writing CSV files.
gensim	Natural language processing library that provides a number of tools for working with large-scale text corpora.
thefuzz	Library for fuzzy string matching.
threading	Library for creating and managing threads.
transformers	Library for natural language processing that provides a number of pre-trained models for tasks such as text classification and question answering.
numpy	Library for scientific computing that provides a high-performance multidimensional array object.
chardet	Library for detecting the encoding of a text file.
stanza	Natural language processing library that provides a high-performance statistical NLP engine for the Turkish language.
glob	Library for finding files that match a pattern.
google_auth_oauthlib	Library for authenticating with Google APIs.
pdf2image	Library for converting PDF files to images.
sumy	Library for summarizing text documents.
info_extraction	Library for extracting information from text documents, such as entities, dates, and keywords.
time	Library for getting the current time and date.
warnings	Library for handling warnings.
flair	Natural language processing library that provides a number of pre-trained models for tasks such as sentiment analysis and named entity recognition.
vaderSentiment	Library for sentiment analysis.
concurrent	Library for managing concurrent execution of tasks.
google	Library for interacting with Google APIs.
fuzzywuzzy	Library for fuzzy string matching.
manage_files	Library for working with files.
PyPDF4	Library for reading and writing PDF files.
re	Library for regular expressions.
os	Library for interacting with the operating system.
openai	Library for interacting with OpenAI APIs.
base64	Library for encoding and decoding binary data in base64 format.
nltk	Natural language processing library that provides a wide range of NLP tools and resources.
pandas	Library for data analysis and manipulation.
similarity_clustering	Library for clustering and finding similar items in a dataset.
functools	Library for providing higher-order functions and other tools for working with functions.
requests	Library for making HTTP requests.
googleapiclient	Library for interacting with Google APIs.
rake_nltk	Library for natural language processing that provides a number of tools for extracting keywords from text.
subprocess	Library for launching subprocesses.
fitz	Library for working with PDF files.
pytesseract	Library for optical character recognition.
collections	Library for providing a number of container data types, such as lists, sets, and dictionaries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Intelligence: Document Text Analysis and Artificial Intelligence Processing

Description

Installation / How to setup

All Programs and Processes

Libraries

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
SLDC		SLDC
audio		audio
documents		documents
help		help
LICENSE		LICENSE
classify_documents.py		classify_documents.py
clean.cmd		clean.cmd
cluster_documents.py		cluster_documents.py
cluster_documents_based_on_similarity.py		cluster_documents_based_on_similarity.py
compare_addresses.py		compare_addresses.py
compare_documents.py		compare_documents.py
convert_audio.py		convert_audio.py
convert_audio_format.py		convert_audio_format.py
convert_pdf_to_png.py		convert_pdf_to_png.py
create_topic_model.py		create_topic_model.py
data.jsonl		data.jsonl
dl_email.py		dl_email.py
document_classification.py		document_classification.py
document_similarity.py		document_similarity.py
download_email.py		download_email.py
extract.py		extract.py
extract_features_from_text.py		extract_features_from_text.py
extract_information_from_text.py		extract_information_from_text.py
extract_key_value_pairs.py		extract_key_value_pairs.py
extract_keywords_using_rake.py		extract_keywords_using_rake.py
extract_named_entities.py		extract_named_entities.py
extract_named_entities_using_spacy.py		extract_named_entities_using_spacy.py
extract_relations_between_entities.py		extract_relations_between_entities.py
extract_text_from_audio.py		extract_text_from_audio.py
extract_text_from_document.py		extract_text_from_document.py
extract_text_from_pdf.py		extract_text_from_pdf.py
find_unique_words_for_categorization.py		find_unique_words_for_categorization.py
fuzzy_categorize_documents.py		fuzzy_categorize_documents.py
fuzzy_match_text.py		fuzzy_match_text.py
gzip_knn_similarity.py		gzip_knn_similarity.py
insert_filenames.py		insert_filenames.py
javakill.cmd		javakill.cmd
main.py		main.py
make_pdf_searchable.py		make_pdf_searchable.py
manage_files.py		manage_files.py
mermaid.js		mermaid.js
optical_character_recognition.py		optical_character_recognition.py
preprocess_text.py		preprocess_text.py
readme.md		readme.md
remove_filenames_from_jsonl_after_annotation.py		remove_filenames_from_jsonl_after_annotation.py
requirements.txt		requirements.txt
sentiment_analysis.py		sentiment_analysis.py
sentiment_analysis_using_flair.py		sentiment_analysis_using_flair.py
sentiment_analysis_using_huggingface.py		sentiment_analysis_using_huggingface.py
sentiment_analysis_using_vader.py		sentiment_analysis_using_vader.py
show_jsonl_lines_without_label.py		show_jsonl_lines_without_label.py
summarize_text.py		summarize_text.py
text_similarity.md		text_similarity.md
text_similarity.py		text_similarity.py
train_distilbert_model.py		train_distilbert_model.py

License

jimmc414/document_intelligence

Folders and files

Latest commit

History

Repository files navigation

Document Intelligence: Document Text Analysis and Artificial Intelligence Processing

Description

Installation / How to setup

All Programs and Processes

Libraries

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages