Document Intelligence is a document intelligence proof of concept that is designed to automatically process and analyze PDF files, emails and call recordings. It extracts relevant information, analyzes sentiment, clusters and categorizes documents based on similarity, and generates summaries. The information is used to automatically assign the correct account number to an external system. The project aims to automatically digest unstructured documents, such as phone calls and texts, and generate metadata indexes for the correct account in a remote system. The objective is for this process to occur end-to-end, from an unknown document to being correctly filed, without human intervention.
The project consists of several Python scripts that accomplish different tasks, including OCR processing, named entity recognition, similarity clustering, sentiment analysis, text summarization, audio transcription, email extraction, key-value pair extraction, and classification.
You can use pip to install the required packages in your Python environment:
pip install textblob sys sklearn torch spacy csv gensim thefuzz threading transformers numpy chardet stanza glob google_auth_oauthlib pdf2image sumy time warnings flair vaderSentiment concurrent google fuzzywuzzy PyPDF4 re os openai base64 nltk pandas functools requests googleapiclient rake_nltk subprocess fitz pytesseract collections
You also need to download the necessary language model for spaCy:
python -m spacy download en_core_web_sm
Additionally, you need to install Poppler and Tesseract on your system and set the necessary dependencies for the Python project. To do this, follow these steps:
Download and install Poppler from https://pypi.org/project/pdf2image/. There is an up-to-date version for Windows users.
Move the extracted directory to a desired location on your system.
Add the bin/ directory to your path environment variable.
set PATH=%PATH%;C:\python\autoindex\poppler-0.68.0\bin
Alternatively, you can use the poppler_path = r"c:\path\to\poppler-xx\bin" argument in the convert_from_path function in the extract_text_from_pdf.py script.
Download and install Tesseract OCR from its official website (https://github.com/ub-mannheim/tesseract/wiki).
After installation, add the Tesseract OCR executable to your system’s path. For Windows, the default installation path is c:\program files\tesseract-ocr\tesseract.exe. Add this path to your system’s path variable or configure the PyTesseract library to use the installed Tesseract directly.
set PATH=%PATH%;c:\program files\tesseract-ocr\tesseract.exe
Verify your Tesseract installation: In the Python script.
import pytesseract
print(pytesseract.get_tesseract_version())
You should see something like tesseract 4.1.1 or higher.
To run the main script of the project, you can use the following command in your terminal:
python main.py
This will process all PDF documents in the 'documents' folder. The script will then perform the following tasks:
- Extract the text from the PDF file using PyMuPDF and PyTesseract.
- Analyze the text using spaCy, Gensim, TextBlob, Sumy, and Transformers.
- Generate a summary of the document and print it to the terminal.
- Extract the key-value pairs from the document and print them to the terminal.
- Cluster the document based on its similarity to other documents in a predefined corpus and print the cluster label to the terminal.
- Classify the document based on its sentiment and print the sentiment score to the terminal.
- Extract any email addresses from the document and print them to the terminal.
- Transcribe any audio files attached to the document using OpenAI API - Whisper and print the transcription to the terminal.
- All programs can be called as stand-alone processes.
Programs can also all run individually
Programs Short Version
Program | Description |
---|---|
extract_text_from_audio.py | Extracts text from audio files. It uses the Google Cloud Speech API to transcribe the audio into text. |
extract_text_from_document.py | Extracts information from documents. It can extract entities, dates, keywords, and summaries from documents. |
extract_text_from_pdf.py | Extracts text from PDF files. It uses the PyPDF2 library to convert PDF files to text. |
extract_key_value_pairs.py | Extracts key-value pairs from text. It uses the spacy library to identify key-value pairs in text. |
extract_named_entities.py | Extracts named entities from text. It uses the spacy library to identify named entities in text. |
optical_character_recognition.py | Extracts text from images using optical character recognition (OCR). It uses the Tesseract OCR library to extract text. |
sentiment_analysis.py | Analyzes the sentiment of text. It uses the VaderSentiment library to analyze the sentiment of text. |
summarize_text.py | Summarizes text. It uses the sumy library to summarize text. |
cluster_documents.py | Classifies text into categories using k-means clustering. It uses the scikit-learn library to perform k-means clustering. |
convert_audio_format.py | Converts audio files from one format to another. It uses the ffmpeg library to convert audio files. |
download_email.py | Downloads emails from a mail server. It uses the imaplib library to download emails from a mail server. |
classify_documents.py | Classifies documents into categories. It uses the spacy library to extract features from documents and then uses a machine learning algorithm to classify the documents into categories. |
compare_documents.py | Calculates the similarity between documents. It uses the gensim library to calculate the similarity between documents. |
extract_features_from_text.py | Extracts features from text. It uses the spacy library to extract features from text. |
manage_files.py | Provides utility functions for working with files. It can be used to read, write, and delete files. |
sentiment_analysis_using_flair.py | Analyzes the sentiment of text using the FastText library. |
compare_addresses.py | Compares addresses using the fuzzywuzzy library. It can be used to find similar addresses. |
fuzzy_match_text.py | Performs fuzzy matching on text. It can be used to find similar text strings. |
sentiment_analysis_using_huggingface.py | Analyzes the sentiment of text using the Hugging Face Transformers library. |
extract_information_from_text.py | Extracts information from documents. It can extract entities, dates, keywords, and summaries from documents. |
main.py | This is the main program of the project. It provides a command line interface for the other programs. |
extract_named_entities_using_spacy.py | Extracts named entities from text. It uses the spacy library to identify named entities in text. |
extract_keywords_using_rake.py | Extracts keywords from text. It uses the Rake library to extract keywords from text. |
extract_relations_between_entities.py | Extracts relations between entities from text. It uses the spacy library to identify relations between entities in text. |
cluster_documents_based_on_similarity.py | Clusters documents based on their similarity. It uses the scikit-learn library to perform k-means clustering. |
preprocess_text.py | Preprocesses text. It can be used to clean, tokenize, and normalize text. |
create_topic_model.py | Creates topic models from text. It uses the gensim library to create topic models from text. |
sentiment_analysis_using_vader.py | Analyzes the sentiment of text using the VADER library. |
Programs Long Version
extract_text_from_audio.py: A program that performs the following functions:
- Get an audio file as an input argument
- Transcribe the audio file using googleapiclient library
- Write the transcription to a text file with the same name as the audio file
- Called by: main.py or as standalone program
extract_text_from_document.py: A program that performs the following functions:
- Get a PDF file as an input argument
- Extract text from the PDF file using PyMuPDF library
- Write the extracted text to a text file with the same name as the PDF file
- Called by: autoExtractPDF2text.py
extract_text_from_pdf.py: A program that performs the following functions:
- Get the input directory and output file name from the main program
- Iterate over all PDF files in the input directory
- Call autoextract.py to extract text from each PDF file
- Called by: main.py or as standalone program
extract_key_value_pairs.py: A program that performs the following functions:
- Get the input directory and output file name from the main program
- Iterate over all text files in the input directory
- Preprocess each text file using text_preprocessing library
- Extract key-value pairs from each text file using info_extraction library
- Write the extracted key-value pairs to a CSV file
- Called by: main.py or as standalone program
extract_named_entities.py: A program that performs the following functions:
- Get the input directory and output file name from the main program
- Iterate over all text files in the input directory
- Preprocess each text file using text_preprocessing library
- Perform named entity recognition on each text file using spacy, flair, and transformers libraries
- Write the extracted entities to a CSV file
- Called by: main.py or as standalone program
optical_character_recognition.py: A program that performs the following functions:
- Get the input directory and output file name from the main program
- Iterate over all image files in the input directory
- Convert each image file to a PDF file using pdf2image library
- Perform OCR on each PDF file using pytesseract library
- Write the OCR text to a text file with the same name as the image file
- Use threading library to run multiple OCR processes in parallel
- Called by: main.py or as standalone program
sentiment_analysis.py: A program that performs the following functions:
- Get the input directory and output file name from the main program
- Iterate over all text files in the input directory
- Preprocess each text file using text_preprocessing library
- Perform sentiment analysis on each text file using vaderSentiment, flair, and transformers libraries
- Write the sentiment scores to a CSV file
- Called by: main.py or as standalone program
summarize_text.py: A program that performs the following functions:
- Get the input directory and output file name from the main program
- Iterate over all text files in the input directory
- Preprocess each text file using text_preprocessing library
- Generate summaries for each text file using sumy, transformers, and openai libraries
- Write the summaries to a CSV file
- Called by: main.py or as standalone program
cluster_documents.py: A program that performs the following functions:
- Get the input directory and output file name from the main program
- Iterate over all text files in the input directory
- Preprocess each text file using text_preprocessing library
- Embed each text file using torch, spacy, gensim, and transformers libraries
- Perform k-means clustering on the embeddings using sklearn library
- Compare the extracted keywords with a list of predefined categories using fuzzywuzzy and thefuzz libraries
- Assign the best matching category to each document
- Write the categories to a CSV file
- Called by: main.py or as standalone program
convert_audio_format.py: A program that performs the following functions:
- Get an audio file as an input argument
- Convert the audio file to a WAV format using subprocess library
- Write the converted audio file to a WAV file with the same name as the original file
- Called by: audioExtractText.py
download_email.py: A program that performs the following functions:
- Get an email address and a password as input arguments
- Connect to Gmail API using googleapiclient and google_auth_oauthlib libraries
- Download all attachments from unread emails in a specified folder
- Write the downloaded attachments to a specified directory
- Called by: main.py or as standalone program
classify_documents.py: A program that performs the following functions:
- Get a text file as an input argument
- Preprocess the text file using text_preprocessing library
- Embed the text file using torch and transformers libraries
- Load a pre-trained document classification model using torch and transformers libraries
- Predict the document class using the model
- Write the predicted class to a CSV file
- Called by: main.py or as standalone program
compare_documents.py: A program that performs the following functions:
- Get two text files as input arguments
- Preprocess the text files using text_preprocessing library
- Embed the text files using torch and transformers libraries
- Compute the cosine similarity between the embeddings using numpy library
- Write the similarity score to a CSV file
- Called by: main.py or as standalone program
extract_features_from_text.py:
- A module that defines some functions for extracting features from text data, such as TF-IDF vectors, word embeddings, and document embeddings.
manage_files.py:
- A module that defines some functions for working with files, such as reading, writing, deleting, moving, and renaming files.
sentiment_analysis_using_flair.py: A program that performs the following functions:
- Get a text file as an input argument
- Preprocess the text file using text_preprocessing library
- Load a pre-trained sentiment analysis model using flair library
- Predict the sentiment polarity and score using the model
- Write the predicted sentiment to a CSV file
- Called by: main.py or as standalone program
compare_addresses.py: A program that performs the following functions:
- Get two addresses as input arguments
- Compare the addresses using fuzzywuzzy library
- Write the comparison score to a CSV file
- Called by: main.py or as standalone program
fuzzy_match_text.py: A program that performs the following functions:
- Get a text file and a list of categories as input arguments
- Extract keywords from the text file using rake_nltk library
- Compare the keywords with the categories using fuzzywuzzy and thefuzz libraries
- Assign the best matching category to the text file
- Write the assigned category to a CSV file
- Called by: main.py or as standalone program
sentiment_analysis_using_huggingface.py: A program that performs the following functions:
- Get a text file as an input argument
- Preprocess the text file using text_preprocessing library
- Load a pre-trained sentiment analysis model using transformers library
- Predict the sentiment polarity and score using the model
- Write the predicted sentiment to a CSV file
- Called by: main.py or as standalone program
extract_information_from_text.py:
- A module that defines some functions for extracting information from text data, such as entities, dates, keywords, and key-value pairs.
main.py: The main program that calls other programs and performs the following functions:
- Get the input directory and output file name from the command-line arguments
- Check if the input directory exists and is not empty
- Call autoExtractPDF2text.py to extract text from PDF files
- Call autoOCR_parallel.py to perform OCR on image files
- Call autoNER.py to perform named entity recognition on text files
- Call categorize_kmeans.py to perform k-means clustering on text files
- Call autosummarize.py to generate summaries for text files
- Call autosentiment.py to perform sentiment analysis on text files
- Call autokvextract.py to extract key-value pairs from text files
- Write the results to a CSV file
extract_named_entities_using_spacy.py: A program that performs the following functions:
- Get a text file as an input argument
- Preprocess the text file using text_preprocessing library
- Perform named entity recognition on the text file using spacy, flair, and transformers libraries
- Write the extracted entities to a CSV file
- Called by: main.py or as standalone program
extract_keywords_using_rake.py: A program that performs the following functions:
- Get a text file as an input argument
- Preprocess the text file using text_preprocessing library
- Extract keywords from the text file using rake_nltk library
- Write the extracted keywords to a CSV file
- Called by: main.py or as standalone program
extract_relations_between_entities.py: A program that performs the following functions:
- Get a text file as an input argument
- Preprocess the text file using text_preprocessing library
- Perform relation extraction on the text file using spacy library
- Write the extracted relations to a CSV file
- Called by: main.py or as standalone program
cluster_documents_based_on_similarity.py: A program that performs the following functions:
- Get the input directory and output file name from the main program
- Iterate over all text files in the input directory
- Preprocess each text file using text_preprocessing library
- Embed each text file using torch, spacy, gensim, and transformers libraries
- Perform similarity clustering on the embeddings using similarity_clustering library
- Write the cluster labels to a CSV file
- Called by: main.py or as standalone program
preprocess_text.py:
- A module that defines some functions for preprocessing text data, such as cleaning, tokenizing, lemmatizing, and vectorizing.
create_topic_model.py: A program that performs the following functions:
- Get a text file as an input argument
- Preprocess the text file using text_preprocessing library
- Perform topic modeling on the text file using gensim library
- Write the extracted topics to a CSV file
- Called by: main.py or as standalone program
sentiment_analysis_using_vader.py: A program that performs the following functions:
- Get a text file as an input argument
- Preprocess the text file using text_preprocessing library
- Perform sentiment analysis on the text file using vaderSentiment library
- Write the sentiment scores to a CSV file
- Called by: main.py or as standalone program
Library | Description |
---|---|
textblob | Natural language processing library that provides a simple API for common NLP tasks such as sentiment analysis and text classification. |
feature_extraction | Library for extracting features from text data. |
text_preprocessing | Library for preprocessing text data, such as cleaning and tokenizing. |
sys | Library for interacting with the Python runtime system. |
sklearn | Machine learning library that provides a wide range of supervised and unsupervised learning algorithms. |
torch | Machine learning library that provides a fast and flexible deep learning framework. |
spacy | Natural language processing library that provides a high-performance statistical NLP engine. |
csv | Library for reading and writing CSV files. |
gensim | Natural language processing library that provides a number of tools for working with large-scale text corpora. |
thefuzz | Library for fuzzy string matching. |
threading | Library for creating and managing threads. |
transformers | Library for natural language processing that provides a number of pre-trained models for tasks such as text classification and question answering. |
numpy | Library for scientific computing that provides a high-performance multidimensional array object. |
chardet | Library for detecting the encoding of a text file. |
stanza | Natural language processing library that provides a high-performance statistical NLP engine for the Turkish language. |
glob | Library for finding files that match a pattern. |
google_auth_oauthlib | Library for authenticating with Google APIs. |
pdf2image | Library for converting PDF files to images. |
sumy | Library for summarizing text documents. |
info_extraction | Library for extracting information from text documents, such as entities, dates, and keywords. |
time | Library for getting the current time and date. |
warnings | Library for handling warnings. |
flair | Natural language processing library that provides a number of pre-trained models for tasks such as sentiment analysis and named entity recognition. |
vaderSentiment | Library for sentiment analysis. |
concurrent | Library for managing concurrent execution of tasks. |
Library for interacting with Google APIs. | |
fuzzywuzzy | Library for fuzzy string matching. |
manage_files | Library for working with files. |
PyPDF4 | Library for reading and writing PDF files. |
re | Library for regular expressions. |
os | Library for interacting with the operating system. |
openai | Library for interacting with OpenAI APIs. |
base64 | Library for encoding and decoding binary data in base64 format. |
nltk | Natural language processing library that provides a wide range of NLP tools and resources. |
pandas | Library for data analysis and manipulation. |
similarity_clustering | Library for clustering and finding similar items in a dataset. |
functools | Library for providing higher-order functions and other tools for working with functions. |
requests | Library for making HTTP requests. |
googleapiclient | Library for interacting with Google APIs. |
rake_nltk | Library for natural language processing that provides a number of tools for extracting keywords from text. |
subprocess | Library for launching subprocesses. |
fitz | Library for working with PDF files. |
pytesseract | Library for optical character recognition. |
collections | Library for providing a number of container data types, such as lists, sets, and dictionaries. |