An OCR project to extract information about Patient and Prescription details from PDF Documents. Also this project involves creation of a backend server which will process data extraction requests.
- What is OCR?
- Introduction to Project
- Project Execution Steps
- Challenges Faced
- Directory Structure
- If you are cloning my repo?
OCR stands for Optical Character Recognition. It's a technology that enables the conversion of different types of documents, such as scanned paper documents, PDF files, or images captured by a digital camera, into editable and searchable data. Essentially, OCR software identifies text within images or scanned documents and converts it into machine-readable text.
Machine learning and AI play significant roles in OCR technology. Machine learning (ML) powers OCR to turn images into text. ML algorithms like convolutional neural networks (CNNs) are trained on massive datasets to recognize characters. ML also helps extract key features from images and utilizes language models to understand context and improve accuracy, especially for ambiguous characters. OCR systems continuously learn and adapt to specific domains and languages through ML, ensuring ever-better performance.
Here are some common applications and domains where OCR is used:
-
Document Digitization: OCR is extensively used to convert scanned documents, PDFs, and images into editable and searchable text. This is useful in offices, libraries, and archives for digitizing large volumes of documents for easier storage, retrieval, and sharing.
-
Data Entry Automation: OCR automates data entry processes by extracting text from documents such as invoices, receipts, and forms. This saves time and reduces errors associated with manual data entry tasks.
-
Banking and Finance: OCR is employed in banking for reading checks, processing forms, and extracting information from financial documents. It facilitates faster processing of transactions and improves accuracy in tasks like check reading and automated form filling.
Whenever we go to hospital, we always fill up some kind of forms and our medical history is created using those forms, prescriptions, test reports. Sometimes, this medical history is used for other purposes like claiming health insurance etc
Health Insurance company might receive thousands of such documents from multiple sources and creating a record of useful information from customers medical history is a very cumbersome task and requires huge manpower. And hence this kind of tasks can be sped up using OCR technology.
For this project we have two types of Medical Documents.
- Patient Medical Record
- Prescription
We are going to extract some important fields from these documents.
Though I have been learning Data Science, then why am I doing this project? Mainly there are 3 reasons.
- OCR is a subset of Computer Vision. OCR can be used in an NLP project like summarizing text using LLM.
- This project involves very fundamental concepts of Python programming like OOP and Modular programming which are industry best practices.
- Also this project involves creation of a backend server using FastAPI, which is known for its performance and many world-renowned companies such as Uber, Netflix and Microsoft use FastAPI to build their applications.
- Step 1: Convert pdf to image using
pdf2image
library - Step 2: Preprocess the image (Apply
adaptive thresholding and binarization using OpenCV2
) - Step 3: Extracting text from image by passing it through
tesseract OCR engine
- Step 4: Finding useful information from text using
RegEx
and returning in JSON format - Step 5: Creating a
FastAPI backend server
which serves data extraction requests by accepting a pdf_file, file_format and returning a JSON object. - Step 6: To create a Demo of
frontend UI using Streamlit
and connect it with our FastAPI server using Python Requests module.
- How to use OCR for real world projects and key image processing concepts like thresholding using
OpenCV2
. - Polished up my Python coding skills by using
OOP, code refactoring and modular programming
. - Setting up of a backend server using
FastAPI
framework. - Unit testing using
Pytest
. - How to use
Postman
for API testing. - I could connect Streamlit frontend with FastAPI backend server using
Python requests
module.
- In adapative thresholding, it requires lot of trial and error to reach optimum values of block size and constant.
- Pytest is not properly integrated with VSCode.
- Also I faced path related errors during unit testing even in PyCharm.
- When creating streamlit app, there are very few practical instructions available on internet for connecting it with a backend server and sending files across.
medical-data-extraction
│ .gitignore
│ README.md
│ requirements.txt
│
├───backend
│ │
│ ├───resources
│ │ │
│ │ ├───patient_details
│ │ │ pd_1.pdf
│ │ │ pd_2.pdf
│ │ │
│ │ └───prescription
│ │ pre_1.pdf
│ │ pre_2.pdf
│ │
│ ├───src
│ │ extractor.py
│ │ main.py //Fastapi Backend Server
│ │ parser_generic.py
│ │ parser_patient_details.py
│ │ parser_prescription.py
│ │ utils.py
│ │
│ ├───tests
│ │ test_prescription_parser.py
│ │
│ └───uploads
│
├───frontend
│ app.py //Streamlit app
│
├───Notebooks
│ 01_prescription_parser.ipynb
│ 02_patient_details_parser.ipynb
│ 03_RegEx.ipynb
│
└───reference
tesseract_papar_by_google.pdf
- Install all dependancies from
requirements.txt
- For
pdf2image
you need to downloadpoppler
- Install Tesseract OCR Engine in your PC
- Set required PATHs as per your environment