Medical Data Extraction

An OCR project to extract information about Patient and Prescription details from PDF Documents. Also this project involves creation of a backend server which will process data extraction requests.

Overview

What is OCR?
Introduction to Project
Project Execution Steps
Challenges Faced
Directory Structure
If you are cloning my repo?

1. What is OCR?

OCR stands for Optical Character Recognition. It's a technology that enables the conversion of different types of documents, such as scanned paper documents, PDF files, or images captured by a digital camera, into editable and searchable data. Essentially, OCR software identifies text within images or scanned documents and converts it into machine-readable text.

Machine learning and AI play significant roles in OCR technology. Machine learning (ML) powers OCR to turn images into text. ML algorithms like convolutional neural networks (CNNs) are trained on massive datasets to recognize characters. ML also helps extract key features from images and utilizes language models to understand context and improve accuracy, especially for ambiguous characters. OCR systems continuously learn and adapt to specific domains and languages through ML, ensuring ever-better performance.

Here are some common applications and domains where OCR is used:

Document Digitization: OCR is extensively used to convert scanned documents, PDFs, and images into editable and searchable text. This is useful in offices, libraries, and archives for digitizing large volumes of documents for easier storage, retrieval, and sharing.
Data Entry Automation: OCR automates data entry processes by extracting text from documents such as invoices, receipts, and forms. This saves time and reduces errors associated with manual data entry tasks.
Banking and Finance: OCR is employed in banking for reading checks, processing forms, and extracting information from financial documents. It facilitates faster processing of transactions and improves accuracy in tasks like check reading and automated form filling.

2. Introduction to Project

Whenever we go to hospital, we always fill up some kind of forms and our medical history is created using those forms, prescriptions, test reports. Sometimes, this medical history is used for other purposes like claiming health insurance etc

Health Insurance company might receive thousands of such documents from multiple sources and creating a record of useful information from customers medical history is a very cumbersome task and requires huge manpower. And hence this kind of tasks can be sped up using OCR technology.

For this project we have two types of Medical Documents.

Patient Medical Record
Prescription

We are going to extract some important fields from these documents.

Why this project?

Though I have been learning Data Science, then why am I doing this project? Mainly there are 3 reasons.

OCR is a subset of Computer Vision. OCR can be used in an NLP project like summarizing text using LLM.
This project involves very fundamental concepts of Python programming like OOP and Modular programming which are industry best practices.
Also this project involves creation of a backend server using FastAPI, which is known for its performance and many world-renowned companies such as Uber, Netflix and Microsoft use FastAPI to build their applications.

3. Project Execution Steps

Step 1: Convert pdf to image using pdf2image library
Step 2: Preprocess the image (Apply adaptive thresholding and binarization using OpenCV2)
Step 3: Extracting text from image by passing it through tesseract OCR engine
Step 4: Finding useful information from text using RegEx and returning in JSON format
Step 5: Creating a FastAPI backend server which serves data extraction requests by accepting a pdf_file, file_format and returning a JSON object.
Step 6: To create a Demo of frontend UI using Streamlit and connect it with our FastAPI server using Python Requests module.

4. What did I learn through this project?

How to use OCR for real world projects and key image processing concepts like thresholding using OpenCV2.
Polished up my Python coding skills by using OOP, code refactoring and modular programming.
Setting up of a backend server using FastAPI framework.
Unit testing using Pytest.
How to use Postman for API testing.
I could connect Streamlit frontend with FastAPI backend server using Python requests module.

5. Challenges faced during this project

In adapative thresholding, it requires lot of trial and error to reach optimum values of block size and constant.
Pytest is not properly integrated with VSCode.
Also I faced path related errors during unit testing even in PyCharm.
When creating streamlit app, there are very few practical instructions available on internet for connecting it with a backend server and sending files across.

6. Directory Structure of Project

medical-data-extraction
│   .gitignore
│   README.md
│   requirements.txt
│
├───backend
│   │
│   ├───resources
│   │   │
│   │   ├───patient_details
│   │   │       pd_1.pdf
│   │   │       pd_2.pdf
│   │   │
│   │   └───prescription
│   │           pre_1.pdf
│   │           pre_2.pdf
│   │
│   ├───src
│   │       extractor.py
│   │       main.py              //Fastapi Backend Server
│   │       parser_generic.py
│   │       parser_patient_details.py
│   │       parser_prescription.py
│   │       utils.py
│   │    
│   ├───tests
│   │       test_prescription_parser.py
│   │
│   └───uploads
│
├───frontend
│       app.py              //Streamlit app
│
├───Notebooks
│       01_prescription_parser.ipynb
│       02_patient_details_parser.ipynb
│       03_RegEx.ipynb
│    
└───reference
        tesseract_papar_by_google.pdf

7. If you are cloning this repository?

Install all dependancies from requirements.txt
For pdf2image you need to download poppler
Install Tesseract OCR Engine in your PC
- Tesseract installation instrution : Github
- Tesseract windows specific instructions: Github
Set required PATHs as per your environment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Medical Data Extraction

Overview

1. What is OCR?

2. Introduction to Project

Why this project?

3. Project Execution Steps

4. What did I learn through this project?

5. Challenges faced during this project

6. Directory Structure of Project

7. If you are cloning this repository?

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
frontend		frontend
reference		reference
.gitignore		.gitignore
8.jpg		8.jpg
README.md		README.md
__init__.py		__init__.py
mde.mp4		mde.mp4
requirements.txt		requirements.txt

Shuaib-S/Prescription-OCR-Data-Extraction-

Folders and files

Latest commit

History

Repository files navigation

Medical Data Extraction

Overview

Why this project?

About

Resources

Stars

Watchers

Forks

Languages