Skip to content

Shuaib-S/Prescription-OCR-Data-Extraction-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Medical Data Extraction

An OCR project to extract information about Patient and Prescription details from PDF Documents. Also this project involves creation of a backend server which will process data extraction requests.

Overview

  1. What is OCR?
  2. Introduction to Project
  3. Project Execution Steps
  4. Challenges Faced
  5. Directory Structure
  6. If you are cloning my repo?

OCR stands for Optical Character Recognition. It's a technology that enables the conversion of different types of documents, such as scanned paper documents, PDF files, or images captured by a digital camera, into editable and searchable data. Essentially, OCR software identifies text within images or scanned documents and converts it into machine-readable text.

Machine learning and AI play significant roles in OCR technology. Machine learning (ML) powers OCR to turn images into text. ML algorithms like convolutional neural networks (CNNs) are trained on massive datasets to recognize characters. ML also helps extract key features from images and utilizes language models to understand context and improve accuracy, especially for ambiguous characters. OCR systems continuously learn and adapt to specific domains and languages through ML, ensuring ever-better performance.

Here are some common applications and domains where OCR is used:

  1. Document Digitization: OCR is extensively used to convert scanned documents, PDFs, and images into editable and searchable text. This is useful in offices, libraries, and archives for digitizing large volumes of documents for easier storage, retrieval, and sharing.

  2. Data Entry Automation: OCR automates data entry processes by extracting text from documents such as invoices, receipts, and forms. This saves time and reduces errors associated with manual data entry tasks.

  3. Banking and Finance: OCR is employed in banking for reading checks, processing forms, and extracting information from financial documents. It facilitates faster processing of transactions and improves accuracy in tasks like check reading and automated form filling.

Whenever we go to hospital, we always fill up some kind of forms and our medical history is created using those forms, prescriptions, test reports. Sometimes, this medical history is used for other purposes like claiming health insurance etc

Health Insurance company might receive thousands of such documents from multiple sources and creating a record of useful information from customers medical history is a very cumbersome task and requires huge manpower. And hence this kind of tasks can be sped up using OCR technology.

For this project we have two types of Medical Documents.

  1. Patient Medical Record
  2. Prescription

We are going to extract some important fields from these documents. 4.jpg

Why this project?

Though I have been learning Data Science, then why am I doing this project? Mainly there are 3 reasons.

  1. OCR is a subset of Computer Vision. OCR can be used in an NLP project like summarizing text using LLM.
  2. This project involves very fundamental concepts of Python programming like OOP and Modular programming which are industry best practices.
  3. Also this project involves creation of a backend server using FastAPI, which is known for its performance and many world-renowned companies such as Uber, Netflix and Microsoft use FastAPI to build their applications.
  • Step 1: Convert pdf to image using pdf2image library
  • Step 2: Preprocess the image (Apply adaptive thresholding and binarization using OpenCV2)
  • Step 3: Extracting text from image by passing it through tesseract OCR engine
  • Step 4: Finding useful information from text using RegEx and returning in JSON format
  • Step 5: Creating a FastAPI backend server which serves data extraction requests by accepting a pdf_file, file_format and returning a JSON object.
  • Step 6: To create a Demo of frontend UI using Streamlit and connect it with our FastAPI server using Python Requests module. Project Execution Steps
  • How to use OCR for real world projects and key image processing concepts like thresholding using OpenCV2.
  • Polished up my Python coding skills by using OOP, code refactoring and modular programming.
  • Setting up of a backend server using FastAPI framework.
  • Unit testing using Pytest.
  • How to use Postman for API testing.
  • I could connect Streamlit frontend with FastAPI backend server using Python requests module.
  • In adapative thresholding, it requires lot of trial and error to reach optimum values of block size and constant.
  • Pytest is not properly integrated with VSCode.
  • Also I faced path related errors during unit testing even in PyCharm.
  • When creating streamlit app, there are very few practical instructions available on internet for connecting it with a backend server and sending files across.
medical-data-extraction
│   .gitignore
│   README.md
│   requirements.txt
│
├───backend
│   │
│   ├───resources
│   │   │
│   │   ├───patient_details
│   │   │       pd_1.pdf
│   │   │       pd_2.pdf
│   │   │
│   │   └───prescription
│   │           pre_1.pdf
│   │           pre_2.pdf
│   │
│   ├───src
│   │       extractor.py
│   │       main.py              //Fastapi Backend Server
│   │       parser_generic.py
│   │       parser_patient_details.py
│   │       parser_prescription.py
│   │       utils.py
│   │    
│   ├───tests
│   │       test_prescription_parser.py
│   │
│   └───uploads
│
├───frontend
│       app.py              //Streamlit app
│
├───Notebooks
│       01_prescription_parser.ipynb
│       02_patient_details_parser.ipynb
│       03_RegEx.ipynb
│    
└───reference
        tesseract_papar_by_google.pdf

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages