Patient Data ETL Pipeline

Developed by Sushant Sinha as a part of Bombardier Assessment

This repository contains a Python ETL (Extract, Transform, Load) pipeline designed to clean and process patient data from a hospital. The pipeline removes protected health information (PHI), handles missing and invalid values, normalizes data, and stores the cleaned data into a structured format. Additionally, it includes unit tests to ensure data integrity and correctness.

Features

Removes PHI (names, addresses, etc.) from the dataset.
Handles missing values and invalid data (e.g., NaN, inf, negative values).
Normalizes and cleans the data.
Adds columns for average glucose levels and diabetes diagnosis.
Excludes outliers when calculating mean values.
Stores the cleaned data into a CSV file.
Includes comprehensive unit tests.

Requirements

Python 3.6+
pandas
numpy
unittest (for running tests)

Installation

Clone the repository:

git clone https://github.com/sushant-sinha/Bombardier-Assessment.git
cd Bombardier-Assessment

Install the required dependencies:
```
pip install pandas numpy
```

Usage

Place your input CSV file (e.g., patient_data.csv) in the project directory.

Update the file paths in diabetesDiagnosis.py:

file_path = 'path_to_your_file/patient_data.csv'
output_file_path = 'path_to_your_output/diabetes_diagnosis_data.csv'

Run the ETL script:
```
python diabetesDiagnosis.py
```
The processed data will be saved to the specified output file path.

Testing

To run the unit tests, use the following command:
```
python testDiabetesDiagnosis.py
```
The tests will verify the correctness of the ETL functions and ensure data integrity.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.gitignore		.gitignore
Doc1.docx		Doc1.docx
README.md		README.md
Sushant Sinha Bombardier Assessment.pdf		Sushant Sinha Bombardier Assessment.pdf
diabetesDiagnosis.py		diabetesDiagnosis.py
diabetes_diagnosis_data.csv		diabetes_diagnosis_data.csv
patient_data.csv		patient_data.csv
testDiabetesDiagnosis.py		testDiabetesDiagnosis.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Patient Data ETL Pipeline

Table of Contents

Features

Requirements

Installation

Usage

Testing

About

Releases

Packages

Languages

sushant-sinha/Bombardier-Assessment

Folders and files

Latest commit

History

Repository files navigation

Patient Data ETL Pipeline

Table of Contents

Features

Requirements

Installation

Usage

Testing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages