Lung Cancer Detection is a project made as part of Engineers Thesis "Applications of artificial intelligence in oncology on computer tomography dataset" by Jakub Owczarek, under the guidance of Thesis Advisor dr. hab. inz Mariusz Mlynarczuk prof. AGH.
The goal of this project is to process the LIDC-IDRI dataset and evaluate the performance of deep learning models pre-trained on Image Net by leveraging transfer learning.
This repository contains the following directories:
-
docs - contains markdown files with more specific descriptions of the project components
-
notebooks - contains Jupyter Notebooks that were used for experiments, analysis, visualizations, etc
-
scripts - this directory is the actual workhorse and contains two notable subdirectories:
- azure - contains scripts for Azure Virtual Machine and Azure Machine Learning
- local - contains scripts that were used for local development
-
src - contains main components of the project:
- azure - contains utilities specific to Azure services
- dataset - contains
DatasetLoader
component used to feed data during model training - model - contains model builder and director classes
- preprocessing - contains classes used for LIDC-IDRI dataset preprocessing
- config.py - some constants used throughout the project
-
tests - contains (few) tests for the project components
This project was created with Azure in mind and therefore the main scripts are meant for usage on Azure.
- First step is to download the LIDC-IDRI dataset on Azure Virtual Machine. The
azure/virtual_machine/download_dataset.sh
script is meant for this task. - Then, it's time to preprocess this dataset to a format suitable for supervised deep learning model training. The
azure/virtual_machine/process_dataset.py
script is meant for this task. Additionally, in the same directory istrain_test_split.py
, which should be used to split processed data. - Finally, the preprocessed dataset can be uploaded with the
upload_dataset_2.sh
script to Azure Blob Storage. There is alsoupload_dataset.sh
script, but it doesn't use theazcopy
utility and is too slow.
- With preprocessed dataset on Azure Blob Storage, the Virtual Machine will be no longer necessary. From this dataset an Azure Machine Learning data asset can be created, which can be utilized during model training.
- Now to run the actual model training under
scripts/azure/machine_learing
is therun_training_job.py
script. This script can be used to create a job on AML, to build, compile and train desired model.
This project is licensed under the MIT License - see the LICENSE.md file for details