-
Notifications
You must be signed in to change notification settings - Fork 146
Add Training Process for Nodule Detection and Classification - added customized datasets #300
Conversation
… the grt algorithm with improvements for handling custom data. Using the documented process in the Readme, a developer can prepare custom datasets from radiologists who have annotated series of CT scans. The data should have lesion box annotations in a .csv file using the format specified. An exmaple using a CT scan data set from a Taiwan-based clinic is included. The data should also have labels for cancer/non-cancer as well.
…ded custom annotation file example
def load_scan(dirpath): | ||
print('loading scan %s' % dirpath) | ||
|
||
if dirpath.startswith('s3://'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you see urlparse
ooi?
@@ -0,0 +1,41 @@ | |||
function AddSegmentation(SegmentDataFolder, FolderDelimiter, BatchSize, ParFor_flag, IgnoreExisting_flag) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for using another language here than Python?
return bw | ||
|
||
|
||
def all_slice_analysis(bw, spacing, cut_num=0, vol_limit=[0.68, 8.2], area_th=6e3, dist_th=62): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a few docstrings so it's easier to grasp what the functions are expecting and doing? :)
end_time = time.time() | ||
|
||
print('elapsed time is %3.2f seconds' % (end_time - start_time)) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can achieve the same by appending two '\n' to the previous print
statement :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, that last print
statement will error in py3 since print
is a function.
Would you mind converting your code to comply with PEP8? There are a few things that need to be fixed according to flake8 and pycodestyle :) |
Thanks for the review @WGierke :) |
I think we already have that preprocessing steps. Converting the data to voxels, clipping the Hounsfield units that are soft tissue and rescaling the image is a very common practice among the top solutions. Could you have a look at lung_segmentation.py and improved_lung_segmentation.py? There already is lots of logic that might be useful for the steps you defined I think :) |
@swarm-ai We'll need quite a bit more context for the PR. This is a big PR with very little reference to any of the pieces of the existing project.
What are these? Where do they come from? Could they be expected to come from new CT imagery without hand labeling?
This is extremely interesting, but is hard to envision how to integrate this when it comes right before the end of the last phase. |
We've discussed internally, and have concluded that both of the following points are true:
We're going to close the PR but we encourage community members to use this as a resource to help inform model training and potentially other pieces of the application. The submission will be recognized for this aspect of contribution under the "Community" heading. |
Hi @isms Can you give me 1-2 days to work on resolving these issues and only just saw these comments? |
@swarm-ai You are more than welcome to keep working on the PR if you'd like but at this point it won't result in additional points. Feel free to email us directly if you have questions or concerns. |
Description
Using the documented process in the Training/Readme, a developer can prepare custom datasets from radiologists who have annotated series of CT scans. The data should have lesion box annotations in a .csv file using the format specified. An exmaple using a CT scan data set from a Taiwan-based clinic is included. The data should also have labels for cancer/non-cancer as well.
Reference to official issue
Issue #130
Issue #131
Motivation and Context
The motivation is to increase the available training examples so that the concept-to-clinic classifier can handle complex lung cancer cases besides those in the Luna and LIDC data sets. We have seen improved model accuracy with a preliminary run using additional data sets. A new model is currently being trained and is on epoch 80 now
How Has This Been Tested?
We have run the training process using Luna, LIDC, and NSCLC-Radiomics Data sets. The NSCLC-Radiomics data set contains 422 cases of lung cancer type non-small cell lung cancer. We label these data sets with lesion location information and cancer/non-cancer labels using the software Horos. We then import this data for training in concurrence with the Luna16 and LIDC data sets. Here is a reference link to download the data sets: http://www.cibl-harvard.org/data
CLA