Use BART word embedder to do the abstractive summarization NLP task on radiology reports from MIMIC-CXR v2.0.0.
- clone this repository
- in this repository,
mkdir data
- from MIMIC-CXR download cxr-study-list.csv.gz and place it in the data folder
- also from MIMIC-CXR, download mimic-cxr-reports.zip, unzip it, and place it in the data folder
- run
python preprocessing
, which generates train.csv, validate.csv, and test.csv in the data folder - use the
notebook/*.ipynb
files to run the various BART models- the baseline model creates a summaries.csv file in data/ that contains both ground-truth and generated summaries
- the fine-tuned and fune-tuned with truncation models create a
.txt
file, that must be converted to a.csv
file by runningpython merge_txt_csv.py
- run
python rouge_score
orpython word_count
to get the ROUGE scores and word count of the summaries respectively.
Note: You may need to go into the python file to set the proper path or filename, which is all configurable from variables at the top of the file.