Skip to content

Latest commit

 

History

History
118 lines (84 loc) · 5.89 KB

README.md

File metadata and controls

118 lines (84 loc) · 5.89 KB

Weakly Supervised Clinical Text Classification

4/6/2020 COVID-19 Update

An updated version of this pipeline was used in our Medium blog post on rapidly identifying symptoms of SARS-CoV-2 in ED (emergency department) assessment notes. We've updated this repo to release our labeling functions and code to the research community.

  1. Concept tagging pipeline concept-tagger.py
  2. Jupyter notebook demo for classifying evidence of complex symptoms (e.g., recent international travel). tutorials/COVID19_Travel_Risk_Factor.ipynb

This repo is a work in progress! We're continuing to iterate on our models and labeling functions. Please let us know if you would like to help contribute!

Overview

This library provides tools for rapidly building clinical text classification tasks using weakly supervised machine learning. Obtaining labeled training data is a common roadblock to using machine learning with unstructured medical data such as patient notes. Weakly supervised methods allow domain experts to quickly refine training set construction, enabling the use of modern deep learning without time consuming manual training data curation.

This library enables integrating common clinical text heuristics and other noisy labeling sources for use with Stanford's weak supervision framework Snorkel. We developed these tools while working on our npj Digital Medicine paper "Medical Device Surveillance with Electronic Health Records" focusing on lightweight code that makes it easier to extract real-world patient outcomes from clinical notes.

Features (2/4/2020)

  • Fast text tokenization, sentence boundary detection, and NLP preprocesssing using custom spaCy modules optimized for clinical and biomedical text.
  • Lightweight information extraction pipeline for span and relation classification with multiprocessing support via Joblib.
  • Tagging support for UMLS concepts, TIMEX3, and custom dictionary/regular expression concept matching.
  • Labeling functions for common, clinical concept attribute classification tasks including:
    • Parent Section Header
    • Negation / Hypothetical / Historical
    • Datetime Canonicalization
    • Event Time Delta (in days preceding doc timestamp)
    • Laterality
    • Family vs. Patient-linked Concept

Contents

Installation

All requirements can be installed via conda. To create a new virtual enviornment and install all dependencies

conda create --yes -n rwe python=3.6
conda activate rwe
conda install pytorch==1.1.0 -c pytorch
conda install snorkel==0.9.3 -c conda-forge
conda install --name rwe -c conda-forge -c pytorch --file requirements.txt
python -m spacy download en

Once the environment is configured, you can launch the tutorial notebooks with

conda activate rwe
cd tutorials
jupyter notebook

Tutorials

Fast Document Preprocessing

The spaCy pipeline and documentation for processing large document collections is found at preprocessing/

Building Training Sets

We've provided a tutorial for tagging clinical concepts and doing relational inference with MIMIC-III data at tutorials/

Reproducing Paper Results

This framework was used in our paper "Medical Device Surveillance with Electronic Health Records" where we described two relation extraction tasks {Pain, Anatomy} and {Complication, Implant} used to evaluate the real-world performance of artifical hip replacements using Stanford Health Care data.

The original code was written using Snorkel v0.7 and Snorkel MeTaL which are both now deprecated. We've written a new simple tutorial pipeline for use with the latest release of Snorkel v0.9.4 using MIMIC-III clinical notes. This new library incorporates many of the lessons we learned while building our original models. We strongly recomend using this tutorial as the basis for any new projects.

The complete labeling functions used in the paper are also available for reference

  • {Pain, Anatomy} legacy/pain.py
  • {Complication, Implant} legacy/implant_complications.ipynb

Citations

If you make use of any of these tools, please cite

@article{Callahan2019,
	author  = {Callahan, Alison and 
	           Fries, Jason A. and 
	           R{\'e}, Christopher and
	           Huddleston, James I. and 
	           Giori, Nicholas J. and
	           Delp, Scott and 
	           Shah, Nigam H.},
	title   = {Medical device surveillance with electronic health records},
	journal = {{npj Digital Medicine}},
	volume  = 2,
	number  = 1,
	pages   = 94,
	year    = 2019,
	url     = {https://www.nature.com/articles/s41746-019-0168-z.pdf},
	doi     = {10.1038/s41746-019-0168-z}
}

and

@article{RatnerVLDB2017,
  author    = {Ratner, Alexander and
               Bach, Stephen H. and
               Ehrenberg, Henry R. and
               Fries, Jason A. and
               Wu, Sen and
               R{\'{e}}}, Christopher 
  title     = {Snorkel: Rapid Training Data Creation with Weak Supervision},
  journal   = {{PVLDB}},
  volume    = {11},
  number    = {3},
  pages     = {269--282},
  year      = {2017},
  url       = {http://www.vldb.org/pvldb/vol11/p269-ratner.pdf},
  doi       = {10.14778/3157794.3157797}
}