Skip to content
This repository has been archived by the owner on Oct 22, 2021. It is now read-only.

Generation of the i2b2 PHI dataset for the NLP Sandbox

License

Notifications You must be signed in to change notification settings

nlpsandbox/i2b2-phi-dataset

Repository files navigation

This notebook is now available in the repository nlpsandbox/notebooks.

nlpsandbox.io

Generating the i2b2 PHI dataset for the NLP Sandbox

GitHub Release GitHub CI GitHub License

Introduction

NLPSandbox.io is an open platform for benchmarking modular natural language processing (NLP) tools on both public and private datasets. Academics, students, and industry professionals are invited to browse the available tasks and participate by developing and submitting an NLP Sandbox tool.

One of the datasets used to benchmark the performance of PHI annotators on NLPSandbox.io is the 2014 i2b2 NLP De-identification Challenge Dataset. This dataset is publicly available and can be used by NLP developers to locally test their tools before submitting them to the NLP Sandbox. Once submitted, PHI annotators will be evaluated on the 2014 i2b2 dataset as well as on private datasets provided by different partner organizations, including MCW, Mayo Clinic and UW.

In order to use the i2b2 dataset to develop your NLP Sandbox PHI annotator, its annotations must first be mapped to the annotations defined by the NLP Sandbox schemas. Because one must agree to not redistribute the i2b2 dataset prior to downloading it, even in a modified form, we provide a dockerized R notebook that you can use to generate files that you can use to map the dataset. The files generated at the end of the notebook can then be pushed to a local or remote instance of the NLP Sandbox Data Node using the NLP Sandbox CLI.

Specification

  • NLP Sandbox schemas version: 1.2.0
  • NLP Sandbox dataset
    • Name: i2b2-phi-dataset
    • Version: 1.2.1

Requirements

Notebooks

Rmd Notebook Description HTML Notebook
generate-dataset.Rmd Generation of the i2b2 PHI dataset for the NLP Sandbox. HTML notebook

Important: Please make sure when you write your own notebooks that no sensitive information ends up being publicly available. Please check with the information security officer of your organization to confirm that the approach described here can be applied to your use case.

Usage

  1. Create and edit the configuration file.

    cp .env.example .env
    
  2. Start RStudio. Add the option -d or --detach to run in the background.

    docker compose up
    

RStudio is now available at http://localhost. On the login page, enter the default username (rstudio) and the password specified in .env.

To stop RStudio, enter Ctrl+C followed by docker compose down. If running in detached mode, you will only need to enter docker compose down.

Configuring the CI/CD workflow

The CI/CD workflow of this repository performs the following actions:

If you decided to fork this repository, you will need to update the environment variables defined at the top of the CI/CD workflow. You also need to create the following GitHub Secrets:

Versioning

GitHub tags

This repository uses semantic versioning to track the releases of this project. This repository uses "non-moving" GitHub tags, that is, a tag will always point to the same git commit once it has been created.

GitHub Pages

The artifact published by this repository are HTML notebooks published to GitHub Pages and the Docker image docker.synapse.org/syn22277123/i2b2-phi-dataset.

The table below describes the GH Pages tags available.

Tag name Moving Description
latest Yes Latest stable release.
edge Yes Latest commit made to the default branch.
edge-<sha> No Same as above with the reference to the git commit.
<major>.<minor>.<patch> No Stable release.

You should avoid using a moving tag like latest when deploying containers in production, because this makes it hard to track which version of the image is running and hard to roll back.

License

Apache License 2.0