Skip to content

Latest commit

 

History

History
103 lines (66 loc) · 5.74 KB

README.md

File metadata and controls

103 lines (66 loc) · 5.74 KB

README

A breakdown of everything in the extractor repository

Installation

Currently, installation has only been tested in Ubuntu versions 14.04 and 16.04.

Instructions

run the following commands in your terminal

sudo apt-get install python-pip python-dev build-essential python-tk

sudo -H pip install --upgrade pip

sudo -H pip install --upgrade virtualenv

now you can either install scripts from source code, or directly through pip.

Via pip:

To install via pip, simply run the command

sudo -H pip install NHANES_semantic_data_dictionary_annotation

From source:

Install git, from the terminal:

sudo apt-get install git

Change to your desired installation directory in the terminal, and then run

git clone https://github.com/rashidsabbir/extractor.git

Change directory to extractor

cd extractor

And install

sudo -H pip install -e .

Now the NHANES_extractor_exp.py script is ready to run from the terminal, The annotation_engine.py script requires a few more steps detailed below.

NHANES Extractor

The scripts "NHANES_extractor.py" and "NHANES_extractor_exp.py" automatically extract semantic data dictionaries, codebooks, and acq files from web-based sources. The automatically extracted SDDs are incomplete and the entries given may not always be correct however, so human annotation is still necessary. For information about how to annotate SDD files, please see "annotation_instructions.md".

The "NHANES_extractor_exp.py" script is more advanced and accurate, so in general you should run it in favor of "NHANES_extractor.py". The structure of the latter script is simpler, so if you are looking to extend these programs, it is a good resource to get the gist of how they work.

Only the exp version is available through pip.

Running the extractor will generate 3 directories under the directory from which it was run: acq, codebook, and sdd. The directory "sdd" contains subdirectories named for categories of NHANES data, which themselves contain partially annotated SDD files (as csv). The other two directories do not require further annotation.

To run:

NHANES_extractor_exp.py

This will take a few minutes

Annotation Engine

The file "annotation_engine.py" is a script designed to streamline the annotation process with a graphical user interface. It extracts information about relevant ontologies from online sources and computes a number of guesses as to the appropriate annotations for any given line in an input SDD file.

To use the annotation engine, you must have access priveleges on the CHEAR labkey server. Using your account details, create a .netrc file (_netrc on windows) in your "home" directory. The contents of ~/.netrc should look like this:

machine chear.tw.rpi.edu

login

password

You may wish to modify the permissions to .netrc to read/write exclusively for you (for security)

To run the program, go to your desired directory and execute:

annotation_engine.py

You will encounter two dialog boxes, the first asks you to select an SDD file (as csv) to annotate using a directory explorer interface. The second asks for your session name; you may choose an identifier, or leave it as default. Now there will be a short delay while data is downloaded from the CHEAR labkey server. If you encounter an error at this point there may be a problem with your .netrc file.

When the engine runs, a GUI will pop up with the first line of the given SDD along with some information, and a set of radio buttons with different options are presented. The column header of the SDD you are annotating will appear at the top, and the row below it. The radio buttons include 4 types: The top produces a placeholder N/A, the indented middle buttons list a number of guesses and an estimate of their confidence, the last entry in the indented section is any non-empty entry that was in the original cell, and the bottom button allows other input that is not specified by the other buttons. If you wish to input an annotation not shown, type the URI in the top text box, and optionally a label for the URI in the bottom box. On pressing "Enter", your selection will be output to a csv file under the "sessions" directory.

If you exit the annotation engine before it terminates by itself, all entries you have made so far will still be contained in the associated csv file.


This python code uses the Beautiful Soup package to extract codebook values and Semantic Data Dictionary (SDD) starting points from NHANES documents.

To specify which year's variables to extract from, set the starting year on the begin_year variable. As 2013-2014 data is the most up to date and complete at the time of this writing, begin_year has been set to 2013.

The *Val variables are used to store the SDD column values.

columnVal stores the name of the NHANES variable, which is required. labelVal stores the label associated with the NHANES variable, extracted from "SAS Label" commentVal stores the comment associated with the NHANES variable, extracted from "English Text" noteVal stores a note associated with the NHANES variable, extracted from "English Instructions" targetVal stores the target of the variable, extracted from "Target" . This column is not in the SDD specification, but is included in the extraction for completeness. attributeVal stores the attribute associated with the variable, using text matching. attributeOfVal is used to assign a role. unitVal is used to assign a unit to the variable as extracted from the label or comment.

The following variable are placeholders for future code that can be written to assign values to their associated columns. timeVal entityVal roleVal relationVal inRelationToVal wasDerivedFromVal wasGeneratedByVal hasPositionVal