Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Green Team - NIEHS Collaboration #114

Open
karafecho opened this issue Mar 1, 2018 · 27 comments
Open

Green Team - NIEHS Collaboration #114

karafecho opened this issue Mar 1, 2018 · 27 comments
Assignees

Comments

@karafecho
Copy link

This issue involves a new collaboration with Charles Schmitt and Shepherd Schurman of NIEHS to Investigate the use of the Translator/Reasoner architecture in the context of NIEHS clinical data sources (i.e., EPR, CRU), knowledge sources (i.e., Tox21, DrugMatrix), and new toxicology use cases.

@karafecho
Copy link
Author

Preliminary meetings have taken place and initial scientific and technical use cases have been developed. ROBOKOP code and data sources are being explored.

@karafecho
Copy link
Author

Meeting with Charles, Shepherd, and Christine scheduled for 1 pm, Wednesday, April 18.

@karafecho
Copy link
Author

  • Data usage agreement signed by NIEHS with redlining. Sent back to UNC and awaiting acceptance/modifications in order to obtain hush+ data

  • Stood-up backend service for chemical similarity service. Now working on smart API to system. Then will add it to Rosetta.

@karafecho
Copy link
Author

DUA accepted by NIEHS. HuSH+ dataset received.

@karafecho
Copy link
Author

karafecho commented Jun 18, 2018

Charles Schmitt's team is planning to develop a workflow to address toxic agent-centric questions such as those listed here. Relevant NIEHS data sources will be identified for eventual exposure as a Translator ChemTox Smart API. Resources are limited, however, so this is a low-priority project for NIEHS.

@karafecho
Copy link
Author

7/20/2018 - Alex, Steve, Chris, and I met with Charles' Tox group and Data Science group to provide an overview of the Translator program and Green/Gamma's role in the program. We requested their assistance with Module 3, MVP1.

@cpschmitt : Please provide an update.

@schmittcp
Copy link

08/02/18 - Several updates:

On the clinical side, we're submitting a request this week to the EPR contract team to pull genotype data and linking hashcodes. The geneotype data will be limited for now to TLR4 related SNPS. This will allow us to include the genotype data into the UNC clinical service and validate prior findings on TLR4 association with ashma and distance to roadway (the paper on this from Shepherd was just published). The linking hashcodes are those that UNC has on their own paper through the work that David Borland has done with Tracs on linking UNC and EPR patients (algorithm from Ashok).

On the clinical side, we are developing a second use case around immune mediated diseases with Dr. Fred Miller at NIEHS. I'll work with Kara on the formulation of the use case. Presumably we would take the same approach as with Asthma to augment the UNC clinical service.

Also, the EPR contract team is starting to draft documents with UNC for a more general data sharing arrangement between UNC and NIEHS based on several prior conversations with UNC. Dave Peden has agreed to serve as the UNC PI on this.

On the tox side, Resham Kulkarni is taking a first pass at the chemical to tox phenotype use case that Scott Auerbach had mentioned and how it relates to the Translator and the module 3. She should have that this week. I'll work on it next, then have Scott review it. Then we'll run it by the Green team.

@karafecho
Copy link
Author

09/06/18 update

From Charles:

  • I'll have Sue Nolte get a renci account (if someone can tell me where we request a renci account) and install tox21 enricher on a server at RENCI. We should also have a phone call to briefly walk through the database and source code.

  • I'll investigate which of the annotation classes within enricher have linkages to other ontologies/terminologies and try to find any more documentation on the annotation classes (in particular drug matrix).

  • One of us will take the lead on defining an API. I'm happy to start this if you want, but I'm equally happy to review one

  • RENCI will implement the API

Independent of this, I'll be looking at AOPs closer for NTP's purposes and will let you know if there's a way to link those into translator (although I'd encourage you to also look if you haven't).


From Kara:

Consulted with Charles on (1) plans to generate a new ICEES cohort (i.e., new tables) using the EPR, which includes an asthma sub-cohort; and (2) a new EPR use case on immune-mediated disease

@karafecho
Copy link
Author

Update on Green/Gamma collaboration with NIEHS (Charles Schmitt), 11/2/2018:

  1. EPR asthma use case - tagged EPR participants in CDWH for incorporation into ICEES; goal is to replicate a prior study; genome-wide sequencing of all EPR participants is expected to be complete by Fall 2019; targeting overlapping population initially (roughly 5000 patients)
  2. EPR immune-mediated disease use case - common risk factors and biological processes across immune-mediated diseases; SMEs to determine risk factors and biological processes; EPR for validation; CTD for clustering on genes, chemicals, diseases/phenotypes; ROBOKOP and neo4j for identifying additional genes, pathway, diseases/phenotypes; ??? tool for visualizing and curating bipartite graphs
  3. Tox21 Enricher - associations between chemicals and Tox21 chemicals (10,000), assays (300); includes Drug Matrix, Leadscope; chemical structure similarity capability; toxicity predictions; need to map assays to ontologies; data to be stored at RENCI, NIEHS to develop API and send dictionary of desired data elements
  4. Not discussed, but referred to during the meeting: Ilya Baldin’s NSF ImPACT award and ICEES as a use case for secure multiparty computation

Modification to approved Green Team IRB protocol on asthma-like patients:

Goal: We seek to add additional patient level data elements to inform the existing research study. The new data elements will include a limited set of genotype calls and responses to two survey questions (EPR Health and Exposure Survey, EPR Exposome Survey) that relate to the current Asthma use case. The data elements are available from the Environmental Polymorphism Registry (EPR), a longitudinal research study being conducted by the NIEHS. The EPR has enrolled around 19,000 subjects in North Carolina which includes approximately 5000 subjects who are in the UNC CDW-H (as determined from a prior UNC-NIEHS study). We plan to add data elements from the EPR to only those subjects that are in the EPR and are in the cohort for this study.

The genotype data is focused on a small set of variants (4 total) that relate to the TLR4 pathway. The survey questions include a broad range of questions related to environmental exposures and general health. In a prior EPR-based study, we found significant differences in suspectibility to Asthma for patients based on distance to roadways and based upon their genotype for these variants. Under this modification, we will extend the existing Asthma analysis to include these genotypes and attempt to confirm the prior findings. We also plan to incorporate the survey questions and genotype data into the existing study analysis in order to uncover potential relationships between Asthma outcomes, clinical data, environmental exposure measures, survey responses, and genotypes. We note that this work is only meant to extend the set of data elements in the existing study and not to pursue additional research goals.

Safeguards:
In prior work between UNC and EPR, we have developed a method to link records from patients seen at both institutions based upon the use of one way hash codes. The hash codes are derived from identifying information (age, sex, name, address), although once computed the identifying information is no longer needed and cannot be regenerated from the hash codes. We plan to use the hash codes to allow linking the existing UNC study data with the EPR data elements. As UNC and EPR share identical hash codes, we can add the EPR data elements to existing de-identified study data at UNC, thus the EPR data elements will not be added to identified patient records and the NIEHS will not to transfer identifying information to UNC. EPR will provide the data elements with the hash codes using secure file transfer.

@karafecho
Copy link
Author

@schmittcp will work with @lstillwe and Sue Nolte to develop an API for Tox21 Enricher data. Charles, please send Sue's GitHub user name to @rayi113 so that he can add her to this repo.

@karafecho
Copy link
Author

@schmittcp @szcc @lstillwe : I'm hoping that the three of you can coordinate on the Tox21 Enricher API.

@karafecho
Copy link
Author

Update, 4/19/19:

  1. Publishing this TIDBIT (see comments here).
  • From Charles: It comes down to whether its an EPR enabled by Translator paper or Translator methods paper with EPR as a use case. If the former, I think we need further evidence of the finding. If the latter I agree we could forgo further evidence and add additional clusters. If the latter, we need to validate our clusters first (which I’m currently working on). From our side, the former is of more interest but I’ll concede that the latter may be a more practical solution if you all would prefer we do that first.
  • From Kara: The latter would be MUCH quicker and it would highlight both the EPR and Translator. Assuming the clusters are from an asthma subset of the EPR, then I think we should determine rates of co-occurrence of asthma and colitis in i2b2@unc (UNC general population) and in ICEES (UNC subset of patients with asthma-like conditions). This would both provide some insights for your group and strengthen the manuscript.
  1. Moving EPR data to the CDWH in an effort to expand ICEES to include data on subjects in the EPR, including EPR-specific survey findings.
  • From Charles: We are held up right now as UNC requested confirmation from our IRB that our consent is ok with sharing the data with UNC. That has mostly occurred, but there are a few signoffs.
  1. Comparing results from (2) with those that your folks are generating outside of a firewall via integration of EPR data with data from Green Team's Exposures Services.
  • From Kara: Lisa is exploring whether we can pull additional chemical exposures from the US EPA conUS database, to move beyond PM2.5 and ozone for years 2002-2014.
  1. Replicating published results on roadway exposures/genes/asthma by way of (3).

@karafecho
Copy link
Author

Update on status of ongoing projects:

  1. NIEHS EPR data (survey data, SNP data) and RENCI Translator ICEES asthma-like cohort (Shepherd, Charles, Kara, Emily, Hao)
  • Replicate NIEHS EPR study using shared patients in EPR and Translator asthma-like cohort (Charles, Kara)

  • Add EPR data to new ICEES tables for years 2010-2016, to be generated using CAMP FHIR PCORnet-> FHIR data conversion pipeline and FHIR PIT data integration pipeline (Kara, Emily, Hao)

  1. NIEHS EPR data and UNC i2b2 data (Shepherd, Charles, Emily, David Borland, Ashok Krishnamurthy)
  • Add EPR flags to overlapping patients in UNC i2b2

  • Analysis agreement between UNC ICD9/10 codes and EPR self-reported diseases

  • Develop a standard procedure for cross UNC-NIEHS studies

  1. IMD subset of NIEHS EPR data and RENCI Translator ROBOKOP (Fred, Shepherd, Charles, Kara); Kara to coordinate with (Chris Bizon, Kenny Morton, Alex Tropsha, Eugene Muratov, Vini Alves, Joyce Borba)
  • ROBOKOP exploration of chemical-disease IMD clusters

  • ROBOKOP exploration of chemical-gene IMD clusters

  1. NIEHS EPR and RENCI Translator Environmental Exposure Services/APIs (Shepherd, Charles, Kara); Kara to coordinate with (Sarav Arunachalam, Alex Valencia Arias)
  • Independent batch pull of exposures data, for integration with EPR data

  • EHP manuscript is in preparation

@xu-hao
Copy link

xu-hao commented Nov 28, 2019

Latest cross walk data for EPR data results in

  • 213 matches with 2012 fhir data and
  • 42 matches with icees table.

The difference between these two data sets include:

  1. rows with no lat, lon
  2. rows with age >= 90

which are excluded from icees table

@szcc
Copy link

szcc commented Nov 29, 2019 via email

@karafecho
Copy link
Author

@szcc : Sounds great.

Forgive my ignorance, but I don't recognize your user name. Perhaps you can remind me?

@szcc
Copy link

szcc commented Dec 2, 2019 via email

@karafecho
Copy link
Author

👍

@karafecho
Copy link
Author

Emily to resolve hashing issues the week of January 13, 2020.

@karafecho
Copy link
Author

@xu-hao : I checked with the EPR folks regarding the EPR "HE_COMPLETION_DATE" variable. This is the variable we should use for integration. So, we should examine exposures during the year prior to survey completion for the CMAQ airborne exposures data. For the ACS socio-economic data, we should use the 5-year block in which the survey was completed. For the roadway data, 2016 is the only option. Does this seem reasonable? Will this present any major challenges?

@karafecho
Copy link
Author

karafecho commented Jan 13, 2020

@xu-hao : To clarify the above plan, we currently have ICEES structured in a somewhat random manner. Specifically, for patient tables, ages are calculated with respect to January 1 of each one-year 'study' period, and exposures and outcomes are examined over the same one-year 'study' period. For visit tables, ages are calculated with respect to visit date, exposures are examined over the 24 hours prior to visit date, and outcomes are examined with respect to the visit itself.

The EPR data are structured differently. Specifically, ages are calculated with respect to the data of survey completion ("HE_COMPLETION_DATE"), and outcomes are reported at the time of survey completion (although some of the variables refer to lifetime metrics). I think it makes the most sense to examine exposures over the one-year period prior to survey completion. If this is too challenging, or will take too much time, then we can compromise, at least for the demonstration project, and examine exposure over the year in which the survey was completed. In other words, if a participant completed the survey on 6/1/2016, then we would examine exposures over the course of 2016.

Does this make sense? If so, which of the above plans is the most feasible.

@xu-hao : let's discuss this tomorrow (Tuesday, 1/14/20).

@karafecho
Copy link
Author

From Emily, new hash matching results, 1/10/2020:

UNCHCS denominator: 2,770,607 patients

NIEHS EPR denominator: 19,388 participants

Matched: 7,233 people (37% of all EPR participants)

@karafecho
Copy link
Author

From Emily, 1/14/2020:

Crosswalk is up on Rockfish, in /opt/RENCI/output/FHIR. Filename is UNC_NIEHS_XWalk_for_Hao.csv. The UNC identifier should match the patient IDs you have the in the FHIR files. The hash is what will match with NIEHS.

@karafecho
Copy link
Author

Update from Kara on EPR asthma cohort data, 01.16.20:

N= 4129 total, all with SNP data

N= 2709 with HE_COMPLETION_DATE
Of those 2709, 4 participants have HE_COMPLETION_DATE in 2014, and 5 have HE_COMPLETION_DATE in 2018, remainder are all 2012 and 2013

N = 2637 with HE_COMPLETION_DATE and D28_Asthma = 0,1
Of those 2637, 1 participant has HE_COMPLETION_DATE in 2014, remainder are all 2012 and 2013
If 5% overlap with UNC ICEES asthma, then N = 132
If 37% overlap with UNC ICEES asthma, then N = 975

Of those 2637, 928 have HE_COMPLETION_DATE and D28_Asthma = 1
If 5% overlap with UNC ICEES asthma, then N = 46
If 37% overlap with UNC ICEES asthma, then N = 343

N = 2593 with D28_Asthma = 0,1 and TLR4_DIST_1X

@karafecho
Copy link
Author

Additional update from Kara, 01.16.20:

Hao has lat/longs and addresses for all 2709 participants with HE_COMPLETION_Date, so we can integrate the exposures data for the 2705 participants with HE_COMPLETION_DATE in 2012 or 2013, excluding the 4 participants with HE_COMPLETION_DATE in 2014.

We will use use the calendar year prior to HE_COMPLETION_DATE to determine airborne pollutant exposure estimates, using the same calculations for AVG and MAX exposure that we currently use for the ICEES integrated feature tables, but expanding from PM2.5 and ozone to include the eight additional airborne pollutant exposure estimates that we now have.

In other words, we'll calculate one-year exposures over the year prior to survey completion. So, if someone completed the survey on 7/1/2012, then we'll calculate exposures from 7/1/2011 - 7/1/2012 (or 7/2/2011 - 7/1/2012).

For the ACS data, we will use the 2012-2016 estimates.

For the roadway data, we do not really have a choice, as we only have 2016 data.

WRT to column headers for the EPR data, we'll create two sets: one for the UNC data and one for the EPR data.

We'll integrate the UNC and EPR data over all available years to date, i.e., 2010-2016.

@karafecho
Copy link
Author

In addition to the above plans for integration of UNC and NIEHS EPR asthma cohort data, we will stand up a private ICEES API at NIEHS.

@[email protected] : Please let @[email protected] and me know how we can move this effort forward as quickly as possible. Thanks!

@karafecho
Copy link
Author

karafecho commented Jan 29, 2020

Clarification to comment from @xu-hao on November 28, 2019:

Original comment

Latest cross walk data for EPR data results in

213 matches with 2012 fhir data and
42 matches with icees table.

The difference between these two data sets include:

rows with no lat, lon
rows with age >= 90

which are excluded from icees table

Correction

The matches noted above represent a two-way join between the original cross-walk for UNC-EPR data (i.e., just the hash match) and both the UNC FHIR files for asthma cohort and the final ICEES integrated feature tables for asthma cohort, which are derived from the FHIR files. The hash codes for the original cross-walk between UNCHCS and NIEHS EPR did not align; i.e., the UNCHCS hash codes differed in format from the NIEHS EPR hash codes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants
@stevencox @cpschmitt @xu-hao @szcc @cbizon @karafecho @schmittcp and others