Skip to content

Neurobagel Data Model limitations

Sebastian Urchs edited this page May 31, 2023 · 6 revisions

During https://github.com/neurobagel/bulk_annotations/issues/10 we discovered a couple of limitations of our data model. The idea of this page is to keep a record of them, so we don't have to discover the same limitation twice. Some limiations are expected and intentional, we don't want to model everything we can because we want a balance of usefulness and complexity.

Lack of longitudinal phenotypic info

Phenotypic data is modeled at the subject level, not at the session level. This is mainly due to BIDS, as longitudinal phenotypic data in BIDS are currently not defined in the spec.

Problems this causes

  • cannot model a change in diagnosis over time. If a participant has a DX of "yes autism" on session 1, but then a DX of "no autism" on session 2, we currently have to pick one. And then the participant will show up with all of their sessions for only that sessions profile.

Missing-value forced to be a catch-all category

For example, I have a column about diagnosis. The correct ontology is "SNOMED-CT". My column has four levels:

  • Schizophrenia
  • Alzheimer's Disease
  • Sibling with Schizophrenia

The last one is not in SNOMED (one could argue that this is more a modeling issue than an ontology issue). But in my data dictionary, I need to map this value to something. The only option I currently have is to declare it as "missing value" - which is probably not a good option.

The alternative would be to "force" annotators to pick the "next closest" term from SNOMED-CT, but this comes with the cost of reducing annotation accuracy.

Lack of model variables to say things about:

  • relations. For example, I cannot model that my participant has a sibling with a diagnosis of "Schizophrenia"
  • medication status. E.g. cannot say: "my participant received Drug XYZ"
  • assessments that aren't tied to a specific tool. e.g. "Body Mass Index" is not a tool in the same way that "blood pressure" is not a tool. But still relevant info
  • "assessments" and "diagnosis" are probably too narrow. E.g.: I can't really expect cognitive atlas to have a purely clinical test battery like the MS EDSS scores. Diagnosis is too narrow because we can differentiate "Diagnosis" from "Finding" or "Observation". An example is "Smoker" vs "Cancer". The former is an observation, the latter a diagnosis.

Assessment tool availability is too strict

Right now we think of assessment tools as something a participant either has (successfully completed) or doesn't have (either completely lacking or incomplete).

The way we implement it is that an assessment tool is counted as "available" for a participant if:

  • there is one or multiple columns that are partOf the assessment tool
  • the row for the participant in question does not have any missing values

Or in other words: all columns that are partOf the tool, must not have missing values for the tool to be counted as available.

Clone this wiki locally