The data model is compiled and released as several different artifacts using various tooling.
The figure below provides an overview; refer to the Makefile
for details of the underlying processing.
modules/*.yaml
: These are source files. Refer to Data Model Framework for editing guidelines.NF.jsonld
: This is the main output format used for distribution and downstream apps (Data Curator App).registered-json-schemas/*.json
: These are JSON serializations for a subset of the data model, for native functionality with Synapse platform or wherever JSON seralization is preferred.
graph LR
A[modules/*.yaml] --> B[retold]
B --> C[NF.jsonld]
C --> D[schematic]
D --> E[NF.jsonld]
A --> L[LinkML]
L --> J[*.json]
class B,D tools
%% Legend
subgraph Legend
G[Files]
H[Tools]
end
style A fill:white,stroke:#333,stroke-width:2px;
style C fill:white,stroke:#333,stroke-width:2px;
style E fill:white,stroke:#333,stroke-width:2px;
style G fill:white,stroke:#333,stroke-width:2px;
style J fill:white,stroke:#333,stroke-width:2px;
style B fill:#aaf,stroke:#333,stroke-width:2px
style D fill:#aaf,stroke:#333,stroke-width:2px
style H fill:#aaf,stroke:#333,stroke-width:2px
style L fill:#aaf,stroke:#333,stroke-width:2px
This data model, maintained by the NF-Open Science Initative, provides standard concepts and structure to help describe and manage data and other resources in the NF-OSI community.
One can reference the data model to understand things such as:
- What type of entities there are (e.g. RNA-seq files vs image files, files vs datasets)
- What metadata properties exist for these different types, to help understand and use a data resource
- What are the preferred/standard labels for something as defined by community input and our data managers (e.g. prefer "NF2-related schwannomatosis" vs the now-deprecated "Neurofibromatosis type 2")
Terms in the metadata dictionary are used in the manifests within the Data Curator App and on the NF Data Portal. We welcome contributions from community members, whether you are a professional data modeler, researcher, or student in the NF community.
- For external contributors somewhat familiar with editing code source files directly: After learning about the data model framework below, you can edit the files in the GitHub UI or create a Codespace on main using our devcontainer setup.
- Other ways to contribute: Create an issue to add or revise new terms, templates, and relationships.
The data model is maintained as a subset of the YAML-based Linked Data Modeling Language (LinkML) that is compatible with our internal tool schematic. This subset of LinkML should be easy to get started with.
The data model primarily models different types of biological data and patients/samples and uses templates that collects information for these entities. (It may helpful to also read about Entity-Relation Model.)
To do this the building blocks are:
LinkML term | Schematic Note | Other Note |
---|---|---|
Class | Usually corresponds to "Component"; schematic uses definition to make a "manifest" | Also known as "template" |
Slot | Schematic calls these "attributes"; the range can be typical types (string, int, etc.) or enumerations (below) | Also known as "property" |
Enum (enumeration) | Schematic calls these "valid values" | Also known as "controlled values" |
Classes depend on slots being defined, and some slots depend on enumerated values being defined.
If a class uses a slot that is not defined, the model will error when trying to build; and same with a slot that uses an enumeration.
Slots can be reused across classes; and same with enumerations for slot.
For example, unit
can be reused across any class entities that needs to capture unit information.
Classes have slots (properties).
All classes are grouped under modules/Template
.
Classes can be built upon, so subclasses inherit properties from a parent class.
Here is an small-ish base class definition for a patient:
PatientTemplate:
is_a: Template
description: >
Template for collecting *minimal* individual-level patient data.
slots:
- individualID
- sex
- age
- ageUnit
- species # should be constrained to human
- diagnosis
- nf1Genotype
- nf2Genotype
annotations:
requiresComponent: ''
required: false
Slots have a range, which can be a basic data type such as "string" and "integer", or a set of controlled values (see Enum).
If the range of a slot is not explicit defined, it defaults to "string" (basically free text).
All slots are in the file modules/props.yaml
.
slots:
#...more above
specimenType:
description: Association with some tissue (a mereologically maximal collection of cells that together perform physiological function).
range: SpecimenTypeEnum # Take a look at SpecimenTypeEnum below
required: false
title:
description: Title of a resource.
meaning: https://www.w3.org/TR/vocab-dcat-3/#Property:resource_title
required: true
# ...more below
Here, required
defines globally whether this slot must be filled out.
For example, specimenID
is required, and is included for assay data class definitions that involve a specimen.
For that reason, it is not included at all in individual-level assay data (e.g. behavioral/psychological data).
Note: In situations where "the data meets the template", issues with a required slot might indicate that the template being used doesn't actually "fit" the type of data/entity, and an additional one needs to be defined.
Validation rules are used with slots/attributes. We use schematic validation syntax exactly as documented in this Confluence doc.
Here is an example of a simple validation rule:
age:
annotations:
requiresDependency: ageUnit
validationRules: num
description: A numeric value representing age of the individual. Use with `ageUnit`.
required: false
Here is an example of a more complicated combination rule:
readPair:
annotations:
validationRules: int::inRange 1 2
...
An enumeration is a set of controlled values.
Enums are most of the files modules
, everything except for what's in Templates
and props.yaml
.
enums:
SpecimenTypeEnum:
permissible_values:
cerebral cortex:
description: The outer layer of the cerebrum composed of neurons and unmyelinated nerve fibers. It is responsible for memory, attention, consciousness and other higher levels of mental function.
meaning: http://purl.obolibrary.org/obo/NCIT_C12443
bone marrow:
description: The soft, fatty, vascular tissue that fills most bone cavities and is the source of red blood cells and many white blood cells.
meaning: http://purl.obolibrary.org/obo/BTO_0000141
#...more below
enums:
BooleanEnum:
description: 'Boolean values as Yes/No enums'
permissible_values:
'Yes':
description: 'True'
'No':
description: 'False'
Aside from meta specific to each type (class, slot, or enum) above, terms have common meta, where the most prominent are summarized here:
description
: Description to help understand the term.meaning
: This is a highly recommended and should be an ontology URI that the term maps to.source
: This can be used to supplementmeaning
, but it's most often used when an ontology URI does not exist. It provides a reference such as a publication. For example, a very novel assay might not have a real ontology concept yet but will likely be described in a paper.notes
: Internal editor notes.
-
Create a new branch in the NF-metadata-dictionary repository. (example: patch/add-attribute)
-
Find the yaml file in the new branch where the attribute belongs. The components of the data model are organized in the folder labeled modules.
-
Create a pull request (PR) to merge the branch to "main". Add either @allaway, @anngvu, or @cconrad8 as a reviewer. Creating the PR will:
i. Build the NF.jsonld from the module source files. This is the format needed by schematic. Therefore, you do not need to edit theNF.jsonld
directly, because it will be done automatically by our service bot.ii. If build succeeds, also run some tests to make sure all looks good/generate previews. After some minutes, a test report will appear in the PR that hopefully looks like the below with generated manifest and test results. Note: Trivial changes can skip tests by labeling the PR with skip tests .
-
Make any necessary changes and then merge the new branch that was created to the main branch.
-
Name the release with the convention MAJOR.MINOR.PATCH. These releases start at v1.0.0. Versioning is roughly following semantic versioning concepts where:
- MAJOR: In-use parent attributes are deleted from dictionary or modified, or in-use child attributes are modified in a non-backwards compatible way (e.g.
Neurofibromatosis 1
changed toNeurofibromatosis type 1
). - MINOR: Concepts/parent attributes are added.
- PATCH: Child attributes are added, or unused child/parent attributes are deleted/modified, or definitions/
comments
are added/modified, orvalidation rules
are modified in a backwards compatible way.
- MAJOR: In-use parent attributes are deleted from dictionary or modified, or in-use child attributes are modified in a non-backwards compatible way (e.g.
-
Navigate to data curator app config file to update the version. First, update the staging branch to test on the staging app. Then, checkout the main branch on your fork, update dcc_config.csv for NF so it matches the staging set up, then create a PR to main on the sage-bionetworks repo. Once merged, the changes should be visible on the data curator app.
-
🎉 Congrats! The term is now added to the metadata dictionary.
To build locally, install the schematic package.
For questions or to get help contributing a term, please create an issue.
The "collection" of metadata terms in this repository is made available under a CC0 license. The individual terms and their definitions may be subject to other (permissive) licenses, please see the source for each metadata term for details.