diff --git a/docs/data_prep.md b/docs/data_prep.md index c2b622d..d2c31b3 100644 --- a/docs/data_prep.md +++ b/docs/data_prep.md @@ -5,30 +5,56 @@ We recommend also checking out [Nipoppy](https://nipoppy.readthedocs.io/), a protocol for standardized organization and processing of clinical-neuroimaging datasets that extends [BIDS](https://bids-specification.readthedocs.io/en/stable/). Neurobagel tools are designed to be compatible with data organized according to the Nipoppy specification, although you do not need to use Nipoppy in order to use Neurobagel. -To use the Neurobagel annotation tool, +To use the Neurobagel annotation tool, please prepare the tabular data for your dataset as a single, tab-separated file (`.tsv`). !!! note In the Neurobagel context, _tabular_ or _phenotypic_ data for a dataset refers to any demographic, - clinical/behavioural, cognitive, or other non-imaging-derived data of participants + clinical/behavioural, cognitive, or other non-imaging-derived data of participants which are typically stored in a tabular file format. ## General requirements for the phenotypic TSV ### All datasets -A valid dataset for Neurobagel **must** include a TSV file that describes participant attributes. -The TSV must contain a minimum of two columns: at least one column must contain subject IDs, -and at least one column must describe demographic or other phenotypic information -(for variables currently modeled by Neurobagel, see the [data dictionary section](dictionaries.md)). +A valid dataset for Neurobagel **MUST** include a TSV file that describes participant attributes. + +The TSV MUST contain: + +- A minimum of two columns +- At least one column containing subject IDs + + ??? note "Only one subject ID column can be annotated" + Neurobagel currently does not support annotating multiple subject ID columns + so you must choose one as the primary ID during annotation + +- At least one column that describes demographic or other phenotypic information +- Unique values in the subject ID column or unique combinations of IDs if both subject and session ID columns are present + +The TSV MAY contain: + +- A column with session IDs, e.g. if the dataset is longitudinal + + ??? note "Only one session ID column can be annotated" + Neurobagel currently does not support annotating multiple session ID columns + so you must choose one as the primary ID during annotation + +The TSV MUST **NOT** contain: + +- Missing values in the columns you plan to annotate as containing the primary subject IDs and session IDs (if available) + +For all variables currently modeled by Neurobagel, see the [data dictionary section](dictionaries.md). ### Datasets with imaging (BIDS) data If a dataset has imaging data in [BIDS](https://bids-specification.readthedocs.io/en/stable/) format, Neurobagel **additionally** requires that: -- At least one column in the phenotypic TSV contains subject IDs that match the names of [BIDS subject subdirectories](https://bids-specification.readthedocs.io/en/stable/02-common-principles.html#filesystem-structure). -If this condition is not met, you will encounter an error when [running the Neurobagel CLI](cli.md) on your dataset to generate Neurobagel graph-ready files, indicating that your BIDS directory contains subjects not found in your phenotypic file. +- At least one column in the phenotypic TSV contains subject IDs that + match the names of [BIDS subject subdirectories](https://bids-specification.readthedocs.io/en/stable/02-common-principles.html#filesystem-structure). + If this condition is not met, you will encounter an error + when [running the Neurobagel CLI](cli.md) on your dataset to generate Neurobagel graph-ready files, + indicating that your BIDS directory contains subjects not found in your phenotypic file. !!! note Subject IDs are case-sensitive and must match BIDS subject IDs exactly @@ -39,7 +65,10 @@ If this condition is not met, you will encounter an error when [running the Neur - All BIDS subjects are included in the phenotypic TSV, even if they only have BIDS imaging information. Neurobagel does not allow for datasets where subjects have BIDS - data but are not represented in the phenotypic TSV (however, subjects who have phenotypic data but no BIDS data are allowed). + data but are not represented in the phenotypic TSV + (however, subjects who have phenotypic data but no BIDS data are allowed). +- If the dataset is longitudinal, the session IDs in the phenotypic TSV + MAY match the session IDs in the BIDS dataset, but don't have to. ## Examples of valid phenotypic TSVs @@ -83,16 +112,23 @@ Example TSV: (see also the BIDS specification section on [Longitudinal and multi-site studies](https://bids-specification.readthedocs.io/en/stable/06-longitudinal-and-multi-site-studies.html#longitudinal-and-multi-site-studies)). ### Multiple participant or session identifier columns -In some cases, there may be a need for more than one set of IDs + +In some cases, there may be a need for more than one set of IDs for participants and/or sessions. For example, if a participant was first enrolled in a behavioural study -with one type of ID, +with one type of ID, and then later joined an imaging study under a different ID. In this case, both types of participant IDs should be recorded in the tabular file. The only requirement is that **the combination of all ID values for a row is unique**. +!!! Warning "Neurobagel currently supports only one subject ID and one session ID" + + Neurobagel currently does not support annotating multiple subject or session ID columns in the same TSV file. + If you have multiple subject or session IDs, you must choose one to use as the primary ID. + Additional subject/session ID columns can still be included in the TSV but will be ignored by Neurobagel. + Example **invalid** TSV: | participant_id | alternative_participant_id | ... |