Skip to content

Commit fa2b5d8

Browse files
authored
[ENH] Allow participants.tsv to contain a superset of subject directories and subjects listed in phenotype files (#2044)
* Update src/schema/objects/files.yaml The participants schema description now contains the comprehensive superset rule from #914. * Update src/schema/objects/files.yaml Committing the good suggestion. Co-authored-by: Chris Markiewicz <[email protected]> * Update src/schema/objects/files.yaml * doc(schema): Update intersects() to return the intersection if non-empty * feat(schema): Require participants.tsv to be a superset of sub_dirs/participants * schema: Improve error messages --------- Co-authored-by: Chris Markiewicz <[email protected]> Co-authored-by: Chris Markiewicz <[email protected]>
2 parents 5003d8f + 4ce9fea commit fa2b5d8

File tree

4 files changed

+34
-21
lines changed

4 files changed

+34
-21
lines changed

src/schema/README.md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -259,20 +259,20 @@ The following operators should be defined by an interpreter:
259259

260260
The following functions should be defined by an interpreter:
261261

262-
| Function | Definition | Example | Note |
263-
| ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------------------------------ |
264-
| `count(arg: array, val: any) -> int` | Number of elements in an array equal to `val` | `count(columns.type, "EEG")` | The number of times "EEG" appears in the column "type" of the current TSV file |
265-
| `exists(arg: str \| array, rule: str) -> int` | Count of files in an array that exist in the dataset. String is array with length 1. See following section for the meanings of rules. | `exists(sidecar.IntendedFor, "subject")` | True if all files in `IntendedFor` exist, relative to the subject directory. |
266-
| `index(arg: array, val: any) -> int` | Index of first element in an array equal to `val`, `null` if not found | `index(["i", "j", "k"], axis)` | The number, from 0-2 corresponding to the string `axis` |
267-
| `intersects(a: array, b: array) -> bool` | `true` if arguments contain any shared elements | `intersects(dataset.modalities, ["pet", "mri"])` | True if either PET or MRI data is found in dataset |
268-
| `allequal(a: array, b: array) -> bool` | `true` if arrays have the same length and paired elements are equal | `intersects(dataset.modalities, ["pet", "mri"])` | True if either PET or MRI data is found in dataset |
269-
| `length(arg: array) -> int` | Number of elements in an array | `length(columns.onset) > 0` | True if there is at least one value in the onset column |
270-
| `match(arg: str, pattern: str) -> bool` | `true` if `arg` matches the regular expression `pattern` (anywhere in string) | `match(extension, ".gz$")` | True if the file extension ends with `.gz` |
271-
| `max(arg: array) -> number` | The largest non-`n/a` value in an array | `max(columns.onset)` | The time of the last onset in an events.tsv file |
272-
| `min(arg: array) -> number` | The smallest non-`n/a` value in an array | `min(sidecar.SliceTiming) == 0` | A check that the onset of the first slice is 0s |
273-
| `sorted(arg: array, method: str) -> array` | The sorted values of the input array; defaults to type-determined sort. If method is "lexical", or "numeric" use lexical or numeric sort. | `sorted(sidecar.VolumeTiming) == sidecar.VolumeTiming` | True if `sidecar.VolumeTiming` is sorted |
274-
| `substr(arg: str, start: int, end: int) -> str` | The portion of the input string spanning from start position to end position | `substr(path, 0, length(path) - 3)` | `path` with the last three characters dropped |
275-
| `type(arg: Any) -> str` | The name of the type, including `"array"`, `"object"`, `"null"` | `type(datatypes)` | Returns `"array"` |
262+
| Function | Definition | Example | Note |
263+
| ------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------------------------------ |
264+
| `count(arg: array, val: any) -> int` | Number of elements in an array equal to `val` | `count(columns.type, "EEG")` | The number of times "EEG" appears in the column "type" of the current TSV file |
265+
| `exists(arg: str \| array, rule: str) -> int` | Count of files in an array that exist in the dataset. String is array with length 1. See following section for the meanings of rules. | `exists(sidecar.IntendedFor, "subject")` | True if all files in `IntendedFor` exist, relative to the subject directory. |
266+
| `index(arg: array, val: any) -> int` | Index of first element in an array equal to `val`, `null` if not found | `index(["i", "j", "k"], axis)` | The number, from 0-2 corresponding to the string `axis` |
267+
| `intersects(a: array, b: array) -> array \| bool` | The intersection of arrays `a` and `b`, or `false` if there are no shared values. | `intersects(dataset.modalities, ["pet", "mri"])` | Non-empty array if either PET or MRI data is found in dataset, otherwise false |
268+
| `allequal(a: array, b: array) -> bool` | `true` if arrays have the same length and paired elements are equal | `intersects(dataset.modalities, ["pet", "mri"])` | True if either PET or MRI data is found in dataset |
269+
| `length(arg: array) -> int` | Number of elements in an array | `length(columns.onset) > 0` | True if there is at least one value in the onset column |
270+
| `match(arg: str, pattern: str) -> bool` | `true` if `arg` matches the regular expression `pattern` (anywhere in string) | `match(extension, ".gz$")` | True if the file extension ends with `.gz` |
271+
| `max(arg: array) -> number` | The largest non-`n/a` value in an array | `max(columns.onset)` | The time of the last onset in an events.tsv file |
272+
| `min(arg: array) -> number` | The smallest non-`n/a` value in an array | `min(sidecar.SliceTiming) == 0` | A check that the onset of the first slice is 0s |
273+
| `sorted(arg: array, method: str) -> array` | The sorted values of the input array; defaults to type-determined sort. If method is "lexical", or "numeric" use lexical or numeric sort. | `sorted(sidecar.VolumeTiming) == sidecar.VolumeTiming` | True if `sidecar.VolumeTiming` is sorted |
274+
| `substr(arg: str, start: int, end: int) -> str` | The portion of the input string spanning from start position to end position | `substr(path, 0, length(path) - 3)` | `path` with the last three characters dropped |
275+
| `type(arg: Any) -> str` | The name of the type, including `"array"`, `"object"`, `"null"` | `type(datatypes)` | Returns `"array"` |
276276

277277
#### The `exists()` function
278278

src/schema/meta/expression_tests.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@
9393
- expression: type(true)
9494
result: 'boolean'
9595
- expression: intersects([1], [1, 2])
96-
result: true
96+
result: [1]
9797
- expression: intersects([1], [])
9898
result: false
9999
- expression: length([1, 2, 3])

src/schema/objects/files.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,10 @@ participants:
7575
followed by a list of optional columns describing participants.
7676
Each participant MUST be described by one and only one row.
7777
78+
The `participant_id` entries MUST be a superset of all subject directories
79+
and all `participant_id` entries found among phenotypic and assessment data
80+
in the `phenotype/` directory.
81+
7882
Commonly used *optional* columns in `participants.tsv` files are `age`, `sex`,
7983
`handedness`, `strain`, and `strain_rrid`.
8084

src/schema/rules/checks/dataset.yaml

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -18,26 +18,35 @@ ParticipantIDMismatch:
1818
issue:
1919
code: PARTICIPANT_ID_MISMATCH
2020
message: |
21-
Participant labels found in this dataset did not match the values in participant_id column
22-
found in the participants.tsv file.
21+
Subject directories found in this dataset did not match the values in
22+
the participant_id column found in the participants.tsv file.
2323
level: error
2424
selectors:
2525
- path == '/participants.tsv'
2626
checks:
27-
- allequal(sorted(columns.participant_id), sorted(dataset.subjects.sub_dirs))
27+
- |
28+
allequal(
29+
sorted(intersects(columns.participant_id, dataset.subjects.sub_dirs)),
30+
sorted(dataset.subjects.sub_dirs)
31+
)
2832
2933
# 51
3034
PhenotypeSubjectsMissing:
3135
issue:
3236
code: PHENOTYPE_SUBJECTS_MISSING
3337
message: |
34-
A phenotype/ .tsv file lists subjects that were not found in the dataset.
38+
A phenotype/ .tsv file lists subjects that were not found in
39+
the participant_id column found in the participants.tsv file.
3540
level: error
3641
selectors:
37-
- path == '/dataset_description.json'
42+
- path == '/participants.tsv'
3843
- type(dataset.subjects.phenotype) != 'null'
3944
checks:
40-
- allequal(sorted(dataset.subjects.phenotype), sorted(dataset.subjects.sub_dirs))
45+
- |
46+
allequal(
47+
sorted(intersects(columns.participant_id, dataset.subjects.phenotype)),
48+
sorted(dataset.subjects.phenotype)
49+
)
4150
4251
# 214
4352
SamplesTSVMissing:

0 commit comments

Comments
 (0)