Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MISC] Update guidelines on file formats and multidimensional arrays - for derivatives #1614

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
35 changes: 31 additions & 4 deletions src/derivatives/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,10 +93,11 @@ in [Derived dataset and pipeline description][derived-dataset-description].

## File format specification

Derived data may be resampled into structures that are not well-handled by the
raw data formats.
In this section, we describe standard formats that SHOULD be adhered to when
appropriate, and the extensions they should have.
Generally derivative data formats SHOULD be the same as for raw data.
For instance raw EEG data stored in the `.edf` data format SHOULD also be stored in the same format when averaged.
However, derived data may be resampled into structures that are not well-handled by the raw data formats.
For such scenarios, in this section we describe standard formats that SHOULD be adhered to when appropriate,
and the extensions they should have.

### GIFTI Surface Data Format

Expand All @@ -123,6 +124,32 @@ or combinations of data arrays.
Unless otherwise stated, bare `.gii` extensions SHOULD NOT be used
for GIFTI files.

### Multidimensional arrays: HDF5 and Zarr

For multidimensional arrays, the following file formats are RECOMMENDED:

- [HDF5](https://www.hdfgroup.org/solutions/hdf5/)
- [Zarr](https://zarr.dev/)

HDF5 and Zarr container format files (note that `.zarr` is typically a directory) should contain the data only (with the field `data`).
This `data` field should be treated as a "virtual directory tree" with a depth one level,
containing BIDS paths at the level of the multidimensional file
(that is, the `.zarr` directory root or the `.h5` file).
BIDS path rules MUST be applied as though these paths existed within the dataset.
Metadata about the multidimensional array SHOULD be documented in the associated JSON sidecar file.
Comment on lines +134 to +139
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for taking so long to review this. I think this is roughly what's being proposed here (using raw data as an example):

dataset/
  sub-01/
    anat.zarr/
      .zgroup
      sub-01_T1w/
        .zarray
        .zattrs
        ...
  sub-02/
    anat.zarr/
      .zgroup
      sub-02_T1w/
        .zarray
        .zattrs
        ...

This repackaging of BIDS data inside a hierarchical data format feels very radical and will require tools to be rewritten to understand entire datasets, as opposed to specific derivative files. I suspect that this is not what was actually intended, so I think it would be very helpful to see examples of the intent.

I see basically two cases that should be addressed:

  1. Existing BIDS-supported formats are built on HDF5 (.nwb, .snirf) or Zarr (.ome.zarr). When considering options for new formats, these should be prioritized to reduce the expansion of necessary tooling.
  2. For generic multidimensional array outputs, HDF5 and Zarr can be treated as extensions of .tsv files. Where TSV files with a header row represent a collection of named 1D arrays, an HDF5/Zarr container contains named N-D arrays that are not constrained to have a common shape. For simplicity, it is encouraged to use a collection of names at the root, which are to be described in a sidecar JSON. For example, to output raw model parameters for an undefined model, one might use:
sub-<label>/<datatype>/<entities>_<suffix>.zarr/
    .zgroup
    alpha/
        .zarray
        ...
    beta/
        .zarray
        ...
sub-<label>/<datatype>/<entities>_params.json

And the JSON file would contain:

{
  "alpha": {
    "Description": "alpha parameter for XYZ model, fit using ABC estimation process",
    "Units": "arbitrary"
  },
  "beta": {
    "Description": "beta parameter for XYZ model, fit using ABC estimation process",
    "Units": "arbitrary"
  }
}

If this was the intent, I'm happy to propose alternative text.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure to understand your example, those arrays were meant for 'stuff' that does not fit the current formats - why one would start allowing packing current data, might be BIDS 2.0. but seems to radical at this stage. ??

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial example is the best I can make of the current text. I don't know what is being described here.


Example of preprocessed data (here relmat indicates a 3D relational matrix in 4D node*node*time*frequency band):
```Text
└─ derivatives//
├─ descriptions.tsv
└─ sub-001/
└─ eeg/
├─ sub-001_task-listening_desc-preproc_eeg.edf
├─ sub-001_task-listening_desc-preproc_eeg.json
├─ sub-001_task-listening_meas-crosscoherence_relamt.hd5
└─ sub-001_task-listening_meas-crosscoherence_relamt.jon
```

<!-- Link Definitions -->

[definitions]: ../common-principles.md#definitions
Expand Down