Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validator would fail on "legit" "n/a" value #1026

Closed
yarikoptic opened this issue Oct 3, 2024 · 11 comments
Closed

validator would fail on "legit" "n/a" value #1026

yarikoptic opened this issue Oct 3, 2024 · 11 comments

Comments

@yarikoptic
Copy link
Contributor

on "spacetop" dataset (which is yet to be made public) there is .tsv files with n/a in some .tsv's, which is the way BIDS mandates to code "missing values". See https://bids-specification.readthedocs.io/en/stable/common-principles.html#tabular-files which states

String values containing tabs MUST be escaped using double quotes. Missing and non-applicable values MUST be coded as n/a. Numerical values MUST employ the dot (.) as decimal separator and MAY be specified in scientific notation, using e or E to separate the significand from the exponent. TSV files MUST be in UTF-8 encoding.

I filed

ATM for my sample dataset (not yet public on openneuro, attn @jungheejung - remind me if we do have it somewhere public meanwhile), a version of hed-validator (#1025) blows with

(deno) yoh@typhon:/mnt/DATA/data/yoh/1076_spacetop$ ~/bin/hed-validator .
Using HEDTOOLS version: {'date': '2024-06-14T17:02:33-0500', 'dirty': False, 'error': None, 'full-revisionid': '940e75ddcedd5a14910098b60277413edc3c024e', 'version': '0.5.0'}
Traceback (most recent call last):
  File "/home/yoh/bin/hed-validator", line 71, in <module>
    main()
  File "/home/yoh/bin/hed-validator", line 37, in main
    issue_list = bids.validate(check_for_warnings=args.check_for_warnings)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yoh/miniconda3/envs/deno/lib/python3.12/site-packages/hed/tools/bids/bids_dataset.py", line 84, in validate
    issues += files.validate_datafiles(self.schema, check_for_warnings=check_for_warnings)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yoh/miniconda3/envs/deno/lib/python3.12/site-packages/hed/tools/bids/bids_file_group.py", line 157, in validate_datafiles
    issues += data_obj.contents.validate(hed_schema, extra_def_dicts=extra_def_dicts, name=name,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yoh/miniconda3/envs/deno/lib/python3.12/site-packages/hed/models/base_input.py", line 359, in validate
    validation_issues = tab_validator.validate(self, self._mapper.get_def_dict(hed_schema, extra_def_dicts), name,
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yoh/miniconda3/envs/deno/lib/python3.12/site-packages/hed/validator/spreadsheet_validator.py", line 63, in validate
    data_new._dataframe = df_util.sort_dataframe_by_onsets(data.dataframe)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yoh/miniconda3/envs/deno/lib/python3.12/site-packages/hed/models/df_util.py", line 118, in sort_dataframe_by_onsets
    df_copy['_temp_onset_sort'] = df_copy['onset'].astype(float)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yoh/miniconda3/envs/deno/lib/python3.12/site-packages/pandas/core/generic.py", line 6643, in astype
    new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yoh/miniconda3/envs/deno/lib/python3.12/site-packages/pandas/core/internals/managers.py", line 430, in astype
    return self.apply(
           ^^^^^^^^^^^
  File "/home/yoh/miniconda3/envs/deno/lib/python3.12/site-packages/pandas/core/internals/managers.py", line 363, in apply
    applied = getattr(b, f)(**kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yoh/miniconda3/envs/deno/lib/python3.12/site-packages/pandas/core/internals/blocks.py", line 758, in astype
    new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yoh/miniconda3/envs/deno/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", line 237, in astype_array_safe
    new_values = astype_array(values, dtype, copy=copy)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yoh/miniconda3/envs/deno/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", line 182, in astype_array
    values = _astype_nansafe(values, dtype, copy=copy)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yoh/miniconda3/envs/deno/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", line 133, in _astype_nansafe
    return arr.astype(dtype, copy=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: could not convert string to float: 'n/a'

note that it is also unclear on which file it blows -- so might be worth providing some feedback (logging ERROR level or just catch/enhance exception while working on a specific file) on which file it happens.

I think it is likely on n/a in onset field in some _events.tsv, e.g.

(deno) yoh@typhon:/mnt/DATA/data/yoh/1076_spacetop$ git grep '^n/a' | grep _events.tsv | head
sub-0001/ses-02/func/sub-0001_ses-02_task-faces_acq-mb8_run-01_events.tsv:n/a   n/a     rating_mouse_trajectory n/a     intensity       disgust male    African old     n/a
sub-0001/ses-02/func/sub-0001_ses-02_task-faces_acq-mb8_run-01_events.tsv:n/a   n/a     rating_mouse_trajectory n/a     intensity       happy   male    EA      young   n/a
sub-0001/ses-02/func/sub-0001_ses-02_task-faces_acq-mb8_run-02_events.tsv:n/a   n/a     rating_mouse_trajectory n/a     sex     happy   male    WC      old     n/a

FWIW -- we did not see similar crash while running deno bids-validator (I guess uses JS version?).

@VisLab
Copy link
Member

VisLab commented Oct 3, 2024

Could you amplify on which validator you are using?

I'm pretty sure that the python validator treats missing values (n/a) values in the onset column as a fatal error. What does it mean if an events.tsv field has an n/a onset value?

@yarikoptic
Copy link
Contributor Author

It is Python one as from this PR https://github.com/hed-standard/hed-python/pull/1025/files

Means that we don't know when it was exactly presented, likely between events with timings present. @jungheejung could elaborate more

@VisLab
Copy link
Member

VisLab commented Oct 3, 2024

hmmm -- since it has been established that event files in BIDS do not have to be ordered by onset, I don't think it would be meaningful to allow onset times to be "n/a". It used to be explicitly so in the spec. It is implied in the description under Task Events in that it explicitly says that duration allows n/a but does not say that for onset. Both are required columns.

@yarikoptic
Copy link
Contributor Author

well, "nothing forbids" it to be n/a ATM, and as for meaningful -- might still be useful to know that there was some event of that kind but timing information was lost, in comparison to completely leaving having such an event out.

A somewhat odd example: earthquake ;) you would know it did happen during the scan but would not know when exactly... Or a little more realistic -- "someone passed with a metal roller bad by the MEG scanner room" during session ;)
I think it would be meaningful to annotate those in _events.tsv as that is when (in that scanning session) it happened. And nothing in BIDS forbids me from doing so ATM.

So indeed -- could be argued, but allowed for in BIDS ATM, thus could be present/used, and tool should ideally be robust to encounter such cases

@VisLab
Copy link
Member

VisLab commented Oct 3, 2024

This doesn't address the issue.... what does that event mean? It seems that the implication is that this event has a mystery time that is between the times of the two events in which it appears between in the tsv. This is problematic because events don't have to be sorted by event time. If you sort -- those will appear at the beginning or the end and you will have lost the association.

Other numeric columns can have n/a, no problem, but it doesn't seem reasonable that events --- which are supposed to be aligned on the time line should be allowed to have no onset times.

A potential way to handle this is to introduce a "foggy" time notion. We could introduce a tag that says the value of something is known to within +/- some amount (e.g., Error-bar/#). If an n/a in the onset column is to mean that it was just something that happened on some unknown time during the experiment, then this specific meaning must be spelled out really explicitly in BIDS and I can't help thinking it should go somewhere else.

@sappelhoff @effigies @Remi-Gau --- the question here is whether the onset columns in events.tsv files should be allowed to have n/a's and if so, what would that mean? Do you care to weigh in on this?

@effigies
Copy link

effigies commented Oct 3, 2024

IMO it would mean that the dataset author accepts undefined treatment of those rows. A tool that needs values to operate can justifiably drop missing values.

@VisLab
Copy link
Member

VisLab commented Oct 4, 2024

@yarikoptic -- we'll fix the issue as follows --- clearly the validator should not throw an exception. For now we will ignore the rows with n/a onsets and generate a warning that these are being ignored. I think the meaning of n/a onsets for events.tsv needs to be clarified in the BIDS specification. It is possible that in the future we could validate those rows as "non-time" -- as we do with validation of other .tsv files such as participants.tsv.

@yarikoptic
Copy link
Contributor Author

Sounds great! Thank you in advance!

@yarikoptic
Copy link
Contributor Author

@VisLab Hi Kay, not to be pushy, but I wonder when do you think you would have time to address this issue? we would like to make our BIDS dataset sparkling clean for HED ;)

@VisLab
Copy link
Member

VisLab commented Oct 11, 2024 via email

@VisLab
Copy link
Member

VisLab commented Oct 11, 2024

Fixed by PR #1028. Let me know if there are issues.

The validator now treats n/a onset rows as though there were no onset column. It validates the HED but does not allow event processes that have onsets and offsets. Let me know if you encounter any issues. Thx

@VisLab VisLab closed this as completed Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants