Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Crawl dataset's metadata only once and before Nipype's workflow #1317

Merged
merged 9 commits into from
Aug 16, 2024

Conversation

oesteban
Copy link
Member

Aggregate dataset-wise operations that typically traverse the list of input files (datalad get, extract biggest file size and metadata extraction) in a single step.

Resolves #1316.

@oesteban oesteban force-pushed the fix/me-metadadata-multiplicity branch 3 times, most recently from 82597a5 to ef40015 Compare August 12, 2024 22:30
@oesteban oesteban changed the title ENH: Run dataset-wise operations only once ENH: Crawl dataset's metadata only once and before Nipype's workflow Aug 14, 2024
@oesteban
Copy link
Member Author

@effigies @mgxd it'd be nice to get feedback on this one - maybe not so much on the code itself (if time is tight), but on the idea of pushing all metadata crawling to just once at the beginning. This way:

  • we avoid the costly ReadSidecarJSON nodes, esp. with multi-echo (multiplies the cost by the number of echos),
  • we take advantage of iterating over input files to estimate the biggest file size in GB, and also consider all echos of a run a single file (in some scenarios you are likely to have to load all echos at once in memory -- similar rationale would apply to magnitude and phase, btw),
  • we still leverage nipype's caching
  • we can iterate over inputs in parallel with asyncio in a rather efficient way.

WDTY?

This PR is still marked a draft and I'm locally testing -- the base implementation should be solid enough for taking a look.

@effigies
Copy link
Member

Yeah, no objections to the overall strategy.

@mgxd
Copy link
Contributor

mgxd commented Aug 14, 2024

So essentially this shifts to keeping all metadata in memory? What if the fields get pared down to only the ones relevant to the pipeline?

@oesteban oesteban force-pushed the fix/me-metadadata-multiplicity branch 5 times, most recently from 2b6049f to 233d7bd Compare August 15, 2024 08:31
@oesteban
Copy link
Member Author

So essentially this shifts to keeping all metadata in memory?

Yes, that's correct. If memory becomes a concern (although metadata's size should be negligible), you can set them to None and only load them when needed from the pickle file.

What if the fields get pared down to only the ones relevant to the pipeline?

Not sure I understand -- you mean further filtering metadata within the new loop so only the relevant metadata is kept?

@oesteban oesteban force-pushed the fix/me-metadadata-multiplicity branch 4 times, most recently from f5c2037 to 991eb50 Compare August 15, 2024 15:33
@mgxd
Copy link
Contributor

mgxd commented Aug 15, 2024

I realize I'm not familiar with what mriqc actually needs the metadata for - is it using the information to calculate something, or just aggregating it into the report?

@oesteban oesteban force-pushed the fix/me-metadadata-multiplicity branch 5 times, most recently from fe47aa2 to de9629d Compare August 15, 2024 22:11
@oesteban
Copy link
Member Author

I realize I'm not familiar with what mriqc actually needs the metadata for - is it using the information to calculate something, or just aggregating it into the report?

Aggregating it in the report. However, fMRIPrep does something similar as it attaches all the metadata to the output. We could have some dictionary of relevant metadata, or leave to the user to find non-critical unmodified metadata (in fMRIPrep).

MRIQC does filter some metadata when submitting to the webapi, but I'm a bit sceptical that will actually shave off a lot of memory.

@oesteban oesteban force-pushed the fix/me-metadadata-multiplicity branch from da3b412 to cf1ea8f Compare August 16, 2024 04:31
@oesteban oesteban marked this pull request as ready for review August 16, 2024 04:32
@oesteban oesteban merged commit 0cf1ae6 into master Aug 16, 2024
15 checks passed
@oesteban oesteban deleted the fix/me-metadadata-multiplicity branch August 16, 2024 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multi-echo BOLD pipeline: potential waste relating/after metadata extraction
3 participants