Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop file structures and scripts for automated metadata extraction #54

Open
Bankso opened this issue Feb 26, 2024 · 1 comment
Open

Comments

@Bankso
Copy link
Contributor

Bankso commented Feb 26, 2024

This issue will need to be broken down further, but I wanted to write everything down together, so it can be reviewed in context.

Short description: we should create a system that automatically extracts metadata from files uploaded to contributor Synapse projects.

Goals:

  • design, write, test code that extracts metadata from data/auxiliary files and adds them to manifests
  • design Synapse project folder structures and file content requirements that support automated metadata extraction

Capturing metadata from data or (seemingly random) processing outputs is a non-trivial task that requires significant time and attention, even when someone is familiar with the data type and how it was processed. Providing tools that extract this information and map it into a data model seems beneficial, as it would lower the time, effort, and expertise requirements. As we move towards working with large, complex data sets (like spatial profiling and multiplexed imaging), metadata requirements will continue to become more substantial. This should be addressed, so we can limit the amount of poorly-/un-annotated data that gets deposited in repositories.

The two main parts would be the file organization/content requirements and the scripts to extract metadata.

File organization and content requirements

  • have defined folder structures/file relationships for assay data stored in Synapse projects
  • have defined supplemental file content/structure requirements, likely tied to a specific data processing pipeline or method, but as generalizable as possible. These could also be expanded over time, to fit different approaches/protocols

Scripts to extract metadata

  • generally, this will be a bunch of "find, clean, copy" functions, that take structured input and pull the info requested for the metadata model
  • in some cases, metadata will correspond to calculated values. These could be calculated on-the-fly, instead of being extracted

For some file types with standardized structure (e.g., OME-TIFF, FASTQ, etc.), automated metadata extraction is an established method, so we can adopt those methods where compatible.

Extracting metadata from auxiliary files (e.g., sample sheets, quality control reports, etc.) generated by instruments and various R/python packages is more difficult, since the file formats and data structure is not necessarily consistent between implementations, but I think this is where we can make the most gains.

@Bankso
Copy link
Contributor Author

Bankso commented Mar 4, 2024

I wanted to note that we should also integrate data structure and content validation, potentially via DCQC. Some areas where this applies:

  • verifying file formats/structures are consistent with expectations
  • verifying QC metrics are within range/expectations
  • verifying that all file linkages are defined in metadata, where applicable
  • verifying that identifiers are consistent across different metadata/file types

In some cases, it may be necessary to include metadata components that are strictly associated with validation. I think this is a reasonable use case, but we should ensure that QC-related metrics are easy to obtain, before implementing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant