Develop file structures and scripts for automated metadata extraction #54

Bankso · 2024-02-26T22:44:18Z

This issue will need to be broken down further, but I wanted to write everything down together, so it can be reviewed in context.

Short description: we should create a system that automatically extracts metadata from files uploaded to contributor Synapse projects.

Goals:

design, write, test code that extracts metadata from data/auxiliary files and adds them to manifests
design Synapse project folder structures and file content requirements that support automated metadata extraction

Capturing metadata from data or (seemingly random) processing outputs is a non-trivial task that requires significant time and attention, even when someone is familiar with the data type and how it was processed. Providing tools that extract this information and map it into a data model seems beneficial, as it would lower the time, effort, and expertise requirements. As we move towards working with large, complex data sets (like spatial profiling and multiplexed imaging), metadata requirements will continue to become more substantial. This should be addressed, so we can limit the amount of poorly-/un-annotated data that gets deposited in repositories.

The two main parts would be the file organization/content requirements and the scripts to extract metadata.

File organization and content requirements

have defined folder structures/file relationships for assay data stored in Synapse projects
have defined supplemental file content/structure requirements, likely tied to a specific data processing pipeline or method, but as generalizable as possible. These could also be expanded over time, to fit different approaches/protocols

Scripts to extract metadata

generally, this will be a bunch of "find, clean, copy" functions, that take structured input and pull the info requested for the metadata model
in some cases, metadata will correspond to calculated values. These could be calculated on-the-fly, instead of being extracted

For some file types with standardized structure (e.g., OME-TIFF, FASTQ, etc.), automated metadata extraction is an established method, so we can adopt those methods where compatible.

Extracting metadata from auxiliary files (e.g., sample sheets, quality control reports, etc.) generated by instruments and various R/python packages is more difficult, since the file formats and data structure is not necessarily consistent between implementations, but I think this is where we can make the most gains.

Bankso · 2024-03-04T22:18:58Z

I wanted to note that we should also integrate data structure and content validation, potentially via DCQC. Some areas where this applies:

verifying file formats/structures are consistent with expectations
verifying QC metrics are within range/expectations
verifying that all file linkages are defined in metadata, where applicable
verifying that identifiers are consistent across different metadata/file types

In some cases, it may be necessary to include metadata components that are strictly associated with validation. I think this is a reasonable use case, but we should ensure that QC-related metrics are easy to obtain, before implementing.

This was referenced Feb 26, 2024

Develop a semi-automated project configuration pipeline #55

Open

Data sharing plan - refactor and model implementation mc2-center/data-models#106

Closed

Bankso mentioned this issue Jul 3, 2024

Design and test process for creating and sharing files + metadata with Synapse Datasets #71

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop file structures and scripts for automated metadata extraction #54

Develop file structures and scripts for automated metadata extraction #54

Bankso commented Feb 26, 2024

Bankso commented Mar 4, 2024 •

edited

Loading

Develop file structures and scripts for automated metadata extraction #54

Develop file structures and scripts for automated metadata extraction #54

Comments

Bankso commented Feb 26, 2024

Bankso commented Mar 4, 2024 • edited Loading

Bankso commented Mar 4, 2024 •

edited

Loading