You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue will need to be broken down further, but I wanted to write everything down together, so it can be reviewed in context.
Short description: we should create a system that automatically extracts metadata from files uploaded to contributor Synapse projects.
Goals:
design, write, test code that extracts metadata from data/auxiliary files and adds them to manifests
design Synapse project folder structures and file content requirements that support automated metadata extraction
Capturing metadata from data or (seemingly random) processing outputs is a non-trivial task that requires significant time and attention, even when someone is familiar with the data type and how it was processed. Providing tools that extract this information and map it into a data model seems beneficial, as it would lower the time, effort, and expertise requirements. As we move towards working with large, complex data sets (like spatial profiling and multiplexed imaging), metadata requirements will continue to become more substantial. This should be addressed, so we can limit the amount of poorly-/un-annotated data that gets deposited in repositories.
The two main parts would be the file organization/content requirements and the scripts to extract metadata.
File organization and content requirements
have defined folder structures/file relationships for assay data stored in Synapse projects
have defined supplemental file content/structure requirements, likely tied to a specific data processing pipeline or method, but as generalizable as possible. These could also be expanded over time, to fit different approaches/protocols
Scripts to extract metadata
generally, this will be a bunch of "find, clean, copy" functions, that take structured input and pull the info requested for the metadata model
in some cases, metadata will correspond to calculated values. These could be calculated on-the-fly, instead of being extracted
For some file types with standardized structure (e.g., OME-TIFF, FASTQ, etc.), automated metadata extraction is an established method, so we can adopt those methods where compatible.
Extracting metadata from auxiliary files (e.g., sample sheets, quality control reports, etc.) generated by instruments and various R/python packages is more difficult, since the file formats and data structure is not necessarily consistent between implementations, but I think this is where we can make the most gains.
The text was updated successfully, but these errors were encountered:
I wanted to note that we should also integrate data structure and content validation, potentially via DCQC. Some areas where this applies:
verifying file formats/structures are consistent with expectations
verifying QC metrics are within range/expectations
verifying that all file linkages are defined in metadata, where applicable
verifying that identifiers are consistent across different metadata/file types
In some cases, it may be necessary to include metadata components that are strictly associated with validation. I think this is a reasonable use case, but we should ensure that QC-related metrics are easy to obtain, before implementing.
This issue will need to be broken down further, but I wanted to write everything down together, so it can be reviewed in context.
Short description: we should create a system that automatically extracts metadata from files uploaded to contributor Synapse projects.
Goals:
Capturing metadata from data or (seemingly random) processing outputs is a non-trivial task that requires significant time and attention, even when someone is familiar with the data type and how it was processed. Providing tools that extract this information and map it into a data model seems beneficial, as it would lower the time, effort, and expertise requirements. As we move towards working with large, complex data sets (like spatial profiling and multiplexed imaging), metadata requirements will continue to become more substantial. This should be addressed, so we can limit the amount of poorly-/un-annotated data that gets deposited in repositories.
The two main parts would be the file organization/content requirements and the scripts to extract metadata.
File organization and content requirements
Scripts to extract metadata
For some file types with standardized structure (e.g., OME-TIFF, FASTQ, etc.), automated metadata extraction is an established method, so we can adopt those methods where compatible.
Extracting metadata from auxiliary files (e.g., sample sheets, quality control reports, etc.) generated by instruments and various R/python packages is more difficult, since the file formats and data structure is not necessarily consistent between implementations, but I think this is where we can make the most gains.
The text was updated successfully, but these errors were encountered: