Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggest moving metadata into the header #11

Open
delocalizer opened this issue Nov 2, 2022 · 3 comments
Open

Suggest moving metadata into the header #11

delocalizer opened this issue Nov 2, 2022 · 3 comments

Comments

@delocalizer
Copy link

delocalizer commented Nov 2, 2022

Having two separate files seems like a recipe for confusion and error.

  1. The metadata isn't like an index file or md5sum that can easily be regenerated if lost, it's inextricably linked to the actual data because it contains fields like pvalueIsNegLog10, sortedByGenomicLocation, and vitally, genomeAssembly. You're not actually allowed to update them independently.
  2. The dataFileMd5sum guarantees integrity in one direction only; if I update the metadata file to refer to a new reference but forget to update the summary file to actually do the coordinate liftover then no error will be raised.
  3. If you already have the checksum, having dataFileName in the metadata is redundant, fragile, and cumbersome. People rename files all the time. It would be unexpected behaviour to most users that if they rename the summary file on the filesystem they need to update an internal field value in the metadata file.
  4. Tools to process two-part formats are more complicated to write and without very explicit guidance from the spec, authors make their own assumptions ("the metadata is always present adjacent to the summary", "the metadata has the same basename as the summary" etc). This leads to a messy and confusing ecosystem.

I'm reluctant to argue by anecdote but "everyone I know" hates two-part file formats, and from my experience, for good reason.

@ljwh2
Copy link
Collaborator

ljwh2 commented Mar 20, 2023

Thanks for the detailed and clear comments.

The current state of the field is that many summary statistics files are lacking key information (particularly effect allele, EAF) which hinder downstream use of the data, or are not shared at all. The main goal of GWAS-SSF is to identify key mandatory and non-mandatory data and metadata fields for usability and encourage data sharing. We believe at this point in time, the community will benefit from definition of these data fields which can be applied to the simple tsv format described here, or GWAS-VCF, or any other file format. We are updating the manuscript to focus on the data content and make this clearer.

It’s clear that including metadata in the header is an optimal choice for data integrity. With respect to the GWAS Catalog, we believe that the risks in separating the data and metadata are already limited by sharing data via a FAIR resource (i.e. fully accessioned and controlled with respect to update). We heard in our working groups that it could be a big stretch for some users to use a format with metadata in the header, presenting an additional overhead and barrier to sharing and/or use of the data, which would be counterproductive. Therefore we don’t feel it’s appropriate to commit resources to change our ingest pipelines to adopt such a file format at the current time. However we will continue to monitor the situation as the field evolves and more tooling becomes available.

@delocalizer
Copy link
Author

delocalizer commented Mar 21, 2023

Fair enough; you know your userbase.

It still worries me that there is no explicit reference genome info in the main data file. Perhaps as a compromise, allow and encourage the "chromosome" column to be a RefSeq accession instead of just a number. That way you implicitly but unambiguously specify the species and reference build, e.g. NC_000001.11 for human chromosome 1, GRCh38 patch 14.

@marcora
Copy link

marcora commented Mar 21, 2023

Why not have something like GWAS-VCF for submission and storage and have Excel (with one tab for associations and one tab for metadata) as an additional download option from the website? I assume that users who have difficulty using a format with metadata in the header are those who use Excel as their main "bioinformatics" tool and won't be happy with YAML either.

Whatever GWAS Catalog picks as a format is going to become the de-facto standard for bioinformaticians and tool developers and, in my humble opinion, picking it based on the needs/skills of "Excel bioinformaticians" seems to me not to be the best approach or the solution to the problems that have afflicted the field so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants