Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trim MAF columns #612

Open
kpjonsson opened this issue Aug 29, 2019 · 2 comments
Open

Trim MAF columns #612

kpjonsson opened this issue Aug 29, 2019 · 2 comments
Assignees
Labels
backburner probably won't address in a near future enhancement New feature or request

Comments

@kpjonsson
Copy link
Member

Per discussion between @md and me and feedback from @cband:

Currently, the final somatic MAF contains 279 columns. These are not all necessary, and a few could be omitted or collapsed into single columns in order to minimize file size and make it easier to navigate what's important. These can be done inside the pipeline (https://github.com/mskcc/vaporware/blob/develop/containers/vcf2maf/filter-somatic-maf.R and the corresponding germline filter script). Some columns that are output by default in VEP/vcf2maf are pretty much useless.

Here are some suggested changes:

  • Evaluate which default columns carry no meaning, see for example Sequence_Source, Validation_Method, Score, Tumor_Sample_UUID, Matched_Norm_Sample_UUID.
  • Prune the gnomAD/ExAC columns added by VEP/vcf2maf since we're doing this annotation ourselves in the pipeline.
  • From above, columns 141-173 in the current iteration are all columns from the pipeline annotation with gnomAD allele frequencies and counts. This already has a column with the individual subpopulation maxium (non_cancer_AF_popmax/non_cancer_AC_popmax) as well the overall non-cancer population (non_cancer_AF/non_cancer_AC)
  • The variant-caller specific metadata (see https://github.com/mskcc/vaporware/blob/develop/docs/variant-annotation-and-filtering.md) can be collapsed into a single column per caller.
  • Similarly, the raw counts be collapsed into single comma/colon/semi-colon separated columns.
  • Facets clonality annotation can be collapsed into fewer columns.
  • Possibly true for the neoantigen prediction annotation too, although I'm not too familiar with it.
  • Some columns that are added by the hotspot annotation can be removed.

Keep in mind:

  • The "official" MAF file spec sheet (https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format) has changed after the MC3 initiative. It's not necessary, in my opinion, to keep all of these. There is no one, single valid MAF format.
  • The MAF files look different in the somatic and germline setting (hotspot and OncoKB annotation, plus the gnomAD population annotation has a slightly separate meaning in this context). Not all of my suggestions above are equally applicable to both.
  • Similarly, they look different for exomes vs. genome (only in gnomAD annotation).
@kpjonsson kpjonsson added help wanted Extra attention is needed postRelease labels Aug 29, 2019
@evanbiederstedt
Copy link
Contributor

I view this as something CCS will always be fiddling with, which is great and good for science.

We simply version the pipeline, and run it based on improved versions.

@gongyixiao gongyixiao added backburner probably won't address in a near future enhancement New feature or request and removed help wanted Extra attention is needed labels Apr 24, 2020
@gongyixiao
Copy link
Collaborator

#928

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backburner probably won't address in a near future enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants