-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full dataset for 32 TCGA cancers from Oncogene 2024 new pipeline #3
Comments
Hi @hermidalc, apologies for the delayed reply. I'm happy to help provide you with files but first have some comments:
To summarize, the tables we currently recommend would be (i) raw or ConQuR-corrected, (ii) separate for WGS and RNA-Seq, (iii) comprise human-associated taxa with high coverage, and (iv) derive from direct alignments against RS210-clean. Table S13 of the paper provides the raw (WGS and RNA-Seq) abundances of RS210-clean taxa with ≥50% aggregate coverage (note: Table S6 contains the taxonomic names for the genome IDs in Table S13). However, I can find and share the ConQuR-corrected WGS table here if that would be helpful. Is that what you would prefer? Edit: There are creative ways to get ConQuR to correct for >1 batch variable at a time, such as concatenating multiple factors into a single vector in R, but we did not publish this data (in part because of limited time to reassure ourselves the correction acted appropriately). In theory, doing this would help create the kind of single abundance table you're asking about. If this is something you want, I can point you in the right direction. |
Thanks very much Greg for the detailed breakdown and explanation. After giving it some thought I believe the last option would be ideal, a joint ConQuR abundance table for all the samples. I would definitely appreciate any advice or help to create this matrix by as you said batch correcting for >1 factor with ConQuR |
Hi @hermidalc, sure, I'm happy to help with that. (Apologies again for the delay.) The basic idea is to concatenate the individual batch variables into one string per sample, followed by factorizing the options. ConQuR can then accept those factors as pseudo-batches to correct (assuming sufficient representation among the pseudo-batches); note that your reference batch will have to reflect one of the concatenated batch names in the ConQuR function. For example, to simultaneously correct over sequencing center, sequencing platform, and experimental strategy in TCGA:
The above could take a long time to run across thousands of samples without multiplexing across cores (n=32 above), and the vanilla ConQuR version proposed its authors (https://github.com/wdl2459/ConQuR) did not implement the I hope that helps! Let me know if you have any other questions |
Hi Greg, I was looking through the RS210 data (table S13) from your paper. There were 1189 OGUs from Table S13, however, taxonomic information was only found 823 of these in TableS6. Just wanted to check, how do I get the taxonomic information for the other 366 OGU? Hope you can help Thanks, |
Hi @kennyyeo13, happy to help and apologies for any confusion. I'm attaching here the full list of OGUs and their respective taxonomic lineages for RS210 (total # is 29,648 OGUs). I have limited time the rest of this upcoming week but hope this addresses your question. |
Dear Greg - sorry to bother again, could you provide the ConQuR-corrected WGS data matrix you referred to above? Thanks very much! |
@hermidalc ConQuR-corrected WGS data matrix: rs210PanFinal5050_Nonzero_HiSeq_WGS_CQ_15Oct23.csv Please let me know if not. Additionally, code of how this correction was performed is here. |
Thanks Greg, sorry if I'm missing it somewhere, but where do I find how the sampleids map to GDC UUIDs or your older "s" sample identifiers? |
@hermidalc Quick reply for now: you should be able to use Table S14 of the Sepich-Poore et al. 2024 Oncogene manuscript (it will be the WGS subset of that metadata) |
Sorry I just see here that the raw counts are likely Table S13 in the Oncogene supplemental? Definitely looks right as it's not the same as the CQ corrected counts. |
Correct, Table S13 of the Oncogene supplement contains the raw data. Just note those are the human-associated hits that had >= 50% cumulative genome coverage |
Sorry for all the bother, as you can imagine we are very closely examining, comparing, and vetting data sources. We built our own pipeline (https://github.com/hermidalc/tcga-wgs-kraken-microbial-quant) that is well suited to the specific needs of our Nat Commun study, the biggest difference in our pipeline is using Kraken2 + Bracken instead of KrakenUniq, since we are using the output data to do microbiome analysis between samples/groups and want more accurate microbial abundance estimates in each sample. I don't know if you have any feedback on that given all your experience. I compared our pipeline WGS raw counts against Ge et al. bioRxiv 2024 (the Salzberg lab continuation of Gihawi et al. https://github.com/yge15/TCGA_Microbial_Content) and we get similar data, coverage, and sparsity when looking at the overlapping samples at both species and genus levels. I would like to do a comparison against your Oncogene 2024 paper raw data. I think it would be your KrakenUniq RS210 raw counts that went through the additional host filtering steps? Is that available somewhere? |
Hi @hermidalc, some quick comments:
Note that there are issues with the above, as the cutoff vary directly with sequencing depth of a sample (see Fig. 4 in that paper). Also, for reasons unknown to me, neither Gihawi et al. or Ge et al. apply this filter—which is what made KrakenUniq useful—to their data, or at least they make no mention of which parameters they chose. The closest it is mentioned from what I've found is when they state the following on their Github:
|
Thank you for the detailed and very thoughtful responses! FYI they added the KrakenUniq algorithm to Kraken2 some time ago and it can be run alongside Kraken2, i.e. "Kraken2Uniq", with the option You commentary on why use k-mer based approaches over direct metagenomic alignments when there aren't many reads to align makes a good point, definitely something worth looking into further. |
Dear Greg - thanks for your continued work on this and for this follow up paper with updated pipeline. I'm the lead author of Hermida et al. Nat Commun 2022 which used your original Poore et al. Nature 2020 dataset, specifically the "Kraken-TCGA-Voom-SNM-Plate-Center-Filtering-Data.csv” and “Metadata-TCGA-Kraken-17625-Samples.csv” from ftp://ftp.microbio.me/pub/cancer_microbiome_analysis.
Do you have the microbial abundances from this updated Oncogene 2024 pipeline for all 32 TCGA cancers in a format similar to the two files mentioned above?
The text was updated successfully, but these errors were encountered: