Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Information about Supplementary_Table2_datasetsQC.xlsx #34

Open
smanfri opened this issue Jun 29, 2023 · 2 comments
Open

Information about Supplementary_Table2_datasetsQC.xlsx #34

smanfri opened this issue Jun 29, 2023 · 2 comments

Comments

@smanfri
Copy link

smanfri commented Jun 29, 2023

Good morning,

I'm a student in Computer Science at Università degli Studi di Milano and for my thesis I am assessing some pipeline for the analysis of SARS-CoV-2 samples.
In order to select the best pipeline for our requirements, I'm using the benchmark datasets available here.
I found the Supplementary_table2 in your paper (Xiaoli L, Hagey JV, Park DJ, Gulvik CA, Young EL, Alikhan N-F, Lawsin A, Hassell N, Knipe K, Oakeson KF, Retchless AC, Shakya M, Lo C-C, Chain P, Page AJ, Metcalf BJ, Su M, Rowell J, Vidyaprakash E, Paden CR, Huang AD, Roellig D, Patel K, Winglee K, Weigand MR, Katz LS. 2022. Benchmark datasets for SARS-CoV-2 surveillance bioinformatics. PeerJ 10:e13821 http://doi.org/10.7717/peerj.13821) and I would like to use also the data contained there for evaluations (and not only the file in.tsv available for every dataset).
I'm writing here because I can't understand how the column 'Total reads' is calculated. In particular, I used FastQC (the value of the field 'Total Sequences') to compute this value and I also counted the reads in the original .FASTQ file but the numbers don't correspond to the ones published in the Supplementary_table2.

Do you know why the numbers are different? Is it possible that Supplementary_table2 is outdated with respect to the current version of the dataset?
If this is the case, which version of the dataset is matched to Supplementary_table2 and used in your paper?

Thank you very much for your time :)

Best regards,
Sara Manfredi

@lskatz
Copy link
Collaborator

lskatz commented Jul 19, 2023

Hi thank you for identifying this discrepancy. Although I can't promise to fix this right now, it might be helpful to post here some values you are finding in FastQC vs what you are seeing in the supplementary. Thank you for your help.

@smanfri
Copy link
Author

smanfri commented Jul 21, 2023

Hi,
thank you for the response.
In the attached file, I compared the total reads reported in the supplementary table 2 and the value found by the tool FastQC (version 0.11.8).
Note that:

  • In the file there are the results for the benchmark “CoronaHiT-rapid” for the Illumina sequences
  • Even if we take the values reported by FastQC for the raw samples (before the trimming) the values reported in the Supplementary table 2 are always bigger

Thank you for the attention,
Sara
Total-reads_Supplementary-table2_VS_FastQC.xlsx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants