Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to join the output for a whole genome analysis #64

Open
pabloangulo7 opened this issue Sep 26, 2022 · 5 comments
Open

How to join the output for a whole genome analysis #64

pabloangulo7 opened this issue Sep 26, 2022 · 5 comments

Comments

@pabloangulo7
Copy link

Hello,

I am using SNPGenie for a Genome-wide SNPs dataset of 70 individuals of a haploid organism consisting of 14 chromosomes, using as input a VCFfile. For the requirements of the tool, I have split the fasta, gtf and vcf files into 14 parts, one per chromosome and applied the command. Also, this process has to be repeated again for the minus strand (-) genes. My question is whether the output for the minus strand can be joined directly to the output of the plus strand and finally, once the whole process is completed, how can all the data be put together to calculate the dn/ds ratio of the whole genome in the population.

Best regards and thanks in advance

Pablo

@singing-scientist
Copy link
Contributor

Greetings, Pablo! Thanks very much for the question. That all sounds correct; the requirement to run twice (once for each strand) is a bit of a hack, unfortunately necessary because SNPGenie was originally written for one strand only. The genes of each strand can simply be combined as you say -- but be aware that, if there are any overlapping antisense genes (i.e., genes in reverse complement reading frames that overlap the same genome positions), then some genome positions/regions will be represented more than once. Also, some of the individual SITES results will be redundant. Thus, the answer to how the data should be combined depends in part on how you want to treat overlapping genes from separate strands (if they exist). Certainly for non-protein-coding sites, you'll only want to include them once, and results will be redundant between the strands.

For the whole (coding) genome, the question is whether you want the mean value for all sites (codon is the unit) or of all genes (gene is the unit). For the first, you could simply sum numerators and denominators, i.e., mean_dN = sum(N_diffs)/sum(N_sites) and mean_dS = sum(S_diffs)/sum(S_sites). This would reflect the "typical protein-coding site". Contrarily, if you're looking for an average of all genes, you could simply take the means of genes, e.g., mean_dN = mean(vector of all genes' dN). This would reflect the "typical gene". Let me know if that makes sense!

Yours,
Chase

@pabloangulo7
Copy link
Author

pabloangulo7 commented Oct 10, 2022

Thank you very much, sorry for the late reply but my data is very large and it took me a long time to finish the analysis. I already have the results for each of the chromosomes and for the minus and plus strand. Then following the instructions you gave me in the previous answer, I was about to calculate the dN/dS ratio for each chromosome and for the whole genome, but to make sure I do it correctly, to do the calculation, for example, for all the sites, I would have to use the file called "codon_results.txt" and use the columns of N_diffs, S_diffs, N_Sites and S_Sites, right? And to do the calculation by genes, would it be the same but with the file "product_results.txt"?

Thanks in advance

@singing-scientist
Copy link
Contributor

Hello! Both codon_results and product_results should yield the same results, i.e., product_results are just the sum of codon_results, for each gene. The question is whether you want to add all the N_diffs, S_diffs, N_Sites and S_Sites together, then calculate dN/dS (mean of all coding sites); or first calculate dN/dS for each product, and take the mean of those values (mean of all genes). Let me know if that helps!

C

@pabloangulo7
Copy link
Author

Thank you very much, I think I understand it better now. On the other hand, I was reviewing the results and there is one thing that doesn't add up. In one of the "site_results.txt" files, one of the SNPs is missing, which for some reason the program has filtered out, but I don't quite know why. In the temporary file of the sample (generated by SNPGenie) "temp_snp_report_VY02.txt" this variant does appear. So, there would be a total of 37 SNPs although only 36 SNPs are analyzed in the results. I attach the files, and specifically it is the variant at site 1580099 (line 27 in the file). Let me know if you need more info.

Thanks in advance
temp_snp_report_VY02.txt
site_results.txt

@singing-scientist
Copy link
Contributor

Greetings, Pablo! Indeed I see what you mean and confirm the SNP at 1580099 is present in temp but not the sites file. It is not really possible for me to troubleshoot this without the input; if you'd like to supply the 3 files and command that give rise to this result I'd be happy to check it out and determine what's happening.

Regarding dN or dS equal to 0, ideally one would perform a statistical test to determine if dN=dS can be rejected with significance. Normally this is done with bootstrapping, but when dS=0 there's no point. Thus as a rule of thumb, I generally consider that if dS (which reflects mutation rate) is 0, there's not enough variation to have statistical power to detect a difference between dN and dS. I'd report the ratio as undefined (certainly don't infer dN>dS or positive selection usually). If dN is 0 that's a different story (again because dS reflects mutation rate) and one can infer strong purifying selection given a proper statistical test. Let me know if that helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants