-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use data.table
package to parse vcf
#31
Comments
I've found that the gap between genotype data as it's stored and exchanged (e.g |
@CreRecombinase yes, I totally agree it is better to just have "documentation/examples of converting files to data matrices". However, these examples should show people how to convert file formats within R, not like what I did before by using I think having file format converting wrapper is conceptually ideal, but practically challenging -- vcf is just one commonly used file format for genotype/haplotype data -- if we write a wrapper for vcf, probably we will be asked to provide wrappers for the other formats like bgen (UKBB format, https://www.well.ox.ac.uk/~gav/bgen_format/index.html). |
@CreRecombinase one thing related to this thread, what other genotype file formats you have worked on so far? I wonder if you could list them here as follows:
|
I think "Phase" in "Phase 1" and "Phase 3" refer to phases of the 1000 genomes project. I would consider both of those "VCF" (although the version of the VCF standard you might come across for Phase 1 data might be different than that for Phase 3 data). There are two other formats that I come across with any real frequency:
After those two I'd put the formats that are what you might call "intermediate maturity":
After that it's a pretty long tail. To name a few:
|
@CreRecombinase Thanks for sharing this! Now I agree more that we better stick to an "n x p" genotype data matrix as |
Currently
ldshrink
assumes the input genotype/haplotype data are stored in an n-by-p numerical matrix, which is convenient from statisticians' perspective. However, public genotype/haplotype data from 1000 Genomes are stored in vcf format.In the past I first used
vcftools
to convert vcf data toIMPUTE2
format (which is indeed a p-by-n matrix), and then transposeIMPUTE2
-formatted data in R. See https://github.com/stephenslab/rss/blob/master/misc/import_1000g_vcf.sh.This two-step workflow is not so convenient (at least for statisticians): they have to learn a new program like
vcftools
before any LD-related operations inldshrink
.It seems that now we can use
data.table
(https://cran.r-project.org/web/packages/data.table) to directly convert vcf data to the n-by-p matrix in R. Here is an example: https://gist.github.com/cfljam/bc762f1d7b412df594ebc4219bac2d2b.Here is my own example.
The benefit of using
data.table
here is two-fold: i) users don't have to leave R and usevcftools
to get n-by-p genotype matrix from vcf data; ii)data.table
is a well-maintained and constantly-upgraded package that can handle large datasets efficiently (at least based on my past experiences).Hence, we can either add a wrapper that uses
data.table
to parse vcf forldshrink
users, or at minimum, we can simply provide a vignette showing how to usedata.table
to parse vcf.Finally there exists a package
vcfR
(https://cran.r-project.org/web/packages/vcfR) that might be relevant (but I have not used it much).The text was updated successfully, but these errors were encountered: