Pipeline for processing 1000 Genomes Phase 3 data sets

This is a snakemake pipeline for downloading and processing the 1000 Genomes Project (1kGP) Phase 3 data sets for use in genomic analyses. Ideally, you would integrate this into another workflow with the use of snakemake's module statement, although you can use it in its own right, too. The workflow contains rules which will:

download the raw sequence data in vcf.gz format and verify their checksums
filter samples by ancestry
remove related samples
filter variants to retain SNPs only
filter variants by MAF
filter variants by QC parameters: genotype missingness, sample missingness, HWE test statistic
filter variants to remove AT/GC SNPs
prune variants by LD
remove the MHC
add the centimorgan field to plink-style bfiles (currently only for hg38) for the calculation of LD scores

Of course, there's much more you can do with these data and plink, but I intend this workflow to be a 'stub' of sorts which can be extended in the downstream direction with additional rules after import as a module. Hopefully these rules suffice to get you started.

If you're use the pipeline for the first time, I recommend testing it with chromosome 22 (chr22) which, as the smallest chromosome, should allow a quick test run.

The assembly wildcard allows you to specify either the hg19 or hg38 reference genome. You should prefer the latter where possible as the hg38 data were generated through high-coverage resequencing of the 1kGP samples, and the hg38 is itself a more accurate assembly and seven years old at this point; hg19 is ancient, stop using it if you can! Check out workflow/rules/wildcard_constraints.smk for other wildcards, many of which can only take a restricted set of values (like an enum) or must match a regular expression.

Dependencies for the software are provided in a docker container hosted on DockerHub. snakemake should pull this automatically so there's no need to build the container yourself, although I have included the Dockerfile under the docker directory so you can see what it contains. The image is derived from the rocker/r-ver image.

pandas is required for some run rules and this is provided in the conda environment defined in envs/global.yaml. Your profile should contain the following so that both docker and conda are used:

software-deployment-method: "apptainer"
software-deployment-method: "conda"

At the moment, the conda statement required to inject this dependency into the snakemake process running the workflow is omitted as it appears not to play nicely with the module statement I use to import this workflow into other workflows (e.g. the GWAS_tools), so I'm afraid you'll have to have pandas in whichever environment snakemake is running in.

I don't define the mem_mb resource for each rule as that number is highly platform-specific, but I do recommend setting it as follows in your profile:

default-resources:
  mem_mb: threads * <RAM per core>

For example, the Intel 'Cascade Lake' CPUs I use on my local cluster provide 3420MB per core, so I set mine to:

default-resources:
  mem_mb: threads * 3420

This allows for more parsimonious rules in the smk files. In my experience, the plink-based rules will generally run faster the more you pump up the thread count.

Importing this workflow in another `snakemake` workflow using `module`

I wrote this workflow so it could be plugged into others with use of the module statement. I do this like so in the GWAS_tools worfklow:

module kGP_pipeline:
    snakefile: github("twillis209/1kGP_pipeline", path = "workflow/Snakefile", commit = 'master')
    config: config['1kGP_pipeline']

The config handling is currently a bit hacky: rather than the module reading its own config file (i.e. the one in 1kGP_pipeline/workflow/config/config.yaml), I have to copy the contents of that file to the config file for GWAS_tools and reference that when overwriting the 1kGP_pipeline config with the config: statement. This doesn't scale very well when importing GWAS_tools into still larger workflows as I like to do, but I'm optimistic I can find a better solution within snakemake.

Outstanding issues (17/10/24)

I am able to run this locally, but the cluster I use seems not to allow plink to create log files. I hope this is just some idiosyncrasy of cluster configuration, but it may relate to apptainer's interaction with the file system, something that can be configured via the snakemake CLI. I'm in the process of troubleshooting this and apparently I'm not the only one with this issue.

If you happen upon this repo and have a problem, please open an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
docker		docker
workflow		workflow
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pipeline for processing 1000 Genomes Phase 3 data sets

Importing this workflow in another `snakemake` workflow using `module`

Outstanding issues (17/10/24)

About

Releases

Packages

Languages

twillis209/1kGP_pipeline

Folders and files

Latest commit

History

Repository files navigation

Pipeline for processing 1000 Genomes Phase 3 data sets

Importing this workflow in another snakemake workflow using module

Outstanding issues (17/10/24)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Importing this workflow in another `snakemake` workflow using `module`

Packages