Skip to content
Pascal N Timshel edited this page Nov 11, 2020 · 8 revisions

SNPs and genes

Q: what genes are used to prioritize cell-types?
A: Only genes in the specificity input that exists in the gene coordinate file (data/gene_coordinates.GRCh37.ensembl_v91.txt) will contribute to the cell-type prioritization: these are protein-coding autosomal genes.

Q: what SNPs are used to prioritize cell-types?
A: When you munge the GWAS summary stat file with the --merge-alleles argument, it restricts your SNPs to SNPs contained in HapMap3. This is currently our recommended workflow. For the technical details on why HapMap3 SNPs are recommended and sufficient, see the supplementary material for the S-LDSC paper (Finucane, Nat. Gen 2015). For CELLECT-MAGMA you do not have to restrict to HapMap3 SNPs, but we encourage users to check how robust their cell-type prioritization results are to this SNP filtering.

Q: Does it matter which genome build my GWAS input is from?
A: No. Internally CELLECT uses SNPs' rsIDs, and not chromosomal coordinates, so it does not matter what genome build the GWAS summary stat file is from.

Q: Are SNPs located in the MHC region used for cell-type prioritization?
A: By default, no. CELLECT-LDSC does not support inclusion of signal from the MHC (see this thread for an explanation of why.) For CELLECT-MAGMA the MHC signal is excluded by default, but you can change this in the config.yml file.

Nomenclature

Q: what is meant by "annotation"?
A: We use the word "annotation" in two contexts:

  1. Genomic 'annotation'. A continuous or discrete feature that can be used 'annotate' or partition the genome. In CELLECT we use cell-type expression specificity to generate cell-type annotations - the basic units of the CELLECT prioritization analysis.
  2. 'Annotation' file (.annot.gz for CELLECT-LDSC or .genes.annot for CELLECT-MAGMA). If you are more curious, see the LDSC documentation or MAGMA documentation for details on the file format.

Speed

Q: CELLECT seems to be hanging at Building DAG of jobs....
A: First solution is to update snakemake, as newer versions of snakemake are considerably faster. Building the workflow DAG can take a long time if you are analyzing many cell-type annotations (and/or GWAS), so you may want to analyze fewer combinations of GWAS and cell-type annotations. Building the DAG can be speed-up if disabling the heritability analysis mode (available in CELLECT-LDSC only).

Clone this wiki locally