MESA is a method to call ethnicity-specific associations for biobank-scale multi-ethnic GWAS. This repository stores the C++ implementation of the statistical method used in my MPhil thesis "Identification of Population-specific Associations in Multi-ethnic Genome-wide Association Studies".
Here are the reasons of using MESA
- You want to detect ethnicity-specific associations for GWAS with admixture populations
- The data set you are analyzing comes from a biobank-scale GWAS
- You want to analyze multiple binary or continuous phenotypes
- You want to estimate admixture proportions of individuals
- You want to estimate population-specific allele frequencies
- You want a computationally efficient method to do all sorts of things mentioned above simultaneously
- A Linux environment
- Intel CPU that supports OpenMP and AVX (Tested on Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz)
- GCC >= 9 and build-essential
- Eigen >= 3.3.7 from here
- Specify variables
BUILD_DIR
,SRC_DIR
,EIGEN_DIR
inmakefile
- Specify the user options in
common.h
if necessary. - Run
make
at the project root to build MESA
- Export the environment variable
OMP_NUM_THREADS
before using MESA. e.g.export OMP_NUM_THREADS=4
- Run
./build/mesa
to see available options - Follow the instructions here to see how you may apply MESA to the data set you would like to analyze
Usage: mesa -i INPUT [--cov COV_FILE] -N NUM_INDIVIDUALS -K NUM_ANCESTRIES -G NUM_SNPS -o OUTPUT_PREFIX [OPTION]...
Required arguments:
-i, --in INPUT Path of genotype matrix file (String).
Accepted file type: tsv, ttsv, bed
e.g. -i /home/User/genotype.bed
-N NUM_INDIVIDUALS Number of individuals to be tested
(Positive integer)
-K NUM_ANCESTRIES Number of ancestries to be fitted
(Positive integer)
-G NUM_SNPS Number of SNPs to be tested (Positive
integer)
-o, --out OUTPUT_PREFIX Prefix of path of output file
(String). For example, option
'-o /home/User/output' will produce
files such as '/home/User/output_p.tsv'Optional arguments:
--cov COV_FILE Path of phenotype matrix file (String)
--batch-size Number of subsampled SNPs (Positive
integer, 1000
--newton-iter X Maximum number of Newton's steps = X
(Positive integer, default: 10)
--em-iter X Maximum number of E-steps and M-steps = X
(Positive integer, default: 30)
--epochs X Number of passes through whole data = X
set (Positive integer, default: 1)
--continue Whether to use existing estimates
extracted from OUTPUT_PREFIX_p.tsv, etc
--no-pretrain Start training with no warm-up
--se-only Calculate standard errors using
existing estimates extracted from
OUTPUT_PREFIX_p.tsv,
OUTPUT_PREFIX_q.tsv and
OUTPUT_PREFIX_effect.tsv
*_p.tsv
: a G-by-K matrix of estimated baseline allele frequency intercept terms. The (g, k)-th entry corresponds to in the literature*_q.tsv
: a N-by-K matrix of estimated ancestry proportions. The (i, k)-th entry corresponds to in the literature*_effect.tsv
: a G-by-(K * cov_num) matrix of estimated effect sizes, where 'cov_num' is the number of phenotypes tested. The (g, (k-1)*cov_num + r)-th entry corresponds to the estimate of in the literature*_effect_se.tsv
: a G-by-(K * cov_num) matrix of standard errors of estimated effect sizes, where 'cov_num' is the number of phenotypes tested. The (g, (k-1)*cov_num + r)-th entry corresponds to the standard error estimate of in the literature