Skip to content

SAIGE Hands On Practice (09 2021 updated)

weizhouUMICH edited this page Sep 10, 2021 · 1 revision

In this tutorial, we will perform single-variant association tests using SAIGE, which accounts for sample relatedness, unbalanced case-control ratios in binary phenotypes, and is scalable for large data sizes.
We will use terminal to run Linux commands first and then use Rstudio console to run R scripts to generate the QQ plots.

Table of Contents

Get Ready

Install the SAIGE R package

  • Please always refer to this github page for the up-to-date installation##

https://github.com/weizhouUMICH/SAIGE#how-to-install-and-run-saige-and-saige-gene

Example data folder

https://github.com/weizhouUMICH/SAIGE/tree/master/extdata

Running SAIGE for single variant association tests

SAIGE has two steps

2 steps in SAIGE

Step 1: fitting the null logistic/linear mixed model

  • For binary traits, a null logistic mixed model will be fitted (--traitType=binary).
  • For quantitative traits, a null linear mixed model will be fitted (--traitType=quantitative) and needs to be inverse normalized (--invNormalize=TRUE)

Input files

  1. Genotype file for constructing the genetic relationship matrix (GRM)
    SAIGE takes the PLINK binary file for the genotypes and assumes the file prefix is same for .bed, .bim. and .fam. This example plink file contains genotypes for 128,868 genetic markers of 1,000 samples (100 families, each family has 10 members)
    ./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.v2.bed ./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.v2.bim ./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.v2.fam

  2. Phenotype file (contains non-genetic covariates if any, such as gender and age) The file can be either space or tab-delimited with a header. It is required that the file contains one column for sample IDs and one column for the phenotype. It may contain columns for non-genetic covariates.
    Note: Current version of SAIGE does not support categorical covariates that have more than two categories. For categorical covariates, please use dummy variables.

#To take a look at the file
head ./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt

2 steps in SAIGE

Fit the null logistic mixed model for binary traits (--traitType=binary):

step1_fitNULLGLMM.R     \
        --plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.v2 \
        --phenoFile=./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt \
        --phenoCol=y_binary \
        --covarColList=x1,x2 \
        --sampleIDColinphenoFile=IID \
        --traitType=binary        \
        --outputPrefix=./output/example_binary \
        --IsOverwriteVarianceRatioFile=TRUE \
        --nThreads=4
  • --plinkFile Path to plink file for creating the genetic relationship matrix (GRM)
  • --phenoFile Path to the phenotype file
  • --phenoCol Column name for phenotype to be tested in the phenotype file
  • --covarColList List of covariates (comma separated), matching column names in the phenotype file
  • --sampleIDColinphenoFile Column name of sample IDs in the phenotype file
  • --traitType binary/quantitative
  • --outputPrefix Path and prefix of the output files
  • --IsOverwriteVarianceRatioFile Whether to overwrite the variance ratio file if the file already exists
  • --nThreads number of CPUs to be used for Step 1. Note: the computation time linearly decreases as the nThreads increases

For more detailed parameter list. You may run

step1_fitNULLGLMM.R --help

The screen output ends with the following text if the job above has been run successfully

The values in the decimal might be different due to the rounding errors of different machines, e.g 0.9340962 instead of 0.9340965
screen output

Note for Quantitative traits, if not normally distributed, inverse normalization needs to be specified to be TRUE --invNormalize=TRUE and --traitType=quantitative

Output files

  1. model file containing the null model fitting results in an R list named modglmm (input for step 2)
    ./output/example_binary.rda
  1. variance ratio file(input for step 2)
    ./output/example_quantitative.varianceRatio.txt
  1. association result file for the subset of randomly selected markers for estimate the variance ratio (temp file, won't be used next)
    ./output/example_quantitative_30markers.SAIGE.results.txt

Step 2: performing single-variant association tests

  • For binary traits, saddlepoint approximation will used to account for case-control imbalance.
  • File formats for dosages/genotypes of genetic variants to be tested can be used: VCF, BGEN, SAV

Input files

  1. Dosage/genotype file containing dosages or gentoypes of markers to test
    SAIGE supports different formats for dosages: VCF, BCF, BGEN and SAV. We will use BGEN in the example today
  • BGEN
    ./input/nfam_1000_n.index_1_MAF_0.01.bgen ./input/nfam_1000_n.index_1_MAF_0.01.bgen.bgi
  1. Sample file
    This file contains one column for sample IDs corresponding to the sample order in the dosage file. No header is included. The file is ONLY for BGEN input.
head ./input/samplefileforBgen.txt
  1. Model file from step 1 ./output/example_binary.rda

  2. Variance ratio file from step 1 ./output/example_binary.varianceRatio.txt

Perform single-variant association tests

step2_SPAtests.R \
  --bgenFile=./input/nfam_1000_n.index_1_MAF_0.01.bgen \
  --bgenFileIndex=./input/nfam_1000_n.index_1_MAF_0.01.bgen.bgi \
  --sampleFile=./input/samplefileforBgen.txt \
  --chrom=1 \
  --minMAF=0.0001 \
  --minMAC=20 \
  --GMMATmodelFile=./output/example_binary.rda \
  --varianceRatioFile=./output/example_binary.varianceRatio.txt \
  --SAIGEOutputFile=./output/example_binary.SAIGE.bgen.txt \
  --numLinesOutput=1000 \
  --IsOutputAFinCaseCtrl=TRUE
  • --bgenFile Path to bgen file.
  • --bgenFileIndex Path to the .bgi file (index of the bgen file)
  • --sampleFile Path to the file that contains one column for IDs of samples in the dosage file
  • --chrom chromosome of the markers in bgen. Note only one chromosome can be tested in each job.
  • --minMAF Minimum minor allele frequency for markers to be tested
  • --minMAC Minimum minor allele count for markers to be tested
  • --GMMATmodelFile Path to the input file containing the glmm model, which is output from step 1
  • --varianceRatioFile Path to the input file containing the variance ratio, which is output from the step 1
  • --SAIGEOutputFile Path to the output file containing assoc test results
  • --numLinesOutput Number of markers to be output to the file each time
  • --IsOutputAFinCaseCtrl whether to output allele frequency in cases and controls for binary traits

For more detailed parameter list. You may run

step2_SPAtests.R --help

The screen output ends with the following text if the job above has been run successfully numPassMarker: 4950 indicates that 4950 markers that pass the filtering (--minMAF, --minMAC) have been tested
screen output

Output file

A file with association test results

#To take a look at the file
head ./output/example_binary.SAIGE.bgen.txt

NOTE:

Association results are with regard to Allele2. For binary traits, the header of the output file:

CHR POS rsID SNPID Allele1 Allele2 AC_Allele2 AF_Allele2 imputationInfo N BETA SE Tstat p.value p.value.NA Is.SPA.converge varT varTstar

  • CHR: chromosome
  • POS: genome position
  • rsID: rs ID for variant (only output for bgen input)
  • SNPID: variant ID
  • Allele1: allele 1
  • Allele2: allele 2
  • AC_Allele2: allele count of allele 2
  • AF_Allele2: allele frequency of allele 2
  • imputationInfo: imputation info. If not in dosage/genotype input file, will output 1
  • N: sample size
  • BETA: effect size of allele 2
  • SE: standard error of BETA
  • Tstat: score statistic of allele 2
  • p.value: p value (with SPA applied for binary traits)
  • p.value.NA: p value when SPA is not applied (only for binary traits)
  • Is.SPA.converge: whether SPA is converged or not (only for binary traits)
  • varT: estimated variance of score statistic with sample relatedness incorporated
  • varTstar: variance of score statistic without sample relatedness incorporated
  • AF.Cases: allele frequency of allele 2 in cases (only for binary traits and if --IsOutputAFinCaseCtrl=TRUE)
  • AF.Controls: allele frequency of allele 2 in controls (only for binary traits and if --IsOutputAFinCaseCtrl=TRUE)

Generate QQ plots for association p-values

We need to use R script for plotting, so please switch to the console in Rstutio console_Rstudio

The output results contain the p values with (p.value) and without (p.value.NA) SPA adjustment. Now, we generate QQ plots for Score tests with and without SPA.

#Set the working directory in Rstudio console to the extdata folder. Please change the path if the extdata folder is not in your home directory on the cluster
setwd("~/extdata/") 

#read in the output file in the output folder in extdata
data = read.table("./output/example_binary.SAIGE.bgen.txt", header=T)

#function to generate a QQ plot
plot_qq<-function(pval,title)
{
	pval.ord<-pval[order(pval)]
	p.unif<-(1:length(pval))/length(pval)
	plot(-log10(p.unif),-log10(pval.ord),xlab="-log10(Expected p-values)",ylab="-log10(Observed p-values)",main=title,pch=16)
	abline(a=0,b=1)
}

par(mfrow=c(1,2))
plot_qq(data$p.value.NA,title="SAIGE without SPA")
plot_qq(data$p.value,title="SAIGE with SPA")

QQ plot

Other useful options and notes

  • --condition = genetic marker ids (chr:pos_ref/alt for vcf/sav or marker id for bgen) separated by comma, e.g.chr3:101651171_C/T,chr3:101651186_G/A, for conditional analysis.
  • --minMAC, --minMAF Minimum minor allele frequency and Minimum minor allele count for markers to be tested. The higher threshold between minMAC and minMAF will be used
  • --minInfo Minimum imputation score for imputed markers to be tested
  • Conditional analysis based summary stats can be performed (--condition) in Step 2 with dosage/genotype input file formats VCF, BGEN and SAV.
  • To query and test a subset of markers
    • both variant IDs and range of chromosome positions can be specified for BGEN input (--idstoExcludeFile, --idstoIncludeFile, --rangestoExcludeFile, --rangestoIncludeFile)
    • range chromosome positions can be specified for VCF/SAV input (--chrom, --start, --end).
  • For VCF/SAV input, --chrom MUST be specified and the string needs to be exactly the same as in the VCF/SAV, such as "01" or "chr1".
  • For VCF/SAV input, --vcfField=DS to test dosages and --vcfField=GT to test genotypes
  • To drop samples with missing genotypes/dosages, --IsDropMissingDosages=TRUE, if FALSE, missing genotypes/dosages will be mean imputed.
  • sampleFile is used specify a file with sample IDs for BGEN file. Please DO NOT include a header in the file. SAIGE versions >= 0.38 do not need sampleFile if VCF files are used

Useful options specifically for binary phenotypes

  • --IsOutputAFinCaseCtrl=TRUE can be specified to output allele frequencies in cases and controls
  • --IsOutputNinCaseCtrl=TRUE can be specified to output sample sizes of cases and controls for each marker
  • --IsOutputHetHomCountsinCaseCtrl=TRUE can be specified to output heterozygous and homozygous counts in cases and controls