Skip to content

Configuration

Abdulrahman Alasiri edited this page Jun 24, 2021 · 19 revisions

Introduction

LoFTK starts to predict LoF variants using VEP along with LOFTEE. Next, LoFTK works in parallel to calculate LoF variants (heterozygote and homozygote) and LoF genes (1-copy and 2-copy losses). Finally, a report of descriptive statistic will be created.

  • To run any analysis, you need to set up the configuration file, LoF.config. Below I will take you through the setup of the configuration file and break it down for you.
  • You can use the #-sign to comment out lines in the configuration file, these will not be read by the bash-based scripts.

Configuration file

Software

You will need to set directories to all supporting tools on the server, for instance:

LOFTOOLKIT=/path/to/LOFTK
PERL=/path/to/perl
VEP=/path/to/ensembl-vep/vep
ENSEMBL=/path/to/ensembl-vep
LOFTEE=/path/to/loftee

Memory & Time

Since we are using the SLURM system we need to provide memory and time settings for various steps in the process. For instance, we predicted LoF variants/genes by splitting the analyses per chromosome, which works for us with approximately 2,000 samples and 61,085,699 SNPs (imputed data). You will probably have to edit this, if your sample is larger. In our case the time and memory for the analysis of one chromosome is as follows:

## Convert allele probes files (IMPUTE2 output) to VCF.
QUEUE_PROB2VCF_CONFIG="05:00:00"
VMEM_PROB2VCF_CONFIG="40G"
## LoF annotation (VEP, LOFTEE).
QUEUE_ANNOTATION_CONFIG="10:00:00"
VMEM_ANNOTATION_CONFIG="90G"
## Calculation of LoF genes.
QUEUE_LOF_GENE_CONFIG="15:00:00"
VMEM_LOF_GENE_CONFIG="90G"
## Calculation of LoF variants.
QUEUE_LOF_SNP_CONFIG="15:00:00"
VMEM_LOF_SNP_CONFIG="90G"
## Statistical output
QUEUE_STAT_DESC_CONFIG="02:00:00"
VMEM_STAT_DESC_CONFIG="90G

Notifications

You can be notified when (sub)-analytical steps are beginning or ending, or worse, are aborted. You can set this as follows:

# REQUIRED: mailing settings
# you're e-mail address; you'll get an email when the job has ended or when it was aborted
# 'BEGIN' Mail is sent at the beginning of the job;
# 'END' Mail is sent at the end of the job;
# 'FAIL' Mail is sent when the job is aborted or rescheduled.
# 'REQUEUE' Mail is sent when the job is suspended;
# 'ALL' equivalent to BEGIN, END, FAIL, REQUEUE, and STAGE_OUT;
# 'NONE' No mail is sent.
YOUREMAIL="your_email"
MAILSETTINGS="END,FAIL"

The top part of your configuration file should look like the one below.

### CONFIGURATION FILE FOR LoF TOOLKIT ###
# Precede your comments with a #-sign.

# Set the directory variables, the order doesn't matter.
# Don't end the directory variables with '/' (forward-slash)!

### --- SYSTEM SETTINGS --- ###
# REQUIRED: Path_to where LoFToolKit resides on the server.
LOFTOOLKIT=/path/to/LOFTK

# REQUIRED: Path_to support programs on the server.
PERL=/path/to/perl
VEP=/path/to/ensembl-vep/vep
ENSEMBL=/path/to/ensembl-vep
LOFTEE=/path/to/loftee

### --- SLURM SETTINGS --- ###
## Convert allele probes files (IMPUTE2 output) to VCF.
QUEUE_PROB2VCF_CONFIG="05:00:00"
VMEM_PROB2VCF_CONFIG="40G"
## LoF annotation (VEP, LOFTEE).
QUEUE_ANNOTATION_CONFIG="10:00:00"
VMEM_ANNOTATION_CONFIG="90G"
## Calculation of LoF genes.
QUEUE_LOF_GENE_CONFIG="15:00:00"
VMEM_LOF_GENE_CONFIG="90G"
## Calculation of LoF variants.
QUEUE_LOF_SNP_CONFIG="15:00:00"
VMEM_LOF_SNP_CONFIG="90G"
## Statistical output
QUEUE_STAT_DESC_CONFIG="02:00:00"
VMEM_STAT_DESC_CONFIG="90G


# REQUIRED: mailing settings
# you're e-mail address; you'll get an email when the job has ended or when it was aborted
# 'BEGIN' Mail is sent at the beginning of the job;
# 'END' Mail is sent at the end of the job;
# 'FAIL' Mail is sent when the job is aborted or rescheduled.
# 'REQUEUE' Mail is sent when the job is suspended;
# 'ALL' equivalent to BEGIN, END, FAIL, REQUEUE, and STAGE_OUT;
# 'NONE' No mail is sent.
YOUREMAIL="your_email"
MAILSETTINGS="END,FAIL"

Analysis settings

Folder structure

You have probably organized your work in folders, here you can set these. You should set a ROOTDIR and provide a PROJECTNAME. These two variables are used to create two new folders in the ROOTDIR; [PROJECTNAME]_Files_for_VCF_LoF and [PROJECTNAME]_LoF_output.

You can add this to the configuration file:

### --- ANALYSIS SETTINGS --- ###
# REQUIRED: Path_to a directory where the main analysis directory resides.
ROOTDIR=/path/to/your_input_data

PROJECTNAME="progect_name"

Analysis specifics

There are some specific settings that depend on the type of analysis you will run:

  • Data type

    • Here you must choose the type of your input data from one of these in DATA_TYPE; genotype, exome and genome
  • Input file format
    Only 2 file formats are accepted to run LoFTK:

    • IMPUTE2 output format

    ❗ If you set FILE_FORMAT to IMPUTE2, please set the INFO score cutoff (INFO) and Probability cutoff (PROB).

    • VCF

    ❗ If you set FILE_FORMAT to VCF, do your input data have been phased? Answer with yes or no

  • Select the assembly version
    LoFTK supports both Homo sapiens (human) genome assembly GRCh37 and GRCh38. you have to choose one of them.

  • Set chromosomes range
    We highly recommend splitting data per chromosome or even chunks per chromosome. But here, you need to set the range of chromosomes. for instance, analyzing chromosome 1 to 14, set CHROMOSOMES="$(seq 1 22)", while if you need to analyze specific chromosomes (not in range), such as chr 1, 4, 7 and 22, set CHROMOSOMES="1 4 7 22".

# REQUIRED: Set data type and input file format:
# Set data type, choose one of these options [genotype/exome/genome]
DATA_TYPE="genotype"

# Set input file format, choose one of these options [IMPUTE2/VCF] # IMPUTE2 must includes required files (hap|allele_probe|info|sample)
FILE_FORMAT="VCF"

# IF you set FILE_FORMAT to IMPUTE2, please set the INFO score cutoff (default: 0.4) and Probability cutoff (default: 0.05)
INFO=0.8
PROB=0.05	

# If you set FILE_FORMAT to VCF, do your input data have been phased? [yes/no]
PHASE_STATUS="yes"

# REQUIRED: Select the assembly version, choose one of these options [GRCh37/GRCh38]
ASSEMBLY="GRCh37"

#REQUIRED: Set chromosomes range, e.g. CHROMOSOMES='$(seq 1 22)'
CHROMOSOMES="$(seq 1 22)"