Skip to content

Configuration

Abdulrahman Alasiri edited this page Jun 22, 2021 · 19 revisions

Introduction

  • To run any analysis, you need to set up the configuration file, LoF.config. Below I will take you through the setup of the configuration file and break it down for you.
  • You can use the #-sign to comment out lines in the configuration file, these will not be read by the bash-based scripts.

Configuration file

Software

You will need to set directories to all supporting tools on the server, for instance:

LOFTOOLKIT=/path/to/LOFTK
PERL=/path/to/perl
VEP=/path/to/ensembl-vep/vep
ENSEMBL=/path/to/ensembl-vep
LOFTEE=/path/to/loftee

Memory & Time

Since we are using the SLURM system we need to provide memory and time settings for various steps in the process. For instance, we predicted LoF variants/genes by splitting the analyses per chromosome, which works for us with ~2,000 samples and . You will probably have to edit this, if your sample is larger. In our case the time and memory for the analysis of one chromosome is as follows:

## Convert allele probes files (IMPUTE2 output) to VCF.
QUEUE_PROB2VCF_CONFIG="05:00:00"
VMEM_PROB2VCF_CONFIG="40G"
## LoF annotation (VEP, LOFTEE).
QUEUE_ANNOTATION_CONFIG="10:00:00"
VMEM_ANNOTATION_CONFIG="90G"
## Calculation of LoF genes.
QUEUE_LOF_GENE_CONFIG="15:00:00"
VMEM_LOF_GENE_CONFIG="90G"
## Calculation of LoF variants.
QUEUE_LOF_SNP_CONFIG="15:00:00"
VMEM_LOF_SNP_CONFIG="90G"
## Statistical output
QUEUE_STAT_DESC_CONFIG="02:00:00"
VMEM_STAT_DESC_CONFIG="90G

Notifications

You can be notified when (sub)-analytical steps are beginning or ending, or worse, are aborted. You can set this as follows:

# REQUIRED: mailing settings
# you're e-mail address; you'll get an email when the job has ended or when it was aborted
# 'BEGIN' Mail is sent at the beginning of the job;
# 'END' Mail is sent at the end of the job;
# 'FAIL' Mail is sent when the job is aborted or rescheduled.
# 'REQUEUE' Mail is sent when the job is suspended;
# 'ALL' equivalent to BEGIN, END, FAIL, REQUEUE, and STAGE_OUT;
# 'NONE' No mail is sent.
YOUREMAIL="your_email"
MAILSETTINGS="END,FAIL"

The top part of your configuration file should look like the one below.

### CONFIGURATION FILE FOR LoF TOOLKIT ###
# Precede your comments with a #-sign.

# Set the directory variables, the order doesn't matter.
# Don't end the directory variables with '/' (forward-slash)!

### --- SYSTEM SETTINGS --- ###
# REQUIRED: Path_to where LoFToolKit resides on the server.
LOFTOOLKIT=/path/to/LOFTK

# REQUIRED: Path_to support programs on the server.
PERL=/path/to/perl
VEP=/path/to/ensembl-vep/vep
ENSEMBL=/path/to/ensembl-vep
LOFTEE=/path/to/loftee

### --- SLURM SETTINGS --- ###
## Convert allele probes files (IMPUTE2 output) to VCF.
QUEUE_PROB2VCF_CONFIG="05:00:00"
VMEM_PROB2VCF_CONFIG="40G"
## LoF annotation (VEP, LOFTEE).
QUEUE_ANNOTATION_CONFIG="10:00:00"
VMEM_ANNOTATION_CONFIG="90G"
## Calculation of LoF genes.
QUEUE_LOF_GENE_CONFIG="15:00:00"
VMEM_LOF_GENE_CONFIG="90G"
## Calculation of LoF variants.
QUEUE_LOF_SNP_CONFIG="15:00:00"
VMEM_LOF_SNP_CONFIG="90G"
## Statistical output
QUEUE_STAT_DESC_CONFIG="02:00:00"
VMEM_STAT_DESC_CONFIG="90G


# REQUIRED: mailing settings
# you're e-mail address; you'll get an email when the job has ended or when it was aborted
# 'BEGIN' Mail is sent at the beginning of the job;
# 'END' Mail is sent at the end of the job;
# 'FAIL' Mail is sent when the job is aborted or rescheduled.
# 'REQUEUE' Mail is sent when the job is suspended;
# 'ALL' equivalent to BEGIN, END, FAIL, REQUEUE, and STAGE_OUT;
# 'NONE' No mail is sent.
YOUREMAIL="your_email"
MAILSETTINGS="END,FAIL"

Analysis settings

Folder structure

You have probably organized your work in folders, here you can set these. You should set a ROOTDIR and provide a PROJECTNAME. These two variables are used to create two new folders in the ROOTDIR; [PROJECTNAME]_Files_for_VCF_LoF and [PROJECTNAME]_LoF_output.

You can add this to the configuration file:

### --- ANALYSIS SETTINGS --- ###
# REQUIRED: Path_to a directory where the main analysis directory resides.
ROOTDIR=/path/to/your_input_data

PROJECTNAME="progect_name"

Analysis specifics

There are some specific settings that depend on the type of analysis you will run:

  • Data type

    • Here you must choose the type of your input data from one of these in DATA_TYPE; genotype, exome and genome
  • Input file format
    Only 2 file formats are accepted to run LoFTK:

    • IMPUTE2 output format

    ❗ If you set FILE_FORMAT to IMPUTE2, please set the INFO score cutoff (INFO) and Probability cutoff (PROB).

    • VCF

    ❗ If you set FILE_FORMAT to VCF, do your input data have been phased? Answer with yes or no

  • Select the assembly version
    LoFTK supports both Homo sapiens (human) genome assembly GRCh37 and GRCh38. you have to choose one of them.

  • Set chromosomes range
    We highly recommend splitting data per chromosome or even chunks per chromosome. But here, you need to set the range of chromosomes. for instance, analyzing chromosome 1 to 14, set CHROMOSOMES="$(seq 1 22)", while if you need to analyze specific chromosomes (not in range), such as chr 1, 4, 7 and 22, set CHROMOSOMES="1 4 7 22".

# REQUIRED: Set data type and input file format:
# Set data type, choose one of these options [genotype/exome/genome]
DATA_TYPE="genotype"

# Set input file format, choose one of these options [IMPUTE2/VCF] # IMPUTE2 must includes required files (hap|allele_probe|info|sample)
FILE_FORMAT="VCF"

# IF you set FILE_FORMAT to IMPUTE2, please set the INFO score cutoff (default: 0.4) and Probability cutoff (default: 0.05)
INFO=0.8
PROB=0.05	

# If you set FILE_FORMAT to VCF, do your input data have been phased? [yes/no]
PHASE_STATUS="yes"

# REQUIRED: Select the assembly version, choose one of these options [GRCh37/GRCh38]
ASSEMBLY="GRCh37"

#REQUIRED: Set chromosomes range, e.g. CHROMOSOMES='$(seq 1 22)'
CHROMOSOMES="$(seq 1 22)"

Clone this wiki locally