FeaturePred is a tool designed to simplify the process of predicting features (e.g. gene expression) in a target sample. It is designed to work with FUSION formated SNP-weights files and PLINK formatted genotype data. It uses a FUSION released script to convert the weights files into a PLINK .SCORE file, which is then used to predict the feature in a target sample using PLINK. The script harmonises the target sample to the reference data automatically and efficiently handles very large target sample datasets.
Access to weights files and more information on FUSION can be found here.
- R and the required packages:
install.packages(c('data.table','optparse','foreach','doMC'))
- FUSION software:
git clone https://github.com/gusevlab/fusion_twas.git
-
FUSION LD reference data (download)
-
pigz software for parallel gz compression
-
Target sample genetic data
- Binary PLINK format (.bed/.bim/.fam)
- RSIDs should match the FUSION LD reference data (1000 Genomes phase 3)
-
FUSION formatted SNP-weights
- Can be downloaded from the FUSION website
- To make your own FUSION format SNP-weights, see the FUSION website and a pipeline that I wrote here
Flag | Description | Default |
---|---|---|
--PLINK_prefix | Path to genome-wide PLINK binaries (.bed/.bim/.fam) [required] | NA |
--PLINK_prefix_chr | Path to per chromosome PLINK binaries (.bed/.bim/.fam) [required] | NA |
--weights | Path for .pos file describing features [required] | NA |
--weights_dir | Directory containing the weights listed in the .pos file [required] | NA |
--ref_ld_chr | Path to FUSION 1KG reference [required] | NA |
--score_files | Path to SCORE files corresponding to weights [optional] | NA |
--ref_expr | Path to reference expression data [optional] | NA |
--n_cores | Specify the number of cores available for parallel computing [optional] | 1 |
--memory | RAM available in MB [optional] | 2000 |
--plink | Path to PLINK software [required] | NA |
--save_score | Specify as T if temporary .SCORE files should kept [optional] | TRUE |
--save_ref_expr | Save reference expression data [optional] | TRUE |
--output | Name of output directory | NA |
--pigz | Path to pigz binary [required] | NA |
--targ_pred | Set to FALSE to create SCORE file and expression reference only [optional] | TRUE |
--ref_maf | Path to per chromosome PLINK freq files [required] | NA |
--chr | Specify chromosome number [optional] | NA |
In the specified output directory, the following files will be produced:
Name | Description |
---|---|
FeaturePredictions_<PANEL>_chr<chr>.txt.gz .csv | Space delimited file containing FID, IID, and the predicted values for each feature. |
FeaturePredictions.log | Log file. |
SCORE_failed.txt | Text file listing weights which couldn't be converted to a .SCORE file (if any). |
Prediction_failed.txt | Text file listing features that couldn't be predicted (if any). |
These examples use the weights and .pos file provided here.
Rscript FeaturePred.V2.0.R \
--PLINK_prefix_chr FUSION/LDREF/1000G.EUR. \
--weights test_data/CMC.BRAIN.RNASEQ/CMC.BRAIN.RNASEQ.pos \
--weights_dir test_data/CMC.BRAIN.RNASEQ \
--ref_ld_chr FUSION/LDREF/1000G.EUR. \
--plink plink \
--ref_maf FUSION/LDREF/1000G.EUR. \
--pigz pigz \
--chr 1 \
--output demo
sbatch -p shared -n 6 --mem 50G Rscript FeaturePred.V2.0.R \
--PLINK_prefix_chr FUSION/LDREF/1000G.EUR. \
--weights test_data/CMC.BRAIN.RNASEQ/CMC.BRAIN.RNASEQ.pos \
--weights_dir test_data/CMC.BRAIN.RNASEQ \
--ref_ld_chr FUSION/LDREF/1000G.EUR. \
--plink plink \
--ref_maf FUSION/LDREF/1000G.EUR. \
--pigz pigz \
--chr 1 \
--output demo \
--n_cores 6 \
--memory 50000
This script was written by Dr Oliver Pain ([email protected])
If you have any questions or comments use the google group.