Skip to content

Variant effect prediction

Anusri Pampari edited this page Jul 3, 2023 · 8 revisions

Note: chrombpnet snp_score command will be deprecated in future versions, please refer to https://github.com/kundajelab/variant-scorer for the updated tool.

The following assumptions are made with this script - make changes accordingly if the assumptions dont hold.

  • The script is designed to work only with SNPs - so make sure reference and alternate allele are just one character.
  • The script returns the following three effect scores -
    • log counts difference in alternate allele and reference allele predictions (log_counts_diff)
    • Sum of absolute difference in log probabilities per base (log_probs_diff_abs_sum) between alternate allele and reference allele predictions
    • Jensenshannon distance between alternate allele profile probability predictions and reference allele profile probability predictions (probs_jsd_diff).
  • Carefully read the input format of SNP_DATA below.
  • The input sequence length (inputlen) is inferred from the model. In chromBPNet models the inputlen used is an even number to ensure symmetry. So here we insert the allele at inputlen//2 locus (assuming 0-based indexing of sequence [0,inputlen)) - which means that the sequence left of allele is 1bp longer than the sequence right of allele.
  • If the reference/alternate allele are at the edge of chromosome - preventing us from generating an inputlen sequence - we will ignore this SNP and print the locus information of the ignored SNP. This might result in the final number of output SNPs being predicted on being smaller than the given input SNPs. Read the output format section to see how this effects the final output.

Usage

chrombpnet snp_score [-h] -snps SNP_DATA -m MODEL_H5 -g GENOME -op OUTPUT_PREFIX [-bs BATCH_SIZE] [-dm DEBUG_MODE_ON]

Input Format

required arguments:
  -snps SNP_DATA, --snp-data SNP_DATA
                        Path to a tsv output with the following information in columns - chr, position to insert allele (0-based), ref
                        allele, alt allele
  -m MODEL_H5, --model-h5 MODEL_H5
                        Path model .h5 file
  -g GENOME, --genome GENOME
                        Genome fasta
  -op OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX
                        Output prefix for bigwig files

optional arguments:
  -bs BATCH_SIZE, --batch-size BATCH_SIZE
                        batch size to use for prediction
  -dm DEBUG_MODE_ON, --debug-mode-on DEBUG_MODE_ON
                        Use this mode to print the flanks of first five SNP insert locations
  • SNP_DATA: A TSV file with the following 5 columns - chr, position (0-based) to insert the allele, reference allele, alternate alllele, meta-data. You can leave the meta-data empty too.
    • Meta-data column can be used to provide the following information such as p-value significance of the SNP, observed effect scores etc. This information will be added as column information to the output (see below) which you can use for downtream analysis. Provide any SNP related information to this column as comma separated values.
    • The reference genome loader used is pyfaidx which is 0-based and hence the allele position provided in snp_data_tsv is expected to be 0-based too. Always check if you are inserting the allele correctly by using the code in the debug_mode_on.
    • Make sure there are no duplicate rows in this file.
    • Example SNP_DATA with 7 snp insertions looks as follows. The meta-data column provides observed effect scores comma separated with ground truth labels.
chr17   18967175        A       G       1.9114519946647748,1
chr4    176935912       C       A       3.2978830709253413,1
chr1    144534082       C       T       -2.6229086754131736,1
chr17   19015380        T       A       5.703283573989607,1
chr1    191859377       A       G       0.021914150011206716,0
chr1    157776262       T       G       0.007600049165843743,0
chr1    179303792       G       A       0.0013620075140122916,0
  • DEBUG_MODE_ON: Takes a value of 1 or 0. This is by default set to 0. When set to 1 we score only the first 5 SNPs in SNP_DATA. In addition we also print to the console the right and left flank of the locus where the SNP is being inserted in. You can check if these flanks match with the flanks as reported in existing databases such as dbSNP (https://www.ncbi.nlm.nih.gov/snp/). If it does not your position values in SNP_DATA will need correction.

Output Format

  • output_prefix_snp_scores.tsv: A TSV file with 8 columns - five columns copied in from the snp_data_tsv along with with the following three added columns - log_counts_diff, log_probs_diff_abs_sum, probs_jsd_diff which represent the following -
    • log_counts_diff: log counts difference in alternate allele and reference allele predictions
    • log_probs_diff_abs_sum: Sum of absolute difference in log probabilites per base between alternate allele and reference allele predictions
    • probs_jsd_diff: Jensenshannon distance between alternate allele profile probability predictions and reference allele profile probability predictions. The number of rows in this TSV file can be less than rows provided in the snp_data_tsv - this is because we are skipping reference/alternate allele that fall at the edge of chromosome preventing us from generating an inputlen sequence.
  • output_prefix_predictions_at_snp.pkl: A pickle file containing a dictionary with the following keys - rsids, ref_logcount_preds, alt_logcount_preds, ref_prob_preds, alt_prob_preds. This pickle stores the model predictions at each of the SNP locations.
    • rsids consists of a list of strings - each value formed by concatenating the following 5 values (Chr, position (0-based) to insert the allele, reference allele, alternate alllele, meta-data) seperated by a underscore. An example rsid will look like this - chr1_100054_A_G_0.55 when meta data is provided, when meta-data is empty it looks like this - chr1_100054_A_G_
    • ref_logcount_preds: Consists of the log count predictions when the reference allele is inserted. Has the same length as the rsids
    • ref_logcount_preds: Consists of the log count predictions when the alternate allele is inserted. Has the same length as the rsids
    • ref_prob_preds: Consists of the profile probability predictions when the reference allele is inserted. Has the same length as the rsids
    • alt_prob_preds: Consists of the profile probability predictions when the alternate allele is inserted. Has the same length as the rsids