MonSTR is a toolkit for identifying and analyzing de novo tandem repeat mutations. The core code for calling de novos and parsing VCF files is modified from that originally written by Thomas Willems in the HipSTR repository.
Authors:
- Ileena Mitra [email protected]
- Melissa Gymrek [email protected]
Table of contents:
- Installation instructions
- Basic usage
- Preparing your input VCF file
- Detailed filtering options
- Specifying mutation models
- Output file formats
- Additional scripts
- Citing MonSTR
You can install MonSTR using the following commands:
git clone https://github.com/gymreklab/STRDenovoTools
cd STRDenovoTools
mkdir build
cd build
cmake ..
make
sudo make install
You will need to have a C-compiler and cmake installed. Type MonSTR --help
to check that MonSTR installed successfully.
MonSTR takes as input a VCF file generated by HipSTR or GangSTR and a pedigree file in .fam format describing family relationships. It outputs text files describing mutation probabilities at each locus in each trio.
Important The VCF files must contain genotype likelihoods unless running in naive mode. The VCFs must have been run on all samples in a family jointly so that their genotype likelihood fields contain info for the same set of alleles. See "Preparing your VCF file" below. Sample IDs in the pedigree file must match exactly to those in the VCF header.
Below shows a basic MonSTR command:
MonSTR \
--strvcf test.vcf.gz \
--fam family.fam \
[--gangstr] \
--out test
By default, a HipSTR VCF is expected. If using a GangSTR VCF, use the option --gangstr
. VCFs must be sorted and indexed using tabix. This can be accomplished by:
bgzip file.vcf
tabix -p vcf file.vcf.gz
The following example command can be run using test data available in the TestData
folder of this repository:
MonSTR \
--strvcf TestData/Q3.filtered.vcf.gz \
--fam TestData/TestQuadFam.ped \
--out test \
--gangstr
Type MonSTR --help
, or the documentation below, to see more detailed information about which options are available.
By default, MonSTR computes a joint likelihood of all possible genotype configurations at each locus in trio. To do so, it must have detailed genotype likelihood information specifying the likelihood of each possible diploid genotype, and the set of possible genotypes considered must be identical in each family member. Therefore, when generating your input VCF file with HipSTR or GangSTR, you must:
- Run HipSTR or GangSTR jointly on all family members. If they are genotyped separately, each VCF file might contain different sets of alternate alleles, making GL fields incomparable across files. Note, you may include multiple families in a single run and output VCF file, as long as families are not split across VCFs.
- If using HipSTR, use the
--output-gls
option to include the GL field in the VCF output. This is a standard VCF fielf for specifying genotype likelihoods (see VCF spec: https://samtools.github.io/hts-specs/VCFv4.2.pdf) - If using GangSTR, use the
--include-ggl
option to include the GGL field. This is a modified version of the GL field to accommodate the larger number of alternate alleles considered by GangSTR. It is described on the GangSTR webpage.
The sample IDs in the VCF output file must match those in the .fam file and should not be repeated across families.
Note, GangSTR only estimates the number of repeat units. It will not output sequence differences, or differences from the reference allele consisting of a non-integer number of repeats. HipSTR can detect sequence differences, or non-integer changes in repeat copy number. To collapse alleles or the same length, or round alleles to the nearest repeat copy number, use the following options:
--combine-alleles-by-length
: Collapse alleles of the same length to one.--round-alleles
: Round allele lengths to nearest repeat unit.
MonSTR provides the following filtering options:
General filters:
--min-coverage <INT>
: Discard calls with less than this much coverage, based on the VCF FORMAT:DP field. Default: 0.--min-score <FLOAT>
: Discard calls with less than this score, based on the VCF FORMAT:Q field. Default: 0.--filter-hom
: Filter calls where the child is homozygous for new allele.--require-all-chilren
: Discard loci in family where not all children have calls. For example, if processing a quad family, but only one child has a call, discard the locus.--family <STR>
: Restrict to analyzing this family (based on the family ID in the .fam file).--require-num-children <INT>
: Require family to have this many children. Default: 0.--region <STR>
: Restrict to loci in this region (chrom:start-end).--period <INT>
: Restrict to loci with this repeat unit length.--max-num-alleles <INT>
: Filter loci with more than this many alleles. Default: 25.
HipSTR-specific filters:
--min-span-coverage <INT>
: Discard calls with less than this many spanning reads, based on the VCF FORMAT:MALLREADS field. Default: 0.--min-supp-reads <INT>
: Discard calls with an allele supported by less than this many reads, based on the VCF FORMAT:MALLREADS field. Default: 0.
GangSTR-specific filters:
--min-num-encl-child <INT>
: Require this many enclosing reads supporting de novo allele. Based on the VCF FORMAT:ENCLREADS field. Default: 0.--max-num-encl-parent <INT>
: Discard if more than this many enclosing reads support the de novo allele in parent. Based on the VCF FORMAT:ENCLREADS field. Default: 1000.--max-perc-encl-parent <FLOAT>
: Discard if more than this percentage enclosing reads support de novo allele in parent. Based on the VCF FORMAT:ENCLREADS field. Default: 1.0.--min-encl-match <FLOAT>
: Discard if fewer than this percentage of enclosing reads match the call. Based on the VCF FORMAT:ENCLREADS and FORMAT:REPCN fields. Default: 0.0.--min-total-encl <INT>
: Require this many total enclosing reads in each sample. Based on the VCF FORMAT:ENCLREADS field. Default: 0.
MonSTR will only process a trio of all samples (mother, family, and child) pass all specified filters.
By default, MonSTR uses a statistical model to compute a posterior probabilty of mutation at each TR in each child ("Model-based calling" below). You may also use a naive method ("Naive calling") to identify mutations.
MonSTR computes a a joint likelihood across parent and child genotypes for all possible genotype configurations compatible with either Mendelian inheritance or a mutation occurring. It then uses these likelihoods to compute the posterior probability of a mutation. Besides genotype likelihoods, which are computed by GangSTR or HipSTR, there are three additional components to the mutation model used:
-
Prior genotype probabilities, which give the probability to see each possible genotype in the parents. By default, a uniform prior is used, assuming each genotype has equal probability. This not a very realistic prior, but is simple, and we have found using more complex genotype priors makes little to no difference in the results. To instead base prior genotype probabilities on allele frequencies computed from samples in your VCF, use the option
--use-pop-priors
. This option is only supported for HipSTR input. -
Transition probabilities, which give the probability of mutating from one allele size to another. Here allele size always refers to the number of repeat units. If using GangSTR, or HipSTR sequence-based alleles, MonSTR will assume a uniform probability to mutate from one allele to each other possible alleles. If using HipSTR with the
--combine-alleles-by-length
option, you can additionally specify more complex transition parameters, based on mutation parameters described in Gymrek, et al. 2017.- GeomP describes the parameter for the geometric step size distribution. If this is equal to 1, all steps are a single unit. If it is less than 1, larger step sizes are allowed.
--default-geomp <FLOAT>
can be used to set a global GeomP value (default: 1.0). - Beta describes a length dependent directionality bias. If Beta=0, then there is no bias. If Beta > 0, long alleles (relative to a "central" allele) tend to contract and short alleles tend to expand.
--default-beta <FLOAT>
and--default-central <INT>
can be used to specify a global Beta and central allele value. (defaults Beta=0.0, central allele=0). The central allele is given in terms of the number of repeats relative to the reference genome.
- GeomP describes the parameter for the geometric step size distribution. If this is equal to 1, all steps are a single unit. If it is less than 1, larger step sizes are allowed.
Alternatively you can set per-locus parameters as described below.
- Prior mutation probabilities, which give the per locus per generation mutation rate of a TR. These are used to compute posterior probabilities of mutation. You can set a global log10 mutation rate to be used for all loci with
--defult-prior <FLOAT>
(default -3). Alternatively you can set per-locus parameters as described below.
In addition to providing global mutation parameters, you may also specify a file with per-locus parameters using the option --mutation-models <FILE>
. This file should be whitespace delimited with a single line per locus and no header line. The file should have the following columns present in each line (in order): chromosome, TR start position, TR end position, repeat unit length (bp), log10 mutation rate, beta, geomp, central_allele. The start coordinate must match exactly to the VCF INFO:START field in the HipSTR VCF for each locus.
In addition to model-based mutation calling, MonSTR implements a naive mutation detection algorithm.
-
Specify the
--naive
option to simply check if calls follow Mendelian inheritance. In this case the posterior probabilities for mutations are always set to 1, else 0 for no mutation. -
Specify the option
--naive-expansions-frr <int1,int2>
to implement a naive calling method to identify candidate expansions. This looks for<int1>
FRR (fully repetitive reads) in the child and 0 in the parent. If there are no FRR reads present, look for loci with at least<int2>
flanking reads in the child greater than the largest parent allele. This only works with GangSTR input. To see a full description of FRR and flanking reads, see the GangSTR documentation and publication. This option uses information from the VCF FORMAT:FLNKREADS and FORMAT:RC fields. Posteriors for candidate expansions are set to -1.
If analyzing mutations on the X chromosome, specify the --chrX
option. This assumes the entire VCF is for chromosome X. Sex information must be available for children. You can use either model-based or naive mutation calling on chromosome X.
If using --chrX
, it is recommended that males have haploid calls. If you are using GangSTR, you can use --bam-samps
and --samp-sex
to perform sex-aware calling of sex chromosomes.
MonSTR outputs files <OUTPREFIX>.all_mutations.tab
and <OUTPREFIX>.locus_summary.tab
(<OUTPREFIX>
is specified by the --out
option). The columns in these files are described below.
The following options also control which loci/mutations are output:
--posterior-threshold <FLOAT>
: Posterior probability threshold to call a mutation. Default: 0.9.--include-invariant
: Output info for loci even if there are no alternate alleles present. By default, invariant loci are excluded.--output-all-loci
: Output all calls to the all_mutations output file, regardless of posterior score. This can be useful if you want to analyze properties as a function of posterior threshold, or want to decide later on an appropriate posterior threshold.
This file contains one line per TR per trio. It contains the following columns:
Column header | Description |
---|---|
chrom | The chromosome of the TR |
pos | The postion of the TR |
period | The length of the repeat unit (in bp) |
prior | The prior probability of mutation |
family | The family ID of the trio (taken from the .fam file) |
child | The sample ID of the child in the trio |
phenotype | The phenotype of the child, taken from the .fam file |
posterior | The posterior probability of mutation in this child at this TR locus. If using naive calling, this is always set to 1 for mutations, else 0. If using naive expansion detection, this is set to -1 for candidate expansions. |
newallele | The allele arising from a de novo mutation (number of repeat units). This field is only meaningful if there was a mutation inferred. |
mutsize | The size of the mutation (number of repeat units). This field is only meaningful if there was a mutation inferred. |
inparents | Set to 1 if the new allele is found in the parents, else 0. This field is only meaningful if there was a mutation inferred. |
poocase | Gives the inferred parent of origin of the mutation, if a mutation was inferred. 0: unknown; 1: Mendelian (no denovo); 2: New allele from father; 3: New allele from mother; 4: Unclear |
isnew | Set to 1 if the new alleles is not found in any control samples in the input VCF. Otherwise set to 0. This field is only meaningful if there was a mutation inferred. |
case_count | Count of the new allele in cases (phenotype=2). This field is only meaningful if there was a mutation inferred. |
ctrl_count | Count of the new allele in controls (phenotype=1). This field is only meaningful if there was a mutation inferred. |
unk_count | Count of the new allele in samples with unknown phenotype (phenotype=0). This field is only meaningful if there was a mutation inferred. |
child_gt | Genotype of the child sample. |
mat_gt | Genotype of the mother. |
pat_gt | Genotype of the father. |
encl_child | Number of enclosing reads of the new allele in the child. This field is only meaningful if there was a mutation inferred. |
encl_mother | Number of enclosing reads of the new allele in the mother. This field is only meaningful if there was a mutation inferred. |
encl_father | Number of enclosing reads of the new allele in the father. This field is only meaningful if there was a mutation inferred. |
encl_parent | Number of encloseing reads of the new allele in the parent the mutation came from (see poocase). |
long_mother | Indicates whether the longer vs. shorter allele was transmitted from the mother. This is only relevant if the mother is heterozygous. 0=unknown; 1=shorter; 2=longer. |
long_father | Indicates whether the longer vs. shorter allele was transmitted from the father. This is only relevant if the father is heterozygous. 0=unknown; 1=shorter; 2=longer. |
This file contains a single line per TR. It contains the following columns:
Column header | Description |
---|---|
chrom | The chromosome of the TR |
pos | The postion of the TR |
period | The length of the repeat unit (in bp) |
ref_allele_size | The number of repeats in the reference genome |
num_alleles_bylength | The number of possible alleles at the locus, collapsing all alleles with the same length. |
num_alleles_byseq | The number of possible alleles at the locus. For GangSTR, this will always be the same as num_alleles_by_length since it only considers length differences. |
het_by_seq | Heterozygosity of the locus |
total_children | The total number of children (trios) analyzed at the locus. |
total_mutations | The total number of mutations called at the locus. |
total_mutation_rate | The estimated mutation rate at the locus, based on the number of observed mutations and transmissions. |
affected_children | The total number of affected children (trios) analyzed at the locus. |
affected_mutations | The total number of mutations called at the locus in affected samples. |
affected_new_mutations | The total number of mutations called at the locus in affected samples resulting in alleles unobserved in the rest of the cohort. |
affected_mutation_rate | The estimated mutation rate at the locus, based on the number of observed mutations and transmissions and considering only affected samples. |
unaffected_children | The total number of unaffected children (trios) analyzed at the locus. |
unaffected_mutations | The total number of mutations called at the locus in unaffected samples. |
unaffected_new_mutations | The total number of mutations called at the locus in unaffected samples resulting in alleles unobserved in the rest of the cohort. |
unaffected_mutation_rate | The estimated mutation rate at the locus, based on the number of observed mutations and transmissions and considering only unaffected samples. |
p-value | P-value based on a Fisher's exact test testing if mutations are enriched in affecteds vs. unaffecteds. |
children_with_mutations | Comma-separated list of children with mutations. Format of each child is famid:phenotype:newallele:controlcount,casecount,unknowncount. |
aff_tdt | Keep track of how often the long vs. short allele were transmitted from mother and father in affected children. Format is: counts for 0=unknown/1=short/2=long mother:father. This summarizes counts from long_mother and long_father in the all mutations file. |
unaff_tdt | Keep track of how often the long vs. short allele were transmitted from mother and father in unaffected children. Format is: counts for 0=unknown/1=short/2=long mother:father. This summarizes counts from long_mother and long_father in the all mutations file. |
See scripts/
for additional scripts:
qc_denovos.py
is a script for further filtering the output of<OUTPREFIX>.all_mutations.tab
, for example to remove loci, families or children with an outlier number of total mutations. Typepython3 scripts/qc_denovos.py
to see full usage information.
If you use this method for your research, please cite: Mitra, et al. 2020 https://www.biorxiv.org/content/10.1101/2020.03.04.974170v1