__ _.._
.-'__`-._.'.--.'.__.,
/--' '-._.' '-._./
/__.--._.--._.'``-.__/
'._.-'-._.-._.-''-..'
- Dhar A, Ralph DK, Minin VN, Matsen FA IV (2020) A Bayesian phylogenetic hidden Markov model for B cell receptor sequence analysis. PLoS Comput Biol 16(8): e1008030 (https://doi.org/10.1371/journal.pcbi.1008030).
Linearham's dependencies are described in the Dockerfile.
We recommend that you run linearham inside a Docker container, since this will make installation much easier (if you're new to Docker, read this).
However, you can also install the dependencies by hand, in which case you should clone the repository and run each command in the Dockerfile that's on a line starting with RUN
(treat WORKDIR
as cd
).
The more similar your system is to that described by the Dockerfile's FROM
line (at the moment, debian), the easier this will be.
It's best to start by running an interactive session in the container:
docker run -it quay.io/matsengrp/linearham /bin/bash
.
You can also create a shell script with your linearham commands for docker to run, and put it in place of /bin/bash
(use test.sh
as an example).
If you want your run to be reproducible, choose a tag of the form v<stuff>
from quay.io, then specify it like so: quay.io/matsengrp/linearham:v<stuff>
.
To access your own data from within the container, or to persist your output beyond the container, you must use volumes by specifying -v
.
We recommend using this convention (other paths may not work):
- Choose a directory outside of Docker that contains both your input data directory and your desired output directory
- Mount it as a volume to
/linearham/work
inside the container:-v /your/local/path:/linearham/work
- All commands inside the container referencing paths inside that directory should do so via
/linearham/work
, e.g. setting--outdir=/linearham/work/output
inside the container will persist your output outside the container in/your/local/path/output
.
Note that because Docker must run as root, this means that you will be writing to the directory on your host machine as root, so a) be very careful and b) don't choose anything anywhere near /
(something like /home/user/several/sub/dirs
is good).
All linearham actions are run using scons in the main linearham directory.
Available actions are --run-partis
, --run-linearham
, and --build-partis-linearham
.
Note that because of the way scons parses arguments, you must always use an =
sign in all args: --arg=val
.
For the same reason, you also have to spell args exactly right, e.g. writing --arg-nam
instead of --arg-name
will silently ignore it.
The input for Linearham is a partis output file.
If you've already run partis to create this file, you only need to run --run-linearham
; if not, you can have linearham run partis for you with --run-partis
.
This runs partis on an input sequence file:
scons --run-partis --fasta-path=<file> --locus={igh|igk|igl} --outdir=<dir> [--parameter-dir=<dir>]
--fasta-path
can be any file type that partis --infname
handles (see partis help).
Partis uses a directory with fitted sample-specific parameters (--parameter-dir), which if not already present will be automatically inferred based on the sequences in the fasta file.
These parameters are more accurate if inferred on the entire repertoire of many clonal families; thus because linearham runs on only a single family at a time it is better if you can use parameters cached in a previous partis run on the entire repertoire, and then pass them to linearham with --parameter-dir
.
However, if you don't, the automatically-inferred parameters will still work fine, they'll just be somewhat less accurate (since they'll only be based on the one family).
Other partis-related arguments:
option | description |
---|---|
--all-clonal-seqs |
If set, attempts to force all sequences in the fasta file into the same clonal family; otherwise it runs partis partition to infer the clonal families. "Attempts" means everything will end up together that doesn't have, say, different cdr3 lengths or wildly different naive sequences. |
--locus |
Which immunoglobulin locus (defaults to igh )? |
--outdir |
The output directory (defaults to output ). |
Once you have a partis output file, whether you made it separately or with linearham, you can run linearham itself:
scons --run-linearham --outdir=<dir> [--partis-yaml-file=<file>] [--parameter-dir=<dir>]
If there is one clonal family (i.e. cluster) in the partis output file, linearham will run on that.
If there is more than one, you'll have to select a cluster to run on using several options.
In such a case you'll likely first want to run linearham with no cluster selection, in which case it'll print a list of the available clusters and exit (see also parse_cluster.py
below).
Partis performs clustering hierarchically, so its output stores a list of partitions, where each partition divides the sequences in the repertoire into clonal families (clusters).
By default, linearham looks in the best (most likely) of these partitions, but you can specify the (zero-based) index of a different one with --partition-index
.
Within a partition, you can specify a cluster either by (zero-based) index with --cluster-index
, or with the unique id of a particular sequence in the cluster with --cluster-seed-unique-id
(see partis --seed-unique-id for more info).
Options to specify the cluster on which to run:
option | description |
---|---|
--partition-index |
zero-based index of partition from which to select the cluster on which to run (defaults to most likely partition) |
--cluster-index |
zero-based index of cluster on which to run in the selected partition |
--cluster-seed-unique-id |
choose the cluster in which the sequence with this name is found |
--lineage-unique-ids |
same as --cluster-seed-unique-id, but also goes on to perform detailed lineage/mutation analysis. (see also below) |
Other options:
option | description |
---|---|
--partis-yaml-file |
Path to the partis output file that is linearham's input. Defaults to the location in --outdir to which the linearham --run-partis action will have written it (if it ran) |
--outdir |
The output directory (defaults to output ). |
--parameter-dir |
Directory from which linearham reads partis hmm files. If not set, it defaults to the location in --outdir used by --run-partis . As for --run-partis (above), parameters will be much more accurate if you cache them with partis beforehand on the entire repertoire, but if this isn't possible they'll be inferred automatically on the one family on which you're running linearham, which should be fine. |
If you don't have a clonal family/cluster of interest, or are not sure how to identify it using these options, you can run scripts/parse_cluster.py to work it out.
For example, running:
./scripts/parse_cluster.py lib/partis/test/reference-results/partition-new-simu.yaml --fasta-output-file parsed_cluster.fa --yaml-output-file parsed_cluster.yaml | less -RS
will print a table of available clusters in the best partition similar to this:
available clusters in partition at index 29 (best):
index size unique_ids
0 71 [...]
1 11 [...]
2 262 [...]
3 4 [...]
Using the indices from this table, you can specify the corresponding clusters to Linearham. Running on the cluster with 262 sequences from the above table would look like:
scons --run-linearham --cluster-index=2 <args.. >
You can also figure out which sequences are in which clusters with the partis view-output
action piped to less -RS
.
Other linearham-related arguments:
option | list? | description |
---|---|---|
--template-path |
no | The RevBayes template path (defaults to templates/revbayes_template.rev). |
--mcmc-iter |
yes | How many RevBayes MCMC iterations should we use (defaults to 10000)? |
--mcmc-thin |
yes | What RevBayes MCMC thinning frequency should we use (defaults to 10)? |
--tune-iter |
yes | How many RevBayes tuning iterations should we use (defaults to 5000)? |
--tune-thin |
yes | What RevBayes tuning thinning frequency should we use (defaults to 100)? |
--num-rates |
yes | The number of gamma rate categories (defaults to 4). |
--burnin-frac |
yes | What fraction of MCMC burnin should we use (defaults to 0.1)? |
--subsamp-frac |
yes | What bootstrap sampling fraction should we use (defaults to 0.05)? |
--rng-seed |
yes | The random number generator (RNG) seed (defaults to 0). |
--asr-pfilters |
no | The ancestral sequence posterior probability thresholds (defaults to 0.1). |
--no-nestly-subdirs |
no | if set, all output files are written directly to --outdir, rather than to a nested series of subdirs. Useful if you'd rather handle directory structure with the code that's calling linearham, and/or you don't plan to run many different combinations of mcmc parameters. |
For the arguments that can be specified as a (,
-separated) list (see middle column), linearham will run revbayes separately, writing to separate nested output directories, for all combinations of all such parameters.
For more information on these arguments, run scons --help
.
This compiles linearham, partis, and other dependencies. You'll only need to run this if you've either modified some source code or you're installing without docker.
Running linearham consists of a series of steps, whose precedence and running is handled by scons. See also below for more detail on the various inputs and outputs of each step.
step | command | description |
---|---|---|
get linearham info | lib/partis/bin/partis get-linearham-info |
reformat the information in all annotations in the partis output file for use by subsequent linearham steps, writes to partis_run.yaml |
select single cluster | scripts/parse_cluster.py |
pull annotation for single specified cluster out of partis_run.yaml , and write it to cluster.yaml and its sequences to cluster_seqs.fasta |
make revbayes input | scripts/generate_revbayes_rev_file.py |
use seqs in cluster_seqs.fasta and template revbayes config templates/revbayes_template.rev to write revbayes config for this run to revbayes_run.rev |
run revbayes | lib/revbayes/projects/cmake/rb |
run revbayes with config file revbayes_run.rev , writing output to revbayes_run.stdout.log . This step is usually by far the slowest; you can adjust e.g. the mcmc options above to trade off speed for confidence/accuracy. |
run phylo hmm | _build/linearham/linearham --pipeline |
run actual linearham phylo hmm, using cluster.yaml , <--parameter-dir> , and revbayes_run.trees to write lh_revbayes_run.trees |
collect run statistics | scripts/run_bootstrap_asr_ess.R |
collects info from lh_revbayes_run.trees and cluster_seqs.fasta to write three output files: linearham_run.{trees,log,ess} |
calculate naive seq stats | scripts/tabulate_naive_probs.py |
collect info from linearham_run.trees to write aa_naive_seqs.{png,fasta,dnamap} |
calculate lineage info | scripts/tabulate_lineage_probs.py |
collect info from linearham_run.trees and aa_naive_seqs.fasta to write lineage summary info to aa_lineage_seqs.{pfilter0.1.dot,fasta,dnamap,pfilter0.1.png} (only run if --lineage-unique-ids is set) |
write git version info | git rev-parse |
write commit/tag info to enable reproducibility |
write final annotations | scripts/write_lh_annotations.py |
Use original partis annotation in cluster.yaml and linearham stats in linearham_run.log to write final linearham annotations to linearham_annotations_{best,all}.yaml |
Most of the output files you're likely to need are by default in the mcmc*/burninfrac*/
subdir of --outdir
:
e.g. burninfrac0.1_subsampfrac0.05/
file | format | description |
---|---|---|
linearham_run.log | tsv | posterior samples of annotations/naive sequences and parameters for the phylogenetic substitution and rate variation models |
linearham_run.trees | newick | posterior tree samples with ancestral sequence annotations (formatted for use by Dendropy) |
linearham_run.ess | tsv | approximate effective sample sizes for each field in linearham_run.log |
aa_naive_seqs.fasta | fasta | each sampled naive amino acid sequence and its associated posterior probability |
aa_naive_seqs.dnamap | fasta (ish) | map from each sampled naive amino acid sequence to its corresponding set of nucleotide naive sequences and posterior probabilities |
aa_naive_seqs.png | png | logo plot of naive amino acid sequence posterior probability using WebLogo to visualize per-site uncertainties |
linearham_annotations_all.yaml | yaml | annotations corresponding to the posterior tree samples (collapsing unique annotations, and with posterior probabilites set in 'logprob' key) |
linearham_annotations_best.yaml | yaml | most likely annotation, i.e. the one that corresponded to the largest number of posterior tree samples |
If --no-nestly-subdirs
is set, instead of the cluster-*/mcmc*/burninfrac*/lineage_*
subdirs, all files are written to the top-level dir (i.e. the calling program must specify a different dir in order to run with different parameters).
Every posterior tree sample corresponds to one sampled annotation; however before writing to linearham_annotations_all.yaml
, duplicate annotations are collapsed.
Each resulting unique annotation is assigned a probability proportional to the number of times it was sampled.
These unique annotations are sorted by the resulting new 'logprob' key (in descending order) and written to linearham_annotations_all.yaml
.
Every sampled tree that contributed to each unique annotation is also added to that annotation (as a list in annotation['tree-info']['linearham']['trees']
).
If --lineage-unique-ids
is specified, there will also be additional lineage-specific output files, by default in subdirectories like lineage_<uid>/
:
file | format | description |
---|---|---|
aa_lineage_seqs.fasta | fasta | for each intermediate ancestor in the lineage of the sequence with the specified id, the sampled amino acid sequence and its associated posterior probability |
aa_lineage_seqs.dnamap | fasta(ish) | for each intermediate ancestor of the lineage of the sequence with the specified id, map from sampled amino acid sequence to its corresponding set of nucleotide sequences and posterior probabilities |
aa_lineage_seqs.pfilterX.svg | svg | posterior probability lineage graphic made with Graphviz, where X is the posterior probability cutoff for the sampled sequences (see details below). |
The posterior probability lineage plot (aa_lineage_seqs.pfilterX.svg
) summarizes all the inferred ancestral sequences (and transitions among them) that were observed in the sampled trees.
Each node represents an amino acid sequence: either inferred ("naive" if it was a naive sequence in any tree, otherwise "inter") or the seed sequence.
The node's percent label (which also determines the color weight) is the fraction of trees in which it was observed, whether as naive or intermediate (so note that this can be larger than the probability in a naive sequence's name in the fasta file above, since in this plot we include also instances where it was an intermediate).
Edges are labelled with the mutations separating their two nodes, and with the percent of transitions (in all trees) from the parent node that connect to the child node (which also determines the color weight).
Edges with probability below are not plotted, so if you want to see more detail you should decrease this (note that this means plotted numbers generally don't add exactly to 100).
Most of the rest of the files in --outdir
are just used to pass information among various linearham steps.
By default, in the top-level dir are:
- the partis output file that was used as input for linearham (e.g.
partis_run.yaml
), which contains partis-inferred clonal families (clusters) and annotations (including inferred naive sequence) - sub dirs for each cluster on which linearham was run, with name of form
cluster-N/
for the cluster indexN
Within each cluster's subdir cluster-N/
are:
- a fasta file
cluster-N/cluster_seqs.fasta
with each of the cluster's input sequences, as well as itspartis
-inferred naive sequence- Note that the input sequences have SHM indels "reversed" (reverted to their state in the naive rearrangement), and non-variable regions (V/J framework) trimmed off.
- This is equivalent to assuming that all shm indels occured at the tips of the tree, which is often not a good assumption, but standard phylogenetic approaches do not handle indels, so if you care about indels you'll need to handle them separately/by hand.
- a partis yaml output file
cluster.yaml
resulting from pulling just this cluster out of the original partis output file that was used as input - the linearham output dir
cluster-N/mcmciter<stuff>
where<stuff>
records the exact options of the revbayes run e.g.mcmciter10000_mcmcthin10_tuneiter5000_tunethin100_numrates4_rngseed0/
e.g. mcmciter10000_mcmcthin10_tuneiter5000_tunethin100_numrates4_rngseed0/
:
file | format | description |
---|---|---|
revbayes_run.trees | tsv | results from tree sampling iterations, including phylogenetic substitution model and rate variation parameters and Newick trees |
revbayes_run.log | tsv | results from tree sampling iterations but branch lengths are listed in tabular form rather than in a Newick tree |
revbayes_run.rev | Rev | generated RevBayes script for tree sampling |
revbayes_run.stdout.log | txt | stdout log of RevBayes tree sampling run |
lh_revbayes_run.trees | tsv | results from tree sampling iterations plus V(D)J recombination information and contribution of each tree to posterior estimation |
burninfrac<stuff> |
dir | subdirectory for results from the run_bootstrap_asr_ess.R step onwards, i.e. final results |
One way to visualize the various output naive sequences and their probabilities is with lib/partis/bin/cf-linearham.py
, which takes as input a linearham output dir and a partis output file (the latter preferably created with the --calculate-alternative-annotations
option set).
It then prints an ascii-art comparison of the amino acid and nucleotide naive sequences, as well as (for partis) a rundown of the alternative gene calls and their probabilities (the most likely of which was presumably input to linearham).
More info here.