Accurate Bayesian reconstruction of cancer phylogenies from bulk sequencing. An implementation of the forest structured Chinese restaurant process with a Dirichlet prior on the node parameters.
PhyClone is currently in development so the following procedure has a few steps.
-
Ensure you have a working
conda
or (preferably)mamba
installation. You can do this by installing Miniforge -
Install the required dependencies using mamba/conda. We will create a new
conda
environment with the dependencies. Download the environment.yaml file, and navigate into its save location. Run the following command:
mamba env create --file environment.yaml
- Activate the
conda
environment.
mamba activate phyclone
Note
You will have to do this whenever you open a new terminal and want to run PhyClone.
- Install PhyClone
pip install git+https://github.com/Roth-Lab/phyclone.git
Or, SSH command:
pip install git+ssh://[email protected]/Roth-Lab/phyclone.git
- If everything worked PhyClone should be available on the command line.
phyclone --help
PhyClone analysis has two possible input files:
- main input file (Required)
- cluster file
Caution
While PhyClone analysis can be minimally run with only the main input.tsv file, it is strongly recommended to provide a cluster file as well.
To run a PhyClone analysis you will need to prepare an input file. The file should be in tab delimited tidy data frame format and have the following columns.
Tip
There is an example file in examples/data/mixing.tsv
- mutation_id - Unique identifier for the mutation. This is free form but should match across all samples.
Warning
PhyClone will remove any mutations without entries for all detected samples. If you have mutations with no data in some samples set their counts to 0.
-
sample_id - Unique identifier for the sample.
-
ref_counts - Number of reads matching the reference allele.
-
alt_counts - Number of reads matching the alternate allele.
-
major_cn - Major copy number of segment overlapping mutation.
-
minor_cn - Minor copy number of segment overlapping mutation.
-
normal_cn - Total copy number of segment in healthy tissue. For autosome this will be two and male sex chromosome one.
You can include the following optional columns.
- tumour_content - The tumour content (cellularity) of the sample. Default value is 1.0 if column is not present.
Note
In principle this could be different for each mutation/sample. However, in most cases it should be the same for all mutations in a sample.
- error_rate - Sequencing error rate. Default value is 0.001 if column is not present.
Important
Though not strictly required to run PhyClone, this file is strongly recommended.
Tip
While any mutation pre-clustering method can be used, we recommend PyClone-VI. Both due to its established strong performance, and its output format which can be fed directly into PhyClone 'as-is'.
The file should be in tab delimited tidy data frame format and have the following columns.
-
mutation_id - Unique identifier for the mutation.
This is free form but should match across all samples and must match the identifiers provided in the main input file.
-
sample_id - Unique identifier for the sample.
-
cluster_id - Cluster that the mutation has been assigned to.
You can include the following optional column:
-
outlier_prob - (Prior) probability that the cluster/mutation is an outlier.
Default value is made equal to the current analysis
--outlier-prob
option if column is not present.
Note
The --outlier-prob
option carries a default value of 0.0
Tip
There is an example file in examples/data/mixing_clusters.tsv
PhyClone analyses are broken into two parts.
First, sampling is performed using the run
sub-command.
Second, a MAP tree is constructed from the trace using the map
sub-command.
Sampling can be run as follows
phyclone run -i INPUT.tsv -c CLUSTERS.tsv -o TRACE.pkl
which will take the INPUT.tsv
and (optionally) the CLUSTERS.tsv
file, as described above and write the trace file TRACE.pkl
in a Python pickle format.
The -n
command can be used to control the number of iterations of sampling to perform.
The -b
command can be used to control the number of burnin iterations to perform.
Note
Burnin is done using a heuristic strategy of unconditional SMC. All samples from the burnin are discarded as they will not target the posterior.
The -d
command can be used to select the emission density.
As in PyClone the binomial
and beta-binomial
densities are available.
Important
Unlike PyClone, PhyClone does not estimate the precision parameter of the Beta-Binomial. This parameter can be set with the --precision flag.
To build the final results of PhyClone you can run the map
command as follows.
phyclone map -i TRACE.pkl -t TREE.nwk -o TABLE.tsv
where TRACE.pkl
is the result from the previous step, TREE.nwk
is the output clone tree in newick format, and TABLE.tsv
the assignment of mutations to clones.
PhyClone is licensed under the GPL v3, see the LICENSE.txt file for details.