Fast visualisation of the population structure of pathogens using Stochastic Cluster Embedding.
Paper:
Lees JA, Tonkin-Hill G, Yang Z, Corander J. Mandrake: visualizing microbial population structure by embedding millions of genomes into a low-dimensional representation. Philosophical Transactions of The Royal Society B. 2022;377: 20210237.
https://doi.org/10.1098/rstb.2021.0237
Documentation available at: https://mandrake.readthedocs.io/en/latest/
See https://mandrake.readthedocs.io/en/latest/installation.html for more details.
- Install miniconda.
- Run
conda create -n mandrake_env mandrake
to install into a clean environment. - Run
conda activate mandrake_env
to use the environment.
Refer to the conda-forge documentation if you want to install a CUDA (GPU) enabled version.
You will need some dependencies, which you can install through conda
:
conda create -n mandrake_env python
conda env update -n mandrake_env --file environment.yml
conda activate mandrake_env
You can then clone this repository, and run:
python setup.py install
You will need the CUDA toolkit installed.
If you have the ability to compile CUDA (e.g. nvcc
) you should see a message:
CUDA found, compiling both GPU and CPU code
otherwise only the CPU version will be compiled:
CUDA not found, compiling CPU code only
After installing, an example command would look like this:
mandrake --sketches sketchlib.h5 --kNN 500 --cpus 4 --maxIter 1000000
This would use a file sketchlib.h5
created by pp-sketchlib
to calculate accessory distances using 500 nearest neighbours.
Output can be found in numerous files prefixed mandrake.embedding*
.
Other useful arguments include:
--alignment
use a fasta alignment to calculate distances--accessory
use a presence/absence file (Rtab or similar) to calculate distances--distances
use a.npz
file from a previous run and skip straight to the embedding step--labels
give labels to colour the output by--perplexity
change the perplexity of the preprocessing (similar to t-SNE)--animate
produce a video of the optimisation--use-gpu
use a GPU for the run. Make sure to increase--n-workers
.
See the documentation for more details.