Phylogenetic Assignment of Named Global Outbreak LINeages
pangolin 2.0 comes with major updates, including a significant speedup and assignment based on machine learning (affectionately described as pangoLEARN).
pangolin now comes in web-application form thanks to the Centre for Genomic Pathogen and Surveillance! Find it here at https://pangolin.cog-uk.io/.
- Requirements
- Install pangolin
- Check the install worked
- Updating pangolin
- Updating from pangolin v1.0 to pangolin v2.0
- Usage
- Output
- pangoLEARN description
- Recall rate
- SNPs associated with a given lineage
- Source data
- Authors
- Citing pangolin
- References
Pangolin runs on MacOS and Linux. The conda environment recipe may not build on Windows (I haven't tested it) but can be run using the Windows subsystem for Linux.
- Some version of conda, we use Miniconda3. Can be downloaded from here
- Your query fasta file
- Clone this repository and
cd pangolin
conda env create -f environment.yml
conda activate pangolin
python setup.py install
- That's it
Troubleshooting install see the pangolin wiki
Note: we recommend using pangolin in the conda environment specified in the
environment.yml
file as per the instructions above. If you can't use conda for some reason, bear in mind the data files are hosted in two separate repositories at
- cov-lineages/lineages
- cov-lineages/pangoLEARN
you will need to pip install them alongside the other dependencies for pangolin (details found in environment.yml).
Type (in the pangolin environment):
pangolin -v
pangolin -pv
pangolin -lv
and you should see the versions of pangolin, and pangoLEARN and lineages data release printed respectively.
Note: Even if you have previously installed pangolin, as it is being worked on intensively, we recommend you check for updates before running.
To update pangolin, pangoLEARN, and lineages automatically to the latest stable release:
conda activate pangolin
pangolin --update
Alternatively, this can be done manually:
conda activate pangolin
git pull
pulls the latest changes from githubpython setup.py install
re-installs pangolin.conda env update -f environment.yml
updates the conda environment (you're unlikely to need to do this, but just in case!)pip install git+https://github.com/cov-lineages/pangoLEARN.git --upgrade
updates if there is a new data releasepip install git+https://github.com/cov-lineages/lineages.git --upgrade
updates if there is a new data release, this is the legacy data repo and is unlikely to have tagged releases in the future
- If invoking data path (-d), changed to pangoLEARN instead of lineages
-d /home/vix/miniconda3/envs/pangolin/lib/python3.6/site-packages/pangoLEARN/data
- The columns in the output file has also changed, unless running
--legacy
- No longer
UFBootstrap
,aLRT
orlineages_version
- New fields:
probability
andpangoLEARN_version
- Activate the environment
conda activate pangolin
- Run
pangolin <query>
, where<query>
is the name of your input file.
pangolin: Phylogenetic Assignment of Named Global Outbreak LINeages
positional arguments:
query Query fasta file of sequences to analyse.
optional arguments:
-h, --help show this help message and exit
-o OUTDIR,
--outdir OUTDIR
Output directory. Default: current working directory
--outfile OUTFILE Optional output file name. Default: lineage_report.csv
-d DATA, --data DATA Data directory minimally containing a fasta alignment
and guide tree
-n, --dry-run Go through the motions but don't actually run
--tempdir TEMPDIR Specify where you want the temp stuff to go. Default:
$TMPDIR
--no-temp Output all intermediate files, for dev purposes.
--max-ambig MAXAMBIG Maximum proportion of Ns allowed for pangolin to
attempt assignment. Default: 0.5
--min-length MINLEN Minimum query length allowed for pangolin to attempt
assignment. Default: 10000
--panGUIlin Run web-app version of pangolin
--assign-using-tree LEGACY: Use original phylogenetic assignment methods
with guide tree. Note, will be significantly slower
than pangoLEARN
--write-tree Output a phylogeny for each query sequence placed in
the guide tree. Only works in combination with legacy
`--assign-using-tree`
-t THREADS,
--threads THREADS Number of threads
-p, --include-putative
Include the bleeding edge lineage definitions in
assignment
--verbose Print lots of stuff to screen
-v, --version show program's version number and exit
-lv, --lineages-version
show lineages's version number and exit
-pv, --pangoLEARN-version
show pangoLEARN's version number and exit
Your output will be a csv file with taxon name and lineage assigned, one line corresponding to each sequence in the fasta file provided
Example:
Taxon | Lineage | support | pangoLEARN_version | status | note |
---|---|---|---|---|---|
Virus1 | B.1 | 80 | 2020-04-27 | passed_qc | |
Virus2 | A.1 | 65 | 2020-04-27 | passed_qc | |
Virus3 | A.3 | 100 | 2020-04-27 | passed_qc | |
Virus4 | B.1.4 | 82 | 2020-04-27 | passed_qc | |
Virus5 | None | 0 | 2020-04-27 | fail | N_content:0.80 |
Virus6 | None | 0 | 2020-04-27 | fail | seq_len:0 |
Virus7 | None | 0 | 2020-04-27 | fail | failed to map |
Legacy phylogenetics output example:
Taxon | Lineage | aLRT | UFbootstrap | lineages_version | status | note |
---|---|---|---|---|---|---|
Virus1 | B.1 | 80 | 82 | 2020-04-27 | passed_qc | |
Virus2 | A.1 | 65 | 95 | 2020-04-27 | passed_qc | |
Virus3 | A.3 | 100 | 100 | 2020-04-27 | passed_qc | |
Virus4 | B.1.4 | 82 | 73 | 2020-04-27 | passed_qc | |
Virus5 | None | 0 | 0 | 2020-04-27 | fail | N_content:0.80 |
Virus6 | None | 0 | 0 | 2020-04-27 | fail | seq_len:0 |
Resources for interpreting the aLRT and UFbootstrap output can be found here and here.
pangoLEARN is an alternative algorithm for lineage assignment, which uses machine learning, that is implemented as of pangolin 2.0. Benefits of the new algorithm include a major speed up, as the phylogenetic approach was struggling to scale with the increase in number of lineages needing to be represented in the guide tree, and that this new approach takes into account all of the diversity present within a lineage rather than just selecting a representative few. The consequences of this approach mean that for large lineages, we have improved our recall and precision significantly and we are continuing to develop more sophisticated approaches to machine learning for lineage assignment.
The current version of pangoLEARN uses multinomial logistic regression, but the pipeline has been written so that as more complex models are developed,the user will be able to choose which model to use to assign their lineages.
To explain the model we're currently using, while a standard regression fits a line to a set of training data to model a linear relationship between variables of interest, a logistic regressions fits a sigmoid (S-shaped) function to the training data, in order to tell two different classes apart. A multinomial logistic regression is an extension of a standard logistic regression in that it can be used to classify more than two classes. Each potential assignment (i.e. lineage) is modeled as a set of n-1 independent binary choices (sigmoid functions), where n is the number of classes.
The model was trained using 30,000 SARS-CoV-2 sequences from GISAID, acknowledgements here, with their lineages by manually curating the global ML tree, as is the standard lineages data release procedure for pangolin. Each base of each genome was one-hot encoded. This left us with a large number of parameters to train, which is why training this model takes approximately 14 hours on our systems (may change with different hardware). This model was built using the standard sci-kit learn implementation of multinomial logistic regression. The code for this process is available in the cov-lineages/cov-support repository.
Multinomial logistic regression is an extremely commonly used model as it is able to simply and intuitively assign probabilities to class assignments. However, it does not incorporate any hierarchical structure. We are currently developing new models that do incorporate hierarchical structure. However, given the limitations of this simple model, it has performed surprisingly well with this data. While more complex models may offer improvements in assignment accuracies for smaller lineages, the logistic regression has the advantages of being intuitive, easy to implement, and relatively fast to train.
Pre-pangolin 2.0: Of 9,843 GISAID sequences assigned lineages by hand (taking sequence, phylogeny and metadata into account), pangolin accurately assigns the lineage of 97.85% of those sequences. Of the sequences that were not recalled correctly, 74.5% had 0 bootstrap and 0 alrt. We're continuing to work to improve this recall rate, but recommend interpreting the pangolin output cautiously with due attention to the UFbootstrap and aLRT values.
Given cov-lineages is relatively slow evolving for an RNA virus and there is still not a huge amount of diversity, missing or ambiguous data at key residues may lead to incorrect placement within the guide tree. We have a filter in place that by default with not call a lineage for any sequence with >50% N-content, but this can be made more conservative with the command line option --max-ambig
.
pangolin 2.0 onwards:
Recall and supporting statistics were generated using the same procedure as above to train a model using 75% of the data, while 25% of the data was used as testing data. Smaller lineages may have lower recall rates due to the very small sample sizes in the test set.
The model trains coefficients for each input parameter, for each potential lineage assignment. A particularly large coefficient in a particular lineage’s sigmoid function indicates a stronger association between that location and that lineage. A particularly negative coefficient in a particular lineage’s sigmoid function indicates the opposite. In other words, we can pick up SNPs that are strongly associated with or strongly negatively associated with a given lineage. This information is hosted for download from the pangoLEARN data repository.
pangolin runs a multinomial logistic regression model trained against lineage assignments based on GISAID data.
Legacy pangolin runs using a guide tree and alignment hosted at cov-lineages/lineages. Some of this data is sourced from GISAID, but anonymised and encrypted to fit with guidelines. Appropriate permissions have been given and acknowledgements for the teams that have worked to provide the original SARS-CoV-2 genome sequences to GISAID are also hosted here.
Pangolin was created by Áine O'Toole, JT McCrone and Emily Scher. It uses lineages from Rambaut et al. 2020.
There is a publication in prep for pangolin, but in the meantime please to link to this github github.com/cov-lineages/pangolin if you have used pangolin in your research.
The following external software is run as part of pangolin:
L.-T. Nguyen, H.A. Schmidt, A. von Haeseler, B.Q. Minh (2015) IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies.. Mol. Biol. Evol., 32:268-274. https://doi.org/10.1093/molbev/msu300
D.T. Hoang, O. Chernomor, A. von Haeseler, B.Q. Minh, L.S. Vinh (2018) UFBoot2: Improving the ultrafast bootstrap approximation. Mol. Biol. Evol., 35:518–522. https://doi.org/10.1093/molbev/msx281
Stéphane Guindon, Jean-François Dufayard, Vincent Lefort, Maria Anisimova, Wim Hordijk, Olivier Gascuel, New Algorithms and Methods to Estimate Maximum-Likelihood Phylogenies: Assessing the Performance of PhyML 3.0, Systematic Biology, Volume 59, Issue 3, May 2010, Pages 307–321, https://doi.org/10.1093/sysbio/syq010
Katoh, Standley 2013 (Molecular Biology and Evolution 30:772-780) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. (outlines version 7)
Heng Li, Minimap2: pairwise alignment for nucleotide sequences, Bioinformatics, Volume 34, Issue 18, 15 September 2018, Pages 3094–3100, https://doi.org/10.1093/bioinformatics/bty191
Köster, Johannes and Rahmann, Sven. “Snakemake - A scalable bioinformatics workflow engine”. Bioinformatics 2012.