Skip to content

Latest commit

 

History

History
79 lines (56 loc) · 5.56 KB

File metadata and controls

79 lines (56 loc) · 5.56 KB

Phylogenetic Analysis

Overview

This directory contains the scripts, XML files and RMarkdown notebooks needed to:

  • Estimate time-calibrated trees in BEAST
  • Estimate state transitions between UK and non-UK branches in the trees in BEAST (DTA)
  • Extract UK transmission lineages
  • Create related figures and tables

Minimal results are also included. Some of the scripts may need some adjustment depending on the local setup.

The structure of the phylogenetic analysis directory is shown below:

phylogenetic/
├── reports
├── results
│   ├── combined_beast_dta
│   └── xml
│       ├── dta
│       │   ├── A-DTA-20200818.xml
│       │   ├── B-DTA-20200818.xml
│       │   ├── B.1.1-DTA-20200818.xml
│       │   ├── B.1.X-DTA-20200818.xml
│       │   └── B.1.pruned-DTA-20200818.xml
│       ├── timetrees
│       │   ├── A.fixedRootPrior.skygrid-20200720.xml
│       │   ├── B.1.1.fixedRootPrior.skygrid-20200720.xml
│       │   ├── B.1.X.fixedRootPrior.skygrid-20200720.xml
│       │   ├── B.1.pruned.fixedRootPrior.skygrid-20200720.xml
│       │   └── B.fixedRootPrior.skygrid-20200720.xml
│       └── preliminary_analysis.xml
├── scripts
└── README.md

Input data

Sequence ids for genomes used in these analyses can be found in data/phylogenetic/metadata.csv with the appropriate acknowledgements in data/phylogenetic/GISAID_acknowledgements.csv. The sequences themselves can be downloaded from COG-UK and GISAID.

The phylogenetic trees that are used as data in the timetree/ analyses can be found in data/phylogenetic/ and were estimated using the COG-UK phylogenetic pipeline - grapevine commit 11bff38 (https://github.com/COG-UK/grapevine).

BEAST analysis

  1. Preliminary analysis: Run preliminary_analysis.xml
  2. Time trees: Run XML files in timetrees/
  3. DTA: Run XML files in dta/

XML files should be run using the developmental BEAST branch approximateTreeLikelihood (commit c8cc55d4).

Reports

Run the RMarkDown notebooks below to extract UK transmission lineages and reproduce figures and tables.

  1. extractLineages.Rmd: Extract TMRCAs and other summary statistics of the UK transmission lineages across all posterior trees from the BEAST analyses.
  2. extractLineagesMCC.Rmd: Extract TMRCAs and other summary statistics of the UK transmission lineages from the BEAST MCC trees.
  3. lineageSummary.Rmd: Plot summary statistics and figures of the UK transmission lineages extracted from the BEAST DTA analyses.
  4. importationSummary.Rmd: Plot figures about the dataset, infections in different countries and inbound travellers. Also apply the importation lag model to the UK transmission lineage TMRCAs and plot figures with lineage importations. This notebook requires some of the input data and results of the epidemiological analyses.
  5. lineageSimilarity.Rmd: Compare similarity of lineage assignments across posterior trees and the MCC tree using the Jaccard index.
  6. lineageBreakdown.Rmd: Plot breakdowns of UK transmission lineages over time (using only the assignment on the MCC trees).

Output

  • DTA output: Log files, MCC trees and subsampled posterior tree files (thinned to only 200 posterior trees) can be found in results/xml/dta/output/.
  • UK transmission lineages: The extracted UK transmission lineages can be found in results/combined_beast_dta/. Only the summary .csv files are provided on this repository.
    • clusters_DTA.csv.xz: Summary statistics of UK transmission lineages across all 2000 posterior trees.
    • clusterSamples_DTA.csv.xz: Assignment of UK genomes to transmission lineages across all 2000 posterior trees.
    • clusters_DTA_MCC_0.5.csv: Summary statistics of UK transmission lineages on the MCC trees using a posterior probability threshold of 0.5. Note that the assignment of genomes to transmission lineages is based on the MCC (summary) tree(s), and therefore does not include the statistical uncertainty that is present in the posterior set of trees.
    • clusters_DTA_MCC_0.5_shifted.csv: As above, but including importation lag.
    • clusterSamples_DTA_MCC_0.5.csv: Assignment of UK genomes to transmission lineages on the MCC trees using a posterior probability threshold of 0.5. Note that the assignment of genomes to transmission lineages is based on the MCC (summary) tree(s), and therefore does not include the statistical uncertainty that is present in the posterior set of trees.
  • Figures: Output figures are stored in results/combined_beast_data/figures/