Efficient and accurate detection of viral sequences at single-cell resolution reveals novel viruses perturbing host gene expression
This repository contains data, code, and figures generated for the manuscript:
Laura Luebbert, Delaney K Sullivan, Maria Carilli, Kristján Eldjárn Hjörleifsson, Alexander Viloria Winnett, Tara Chari, Lior Pachter (2023). [Efficient and accurate detection of viral sequences at single-cell resolution reveals novel viruses perturbing host gene expression](https://www.biorxiv.org/content/10.1101/2023.12.11.571168). bioRxiv 2023.12.11.571168; doi: https://doi.org/10.1101/2023.12.11.571168
The preprint is posted on the bioRxiv: https://www.biorxiv.org/content/10.1101/2023.12.11.571168
The Notebooks folder contains code to perform all analyses that were used for the preprint, starting with pre-processing of the raw data all the way to final figure generation. The notebooks are easily and readily executable via Google Colaboratory with a link directly to the site from each notebook page.
Large datasets are stored on Caltech Data and can be accessed under the DOIs 10.22002/krqmp-5hy81 and 10.22002/k7xqw-88d74.
Click here to view the interactive Krona plot showing all viruses expressed above the QC threshold in macaque cells that passed quality control, broken down by animal, timepoint, taxonomy, and fraction of positive cells occupied by each virus. Code to reproduce the Krona plot
The precomputed_refs folder contains precomputed reference indices for the detection of viral RNA in sequencing data (through alignment to the optimized PalmDB) and with masked human (or mouse) genome and transcriptome.
A description of kallisto, bustools, and kb-python including tutorials for their use can be found here: https://www.biorxiv.org/content/10.1101/2023.11.21.568164
# 1. Install kb-python (optional: install gget to fetch the host genome and transcriptome)
pip install kb-python gget
# 2. Download optimized PalmDB reference files
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_rdrp_seqs.fa
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_clustered_t2g.txt
# 3. Create reference index (+ optional masking of the host, here human, genome using the D-list)
# Single-thread runtime: 1.5 h; Max RAM: 4.4 GB; Size of generated index: 593 MB
# Without D-list: Single-thread runtime: 3.5 min; Max RAM: 3.9 GB; Size of generated index: 592 MB
kb ref \
--aa \
--d-list $(gget ref --ftp -w dna homo_sapiens) \
-i index.idx --workflow custom \
palmdb_rdrp_seqs.fa
# 4. Align sequencing reads
# Single-thread runtime: 1.5 min / 1 million sequences; Max RAM: 2.1 GB
kb count \
--aa \
-i index.idx -g palmdb_clustered_t2g.txt \
--parity single \
-x default \
$USER_DATA.fastq.gz