Skip to content

Commit

Permalink
Updating documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
afonsoguerra committed Jan 4, 2024
1 parent 60022f5 commit 1cd0ebe
Show file tree
Hide file tree
Showing 2 changed files with 181 additions and 2 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
*.simg

181 changes: 179 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,28 @@
# AYUKA: A toolkit for fast viral genotyping using whole genome sequencing.

AYUKA is a fast viral genotyping toolkit that can analyze raw sequencing reads to determine the viral genotypes present in a sample. It was developed by Jose Afonso Guerra-Assuncao at University College London.

Please find the singularity container with the full codebase on the releases tab.
## Installation

AYUKA can be installed via the Singularity container available on the [GitHub releases page](https://github.com/afonsoguerra/AYUKA). Alternatively, it can be installed from source by following the instructions in the Singularity recipe file.

The key dependencies are:

- Perl ([https://www.perl.org/](https://www.perl.org/))
- R ([https://www.r-project.org/](https://www.r-project.org/))
- LaTeX ([https://www.latex-project.org/](https://www.latex-project.org/))
- Jellyfish ([http://www.genome.umd.edu/jellyfish.html](http://www.genome.umd.edu/jellyfish.html))
- SQLite3 ([https://www.sqlite.org/index.html](https://www.sqlite.org/index.html))

## Usage

AYUKA takes as input raw sequencing reads in FASTQ or FASTA format. It can process gzipped files directly.

Please find the singularity container with the full codebase on the releases tab.

The software can be invoked by running:

```./ayuka22-222.simg```
```./ayuka.simg```

or

Expand All @@ -25,3 +42,163 @@ Associated databases available on Zenodo:
José Afonso Guerra-Assunção. (2022). AYUKA Genotyper Databases - Human Adenovirus (HAdV) (22-222) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6521576


## Usage

AYUKA takes as input raw sequencing reads in FASTQ or FASTA format. It can process gzipped files directly.

The basic usage is:

```
AYUKA.pl --seqs sample.fq.gz
```

To analyze paired-end reads:

```
AYUKA.pl --seqs sample_R1.fq.gz,sample_R2.fq.gz
```

Or:

```
AYUKA.pl --seqs sample_R1.fq.gz --seqs sample_R2.fq.gz
```

A text file listing multiple sample files can also be provided:

```
AYUKA.pl --fqList list_of_samples.txt
```
In this last case, each line of the list file should contain the files for one sample.


### Filtering Options

`-min_depth`: Minimum k-mer depth to filter noise [default 10]

`-min_bin_frac`: Minimum fraction of bins to be covered [default 0.1]

### Output Options

`-outputDir`: Output directory [default current working directory]

`-outputFile`: Output file name prefix [default random ID]

### Advanced Options

`-database`: Specify custom k-mer database

`-threads`: Number of threads for Jellyfish

`-skipReport`: Skip PDF report generation

`-pipeline`: Skip report and positional plots for pipelines

## Output

AYUKA produces two main types of output:

1. A PDF report with tables and plots summarizing the analysis

2. Tab-delimited text files with the detailed genotyping results

The PDF report contains:

- Genotype table with p-values
- Coverage and depth estimates
- Plots showing k-mer distributions across the genome

The text files contain the raw data used to generate the report.

## Applications

Some key applications of AYUKA include:

- Rapid preliminary analysis of clinical viral sequencing samples prior to mapping to the best reference
- Distinguishing mixed infections from recombinants
- Detection of virus reads in RNA-seq experiments
- Monitoring viral engineering experiments

The fast speed (under 1 minute per sample) and ability to run on a laptop makes AYUKA suitable for outbreak monitoring and routine screening. The results can guide more in-depth bioinformatics analyses when needed.

## Databases

AYUKA requires a pre-built k-mer database for the viral genotypes of interest. Databases are available for:

- Human adenoviruses
- Adeno-associated viruses (AAV)
- Epstein-Barr virus (EBV)


## Citation

If you use AYUKA in your work, please cite:

Guerra-Assuncao et al. "AYUKA: A toolkit for fast viral genotyping using whole genome sequencing" *bioRxiv* 2022. https://doi.org/10.1101/2022.09.07.506755


# Creating an AYUKA Database

AYUKA requires a pre-built k-mer database to perform genotyping. The `ayuka.database.conf` configuration file controls database creation.

## Parameters

- **VirusType** - The type of virus to build the database for, e.g. Adenovirus. This selects the appropriate reference sequences.

- **KMERSize** - The k-mer length to use. Larger k-mers are more specific but reduce sensitivity. Typical values are 35-285. Note: Should always be less than sequencing read length being processed. Look at the manuscript for more details on the effects of k-mer size on sensitivity, specificity, memory consumption and speed.

- **SpeciesInfoFile** - A 2 column TSV file listing the genotype IDs and GenBank accessions to include. See `adenovirus.database.info` for an example.

- **BuildThreads** - Number of CPU threads to use. Parallelizes k-mer counting.

- **Number_of_bins** - Number of bins to divide each genome into for positional plots. 100 is a good default.

- **BuildFolder** - Output folder for the database files.

- **NCBI_NT_filter_string** - A string to filter the NT database with. Removes k-mers from unrelated organisms.

## Considerations

- Include representative and high quality RefSeq genomes covering all genotypes of interest

- Adjust k-mer size based on read length and specificity needed

- Use all available threads to parallelize counting and speed up database build

- Filter NT database stringently to keep only informative k-mers

- Use ~100 bins for full genome plots while limiting output file size

- Standardized build folder naming helps manage multiple databases

- Check that genotypes are divergent enough before including together

- Compare number of shared vs specific k-mers to optimize parameters


## Building the Database

AYUKA provides a Makefile and Singularity container to streamline database building.

The Makefile and config file should be placed in the same folder.

The `ayuka.simg` main Singularity container should also be present in this folder or symbolically linked.

To build the database, run:

```
make database
```

This will launch the `ayuka.simg` container and perform the database build process using the parameters in `ayuka.database.conf`.

The Makefile will:

- Retrieve reference genomes from GenBank
- Run Jellyfish to count k-mers
- Annotate and filter the k-mer set
- Save the database files

The Singularity container bundles all the required dependencies in a portable runtime. This ensures a consistent environment for database building.

The Makefile automates the entire pipeline from config to final database. Together with the container, it provides a turn-key solution for generating custom AYUKA genotyping databases.

0 comments on commit 1cd0ebe

Please sign in to comment.