Codon Usage Bias from RNA-sequencing data (CUBseq) is a fully automatic pipeline that produces robust estimates of codon usage frequencies at the transcriptome level. CUBseq can be used for any organism with an NCBI taxonomy ID, available RNA-sequencing data and a reference genome/annotation. The end result is a dataset of transcriptome-wide sequences with variants built in, allowing CUBseq to provide codon relative frequencies as well as raw counts at codon and amino acid resolution for custom downstream codon usage analysis.
- Large-scale transcriptome-wide codon usage analysis.
- Generation of transcriptome-derived codon usage tables (expressed as relative frequency and frequency per thousand).
- Quantification of transcriptome-wide genes.
- Robust identification of high expression genes.
- Reconstruction of transcriptomes per sample using variant calls.
- Analysis of mutation frequency per sample across the transcriptome and at gene level.
- Comparison of codon frequency with tRNA abundance.
Note
Before running the workflow, you will need to have Nextflow installed. See instructions on how to here.
nextflow pull stracquadaniolab/cubseq-nf -r main
A nextflow.config
configuration file will need to be created where parameters are defined, as specified below in Configuring CUBseq
. This configuration file will need to be created in the same directory where the pipeline will be run. An example configuration file is provided in example-nextflow.config.
Assuming the configuration file is set, to run CUBseq, the bare minimum command required is:
nextflow run stracquadaniolab/cubseq-nf -r main -profile singularity -c conf/nextflow.config
Alternatively, you can define parameters and call custom profiles (examples available on example-nextflow.config) directly in the nextflow run
command:
nextflow run stracquadaniolab/cubseq-nf -r main -profile singularity,cell -c conf/nextflow.config --resultsDir ./results/test-run
For example, here we call a profile, cell
, which we defined in our config file (which we used to specify the executor, RAM/CPU requirements and error strategy for each process). We also specify a custom results directory path to save output files to.
To run CUBseq you will need to specify a number of paths for storing
results, and provide appropriate parameter options based on the
organism being analysed. These parameters need to be defined in a
configuration file called nextflow.config
. Required parameters are
indicated with an asterisk, the rest of the parameters are optional.
Parameter | Description |
---|---|
resultsDir |
Directory where all results are stored [default: "./results/" ]. |
Paths to genome files | |
genome.reference * |
Path to genome reference (fasta) file [example: "data/genome/ecoli.fa" ]. |
genome.annotation * |
Path to genome annotation (GTF/GFF/GFF3) file [example: "data/genome/ecoli.gff" ]. |
ENA metadata retrieval parameters | |
taxonId * |
NCBI taxonomy ID of organism to be analysed [default: "562" ]. |
limitSearch |
Limit number of records output from ENA search query [default: 0 ]. |
removeRun |
Remove run by specifying its run accession [default: "NULL" , example: "SRR13894889" ]. |
max_sra_bytes |
Specify runs to remove if they exceed size of sra_bytes [default: "55000000000" ]. |
dateMin |
Set minimum date (YYYY/MM/DD) to filter runs by (inclusive) [default: "1950-01-01" ]. |
dateMax |
Set maximum date (YYYY/MM/DD) to filter runs by (inclusive), uses current date by default [default: "FALSE" ]. |
STAR align parameters | |
star.sjdbOverhang |
The "--sjdbOverhang" option of STAR, specifies length of genomic sequence on each side of the junctions, refer to STAR documentation for more detail. Here, we use STAR's default option [default: "100" ]. |
star.genomeSAindexNbases * |
The "--genomeSAindexNbases" option of STAR, specifying the length (bases) of SA pre-indexing string. This must be scaled down for small genomes, using formula: min(14, log2(GenomeLength)/2 - 1). [default: "10" ]. |
star.alignIntronMax |
The "--alignIntronMax" option of STAR, specifying maximum intron size [default: "1" .] |
star.limitBAMsortRAM |
The "--limitBAMsortRAM" option of STAR, specifying maximum available RAM (bytes) [default: "2342750981" ]. |
star.outBAMsortingBinsN |
The "--outBAMsortingBinsN" option of STAR, specifying the number of genome bins for coordinate-sorting [default: "50" ]. |
featureCounts parameters | |
featureCounts.type.feature |
The "-t" option of featureCounts, specifying feature type(s) in a GTF annotation to be used for read mapping. Multiple types should be separated by "," with no space in between [default: "exon" ]. |
featureCounts.type.attribute |
The "-g" option of featureCounts, specifying attribute type in the GTF annotation [default" "gene_id" ]. |
Freebayes parameters | |
freebayes.ploidy * |
The "--ploidy" option of Freebayes, specifying the default ploidy for the organism used in the analysis. [default: "1" ]. |
freebayes.args |
Additional Freebayes arguments, refer to their documentation [default: ""]. |
bcftools parameters | |
bcftools.filter_vcf.args |
Additional bcftools filter arguments for filtering the VCF file, refer to their documentation [default: 'QUAL>20 && TYPE="snp"' , note the use of quotation marks here]. |
Salmon indexing parameters | |
salmon.index.args |
Additional arguments for salmon indexing, refer to their documentation [default: ""]. |
Salmon quantification parameters | |
salmon.quant.libtype |
The "--libType" option of Salmon quant, specifying library type, CUBseq sets this to "Automatic" detection by default. Refer to their documentation for more information [default: "A" ]. |
salmon.quant.args |
Additional arguments for salmon quant, refer to their documentation [example: "--writeUnmappedNames" ]. |
tximport parameters | |
summarize_to_gene. counts_from_abundance |
Generate counts from abundances in tximport [default: "no" ]. |
CUBseq results are stored in the following directories:
results/metadata/metadata.csv
: file containing the ENA metadata of RNA sequencing runs.results/bams/
: directory containing the bam files, as processed by STAR.results/featureCounts/
: directory containing featureCounts gene quantification results per sample and summary statistics.results/freebayes-vcf/
: directory containing vcf files, as processed by Freebayes.results/vcf/
: directory containing filtered vcf files, as processed by bcftools norm and bcftools filter.results/transcriptome-consensus/
: directory containing consensus transcriptomes in fasta format.results/wt-transcriptome/
: directory containing the wild-type transcriptome, as generated by gffread.results/mut-transcriptome/
: directory containing the reconstructed mutated transcriptomes per sequencing run, as processed by gffread.results/salmon-quant/
: directory containing gene abundance results per sequencing run, as processed by salmon quantification.results/dataset/
: directory containing the tximport RDS file that sumamrises salmon quantification results at the gene-level (expressed as TPM matrix).results/gene-rank-analysis/
: directory containing results of CUBseq's gene rank analysis.results/heg-mut-transcriptome/
: directory of fasta files per sequencing run, containing only highly expressed genes.results/protein-mut-transcriptome/
: directory of fasta files per sequencing run, containing transcriptome-wide (i.e. all protein-coding) gemes.results/cu-data/
: directory containing codon usage count data for highly expressed genes, protein coding genes, as well as from the Kazusa and CoCoPUTs databases (if available).results/summarise-codon-counts/
: directory containing codon counts summarised at codon and amino acid resolution.
- Anima Sutradhar ([email protected]): developer and maintainer.
- Giovanni Stracquadanio ([email protected]): principal investigator.
If you have any questions, issues or feature requests, please get in touch using the emails above or posting an Issue.