Down stream analyses

Analysing the output of Corekaburra is highly dependent on the biological question driving the enquiry. Here we aim to provide you with a tool to aid your analysis or give you ideas on how to answer them. If you find variations or new ways of querying outputs not described here feel free to drop them in the Issues section and we will look to include them here.

Querying outputs

Standard commandline tools like grep, cut, uniq and others are very useful when querying the output of Corekaburra. They allow users to easily pose and answer questions with minimal effort. In the following section we will walk through ways of querying output files that we have found useful.

Get core-pairs with specific accessory gene between them

The following line will give you all core pairs and the number of times each of them encode accessory_gene_x between them.
grep 'accessory_gene_x' core_core_accessory_gene_content.tsv | cut -f 2,3 | sort | uniq -c

Get all accessory genes between pair of core genes

Searching for core genes we use a \t (tab) to seperate the two core genes as the file is tab-delimited.
grep 'core_gene_1\tcore_gene_2' core_core_accessory_gene_content.tsv | cut -f 4 | sort | uniq -c
If you know a pair of core genes should give you a result, but does not. Try changing around the core genes, as the core pairs of the output are sorted.

Get all accessory genes next to core gene and sequence break

As core-pairs containing a sequence break are not sorted we have to make grep look for both arrangements of a core gene (core_gene_1 in this case) and a sequence break. This is done with \| separating the two patterns we are looking for.
grep core_gene_1\tSeqeunce_break\|Seqeunce_break\tcore_gene_1 core_core_accessory_gene_content.tsv | cut -f 4 | sort | uniq -c

Get accessory genes between specific core-pair in a genome

Searching for accessory genes in a specific genome is like searching for accessory between a specified core-pair, but with an additional column required for the genome in question (Again using \t).
grep 'genome_x\tcore_gene_1\tcore_gene_2' core_core_accessory_gene_content.tsv | cut -f 4

Get distribution of accessory genes between core-pair

This line is useful to get an idea of how many genomes and accessory genes that have and are inserted across a dataset.
grep 'core_gene_1\tcore_gene_2' low_frequency_gene_placement.tsv | cut -f 5 | sort | uniq -c

Get lines for core pair that have accessory genes inserted

A good way of isolating genomes with inserted accessory genes for further analysis is using grep's -v option to exclude matches. This is done in combination with a \t and \r to signify that it is the last column we want to match.
grep 'core_gene_1\tcore_gene_2' low_frequency_gene_placement.tsv | grep -v \t0\r

Get distribution of basepairs between core-pair

Sometimes it is useful to tell if there could be changes in the number of base pairs between two genes. Between two core genes a distribution of this can be produced by:
grep 'core_gene_1\tcore_gene_2' low_frequency_gene_placement.tsv | cut -f 4 | sort | uniq -c

Analysis ideas

Most often an analysis of the Corekaburra output starts with the core_pair_summary.csv file. This file provides a good overview with multiple sources of information (see output section for more).

A general approach

Sorting the core_pair_summary.csv file by max_acc and going through each core-pair region down to a minimum number of accessory genes (3, 5, or 10, depending on aim), noting what is in each region (mobile genetic elements, other accessory genes, potential relation to function) can be a good way of analysing a pan-genome more in depth. This process is laborious and can to some extent be automated.

Looking for mobile genetic elements?

Some mobile genetic elements (Prophages, integrative conjugative elements, transposons etc.) can be large in relation to the number of encoded genes. Because of this, sorting the core_pair_summary.csv file by max_acc will give you an easier time determining which core-pair regions to examine. Using a tool like Magphi with nucleotide sequences of core-pairs as seed-sequences, it can be easy to pull out and examine regions of interest.

Looking for genomic inversions

Genomic inversions/rearrangements have been linked to adaptations in a population. One example can be found in Enterococcus faecium. It is possible to identify inversions using the Corekaburra output, but the process can require some wrangling of the output files. First candidate core-pairs produced by inversions must be identified, next genomes containing these pairs must be found. This process is of course easier in complete genomes, where synteny is complete. However, it can be done in incomplete genomes with more work and potential follow-up experiments.

Wilder but untested ideas

Increasing the power of GWAS or association like studies?
Could accessory genes be summarised based on co-occurrence across genomes and core regions to decrease the number of statistical tests carried out in association studies, therefore increasing power?
Understand genome structure without having a complete genome
If you have enough genomes that are not closed/complete, is it then possible to get a 'pseudo' structure of the genome by using core gene synteny from all genomes. This would require all genomes to assemble across different regions, therefor correcting and overlapping differing parts of the 'pseudo' structure. This is complicated by genomes not having one single structure or layout due to rearrangements, and some parts of genomes being more likely to assemble poorly (regions with rRNAs and tRNAs as an example).
Having a pseudo structure could help in scaffolding draft genomes into a complete genome, with some caveats.