Initial data loading

Workflow

Making the sqlite tables & loading loci table

Spacers and direct repeats

CRISPR-associated genes

Leader region

Making the sqlite tables & loading loci table

making the database tables

$ CLdb_makeDB.pl

By default, the database file is named 'CLdb.sqlite'

loading the loci table

The 'loci' table has all of the basic info on each CRISPR locus (for formatting, see Database setup).

$ CLdb_loadLoci.pl -d CLdb.sqlite < loci.txt

adding number of scaffolds to the loci table

This is only needed if loci table has genomes containing >1 sequence (multiple scaffolds, chromosomes, etc).

$ CLdb_addScaffoldCount.pl -d CLdb.sqlite

Spacers and direct repeats

loading arrays and direct repeats to their respective tables

Array files must be designated in the loci table! Array files should be located in $CLdb_home/array (see Database setup).

$ CLdb_loadArrays.pl -d CLdb.sqlite

grouping spacers and direct repeats (groups with same sequence)

Spacers and DRs with the exact same sequence (reverse-complement accounted for) are placed in the same group (ie. sequence de-replicated).

$ CLdb_groupArrayElements.pl -d CLdb.sqlite -s -r

pseudo-hierarchical clustering of spacers & DRs (good for plotting loci)

Same as CLdb_groupArrayElements.pl, but many sequence identity cutoffs are use.

$ CLdb_hclusterArrays.pl -d CLdb.sqlite -s -r

calculating direct repeat consensus sequences

Direct repeats can vary in a CRISPR, especially at the trailer end. Calculate and load the consensus of the direct repeats for each locus.

$ CLdb_loadDRConsensus.pl -d CLdb.sqlite

pairwise blast of all spacers

Spacers may partially overlap in sequence identity, potentially due to acquisitions of partially overlapping protospacers. Pairwise blasts of spacers can help identify this phenomena.

$ CLdb_spacerPairwiseBlast.pl -d CLdb.sqlite

CRISPR-associated genes

getting genes in CRISPR locus region (defined in Loci table)

Gene info is obtained from the genbank file specified in the loci table (for more info, see Database setup).

$ CLdb_getGenesInLoci.pl -d CLdb.sqlite > gene_table.txt

optional manually curate the 'gene_alias' column values before loading the table into CLdb

loading genes into the Genes table

$ CLdb_loadGenes.pl -d CLdb.sqlite < gene_table.txt

get and load genes in one command

Just run this if you don't need to manually curate the gene table.

$ CLdb_getGenesInLoci.pl -d CLdb.sqlite | CLdb_loadGenes.pl -d CLdb.sqlite

Leader region

getting potential leader regions

Potential leader regions can be identified by degeneracies in direct repeats (assumes degeneracies are more frequent at the trailer end), or just both regions adjacent to the CRISPR can be written.

CLdb_getLeaderRegions.pl -d CLdb.sqlite > possible_leaders.fna

getting potential leader regions for just 1 subtype

CLdb_getLeaderRegions.pl -d CLdb.sqlite -q "AND subtype='I-B'" > leaders_IB.fna

identifying leaders

Align the potential leader regions and determine the point in the alignment where conservation ends.

mafft --adjustdirection leaders_IB.fna > leaders_IB_aln.fna

'--adjustdirection' is needed because leaders can be reverse-complemented to each other.
If 2 leaders are written for a single locus, remove the 1 that does not align (there can be only one!)
determine where leader conservation ends
- for example: conservation ends 50bp from end of alignment
- this will be trimmed off of the leader region when added to the database

loading identified leader regions

CLdb_loadLeaders.pl -d CLdb.sqlite -t 50 test_leader_Ib.fna test_leader_Ib_aln.fna

'-t 50' = trim off the last 50bp of unconserved sequence in the alignment
- 50bp trimmed from side farthest from the array
both the aligned and unaligned sequenced are needed because mafft can alter orientation during alignment (--adjustdirection)

grouping leaders (100% sequence identity)

Works much like CLdb_groupArrayElements.pl

CLdb_groupLeaders.pl -da CLdb.sqlite

Initial data loading

Workflow

Making the sqlite tables & loading loci table

making the database tables

loading the loci table

adding number of scaffolds to the loci table

Spacers and direct repeats

loading arrays and direct repeats to their respective tables

grouping spacers and direct repeats (groups with same sequence)

pseudo-hierarchical clustering of spacers & DRs (good for plotting loci)

calculating direct repeat consensus sequences

pairwise blast of all spacers

CRISPR-associated genes

getting genes in CRISPR locus region (defined in Loci table)

loading genes into the Genes table

get and load genes in one command

Leader region

getting potential leader regions

getting potential leader regions for just 1 subtype

identifying leaders

loading identified leader regions

grouping leaders (100% sequence identity)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally