-
Notifications
You must be signed in to change notification settings - Fork 1
Initial data loading
Making the sqlite tables & loading loci table
Spacers and direct repeats
- Loading arrays
- Grouping spacers and direct repeats
- Pseudo-hierarchical clustering
- Direct repeat consensus sequence
- Pairwise blast of all spacers
CRISPR-associated genes
Leader region
$ CLdb_makeDB.pl
By default, the database file is named 'CLdb.sqlite'
The 'loci' table has all of the basic info on each CRISPR locus (for formatting, see Database setup).
$ CLdb_loadLoci.pl -d CLdb.sqlite < loci.txt
This is only needed if loci table has genomes containing >1 sequence (multiple scaffolds, chromosomes, etc).
$ CLdb_addScaffoldCount.pl -d CLdb.sqlite
Array files must be designated in the loci table! Array files should be located in $CLdb_home/array (see Database setup).
$ CLdb_loadArrays.pl -d CLdb.sqlite
Spacers and DRs with the exact same sequence (reverse-complement accounted for) are placed in the same group (ie. sequence de-replicated).
$ CLdb_groupArrayElements.pl -d CLdb.sqlite -s -r
Same as CLdb_groupArrayElements.pl, but many sequence identity cutoffs are use.
$ CLdb_hclusterArrays.pl -d CLdb.sqlite -s -r
Direct repeats can vary in a CRISPR, especially at the trailer end. Calculate and load the consensus of the direct repeats for each locus.
$ CLdb_loadDRConsensus.pl -d CLdb.sqlite
Spacers may partially overlap in sequence identity, potentially due to acquisitions of partially overlapping protospacers. Pairwise blasts of spacers can help identify this phenomena.
$ CLdb_spacerPairwiseBlast.pl -d CLdb.sqlite
Gene info is obtained from the genbank file specified in the loci table (for more info, see Database setup).
$ CLdb_getGenesInLoci.pl -d CLdb.sqlite > gene_table.txt
optional manually curate the 'gene_alias' column values before loading the table into CLdb
$ CLdb_loadGenes.pl -d CLdb.sqlite < gene_table.txt
Just run this if you don't need to manually curate the gene table.
$ CLdb_getGenesInLoci.pl -d CLdb.sqlite | CLdb_loadGenes.pl -d CLdb.sqlite
Potential leader regions can be identified by degeneracies in direct repeats (assumes degeneracies are more frequent at the trailer end), or just both regions adjacent to the CRISPR can be written.
CLdb_getLeaderRegions.pl -d CLdb.sqlite > possible_leaders.fna
CLdb_getLeaderRegions.pl -d CLdb.sqlite -q "AND subtype='I-B'" > leaders_IB.fna
Align the potential leader regions and determine the point in the alignment where conservation ends.
mafft --adjustdirection leaders_IB.fna > leaders_IB_aln.fna
-
'--adjustdirection' is needed because leaders can be reverse-complemented to each other.
-
If 2 leaders are written for a single locus, remove the 1 that does not align (there can be only one!)
-
determine where leader conservation ends
-
for example: conservation ends 50bp from end of alignment
-
this will be trimmed off of the leader region when added to the database
-
CLdb_loadLeaders.pl -d CLdb.sqlite -t 50 test_leader_Ib.fna test_leader_Ib_aln.fna
-
'-t 50' = trim off the last 50bp of unconserved sequence in the alignment
- 50bp trimmed from side farthest from the array
-
both the aligned and unaligned sequenced are needed because mafft can alter orientation during alignment (--adjustdirection)
Works much like CLdb_groupArrayElements.pl
CLdb_groupLeaders.pl -da CLdb.sqlite