-
Notifications
You must be signed in to change notification settings - Fork 1
Database Setup
1) Loci.txt table (tab-delimited)
The loci table designates specific metadata for each CRISPR array and/or CAS gene operon. This table is filled out after identifying CRISPRs with CRISPRFinder and optionally identifying the subtype.
This table can easily be made in Excel and then saved as a tab-delimited file (example).
columns needed
- Locus_ID
- Taxon_ID
- Taxon_Name
- Subtype*
- Locus_Start
- Locus_End
- CAS_Start*
- CAS_End*
- Array_Start*
- Array_End*
- Array_status**
- CAS_status***
- Genbank_file
- Array_File*
- Author
- File_Creation_Date*
* blank values allowed
** Possible values: "present", "absent"
*** Possible values: "intact", "absent", "broken", "shuffled"
"broken" = some genes missing "shuffled" = gene order
"Taxon_ID" = FIG_ID or any other unique identifier for the genome
"locus_ID" field must have unique locus identifiers (e.g. locus1, locus2, locus3, etc.)
optional columns
- fasta_file
- scaffold_name
- leader_start
- leader_end
- leader_sequence
- PAM_start
- PAM_end
"fasta_file" only needed if genbank files do not have genome nuleotide sequence
"scaffold_name" specifies the scaffold that the locus is on. Without this field value, 'CLDB_ONE_CHROMOSOME' will be used.
"leader_*" fields are for specifying known leader regions for loci
"PAM_*" fields are for specifying known PAM regions for loci
2) Array.txt table files
The array files designate the spacers and direct repeats in each CRISPR array, which were identified by CRISPRFinder. To make the tables, just copy and paste the array tables from the CRISPRFinder (example).
columns needed
- Start position
- Direct repeat sequence
- Spacer sequence
- End position
3) Genbank files for each genome containing the CRISPR loci loaded into CLdb
These are needed to identify locations of CAS genes (determined by 'operon_start' and 'operon_end' fields in loci.txt table).
4) Fasta files of each genome (only needed genome sequence information is not included in the genbank files).
The genome sequence is needed to extract sequence information on the entire CRISPR array, leader regions, or spacer blast hits. If the fasta files are not provided, scripts that require them will try to make them automatically from the genbank files.
The CLdb directory name for this example will be 'CLdb_test' in your home directory. Use the following commands:
$ CLdb_HOME="$HOME/CLdb_test/
$ mkdir $CLdb_HOME
$ cd $CLdb_HOME
$ mkdir genbank
- move/copy/symlink genbank files in the $CLdb_home/genbank/ directory
$ mkdir array
- move/copy/symlink array files in the $CLdb_home/array/ directory
(optional) $ mkdir fasta
- (optional) move/copy/symlink genome fasta files in the $CLdb_home/fasta/ directory
-
If no scaffold names are provided in the loci.txt table, the 'CLDB_ONE_CHROMOSOME' is used for the 'Scaffold' field.
-
Scaffold names in genbank LOCUS IDs (e.g 'LOCUS scaffold72_1_size14107-refined') and genome fasta files should match the scaffold names in the loci.txt table!