-
Notifications
You must be signed in to change notification settings - Fork 8
File and Test Organization
####Data Sets####
Root: brain:~/local_projects/paladin/data_sets
Description
This directory contains:
- References
- 6 MCBS913 data sets (fasta and GFF)
- The NT and AA translations of both versions of the UniProt DBs (full and filtered w/o the 6 datasets above)
- Reads
- PE sets for each of the 6 MCBS913 data sets
- SE of the previous 6 concatenated, for simulated metagenomic reads (metareads.fq)
- PE of a real metagenomic set (Jun_MW4)
All testing makes use of symbolic links to these files, with read mapping related files (PAC/BWT/SA) stored in the individual test's directory, and not with the dataset.
####Seed Testing####
Root: brain:~/local_projects/paladin/test-seed_length
Description
Testing the relationship between read mapped percentages and seed length
Instructions
- genIndices.sh will index all references.
- alignSeed.sh will run the testing for all single genome read sets (1-3 below)
- alignMetagenome.sh will run the testing for metagenome read (4 below)
Notes Each subdirectory under the root directory is a numeral identifying the read set being run against the reference. Each of 3 references is also stored within each subdirectory. Outputs will be in each directory in the form of samstat files which should be compiled with sam2csv script into a single CSV file. Values are as follows (Reads, References):
- AcidovoraxAvenaeATCC19860
- Acidovorax_citrulli_AAC00_1_uid58429_NC_008752 (0.4%)
- Variovorax_paradoxus_EPS_uid62107_NC_014931 (15.3%)
- Thiomonas_intermedia_K12_uid48825_NC_014153 (31.1%)
- EscherichiaColiStrK-12SubstrMG1655
- Escherichia_coli_042_uid161985_NC_017626 (0.5%)
- Yersinia_pestis_A1122_uid158119_NC_017168 (15.4%)
- Haemophilus_parainfluenzae_T3T1_uid72801_NC_015964 (31%)
- StaphylococcusEpidermidisATCC12228
- Staphylococcus_pasteuri_SP1_NC_022737 (3.8%)
- Macrococcus_caseolyticus_JCSC5402_NC_011995 (17%)
- Bacillus_cellulosilyticus_DSM2522_NC_014829 (N/A%)
- Metagenome
- Iterates through directories/sets above
####ORF Length Testing####
Root: brain:~/local_projects/paladin/test-orf_length
Description
Testing the relationship between read mapped percentages and minimum ORF length filtering. NOTE - this test is likely deprecated with new algorithm variants.
Instructions
- genIndices.sh will index all references.
- alignOrfs.sh will run the testing for all single genome read sets (1-3 below)
- alignMetagenome.sh will run the testing for metagenome read (4 below)
Notes
Each subdirectory under the root directory is a numeral identifying the read set being run against the reference. Each of 3 references is also stored within each subdirectory. Outputs will be in each directory in the form of samstat files which should be compiled with sam2csv script into a single CSV file. Values are as follows (Reads, References):
- AcidovoraxAvenaeATCC19860
- Acidovorax_citrulli_AAC00_1_uid58429_NC_008752 (0.4%)
- Variovorax_paradoxus_EPS_uid62107_NC_014931 (15.3%)
- Thiomonas_intermedia_K12_uid48825_NC_014153 (31.1%)
- EscherichiaColiStrK-12SubstrMG1655
- Escherichia_coli_042_uid161985_NC_017626 (0.5%)
- Yersinia_pestis_A1122_uid158119_NC_017168 (15.4%)
- Haemophilus_parainfluenzae_T3T1_uid72801_NC_015964 (31%)
- StaphylococcusEpidermidisATCC12228
- Staphylococcus_pasteuri_SP1_NC_022737 (3.8%)
- Macrococcus_caseolyticus_JCSC5402_NC_011995 (17%)
- Bacillus_cellulosilyticus_DSM2522_NC_014829 (N/A%)
- Metagenome
- Iterates through directories/sets above
####No Hidden Stop Count per Frame Testing####
Root: brain:~/local_projects/paladin/test-no_hidden_stop_count
Description
Via PALADIN variant 1, index all 6 frames for the combined MCBS913 dataset, as well as the UniProt DB. The frame number is used as the first character in each sequence header of each AA sequence, with 0 being the correctly aligned read frame for the protein in question. Then the number of frames with no hidden stop codons are counted
Instructions
- Run a PALADIN index using the all 6 frame index variant
- Run ~/repos/paladin/Scripts/countNoHiddenStop.py file.pro startLength, endLength, stepLength
- Redirect to CSV file, will contain column headings
Notes The results of this test can be found in "No Hidden Stop Counts.xlsx"
####Order of Likelihood of Stop Codons by Frame####
Root: brain:~/local_projects/paladin/test-stop_likelihood
Description
Via PALADIN variant 1, index all 6 frames for the combined MCBS913 dataset, as well as the UniProt DB. The frame number is used as the first character in each sequence header of each AA sequence, with 0 being the correctly aligned read frame for the protein in question. Then the likelihood of stop codons per frame is reported in a matrix view
Instructions
- Run a PALADIN index using the all 6 frame index variant
- Run ~/repos/paladin/Scripts/stopLikelihoodCounts.py file.pro
- Redirect to CSV file
Notes The results of this test can be found in "Stop Stats.xlsx"
####Order of Likelihood of Stop Codons by GC Content####
Root: brain:~/local_projects/paladin/test-stoplikelihood2
Description
Via PALADIN variant 1, index all 6 frames of the UniProt DB. The frame number is used as the first character in each sequence header of each AA sequence, with 0 being the correctly aligned read frame for the protein in question. The GC content is used as the second filed in the sequence header. Then the likelihood of stop codons per GC content is reported in a matrix view
Instructions
- Run a PALADIN index using the all 6 frame index variant and testing index protein generation function
- Run ~/repos/paladin/Scripts/stopLikelihoodCountsGC.py file.pro Order
- Redirect to CSV file
Notes The results of this test can be found in "Stop Stats.xlsx"
####ALL ALIGNMENT TESTS####
Root: brain:~/local_projects/paladin/test-alignXXX
Description
All alignment tests are run using the follow automated pipeline:
- Run a PALADIN index using the appropriate variant (stderr and runtime are reported in .LOG file)
- Run alignment using the appropriate variant, redirected into SAM file (stderr and runtime are reported in .LOG file). Alignment is recorded to .SAM file.
- Convert .SAM to .BAM
- Flagstats are saved to .SAMSTAT file
- listMappedCDS.py does a lookup of the corresponding GFF CDS entry for each mapped read in the SAM file, and saves this list to a .CDS file
- listMappedCDS.py does a lookup of the corresponding GFF CDS entry for each mapped read in the SAM file, and for each corresponding mapping in the UniProt, saves the UniProt and RefSeq IDS to a .CDSMAP file
Notes "cat file.cds | sort | uniq | wc -l" can be run to see the number of CDS entries corresponding to reads that were successfully mapped. Other operations can be performed with cdsmap for UniProt mapping info
The results of these tests can be found in "PALADIN Test Stats.xlsx"
- Align1 - PALADIN variant 1, MCBS913 metagenome reads, UniProt DB (full and filtered), seed length 9 and 11
- Align2 - PALADIN variant 2, MCBS913 metagenome reads, UniProt DB (full and filtered), seed length 9 and 11
- BWA - BWA, MCBS913 metagenome reads, UniProt DB (full and filtered)