added locidex documentation

phac-nml · Oct 1, 2024 · 604e5ed · 604e5ed
1 parent 63a8076
commit 604e5ed
Show file tree

Hide file tree

Showing 2 changed files with 37 additions and 35 deletions.
diff --git a/docs/usage/configuration.md b/docs/usage/configuration.md
@@ -35,7 +35,7 @@ All Command line arguments and defaults can be set and/or altered in the `nextfl
 ## Quality control report configuration
 > **WARNING:** Tread carefully here, as this will require modification of the `nextflow.config` file. **Make sure you have saved a back up of your `nextflow.config` file before playing with these option**
 
-### QCReport field desciption
+### QCReport field description
 The section of interest is the `QCReport` fields in the params section of the `nextflow.config`. There are multiple sections with values that can be modified or you can add data for a different organism. The default values in the pipeline are set up for **Illumina data** so you may need to adjust settingS for Nanopore or Pacbio data.
 
 An example of the QCReport structure is shown below. With annotation describing the values.
@@ -51,7 +51,7 @@ QCReport {
         min_nr_contigs = 1 // the minimum number of contigs a sample is allowed to have, a value of 1 works as a sanity check
         max_nr_contigs = 500 // The maximum number of contigs the organism in the search field is allowed to have. to many contigs could indicate a bad assembly or contamination
         min_length = 4500000 // The minimum genome length allowed for the organism specified in the search field
-        max_length = 6000000 // The maxmimum genome length the organism in the search field is allowed to have
+        max_length = 6000000 // The maximum genome length the organism in the search field is allowed to have
         max_checkm_contamination = 3.0 // The maximum level of allowed contamination allowed by CheckM
         min_average_coverage = 30 // The minimum average coverage allowed
     }
@@ -83,7 +83,7 @@ VAR_NAME { // Replace VAR name with the genus name of your sample, only use ASCI
     min_n50 = // Set your minimum n50 value
     max_n50 = // Set a maximum n50 value
     min_nr_contigs = // Set a minimum number of contigs
-    max_nr_contigs = // The maximum number of contings
+    max_nr_contigs = // The maximum number of contigs
     min_length = // Set a minimum genome length
     max_length = // set a maximum genome length
     max_checkm_contamination = // Set a maximum level of contamination to use
@@ -150,13 +150,13 @@ After having my values filled out, I can simply add them to the QCReport section
 ```
 
 ## Quality Control Fields
-This section affects the behaviours of the final summary quality control messages and is noted in the `QCReportFields` within the `nextflow.config`. **I would advise against manipulating this section unless you really know what you are doing**.
+This section affects the behavior of the final summary quality control messages and is noted in the `QCReportFields` within the `nextflow.config`. **I would advise against manipulating this section unless you really know what you are doing**.
 
 Each value in the QC report fields contains the following fields.
 
 - Field name
     - path: path to the information in the summary report JSON
-    - coerce_type: Type to be coreced too, can be a Float, Integer, or Bool
+    - coerce_type: Type to be coerced too, can be a Float, Integer, or Bool
     - compare_fields: A list of fields corresponding to fields in the `QCReport` section of the `nextflow.config`. If two values are specified it will be assumed you wish to check that a value is in between a range of values.
     - comp_type: The comparison type specified, 'ge' for greater or equal, 'le' for less than or equal, 'bool' for true or false or 'range' for checking if a value is between two values.
     - on: A boolean value for disabling a comparison
@@ -192,7 +192,7 @@ The directory of a database set for Locidex contains the following structure as
 
 An example `manifest.json` file can be found in the mikrokondo [test data sets here](https://github.com/phac-nml/mikrokondo/tree/main/tests/data/databases/locidex_dbs).
 
-Internally the `manifest.json` contains the following structure. Modifications to what `locidex manifest` outputs can be made as long as all fields populated. In the below example the `manifest.json` file generated by locidex has been modified to create two seperate entries for *Escherichia coli* and *Shigella*.
+Internally the `manifest.json` contains the following structure. Modifications to what `locidex manifest` outputs can be made as long as all fields populated. In the below example the `manifest.json` file generated by locidex has been modified to create two separate entries for *Escherichia coli* and *Shigella*.
 
 ```
 {
@@ -270,7 +270,9 @@ Internally the `manifest.json` contains the following structure. Modifications t
 ### How automated selection works.
 Mikrokondo is able to identify the species that a sample represents internally, but in order to identify the correct WgMLST scheme to use for allele calling the top-level key in the `manifest.json` file must be a name that can be parsed from the speciation output of Mash or Kraken2 e.g. *Salmonella enterica*, *Campylobacter_A anatolicus*, *Escherichia* etc.
 
-Mikrokono will then be able to match the bacterial name outputs to what is in the `manifest.json`. In the following example below the three bacteria (*Salmonella enterica*, *Campylobacter_A anatolicus*, *Escherichia coli*) would all be matched to the correct scheme:
+> **Note:** The database and organism names are not case sensitive.
+
+Mikrokondo will then be able to match the bacterial name outputs to what is in the `manifest.json`. In the following example below the three bacteria (*Salmonella enterica*, *Campylobacter_A anatolicus*, *Escherichia coli*) would all be matched to the correct scheme:
 ```
 {
   "Salmonella": [
@@ -285,4 +287,4 @@ Mikrokono will then be able to match the bacterial name outputs to what is in th
 }
 ```
 
-This is because mikrokondo looks for the best exact match from the database names in the output species name. So spurious tokens like the `_A` in *Campylobacter_A anatolicus* would be removed and the `Campylobacter` database would be selected.
+This is because mikrokondo looks for the best exact match from the database names in the output species name. So spurious tokens like the `_A` in *Campylobacter_A anatolicus* would be removed and the `Campylobacter` database would be selected. For Salmonella, as the key value `Salmonella` overlaps entirely with the *Salmonella* of *Salmonella enterica* the Salmonella database would be selected. If There was a `Salmonella Enterica` database that would be selected over the generic `Salmonella` scheme. The `Escherichia coli` database would be selected for *Escherichia coli* as there they are a 100% match.