Skip to content

Commit

Permalink
added locidex documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
mattheww95 committed Oct 1, 2024
1 parent 63a8076 commit 604e5ed
Show file tree
Hide file tree
Showing 2 changed files with 37 additions and 35 deletions.
18 changes: 10 additions & 8 deletions docs/usage/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ All Command line arguments and defaults can be set and/or altered in the `nextfl
## Quality control report configuration
> **WARNING:** Tread carefully here, as this will require modification of the `nextflow.config` file. **Make sure you have saved a back up of your `nextflow.config` file before playing with these option**
### QCReport field desciption
### QCReport field description
The section of interest is the `QCReport` fields in the params section of the `nextflow.config`. There are multiple sections with values that can be modified or you can add data for a different organism. The default values in the pipeline are set up for **Illumina data** so you may need to adjust settingS for Nanopore or Pacbio data.
An example of the QCReport structure is shown below. With annotation describing the values.
Expand All @@ -51,7 +51,7 @@ QCReport {
min_nr_contigs = 1 // the minimum number of contigs a sample is allowed to have, a value of 1 works as a sanity check
max_nr_contigs = 500 // The maximum number of contigs the organism in the search field is allowed to have. to many contigs could indicate a bad assembly or contamination
min_length = 4500000 // The minimum genome length allowed for the organism specified in the search field
max_length = 6000000 // The maxmimum genome length the organism in the search field is allowed to have
max_length = 6000000 // The maximum genome length the organism in the search field is allowed to have
max_checkm_contamination = 3.0 // The maximum level of allowed contamination allowed by CheckM
min_average_coverage = 30 // The minimum average coverage allowed
}
Expand Down Expand Up @@ -83,7 +83,7 @@ VAR_NAME { // Replace VAR name with the genus name of your sample, only use ASCI
min_n50 = // Set your minimum n50 value
max_n50 = // Set a maximum n50 value
min_nr_contigs = // Set a minimum number of contigs
max_nr_contigs = // The maximum number of contings
max_nr_contigs = // The maximum number of contigs
min_length = // Set a minimum genome length
max_length = // set a maximum genome length
max_checkm_contamination = // Set a maximum level of contamination to use
Expand Down Expand Up @@ -150,13 +150,13 @@ After having my values filled out, I can simply add them to the QCReport section
```
## Quality Control Fields
This section affects the behaviours of the final summary quality control messages and is noted in the `QCReportFields` within the `nextflow.config`. **I would advise against manipulating this section unless you really know what you are doing**.
This section affects the behavior of the final summary quality control messages and is noted in the `QCReportFields` within the `nextflow.config`. **I would advise against manipulating this section unless you really know what you are doing**.
Each value in the QC report fields contains the following fields.
- Field name
- path: path to the information in the summary report JSON
- coerce_type: Type to be coreced too, can be a Float, Integer, or Bool
- coerce_type: Type to be coerced too, can be a Float, Integer, or Bool
- compare_fields: A list of fields corresponding to fields in the `QCReport` section of the `nextflow.config`. If two values are specified it will be assumed you wish to check that a value is in between a range of values.
- comp_type: The comparison type specified, 'ge' for greater or equal, 'le' for less than or equal, 'bool' for true or false or 'range' for checking if a value is between two values.
- on: A boolean value for disabling a comparison
Expand Down Expand Up @@ -192,7 +192,7 @@ The directory of a database set for Locidex contains the following structure as
An example `manifest.json` file can be found in the mikrokondo [test data sets here](https://github.com/phac-nml/mikrokondo/tree/main/tests/data/databases/locidex_dbs).
Internally the `manifest.json` contains the following structure. Modifications to what `locidex manifest` outputs can be made as long as all fields populated. In the below example the `manifest.json` file generated by locidex has been modified to create two seperate entries for *Escherichia coli* and *Shigella*.
Internally the `manifest.json` contains the following structure. Modifications to what `locidex manifest` outputs can be made as long as all fields populated. In the below example the `manifest.json` file generated by locidex has been modified to create two separate entries for *Escherichia coli* and *Shigella*.
```
{
Expand Down Expand Up @@ -270,7 +270,9 @@ Internally the `manifest.json` contains the following structure. Modifications t
### How automated selection works.
Mikrokondo is able to identify the species that a sample represents internally, but in order to identify the correct WgMLST scheme to use for allele calling the top-level key in the `manifest.json` file must be a name that can be parsed from the speciation output of Mash or Kraken2 e.g. *Salmonella enterica*, *Campylobacter_A anatolicus*, *Escherichia* etc.
Mikrokono will then be able to match the bacterial name outputs to what is in the `manifest.json`. In the following example below the three bacteria (*Salmonella enterica*, *Campylobacter_A anatolicus*, *Escherichia coli*) would all be matched to the correct scheme:
> **Note:** The database and organism names are not case sensitive.
Mikrokondo will then be able to match the bacterial name outputs to what is in the `manifest.json`. In the following example below the three bacteria (*Salmonella enterica*, *Campylobacter_A anatolicus*, *Escherichia coli*) would all be matched to the correct scheme:
```
{
"Salmonella": [
Expand All @@ -285,4 +287,4 @@ Mikrokono will then be able to match the bacterial name outputs to what is in th
}
```
This is because mikrokondo looks for the best exact match from the database names in the output species name. So spurious tokens like the `_A` in *Campylobacter_A anatolicus* would be removed and the `Campylobacter` database would be selected.
This is because mikrokondo looks for the best exact match from the database names in the output species name. So spurious tokens like the `_A` in *Campylobacter_A anatolicus* would be removed and the `Campylobacter` database would be selected. For Salmonella, as the key value `Salmonella` overlaps entirely with the *Salmonella* of *Salmonella enterica* the Salmonella database would be selected. If There was a `Salmonella Enterica` database that would be selected over the generic `Salmonella` scheme. The `Escherichia coli` database would be selected for *Escherichia coli* as there they are a 100% match.
Loading

0 comments on commit 604e5ed

Please sign in to comment.