diff --git a/.gitignore b/.gitignore
index b5fe70e8..2ccedd43 100644
--- a/.gitignore
+++ b/.gitignore
@@ -22,3 +22,4 @@ docs/TODO.md
.git.bak
assets/schema_input_nfv2.0.0.json
nextflow_schema_nfv2.json
+.vscode
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 8e82a777..13c5e445 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -3,6 +3,12 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## Unreleased
+
+### `Updated`
+
+- Documentation and workflow diagram has been updated. [PR 123](https://github.com/phac-nml/mikrokondo/pull/123)
+
## [0.4.2] - 2024-09-25
### `Fixed`
diff --git a/docs/images/Thumbs.db b/docs/images/Thumbs.db
deleted file mode 100644
index ba755609..00000000
Binary files a/docs/images/Thumbs.db and /dev/null differ
diff --git a/docs/images/mikrokondo_mermaid.svg b/docs/images/mikrokondo_mermaid.svg
index 1bce78d5..6ef9dabe 100644
--- a/docs/images/mikrokondo_mermaid.svg
+++ b/docs/images/mikrokondo_mermaid.svg
@@ -1 +1 @@
-
\ No newline at end of file
+
\ No newline at end of file
diff --git a/docs/usage/configuration.md b/docs/usage/configuration.md
index a7cf2095..f63d1c2e 100644
--- a/docs/usage/configuration.md
+++ b/docs/usage/configuration.md
@@ -1,182 +1,290 @@
-# Configuration
-## Configuration files overview
-
-The following files contain configuration settings:
-
-- `conf/base.config`: Where cpu, memory and time parameters can be set for the different workflow processes. **You will likely need to adjust parameters within this file for your computing environment**.
-
-- `conf/modules.config`: contains error strategy, output director structure and execution instruction parameters. **It is unadvised to alter this file unless involved in pipeline development, or tuning to a system.**
-
-- `nextflow.config`: contains default tool settings that tie to CLI options. These options can be directly set within the `params` section of this file in cases when a user has optimized their pipeline usage and has identified the flags they will use every time the pipeline is run.
-
-### Base configuration (conf/base.config)
-Within this file computing resources can be configured for each process. Mikrokondo uses labels to define resource requirements for each process, here are their definitions:
-
-- `process_single`: processes requiring only a single core and low memory (e.g., listing of directories).
-- `process_low`: processes that would typically run easily on a small laptop (e.g., staging of data in a Python script).
-- `process_medium`: processes that would typically run on a desktop computer equipped for playing newer video games (Memory or computationally intensive applications that can be parallelized, e.g., rendering, processing large files in memory or running BLAST).
-- `process_high`: processes that would typically run on a high performance desktop computer (Memory or computationally intensive application, e.g., performing *de novo* assembly or performing BLAST searches on large databases).
-- `process_long`: modifies/overwrites the amount of time allowed for any of the above processes to allow for certain jobs to take longer (e.g., performing *de novo* assembly with less computational resources or performing global alignments on divergent sequences).
-- `process_high_memory`: modifies/overwrites the amount of memory given to any process and grant significantly more memory to any process (Aids in metagenomic assembly or clustering of large datasets).
-
-For actual resource amounts allotted to each process definition, see the `conf/base.config` file _Process-specific resource requirements_ section.
-
-### Hardcoded tool configuration (nextflow.config)
-All Command line arguments and defaults can be set and/or altered in the `nextflow.config` file, _params_ section. For a full list of parameters to be altered please refer to the `nextflow.config` file in the repo. Some common arguments have been listed in the [Common command line arguments](/usage/useage/#common-command-line-arguments) section of the docs and further description of tool parameters can also be found in [tool specific parameters](/usage/tool_params/).
-
-> **Example:** if your laboratory typically sequences using Nanopore chemistry "r1041_e82_400bps_hac_v4.2.0", the following code would be substituted in the _params_ section of the `nextflow.config` file:
->
->```
->nanopore_chemistry = "r1041_e82_400bps_hac_v4.2.0" // Note the quotes around the value
->```
->
->With this change, you would no longer need to explicitly state the nanopore chemistry as an extra CLI argument when running mikrokondo.
-
-## Quality control report configuration
-> **WARNING:** Tread carefully here, as this will require modification of the `nextflow.config` file. **Make sure you have saved a back up of your `nextflow.config` file before playing with these option**
-
-### QCReport field desciption
-The section of interest is the `QCReport` fields in the params section of the `nextflow.config`. There are multiple sections with values that can be modified or you can add data for a different organism. The default values in the pipeline are set up for **Illumina data** so you may need to adjust settingS for Nanopore or Pacbio data.
-
-An example of the QCReport structure is shown below. With annotation describing the values.
->**NOTE:** The values below do not affect the running of the pipeline, these values only affect the final quality messages output by the pipeline.
-```
-QCReport {
- escherichia // Generic top level name fo the field, it is name is technically arbitrary but it nice field name keeps things organized
- {
- search = "Escherichia coli" // The phrase that is searched for in the species_top_hit field mentioned above. The search is for containment so if you wanted to look for E.coli and E.albertii you could just set the value too "Escherichia"
- raw_average_quality = 30 // Minimum raw average quality of all bases in the sequencing data. This value is generated before the decontamination procedure.
- min_n50 = 95000 // The minimum n50 value allowed from quast
- max_n50 = 6000000 // The maximum n50 value allowed from quast
- min_nr_contigs = 1 // the minimum number of contigs a sample is allowed to have, a value of 1 works as a sanity check
- max_nr_contigs = 500 // The maximum number of contigs the organism in the search field is allowed to have. to many contigs could indicate a bad assembly or contamination
- min_length = 4500000 // The minimum genome length allowed for the organism specified in the search field
- max_length = 6000000 // The maxmimum genome length the organism in the search field is allowed to have
- max_checkm_contamination = 3.0 // The maximum level of allowed contamination allowed by CheckM
- min_average_coverage = 30 // The minimum average coverage allowed
- }
- // DO NOT REMOVE THE FALLTRHOUGH FIELD AS IT IS NEEDED TO CAPTURE OTHER ORGANISMS
- fallthrough // The fallthrough field exist as a default value to capture organisms where no quality control data has been specified
- {
- search = "No organism specific QC data available."
- raw_average_quality = 30
- min_n50 = null
- max_n50 = null
- min_nr_contigs = null
- max_nr_contigs = null
- min_length = null
- max_length = null
- max_checkm_contamination = 3.0
- min_average_coverage = 30
- }
-}
-```
-
-### Example adding quality control data for *Salmonella*
-
-If you wanted to add quality control data for *Salmonella* you can start off by using the template below:
-
-```
-VAR_NAME { // Replace VAR name with the genus name of your sample, only use ASCII (a-zA-Z) alphabet characters in the name and replace spaces, punctuation and other special characters with underscores (_)
- search = "Search phrase" // Search phrase for your species top_hit, Note the quotes
- raw_average_quality = // 30 is a default value please change it as needed
- min_n50 = // Set your minimum n50 value
- max_n50 = // Set a maximum n50 value
- min_nr_contigs = // Set a minimum number of contigs
- max_nr_contigs = // The maximum number of contings
- min_length = // Set a minimum genome length
- max_length = // set a maximum genome length
- max_checkm_contamination = // Set a maximum level of contamination to use
- min_average_coverage = // Set the minimum coverage value
-}
-```
-
-For *Salmonella* I would fill in the values like so.
-```
-salmonella {
- search = "Salmonella"
- raw_average_quality = 30
- min_n50 = 95000
- max_n50 = 6000000
- min_nr_contigs = 1
- max_nr_contigs = 200
- min_length = 4400000
- max_length = 6000000
- max_checkm_contamination = 3.0
- min_average_coverage = 30
-}
-```
-
-After having my values filled out, I can simply add them to the QCReport section in the `nextflow.config` file.
-
-```
- QCReport {
- escherichia {
- search = "Escherichia coli"
- raw_average_quality = 30
- min_n50 = 95000
- max_n50 = 6000000
- min_nr_contigs = 1
- max_nr_contigs = 500
- min_length = 4500000
- max_length = 6000000
- max_checkm_contamination = 3.0
- min_average_coverage = 30
- } salmonella { // NOTE watch the opening and closing brackets
- search = "Salmonella"
- raw_average_quality = 30
- min_n50 = 95000
- max_n50 = 6000000
- min_nr_contigs = 1
- max_nr_contigs = 200
- min_length = 4400000
- max_length = 6000000
- max_checkm_contamination = 3.0
- min_average_coverage = 30
- }
- fallthrough {
- search = "No organism specific QC data available."
- raw_average_quality = 30
- min_n50 = null
- max_n50 = null
- min_nr_contigs = null
- max_nr_contigs = null
- min_length = null
- max_length = null
- max_checkm_contamination = 3.0
- min_average_coverage = 30
- }
- }
-```
-
-## Quality Control Fields
-This section affects the behaviours of the final summary quality control messages and is noted in the `QCReportFields` within the `nextflow.config`. **I would advise against manipulating this section unless you really know what you are doing**.
-
-TODO test what happens if no quality msg is available for the bool fields types.
-
-Each value in the QC report fields contains the following fields.
-
-- Field name
- - path: path to the information in the summary report JSON
- - coerce_type: Type to be coreced too, can be a Float, Integer, or Bool
- - compare_fields: A list of fields corresponding to fields in the `QCReport` section of the `nextflow.config`. If two values are specified it will be assumed you wish to check that a value is in between a range of values.
- - comp_type: The comparison type specified, 'ge' for greater or equal, 'le' for less than or equal, 'bool' for true or false or 'range' for checking if a value is between two values.
- - on: A boolean value for disabling a comparison
- - low_msg: A message for if a value is less than its compared value (optional)
- - high_msg: A message for if value is above a certain value (optional)
-
-An example of what these fields look like is:
-
-```
-QCReportFields {
- raw_average_quality {
- path = [params.raw_reads.report_tag, "combined", "qual_mean"]
- coerce_type = 'Float'
- compare_fields = ['raw_average_quality']
- comp_type = "ge"
- on = true
- low_msg = "Base quality is poor, resequencing is recommended."
- }
-}
-```
-
+# Configuration
+## Configuration files overview
+
+The following files contain configuration settings:
+
+- `conf/base.config`: Where cpu, memory and time parameters can be set for the different workflow processes. **You will likely need to adjust parameters within this file for your computing environment**.
+
+- `conf/modules.config`: contains error strategy, output director structure and execution instruction parameters. **It is unadvised to alter this file unless involved in pipeline development, or tuning to a system.**
+
+- `nextflow.config`: contains default tool settings that tie to CLI options. These options can be directly set within the `params` section of this file in cases when a user has optimized their pipeline usage and has identified the flags they will use every time the pipeline is run.
+
+### Base configuration (conf/base.config)
+Within this file computing resources can be configured for each process. Mikrokondo uses labels to define resource requirements for each process, here are their definitions:
+
+- `process_single`: processes requiring only a single core and low memory (e.g., listing of directories).
+- `process_low`: processes that would typically run easily on a small laptop (e.g., staging of data in a Python script).
+- `process_medium`: processes that would typically run on a desktop computer equipped for playing newer video games (Memory or computationally intensive applications that can be parallelized, e.g., rendering, processing large files in memory or running BLAST).
+- `process_high`: processes that would typically run on a high performance desktop computer (Memory or computationally intensive application, e.g., performing *de novo* assembly or performing BLAST searches on large databases).
+- `process_long`: modifies/overwrites the amount of time allowed for any of the above processes to allow for certain jobs to take longer (e.g., performing *de novo* assembly with less computational resources or performing global alignments on divergent sequences).
+- `process_high_memory`: modifies/overwrites the amount of memory given to any process and grant significantly more memory to any process (Aids in metagenomic assembly or clustering of large datasets).
+
+For actual resource amounts allotted to each process definition, see the `conf/base.config` file _Process-specific resource requirements_ section.
+
+### Hardcoded tool configuration (nextflow.config)
+All Command line arguments and defaults can be set and/or altered in the `nextflow.config` file, _params_ section. For a full list of parameters to be altered please refer to the `nextflow.config` file in the repo. Some common arguments have been listed in the [Common command line arguments](/usage/useage/#common-command-line-arguments) section of the docs and further description of tool parameters can also be found in [tool specific parameters](/usage/tool_params/).
+
+> **Example:** if your laboratory typically sequences using Nanopore chemistry "r1041_e82_400bps_hac_v4.2.0", the following code would be substituted in the _params_ section of the `nextflow.config` file:
+>
+>```
+>nanopore_chemistry = "r1041_e82_400bps_hac_v4.2.0" // Note the quotes around the value
+>```
+>
+>With this change, you would no longer need to explicitly state the nanopore chemistry as an extra CLI argument when running mikrokondo.
+
+## Quality control report configuration
+> **WARNING:** Tread carefully here, as this will require modification of the `nextflow.config` file. **Make sure you have saved a back up of your `nextflow.config` file before playing with these option**
+
+### QCReport field description
+The section of interest is the `QCReport` fields in the params section of the `nextflow.config`. There are multiple sections with values that can be modified or you can add data for a different organism. The default values in the pipeline are set up for **Illumina data** so you may need to adjust settingS for Nanopore or Pacbio data.
+
+An example of the QCReport structure is shown below. With annotation describing the values.
+>**NOTE:** The values below do not affect the running of the pipeline, these values only affect the final quality messages output by the pipeline.
+```
+QCReport {
+ escherichia // Generic top level name fo the field, it is name is technically arbitrary but it nice field name keeps things organized
+ {
+ search = "Escherichia coli" // The phrase that is searched for in the species_top_hit field mentioned above. The search is for containment so if you wanted to look for E.coli and E.albertii you could just set the value too "Escherichia"
+ raw_average_quality = 30 // Minimum raw average quality of all bases in the sequencing data. This value is generated before the decontamination procedure.
+ min_n50 = 95000 // The minimum n50 value allowed from quast
+ max_n50 = 6000000 // The maximum n50 value allowed from quast
+ min_nr_contigs = 1 // the minimum number of contigs a sample is allowed to have, a value of 1 works as a sanity check
+ max_nr_contigs = 500 // The maximum number of contigs the organism in the search field is allowed to have. to many contigs could indicate a bad assembly or contamination
+ min_length = 4500000 // The minimum genome length allowed for the organism specified in the search field
+ max_length = 6000000 // The maximum genome length the organism in the search field is allowed to have
+ max_checkm_contamination = 3.0 // The maximum level of allowed contamination allowed by CheckM
+ min_average_coverage = 30 // The minimum average coverage allowed
+ }
+ // DO NOT REMOVE THE FALLTRHOUGH FIELD AS IT IS NEEDED TO CAPTURE OTHER ORGANISMS
+ fallthrough // The fallthrough field exist as a default value to capture organisms where no quality control data has been specified
+ {
+ search = "No organism specific QC data available."
+ raw_average_quality = 30
+ min_n50 = null
+ max_n50 = null
+ min_nr_contigs = null
+ max_nr_contigs = null
+ min_length = null
+ max_length = null
+ max_checkm_contamination = 3.0
+ min_average_coverage = 30
+ }
+}
+```
+
+### Example adding quality control data for *Salmonella*
+
+If you wanted to add quality control data for *Salmonella* you can start off by using the template below:
+
+```
+VAR_NAME { // Replace VAR name with the genus name of your sample, only use ASCII (a-zA-Z) alphabet characters in the name and replace spaces, punctuation and other special characters with underscores (_)
+ search = "Search phrase" // Search phrase for your species top_hit, Note the quotes
+ raw_average_quality = // 30 is a default value please change it as needed
+ min_n50 = // Set your minimum n50 value
+ max_n50 = // Set a maximum n50 value
+ min_nr_contigs = // Set a minimum number of contigs
+ max_nr_contigs = // The maximum number of contigs
+ min_length = // Set a minimum genome length
+ max_length = // set a maximum genome length
+ max_checkm_contamination = // Set a maximum level of contamination to use
+ min_average_coverage = // Set the minimum coverage value
+}
+```
+
+For *Salmonella* I would fill in the values like so.
+```
+salmonella {
+ search = "Salmonella"
+ raw_average_quality = 30
+ min_n50 = 95000
+ max_n50 = 6000000
+ min_nr_contigs = 1
+ max_nr_contigs = 200
+ min_length = 4400000
+ max_length = 6000000
+ max_checkm_contamination = 3.0
+ min_average_coverage = 30
+}
+```
+
+After having my values filled out, I can simply add them to the QCReport section in the `nextflow.config` file.
+
+```
+ QCReport {
+ escherichia {
+ search = "Escherichia coli"
+ raw_average_quality = 30
+ min_n50 = 95000
+ max_n50 = 6000000
+ min_nr_contigs = 1
+ max_nr_contigs = 500
+ min_length = 4500000
+ max_length = 6000000
+ max_checkm_contamination = 3.0
+ min_average_coverage = 30
+ } salmonella { // NOTE watch the opening and closing brackets
+ search = "Salmonella"
+ raw_average_quality = 30
+ min_n50 = 95000
+ max_n50 = 6000000
+ min_nr_contigs = 1
+ max_nr_contigs = 200
+ min_length = 4400000
+ max_length = 6000000
+ max_checkm_contamination = 3.0
+ min_average_coverage = 30
+ }
+ fallthrough {
+ search = "No organism specific QC data available."
+ raw_average_quality = 30
+ min_n50 = null
+ max_n50 = null
+ min_nr_contigs = null
+ max_nr_contigs = null
+ min_length = null
+ max_length = null
+ max_checkm_contamination = 3.0
+ min_average_coverage = 30
+ }
+ }
+```
+
+## Quality Control Fields
+This section affects the behavior of the final summary quality control messages and is noted in the `QCReportFields` within the `nextflow.config`. **I would advise against manipulating this section unless you really know what you are doing**.
+
+Each value in the QC report fields contains the following fields.
+
+- Field name
+ - path: path to the information in the summary report JSON
+ - coerce_type: Type to be coerced too, can be a Float, Integer, or Bool
+ - compare_fields: A list of fields corresponding to fields in the `QCReport` section of the `nextflow.config`. If two values are specified it will be assumed you wish to check that a value is in between a range of values.
+ - comp_type: The comparison type specified, 'ge' for greater or equal, 'le' for less than or equal, 'bool' for true or false or 'range' for checking if a value is between two values.
+ - on: A boolean value for disabling a comparison
+ - low_msg: A message for if a value is less than its compared value (optional)
+ - high_msg: A message for if value is above a certain value (optional)
+
+An example of what these fields look like is:
+
+```
+QCReportFields {
+ raw_average_quality {
+ path = [params.raw_reads.report_tag, "combined", "qual_mean"]
+ coerce_type = 'Float'
+ compare_fields = ['raw_average_quality']
+ comp_type = "ge"
+ on = true
+ low_msg = "Base quality is poor, resequencing is recommended."
+ }
+}
+```
+
+## Locidex Manifest File
+Automated selection allele calling databases is supported within mikrokondo. This is accomplished with the help of Locidex itself, which offers a utility to generate a `manifest.json` file.
+
+The directory of a database set for Locidex contains the following structure as the `manifest.json` keeps track of the paths relative too the location of the manifest file itself:
+```
+--|
+ |- Database 1
+ |- Database 2
+ |- Database n
+ |- manifest.json
+```
+
+An example `manifest.json` file can be found in the mikrokondo [test data sets here](https://github.com/phac-nml/mikrokondo/tree/main/tests/data/databases/locidex_dbs).
+
+Internally the `manifest.json` contains the following structure. Modifications to what `locidex manifest` outputs can be made as long as all fields populated. In the below example the `manifest.json` file generated by locidex has been modified to create two separate entries for *Escherichia coli* and *Shigella*.
+
+```
+{
+ "Salmonella": [
+ {
+ "path": "wgmlst_salmonella",
+ "config": {
+ "db_name": "Salmonella",
+ "db_version": "1.0.0",
+ "db_date": "2024-03-17",
+ "db_author": "Tester",
+ "db_desc": "Salmonella Database",
+ "db_num_seqs": 51251,
+ "is_nucl": true,
+ "is_prot": true,
+ "nucleotide_db_name": "nucleotide",
+ "protein_db_name": "protein"
+ }
+ }
+ ],
+ "Escherichia coli": [
+ {
+ "path": "wgmlst_escherichia_shigella",
+ "config": {
+ "db_name": "EC and Shigella",
+ "db_version": "1.0.0",
+ "db_date": "2024-04-30",
+ "db_author": "Tester",
+ "db_desc": "Shigella and E.coli",
+ "db_num_seqs": 57692,
+ "is_nucl": true,
+ "is_prot": true,
+ "nucleotide_db_name": "nucleotide",
+ "protein_db_name": "protein"
+ }
+ }
+ ],
+ "Shigella": [
+ {
+ "path": "wgmlst_escherichia_shigella",
+ "config": {
+ "db_name": "EC and Shigella",
+ "db_version": "1.0.0",
+ "db_date": "2024-04-30",
+ "db_author": "Tester",
+ "db_desc": "Shigella and E.coli",
+ "db_num_seqs": 57692,
+ "is_nucl": true,
+ "is_prot": true,
+ "nucleotide_db_name": "nucleotide",
+ "protein_db_name": "protein"
+ }
+ }
+ ],
+ "Listeria Monocytogenes": [
+ {
+ "path": "wgmlst_listeria",
+ "config": {
+ "db_name": "Listeria Monocytogenes wgMLST",
+ "db_version": "1.0.0",
+ "db_date": "2024-04-16",
+ "db_author": "Tester",
+ "db_desc": "Listeria Monocytogenes wgMLST",
+ "db_num_seqs": 22404,
+ "is_nucl": true,
+ "is_prot": true,
+ "nucleotide_db_name": "nucleotide",
+ "protein_db_name": "protein"
+ }
+ }
+ ]
+}
+```
+
+### How automated selection works.
+Mikrokondo is able to identify the species that a sample represents internally, but in order to identify the correct WgMLST scheme to use for allele calling the top-level key in the `manifest.json` file must be a name that can be parsed from the speciation output of Mash or Kraken2 e.g. *Salmonella enterica*, *Campylobacter_A anatolicus*, *Escherichia* etc.
+
+> **Note:** The database and organism names are not case sensitive.
+
+Mikrokondo will then be able to match the bacterial name outputs to what is in the `manifest.json`. In the following example below the three bacteria (*Salmonella enterica*, *Campylobacter_A anatolicus*, *Escherichia coli*) would all be matched to the correct scheme:
+```
+{
+ "Salmonella": [
+ ...
+ ],
+ "Escherichia coli": [
+ ...
+ ],
+ "Campylobacter": [
+ ...
+ ]
+}
+```
+
+This is because mikrokondo looks for the best exact match from the database names in the output species name. So spurious tokens like the `_A` in *Campylobacter_A anatolicus* would be removed and the `Campylobacter` database would be selected. For Salmonella, as the key value `Salmonella` overlaps entirely with the *Salmonella* of *Salmonella enterica* the Salmonella database would be selected. If There was a `Salmonella Enterica` database that would be selected over the generic `Salmonella` scheme. The `Escherichia coli` database would be selected for *Escherichia coli* as there they are a 100% match.
diff --git a/docs/usage/tool_params.md b/docs/usage/tool_params.md
index b9999ef6..7bc39444 100644
--- a/docs/usage/tool_params.md
+++ b/docs/usage/tool_params.md
@@ -26,7 +26,7 @@ A custom Python script that gathers quality metrics for each fastq file.
- report_tag: this field determines the name of the Raw Read Metric field in the final summary report. **Do no touch this unless doing pipeline development.**
### Coreutils
-In cases where a process uses bash scripting only, Nextflow by default will utilize system binaries when they are available and no container is specified. For reproducability, we have chosen to use containers in such cases. When a better container is available, you can direct the pipeline to use it via below commands:
+In cases where a process uses bash scripting only, Nextflow by default will utilize system binaries when they are available and no container is specified. For reproducibility, we have chosen to use containers in such cases. When a better container is available, you can direct the pipeline to use it via below commands:
- coreutils
- singularity: coreutils singularity container
@@ -34,7 +34,7 @@ In cases where a process uses bash scripting only, Nextflow by default will util
### Python
-Some scripts require Python3, therefore a well tested Python3 container is provided for reproducability. However, as all the scripts within mikrokondo use only the standard library you can swap these containers to use any python interpreter version. For instance, swapping in **pypy3** may result a massive performance boost from the scripts, though this is currently untested.
+Some scripts require Python3, therefore a well tested Python3 container is provided for reproducibility. However, as all the scripts within mikrokondo use only the standard library you can swap these containers to use any python interpreter version. For instance, swapping in **pypy3** may result a massive performance boost from the scripts, though this is currently untested.
- python3
- singularity: Python3 singularity container
@@ -44,7 +44,7 @@ Some scripts require Python3, therefore a well tested Python3 container is provi
Kat was previously used to estimate genome size, however at the time of writing KAT appears to be only infrequently updated and newer versions would have issues running/sometimes giving an incorrect output due to failures in peak recognition. Therefore, KAT has been removed from the pipeline, It's code still remains but it **will be removed in the future**.
### Seqtk
-Seqtk is used for both the sub-sampling of reads and conversion of fasta files to fastq files in mikrokondo. The usage of seqtk to convert a fasta to a fastq is needed in certain typing tools requiring reads as input (this was a design decision for generalizability of the pipeline).
+Seqtk is used for both the sub-sampling of reads and conversion of fasta files to fastq files in mikrokondo. The usage of seqtk to convert a fasta to a fastq is needed in certain typing tools requiring reads as input (this was a design decision to keep the pipeline generalizable).
- seqtk
- singularity: singularity container for seqtk
@@ -58,7 +58,7 @@ Seqtk is used for both the sub-sampling of reads and conversion of fasta files t
Fastp is fast and widely used program for gathering of read quality metrics, adapter trimming, read filtering and read trimming. FastP has extensive options for configuration which are detailed in their documentation, but sensible defaults have been set. **Adapter trimming in Fastp is performed using overlap analysis, however if you do not trust this you can specify the sequencing adapters used directly in the additional arguments for Fastp**.
- fastp
- - singulartiy: singularity container for FastP
+ - singularity: singularity container for FastP
- docker: docker container for FastP
- fastq_ext: extension of the output Fastp trimmed reads, do not touch this unless doing pipeline development.
- html_ext: Extension of the html report output by fastp, do no touch unless doing pipeline development.
@@ -104,10 +104,10 @@ Flye is used for assembly of Nanopore data.
- log_ext: the file extension for the Flye log files. Do not alter this field unless doing pipeline development
- json_ext: the file extension for the Flye json files. Do not alter this field unless doing pipeline development
- **polishing_iterations**: The number of polishing iterations for Flye.
- - ext_args: Extra commandline options to pass to Flye
+ - ext_args: Extra command line options to pass to Flye
### Spades
-Usef for paired end read assembly
+Used for paired end read assembly
- spades
- singularity: Singularity container for spades
@@ -120,7 +120,7 @@ Usef for paired end read assembly
- outdir: The name of the output directory for assemblies. Do not alter this field unless doing pipeline development
### FastQC
-This is a defualt tool added to nf-core pipelines. This feature will likely be removed in the future but for those fond of it, the outputs of FastQC still remain.
+This is a default tool added to nf-core pipelines. This feature will likely be removed in the future but for those fond of it, the outputs of FastQC still remain.
- fastqc
- html_ext: The file extension of the fastqc html file. Do not alter this field unless doing pipeline development
@@ -145,12 +145,12 @@ Assemblies can be prevented from going into further analyses based on the Quast
- quast_filter
- n50_field: The name of the field to search for and filter. Do not alter this field unless doing pipeline development.
- n50_value: The minimum value the field specified is allowed to contain.
- - nr_contigs_field: The name of field in the Quast report to fiter on. Do not alter this field unless doing pipeline development.
+ - nr_contigs_field: The name of field in the Quast report to filter on. Do not alter this field unless doing pipeline development.
- nr_contigs_value: The minimum number of contigs an assembly must have to proceed further through the pipeline.
- sample_header: The column name in the Quast output containing the sample information. Do not alter this field unless doing pipeline development.
### CheckM
-CheckM is used within the pipeline for assesing contamination in assemblies.
+CheckM is used within the pipeline for assessing contamination in assemblies.
- checkm
- singularity: Singularity container containing CheckM
@@ -189,13 +189,13 @@ Run Torsten Seemann's seven gene MLST program.
- mlst
- singularity: Singularity container for mlst.
- docker: Docker container for mlst.
- - **args**: Addtional arguments to pass to mlst.
+ - **args**: Additional arguments to pass to mlst.
- tsv_ext: Extension of the mlst tabular file. Do not alter this field unless doing pipeline development.
- json_ext: Extension of the mlst output JSON file. Do not alter this field unless doing pipeline development.
- report_tag: Name of the data outputs in the final report. Do not alter this field unless doing pipeline development.
### Mash
-Mash is used repeatedly througout the pipeline for estimation of genome size from reads, contamination detection and for determining the final species of an assembly.
+Mash is used repeatedly throughout the pipeline for estimation of genome size from reads, contamination detection and for determining the final species of an assembly.
- mash
- singularity: Singularity container for mash.
@@ -233,14 +233,14 @@ This step is used to remove contaminants from read data, it exists to perform de
- docker: Docker container used to perform dehosting, this container contains minimap2 and samtools.
- phix_fa: The path to file containing the phiX fasta.
- homo_sapiens_fa: The path to file containing the human genomes fasta.
- - pacbio_mg: The path to file containg the pacbio sequencing control.
+ - pacbio_mg: The path to file containing the pacbio sequencing control.
- output_ext: The extension of the deconned fastq files. Do not alter this field unless doing pipeline development.
- mega_mm2_idx: The path to the minimap2 index used for dehosting. Do not alter this field unless doing pipeline development.
- mm2_illumina: The arguments passed to minimap2 for Illumina data. Do not alter this field unless doing pipeline development.
- mm2_pac: The arguments passed to minimap2 for Pacbio Data. Do not alter this field unless doing pipeline development.
- mm2_ont: The arguments passed to minimap2 for Nanopore data. Do not alter this field unless doing pipeline development.
- samtools_output_ext: The extension of the output from samtools. Do not alter this field unless doing pipeline development.
- - samtools_singletons_ext: The extension of singelton reads from samtools. Do not alter this field unless doing pipeline development.
+ - samtools_singletons_ext: The extension of singleton reads from samtools. Do not alter this field unless doing pipeline development.
- output_ext: The name of the files output from samtools. Do not alter this field unless doing pipeline development.
- output_dir: The directory where deconned reads are placed. Do not alter this field unless doing pipeline development.
@@ -248,8 +248,8 @@ This step is used to remove contaminants from read data, it exists to perform de
Minimap2 is used frequently throughout the pipeline for decontamination and mapping reads back to assemblies for polishing.
- minimap2
- - singularity: The singularity container for minimap2, the same one is used for contmaination removal.
- - docker: The Docker container for minimap2, the same one is used for contmaination removal.
+ - singularity: The singularity container for minimap2, the same one is used for contamination removal.
+ - docker: The Docker container for minimap2, the same one is used for contamination removal.
- index_outdir: The directory where created indices are output. Do not alter this field unless doing pipeline development.
- index_ext: The file extension of create indices. Do not alter this field unless doing pipeline development.
@@ -273,7 +273,7 @@ Racon is used as a first pass for polishing assemblies.
- outdir: The directory containing the polished sequences. Do not alter this field unless doing pipeline development.
### Pilon
-Pilon was added to the pipeline, but it is run iteratively which at the time of writing this pipeline was not well supported in Nextflow so a seperate script and containers are provided to utilize Pilon. The code for Pilon remains in the pipeline so that when able to do so easily, iterative Pilon polishing can be integrated directly into the pipeline.
+Pilon was added to the pipeline, but it is run iteratively which at the time of writing this pipeline was not well supported in Nextflow so a separate script and containers are provided to utilize Pilon. The code for Pilon remains in the pipeline so that when able to do so easily, iterative Pilon polishing can be integrated directly into the pipeline.
### Pilon Iterative Polishing
This process is a wrapper around minimap2, samtools and Pilon for iterative polishing containers are built **but if you ever have problems with this step, disabling polishing will fix your issue (at the cost of polishing)**.
@@ -290,9 +290,9 @@ This process is a wrapper around minimap2, samtools and Pilon for iterative poli
- bai_ext: Bam index file extension. Do not alter this field unless doing pipeline development.
- changes_ext: File extensions for the pilon output containing the changes applied to the assembly. Do not alter this field unless doing pipeline development.
- changes_outdir: The output directory for the pilon changes. Do not alter this field unless doing pipeline development.
- - max_memory_multiplier: On failure this program will try again with more memory, the mulitplier is the factor that the amount of memory passed to the program will be increased by. Do not alter this field unless doing pipeline development.
- - **max_polishing_illumina**: Number of iterations for polishing an illuina assembly with illumina reads.
- - **max_polishing_nanopre**: Number of iterations to polish a Nanopore assembly with (will use illumina reads if provided).
+ - max_memory_multiplier: On failure this program will try again with more memory, the multiplier is the factor that the amount of memory passed to the program will be increased by. Do not alter this field unless doing pipeline development.
+ - **max_polishing_illumina**: Number of iterations for polishing an illumina assembly with illumina reads.
+ - **max_polishing_nanopore**: Number of iterations to polish a Nanopore assembly with (will use illumina reads if provided).
- **max_polishing_pacbio**: Number iterations to polish assembly with (will use illumina reads if provided).
### Medaka Polishing
@@ -301,12 +301,12 @@ Medaka is used for polishing of Nanopore assemblies, make sure you specify a med
- medaka
- singularity: Singularity container with Medaka.
- docker: Docker container with Medaka.
- - model: This parameter will be autofilled with the model specified at the top level by the `nanopore_chemistry` option. Do not alter this field unless doing pipeline development.
+ - model: This parameter will be auto filled with the model specified at the top level by the `nanopore_chemistry` option. Do not alter this field unless doing pipeline development.
- fasta_ext: Polished fasta output. Do not alter this field unless doing pipeline development.
- batch_size: The batch size passed to medaka, this can improve performance. Do not alter this field unless doing pipeline development.
### Unicycler
-Unicycler is an option provided for hybrid assembly, it is a great option and outputs an excellent assembly but it requires **A lot** of resources. Which is why the alternate hybrid assembly option using Flye->Racon->Pilon is available. As well there can be a fairly cryptic Spades error generated by Unicycler that usaully relates to memory usage, it will typically say something involving `tputs`.
+Unicycler is an option provided for hybrid assembly, it is a great option and outputs an excellent assembly but it requires **A lot** of resources. Which is why the alternate hybrid assembly option using Flye->Racon->Pilon is available. As well there can be a fairly cryptic Spades error generated by Unicycler that usually relates to memory usage, it will typically say something involving `tputs`.
- unicycler
- singularity: The Singularity container containing Unicycler.
@@ -337,8 +337,8 @@ StarAMR provides annotation of antimicrobial resistance genes within your data.
- staramr
- singularity: The singularity container containing staramr.
- - docker: The Docker container containing starmar.
- - **db**: The database for StarAMR. The default value of `null` tells the pipeline to use the database included in the StarAMR container. However you can specify a path to a valid StarAMR datbase and use that instead.
+ - docker: The Docker container containing StarAMR.
+ - **db**: The database for StarAMR. The default value of `null` tells the pipeline to use the database included in the StarAMR container. However you can specify a path to a valid StarAMR database and use that instead.
- tsv_ext: File extension of the reports from StarAMR. Do not alter this field unless doing pipeline development.
- txt_ext: File extension of the text reports from StarAMR. Do not alter this field unless doing pipeline development.
- xlsx_ext: File extension of the excel spread sheet from StarAMR. Do not alter this field unless doing pipeline development.
@@ -347,7 +347,7 @@ StarAMR provides annotation of antimicrobial resistance genes within your data.
- report_tag: The field name of StarAMR in the final summary report. Do not alter this field unless doing pipeline development.
- header_p: Indicates the final report from StarAMR contains a header line. Do not alter this field unless doing pipeline development.
-### Bakta
+## Bakta
Bakta is used to provide annotation of genomes, it is very reliable but it can be slow.
- bakta
@@ -368,7 +368,7 @@ Bakta is used to provide annotation of genomes, it is very reliable but it can b
- txt_ext: The file extension of the txt report. Do not alter this field unless doing pipeline development.
- min_contig_length: The minimum contig length to be annotated by Bakta. This can be set from the command line using the argument `--ba_min_contig_length`.
-### Bandage
+## Bandage
Bandage is included to make bandage plots of the initial assemblies e.g. Spades, Flye or Unicycler. These images can be useful in determining the quality of an assembly.
- bandage
@@ -377,7 +377,7 @@ Bandage is included to make bandage plots of the initial assemblies e.g. Spades,
- svg_ext: The extension of the SVG file created by bandage. Do not alter this field unless doing pipeline development.
- outdir: The output directory of the bandage images.
-### Subtyping Report
+## Subtyping Report
All sub typing report tools contain a common report tag so that they can be identified by the program.
- subtyping_report
@@ -394,6 +394,11 @@ ECTyper is used to perform *in-silico* typing of *Escherichia coli* and is autom
- txt_ext: Text file extension of ECTyper output. Do not alter this field unless doing pipeline development.
- report_tag: Report tag for ECTyper data. Do not alter this field unless doing pipeline development.
- header_p: denotes if the table output from ECTyper contains a header. Do not alter this field unless doing pipeline development.
+ - ec_opid`: The minimum percent identity to determine an O antigens presence, It must be an integer.
+ - ec_opcov: The minimum percent coverage of O antigen, It must be an integer.
+ - ec_hpid: The minimum percent identity to determine an H antigens presence, It must be an integer.
+ - ec_hcov: The minimum percent coverage of the H antigen, It must be an integer.
+ - ec_enable_verification: A boolean value to enable species verification in ECTyper.
### Kleborate
Kleborate performs automatic typing of *Kelbsiella*.
@@ -415,7 +420,7 @@ Performs typing of *Staphylococcus* species.
- report_tag: The report tag for Spatyper. Do not alter this field unless doing pipeline development.
- header_p: denotes whether or not the output table contains a header. Do not alter this field unless doing pipeline development.
- repeats: An optional file specifying repeats can be passed to Spatyper.
- - repeat_order: An optional file containing a repeat ordet to pass to Spatyper.
+ - repeat_order: An optional file containing a repeat order to pass to Spatyper.
### SISTR
*In-silico Salmonella* serotype prediction.
@@ -463,9 +468,60 @@ Code still remains but it will likely be removed later on.
- report_tag: The report tag for Shigatyper. Do not alter this field unless doing pipeline development.
- header_p: Denotes if the report output contains a header. Do not alter this field unless doing pipeline development.
-### Kraken2 Contig Binning
-Bins contigs based on the Kraken2 output for contaminated/metagenomic samples. This is implemeted by using a custom script.
+## Kraken2 Contig Binning
+Bins contigs based on the Kraken2 output for contaminated/metagenomic samples. This is implemented by using a custom script.
- kraken_bin
- - **taxonomic_level**: The taxonomic level to bin contigs at. Binning at species level is not recommended the default is to bin at a genus level which is specied by a character of `G`. To bin at a higher level such as family you would specify `F`.
+ - **taxonomic_level**: The taxonomic level to bin contigs at. Binning at species level is not recommended the default is to bin at a genus level which is species by a character of `G`. To bin at a higher level such as family you would specify `F`.
- fasta_ext: The extension of the fasta files output. Do not alter this field unless doing pipeline development.
+
+## Locidex (Allele Calling)
+Parameters for use of locidex in allele calling.
+
+- Locidex
+ - singularity: The Singularity container containing Locidex.
+ - docker: The path to the Docker container containing Locidex.
+ - private_repository: The path to the Docker container containing Locidex in a private repository (this helps in cloud execution environments).
+ - min_evalue = See `--lx_min_evalue`.
+ - min_dna_len = See `--lx_min_dna_len`.
+ - min_aa_len = See `--lx_min_aa_len`.
+ - max_dna_len = See `--lx_max_dna_len`.
+ - max_aa_len = See `--lx_max_aa_len`.
+ - min_dna_ident = See `--lx_min_dna_ident`.
+ - min_aa_ident = See `--lx_min_aa_ident`.
+ - min_dna_match_cov = See `--lx_min_dna_match_cov`.
+ - min_aa_match_cov = See `--lx_min_aa_match_cov`
+ - max_target_seqs = See `--lx_max_target_seqs`.
+ - extraction_mode = See `--lx_extraction_mode`.
+ - report_mode = See `--lx_report_mode`.
+ - report_prop = See `--lx_report_prop`.
+ - report_max_ambig = See `--lx_report_max_ambig`.
+ - report_max_stop = See `--lx_report_max_stop`.
+ - allele_database = See `--lx_allele_database`.
+ - date_format_string: The date format used in parsing the locidex `manifest.json` file. Do not alter this field unless doing pipeline development.
+ - manifest_db_path: Do not alter this field unless doing pipeline development.
+ - manifest_config_key: The name of key holding config data. Do not alter this field unless doing pipeline development.
+ - manifest_config_name: The name field to use in the locidex `manifest.json` file for db identification. Do not alter this field unless doing pipeline development.
+ - manifest_config_version: Config key field containing the version information for locidex. Do not alter this field unless doing pipeline development.
+ - manifest_name: The name of the `manifest.json` file for locidex. Do not alter this field unless doing pipeline development.
+ - config_data_file: The name of the locidex database file containing config information. Do not alter this field unless doing pipeline development.
+ - database_config_value_date: Name of the field containing the date in the locidex `manifest.json`. Do not alter this field unless doing pipeline development.
+ - extracted_seqs_suffix: Extracted sequences file suffix. Do not alter this field unless doing pipeline development.
+ - seq_store_suffix: Seq store suffix. Do not alter this field unless doing pipeline development.
+ - gbk_suffix: Extension name of the generated GBK file. Do not alter this field unless doing pipeline development.
+ - extraction_dir: Directory name of the locidex extract outputs. Do not alter this field unless doing pipeline development.
+ - report_suffix: Report suffix of the locidex outputs. Do not alter this field unless doing pipeline development.
+ - db_config_output_name: Output name of the selected database used for locidex. Do not alter this field unless doing pipeline development.
+ - report_tag: The report tag for Locidex Report. Do not alter this field unless doing pipeline development.
+
+## Locidex Summary
+The information used in creating a summary of the locidex outputs.
+
+- locidex_summary
+ - report_tag: The report tag for the locidex summary. Do not alter this field unless doing pipeline development.
+ - data_key: The field containing the relevant data to summarize. Do not alter this field unless doing pipeline development.
+ - data_profile_key: The key containing the profile information. Do not alter this field unless doing pipeline development.
+ - data_sample_key: The name of the key containing the sample info. Do not alter this field unless doing pipeline development.
+ - missing_allele_value: The field used for the missing allele value. Do not alter this field unless doing pipeline development.
+ - **reportable_alleles**: A list of alleles to show their presence or absence of in the final output.
+ - report_exclude_fields: Fields to exclude from the final summary report. Do not alter this field unless doing pipeline development.
diff --git a/docs/usage/usage.md b/docs/usage/usage.md
index 32e2035f..0ea8fcc7 100644
--- a/docs/usage/usage.md
+++ b/docs/usage/usage.md
@@ -112,6 +112,8 @@ Numerous steps within mikrokondo can be turned off without compromising the stab
- `--skip_metagenomic_detection`: Skips classification of sample as metagnomic and forces a sample to be analyzed as an isolate.
- `--skip_raw_read_metrics`: Prevents generation of raw read metrics, e.g. metrics generated about the reads before any trimming or filtering is performed.
- `--skip_mlst`: Skip seven gene MLST.
+- `--skip_length_filtering_contigs`: Skip length filtering of contigs based on the `--qt_min_contig_length` parameter.
+- `--skip_allele_calling`: Skip allele calling with Locidex.
#### Datasets
Different databases/pre-computed files are required for usage within mikrokondo. These can be downloaded or created by the user, and if not configured within the `nextflow.config` file they can be passed in as files with the following command-line arguments.
@@ -122,6 +124,14 @@ Different databases/pre-computed files are required for usage within mikrokondo.
- `--kraken2_db`: Kraken2 database that can be used for speciation and binning of meta-genomically assembled contigs.
- `--staramr_db`: An optional StarAMR database to be passed in, it is recommended to use the database packaged in the container.
+#### Allele Scheme Options
+Allele scheme selection parameters.
+
+- `--override_allele_scheme`: Provide the path to an allele scheme (currently only locidex is supported) that will be used for all samples provided. e.g. no automated allele database selection is performed, this scheme will be applied.
+- `--lx_allele_database`: A path to a `manifest.json` file used by locidex for automated allele selection. This option cannot be used along side `--overrided_allele_scheme`.
+ >**Note:** The provide only a path to the `manifest.json` file as `some/directory` **NOT** `some/directory/manifest.json`
+
+
#### FastP Arguments
For simplicity parameters affecting FastP have been moved to the top level. Each argument matches one listed within the [FastP](https://phac-nml.github.io/mikrokondo/usage/tool_params/#fastp) usage section with only a `fp_` being appended to the front of the argument. For a more detailed description of what each argument does please review the tool specific parameters for [FastP](https://phac-nml.github.io/mikrokondo/usage/tool_params/#fastp) here.
@@ -148,11 +158,44 @@ Top level parameters that can be passed to Quast.
- `--qt_min_contig_length`: Minimum length of a contig to be analyzed within Quast.
-#### Mash parameters
+#### Mash Parameters
Top level parameters to be passed to Mash.
- `--mh_min_kmer`: The minimum time a kmer needs to appear to be used in genome size estimation by mash.
+#### ECTyper Parameters
+Top level parameters to pass to ECTyper. Each argument corresponds to one within ECTyper.
+
+- `--ec_opid`: The minimum percent identity to determine an O antigens presence, It must be an integer.
+- `--ec_opcov`: The minimum percent coverage of O antigen, It must be an integer.
+- `--ec_hpid`: The minimum percent identity to determine an H antigens presence, It must be an integer.
+- `--ec_hcov`: The minimum percent coverage of the H antigen, It must be an integer.
+- `--ec_enable_verification`: A boolean value to enable species verification in ECTyper.
+
+#### SISTR Parameters
+Top level parameters for SISTR.
+
+- `--sr_full_cgmlst`: A boolean value (default is true) to use the full cgMLST set of alleles for SISTR which includes some highly similar alleles.
+
+
+#### Locidex Parameters
+Top level parameters for Locidex. The currently implemented allele caller, do not that internally Locidex uses blast so many of the parameters correspond to blast options.
+
+- `--lx_min_evalue`: Minimum e-value required for a match.
+- `--lx_min_dna_len`: Global minimum query length of DNA strand.
+- `--lx_min_aa_len`: Global minimum query length of an Amino Acid strand.
+- `--lx_max_dna_len`: Global maximum query length of DNA strand.
+- `--lx_max_aa_len`: Global maximum query length of Amino Acid strand.
+- `--lx_min_dna_ident`: Global minimum DNA percent identity required for match. (float).
+- `--lx_min_aa_ident`: Global minimum Amino Acid percent identiy required for match. (float).
+- `--lx_min_dna_match_cov`: Global minimum DNA percent hit coverage identity required for match (float).
+- `--lx_min_aa_match_cov`: Global minimum Amino Acid hit coverage identity required for match (float).
+- `--lx_max_target_seqs`: Maximum number of sequence hits per query.
+- `--lx_extraction_mode`: Different ways to run locidex (Options: snps, trim, raw, extend).
+- `--lx_report_mode`: Allele profile assignment (Options: normal or conservative).
+- `--lx_report_prop`: Metadata label to use for aggregation. Only alphanumeric characters, underscores and dashes are allowed in names.
+- `--lx_report_max_ambig`: Maximum number of ambiguous characters allowed in a sequence.
+- `--lx_report_max_stop`: Maximum number of internal stop codons allowed in a sequence.
#### Containers
diff --git a/mkdocs.yml b/mkdocs.yml
index 90e19e1a..0b8f34a2 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -1,6 +1,15 @@
site_name: mikrokondo
theme:
name: material
+ palette:
+ - scheme: default
+ toggle:
+ icon: material/brightness-7
+ name: Switch to dark mode
+ - scheme: slate
+ toggle:
+ icon: material/brightness-4
+ name: Switch to light mode
features:
- navigation.tabs
- navigation.tabs.sticky
diff --git a/nextflow_schema.json b/nextflow_schema.json
index e059f396..16b48dbc 100644
--- a/nextflow_schema.json
+++ b/nextflow_schema.json
@@ -613,14 +613,14 @@
"lx_min_dna_ident": {
"type": "number",
"default": 80,
- "description": "Global minimum DNA percent identity required for match. (Float value)",
+ "description": "Global minimum DNA percent identity required for match. (float).",
"minimum": 0,
"maximum": 100
},
"lx_min_aa_ident": {
"type": "number",
"default": 80,
- "description": "Global minimum Amino Acid percent identiy required for match. (Float value)",
+ "description": "Global minimum Amino Acid percent identiy required for match. (float)",
"minimum": 0,
"maximum": 100
},
@@ -667,7 +667,7 @@
"lx_report_prop": {
"type": "string",
"default": "locus_name",
- "description": "Metadata label to use for aggregation. Only alphanumeric characters, underscores and dashes are allowed in names",
+ "description": "Metadata label to use for aggregation. Only alphanumeric characters, underscores and dashes are allowed in names.",
"pattern": "^[A-Za-z0-9_-]*$"
},
"lx_report_max_ambig": {