Determines which reference sequence is more likely to be present in a given sample
seq_typing is a software to determine a given sample type using either a read mapping approach or a sequence Blast search against a set of reference sequences.
For the read mapping approach, the sample's reads are mapped to the given reference sequences using Bowtie2, parsed with Samtools and analysed via ReMatCh. Based on the length of the sequence covered and it's depth of coverage, seq_typing returns the type associated with the reference sequence which is more likely to be present. The selected sequence will be the one covered to a greater extent, with higher depth of coverage and with the highest identity (applied hierarchically following the order here described), that passes defined thresholds.
For the Blast approach (when using sequences fasta files) the sequence selected, for each DB sequence, is determined accordingly with the best Blast hit. The best hit is defined by the largest alignment length, highest similarity, lowest E-value and number of gaps, and largest reference sequence length (applied hierarchically following the order here described). The selected sequence criteria is the same used with the read mapping approach (although the depth of coverage will always be 1).
In both cases, manual curation and sequence type definition is required for reference sequences database production.
- Illumina Fastq files
OR - Sequence fasta file
For get_stx_db.py script:
ReMatCh:
git clone https://github.com/B-UMMI/ReMatCh.git
cd ReMatCh
python3 setup.py install
NOTE:
If you don't have permission for global system installation, try the following install command instead:
python3 setup.py install --user
Blast+:
wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-*-x64-linux.tar.gz
tar xf ncbi-blast-*-x64-linux.tar.gz
rm ncbi-blast-*-x64-linux.tar.gz
cd ncbi-blast-*/bin
# Temporarily add Blast binaries to the PATH
export PATH="$(pwd -P):$PATH"
# Permanently add Blast binaries to the PATH
echo export PATH="$(pwd -P):$PATH" >> ~/.profile
git clone https://github.com/B-UMMI/seq_typing.git
cd seq_typing
python3 setup.py install
NOTE:
If you don't have permission for global system installation, try the following install command instead:
python3 setup.py install --user
usage: seq_typing.py [-h] [--version] {reads,index,assembly,blast} ...
Determines which reference sequence is more likely to be present in a given
sample
optional arguments:
-h, --help Show this help message and exit
--version Version information
Subcommands:
Valid subcommands
{reads,index,assembly,blast}
Additional help
reads reads --help
index index --help
assembly assembly --help
blast blast --help
- index module:
Creates Bowtie2 index. This is useful when running the same reference sequences file for different reads dataset. - reads module:
Run seq_typing.py using fastq files. If running multiple samples using the same reference sequences file, consider use first seq_typing.py index module. - blast module:
Creates Blast DB. This is useful when running the same DB sequence file for different assemblies. - assembly module:
Run seq_typing.py using a fasta file. If running multiple samples using the same DB sequence file, consider use first seq_typing.py blast module.
Creates Bowtie2 index.
This is useful when running the same reference sequences file for different reads dataset.
usage: seq_typing.py index [-h]
-r /path/to/reference.fasta ... | --org escherichia coli
[-o /path/to/output/directory/] [-j N]
Creates Bowtie2 index. This is useful when running the same reference
sequences file for different reads dataset.
optional arguments:
-h, --help show this help message and exit
Required one of the following options:
-r --reference /path/to/reference.fasta ...
Path to reference sequences files. If more than one
file is passed, a Bowtie2 index for each file will be
created. (default: None)
--org escherichia coli
Organism option with reference sequences provided
("seqtyping/reference_sequences/" folder) together
with seq_typing.py for typing (default: None)
General facultative options:
-o --outdir /path/to/output/directory/
Path to the directory where the information will be
stored (default: ./) (default: .)
-j N, --threads N Number of threads to use (default: 1) (default: 1)
Run seq_typing.py using fastq files.
usage: seq_typing.py reads [-h]
-f /path/to/input/file.fq.gz ...
-r /path/to/reference_sequence.fasta ... | --org escherichia coli
[-s sample-ID] [-o /path/to/output/directory/] [-j N]
[--typeSeparator _]
[--extraSeq N] [--minCovPresence N]
[--minCovCall N] [--minGeneCoverage N]
[--minDepthCoverage N] [--minGeneIdentity N]
[--bowtieAlgo="--very-sensitive-local"] [--maxNumMapLoc N]
[--doNotRemoveConsensus] [--saveNewAllele] [--typeNotInNew]
[--debug] [--resume]
Run seq_typing.py using fastq files. If running multiple samples using the
same reference sequences file, consider use first "seq_typing.py index"
module.
optional arguments:
-h, --help show this help message and exit
Required options:
-f --fastq /path/to/input/file.fq.gz ...
Path to single OR paired-end fastq files. If two files
are passed, they will be assumed as being the paired
fastq files
Required one of the following options:
-r --reference /path/to/reference_sequence.fasta ...
Path to reference sequences files. If Bowtie2 index was
already produced, only provide the file name that ends
with ".1.bt2", but without this termination (for
example, for a Bowtie2 index
"/file/sequences.fasta.1.bt2", only provide
"/file/sequences.fasta"). If no Bowtie2 index files
are found, those will be created in --outdir. If more
than one file is passed, a type for each file will be
determined. Give the files name in the same order that
the type must be determined. (default: None)
--org escherichia coli
Organism option with reference sequences provided
together with seq_typing.py for typing
("seqtyping/reference_sequences/" folder)
General facultative options:
-s --sample sample-ID
Sample name (default: sample)
-o --outdir /path/to/output/directory/
Path to the directory where the information will be
stored (default: ./)
-j --threads N Number of threads to use (default: 1)
--typeSeparator _ Last single character separating the general sequence
header from the last part containing the type (default: _)
--extraSeq N Sequence length added to both ends of target sequences
(usefull to improve reads mapping to the target one)
that will be trimmed in ReMatCh outputs
(default when not using --org: 0)
--minCovPresence N Reference position minimum coverage depth to consider
the position to be present in the sample
(default when not using --org: 5)
--minCovCall N Reference position minimum coverage depth to perform a
base call (default when not using --org: 10)
--minGeneCoverage N Minimum percentage of target reference sequence
covered to consider a sequence to be present (value
between [0, 100]) (default when not using --org: 60)
--minDepthCoverage N Minimum depth of coverage of target reference sequence
to consider a sequence to be present (default: 2)
--minGeneIdentity N Minimum percentage of identity of reference sequence
covered to consider a gene to be present (value
between [0, 100]). One INDEL will be considered as one
difference
--bowtieAlgo="--very-sensitive-local"
Bowtie2 alignment mode. It can be an end-to-end
alignment (unclipped alignment) or local alignment
(soft clipped alignment). Also, can choose between
fast or sensitive alignments. Please check Bowtie2
manual for extra information:
http://bowtie-bio.sourceforge.net/bowtie2/index.shtml .
This option should be provided between quotes and
starting with an empty space
(like --bowtieAlgo " --very-fast") or using equal
sign (like --bowtieAlgo="--very-fast")
(default when not using --org: "--very-sensitive-local")
--maxNumMapLoc N Maximum number of locations to which a read can map
(sometimes useful when mapping against similar sequences)
(default when not using --org: 1)
--saveNewAllele Save the new allele found for the selected type
(default: false)
--typeNotInNew Do not save the type of the selected sequence in the header
of the new allele (when writing uses the "--typeSeparator").
(default: false)
--doNotRemoveConsensus
Do not remove ReMatCh consensus sequences
--debug Debug mode: do not remove temporary files
--resume Resume seq_typing.py reads
Creates Blast DB.
This is useful when running the same DB sequence file for different assemblies.
usage: seq_typing.py blast [-h]
-t nucl
-f /path/to/db.sequences.fasta ... | --org escherichia coli
[-o /path/to/output/directory/] [--extraSeq N]
Creates Blast DB. This is useful when running the same DB sequence file for
different assemblies.
optional arguments:
-h, --help show this help message and exit
Required one of the following options:
-f --fasta /path/to/db.sequences.fasta ...
Path to DB sequences files. If more than one file is
passed, a Blast DB for each file will be created.
--org escherichia coli
Organism option with DB sequences files provided
("seqtyping/reference_sequences/" folder) together with
seq_typing.py for typing
Required option for --fasta:
-t nucl, --type nucl Blast DB type (available options: nucl, prot)
General facultative options:
-o --outdir /path/to/output/directory/
Path to the directory where the information will be
stored (default: ./)
--extraSeq N Sequence length added to both ends of target sequences
(usefull when analysing data by reads mapping)
that will be trimmed for Blast analysis.
(default when not using --org: 0)
Run seq_typing using a fasta file.
If running multiple samples using the same DB sequence file, consider use first seq_typing.py blast module.
usage: seq_typing.py assembly [-h]
-f /path/to/query/assembly_file.fasta
-b /path/to/Blast/db.sequences.file ... -t nucl | --org escherichia coli
[-s sample-ID] [-o /path/to/output/directory/] [-j N]
[--typeSeparator _] [--extraSeq N] [--minGeneCoverage N]
[--minGeneIdentity N] [--saveNewAllele] [--typeNotInNew]
[--debug] [--resume]
Run seq_typing.py using a fasta file. If running multiple samples using the
same DB sequence file, consider use first "seq_typing.py blast"
module.
optional arguments:
-h, --help show this help message and exit
Required options:
-f /path/to/query/assembly_file.fasta, --fasta /path/to/query/assembly_file.fasta
Path to fasta file containing the query sequences from
which the types should be assessed
Required one of the following options:
-b --blast /path/to/Blast/db.sequences.file ...
Path to DB sequences files. If Blast DB was already
produced, only provide the file that do not end with
".n*" something (do not use for example
/blast_db.sequences.fasta.nhr). If no Blast DB is
found for the DB sequence file, one will be created in
--outdir. If more than one Blast DB file is passed, a
type for each file will be determined. Give the files
in the same order that the type must be determined.
--org escherichia coli
Organism option with DB sequences files provided
("seqtyping/reference_sequences/" folder) together with
seq_typing.py for typing
Required option for --blast:
-t --type nucl Blast DB type (available options: nucl, prot)
General facultative options:
-s --sample sample-ID
Sample name (default: sample)
-o --outdir /path/to/output/directory/
Path to the directory where the information will be
stored (default: ./)
-j --threads N Number of threads to use (default: 1)
--typeSeparator _ Last single character separating the general sequence
header from the last part containing the type (default: _)
--extraSeq N Sequence length added to both ends of target sequences
(usefull when analysing data by reads mapping)
that will be trimmed for Blast analysis.
--minGeneCoverage N Minimum percentage of target reference sequence
covered to consider a sequence to be present (value
between [0, 100]) (default when not using --org: 60)
--minGeneIdentity N Minimum percentage of identity of reference sequence
covered to consider a gene to be present (value
between [0, 100])
--saveNewAllele Save the new allele found for the selected type
(default: false)
--typeNotInNew Do not save the type of the selected sequence in the header
of the new allele (when writing uses the "--typeSeparator").
(default: false)
--debug Debug mode: do not remove temporary files
--resume Resume seq_typing.py assembly
For the following organisms, references sequences are provided.
- Serotyping:
- Escherichia coli
- Staph agr (Staphylococcus aureus, agr typing)
- Haemophilus influenzae
- GBS sero (Group B Streptococcus, Streptococcus agalactiae, serotype)
- Dengue virus (with genotype information)
- Other types:
- GBS pili (Group B Streptococcus, Streptococcus agalactiae, pili typing)
- GBS surf (Group B Streptococcus, Streptococcus agalactiae, surface protein typing)
- stx subtyping (Escherichia coli stx subtyping)
Use --org
option with one of those organisms options
Serotyping Haemophilus influenzae using provided references sequences (that uses only one reference sequences file):
seq_typing.py reads --org Haemophilus influenzae \
--fastq sample_1.fq.gz sample_2.fq.gz \
--outdir sample_out/ \
--threads 2
Serotyping Escherichia coli using provided references sequences (that uses two reference sequences files):
seq_typing.py reads --org Escherichia coli \
--fastq sample_1.fq.gz sample_2.fq.gz \
--outdir sample_out/ \
--threads 2
Type one sample with a users own set of references sequences (using for example single-end reads):
seq_typing.py reads --reference references/Ecoli/O_type.fasta references/Ecoli/H_type.fasta \
--fastq sample.fq.gz \
--outdir sample_out/ \
--threads 2
When running the same reference sequences files for different reads dataset, the Bowtie2 index files can be produced before to speed up the analysis.
Example using Dengue virus provided reference sequences (that uses only one reference sequences file):
seq_typing.py index --org Dengue virus \
--outdir index_out/ \
--threads 2
# Run seq_typing using created database
seq_typing.py reads --reference index_out/1_GenotypesDENV_14-05-18.fasta \
--fastq sample_1.fq.gz sample_2.fq.gz \
--outdir sample_out/ \
--threads 2
The following examples show how to use users own reference sequences files. If many samples will be analysed using the same reference sequences file, a preliminary seq_typing.py index step is advisable to be run.
Run seq_typing without previous construction of reference database:
seq_typing.py reads --reference references/O_type.fasta references/H_type.fasta \
--fastq sample_1.fq.gz sample_2.fq.gz \
--outdir sample_out/ \
--threads 2
Run seq_typing with a preliminary step for Bowtie2 index production (useful when running multiple samples with the same reference sequences file):
# Preliminary step for Bowtie2 index construction.
seq_typing.py index --reference references/O_type.fasta references/H_type.fasta \
--outdir index_out/ \
--threads 2
# Run seq_typing using created database
seq_typing.py reads --reference index_out/O_type.fasta index_out/H_type.fasta \
--fastq sample_1.fq.gz sample_2.fq.gz \
--outdir sample_out/ \
--threads 2
Type Dengue virus using assemblies with provided reference sequences (uses only one reference sequences file):
seq_typing.py assembly --org Dengue virus \
--fasta sample.fasta \
--outdir sample_out/ \
--threads 2
When running the same database for different samples, a single Blast database should be produce first to speed up the analysis.
Example using Escherichia coli provided reference sequences (that uses two reference sequences files):
seq_typing.py blast --org Escherichia coli \
--outdir db_out/
# Run seq_typing using created database
seq_typing.py assembly --blast db_out/1_O_type.fasta db_out/2_H_type.fasta \
--type nucl \
--fasta sample.fasta \
--outdir sample_out/ \
--threads 2
For users own reference sequences files, seq_typing requires the construction of the reference database. seq_typing will construct the reference DB while analysing the sample's sequences. If many samples will be analysed using the same reference sequences file, a preliminary seq_typing.py blast step is advisable to be run.
Run seq_typing without previous construction of reference database:
seq_typing.py assembly --blast references/O_type.fasta references/H_type.fasta \
--type nucl \
--fasta sample.fasta \
--outdir sample_out/ \
--threads 2
Run seq_typing with a preliminary step for reference DB construction (useful when running multiple samples with the same reference sequences file):
# Preliminary step for reference DB construction.
seq_typing.py blast --blast references/O_type.fasta references/H_type.fasta \
--type nucl \
--outdir db_out/
# Run seq_typing using created database
seq_typing.py assembly --blast db_out/O_type.fasta db_out/H_type.fasta \
--type nucl \
--fasta sample.fasta \
--outdir sample_out/ \
--threads 2
A specific script was created for E. coli stx subtyping (ecoli_stx_subtyping.py) in order to accommodate the possible existence of stx2 paralogs.
It works very similar to seq_typing.py.
usage: ecoli_stx_subtyping.py [-h] [--version] {reads,assembly,blast} ...
Gets E. coli stx subtypes
optional arguments:
-h, --help Show this help message and exit
--version Version information
Subcommands:
Valid subcommands
{reads,assembly}
Additional help
reads reads --help
assembly assembly --help
Run ecoli_stx_subtyping.py using fastq files.
usage: ecoli_stx_subtyping.py reads [-h]
-f /path/to/input/file.fq.gz ...
-r /path/to/reference_sequence.fasta ... | --org stx subtyping
[--stx2covered N] [--stx2identity N]
[--sample sample-ID] [-o /path/to/output/directory/] [-j N]
[--typeSeparator _]
[--extraSeq N] [--minCovPresence N]
[--minCovCall N] [--minGeneCoverage N]
[--minDepthCoverage N] [--minGeneIdentity N]
[--bowtieAlgo="--very-sensitive-local"] [--maxNumMapLoc N]
[--doNotRemoveConsensus] [--saveNewAllele] [--typeNotInNew]
[--debug] [--resume]
Run ecoli_stx_subtyping.py using fastq files
optional arguments:
-h, --help show this help message and exit
Required options:
-f --fastq /path/to/input/file.fq.gz ...
Path to single OR paired-end fastq files. If two files
are passed, they will be assumed as being the paired
fastq files
Required one of the following options:
-r --reference 1_virulence_db.stx1_subtyping.fasta 2_virulence_db.stx2_subtyping.fasta
Path to stx subtyping reference sequences (if not want to use
the ones provided together with seq_typing.py)
--org stx subtyping To use stx subtyping reference sequences provided
together with seq_typing.py
ecoli_stx_subtyping specific facultative options:
--stx2covered N Minimal percentage of sequence covered to consider
extra stx2 subtypes (value between [0, 100]) (default: 100)
--stx2identity N Minimal sequence identity to consider extra stx2
subtypes (value between [0, 100]) (default: 99.5)
General facultative options:
-s --sample sample-ID
Sample name (default: sample)
-o --outdir /path/to/output/directory/
Path to the directory where the information will be
stored (default: ./)
-j --threads N Number of threads to use (default: 1)
--typeSeparator _ Last single character separating the general sequence
header from the last part containing the type (default: _)
--extraSeq N Sequence length added to both ends of target sequences
(usefull to improve reads mapping to the target one)
that will be trimmed in ReMatCh outputs (default: 0)
--minCovPresence N Reference position minimum coverage depth to consider
the position to be present in the sample (default: 5)
--minCovCall N Reference position minimum coverage depth to perform a
base call (default: 10)
--minGeneCoverage N Minimum percentage of target reference sequence
covered to consider a sequence to be present (value
between [0, 100]) (default: 60)
--minDepthCoverage N Minimum depth of coverage of target reference sequence
to consider a sequence to be present (default: 2)
--minGeneIdentity N Minimum percentage of identity of reference sequence
covered to consider a gene to be present (value
between [0, 100]). One INDEL will be considered as one
difference
--bowtieAlgo="--very-sensitive-local"
Bowtie2 alignment mode. It can be an end-to-end
alignment (unclipped alignment) or local alignment
(soft clipped alignment). Also, can choose between
fast or sensitive alignments. Please check Bowtie2
manual for extra information:
http://bowtie-bio.sourceforge.net/bowtie2/index.shtml .
This option should be provided between quotes and
starting with an empty space
(like --bowtieAlgo " --very-fast") or using equal
sign (like --bowtieAlgo="--very-fast")
(default when not using --org: "--very-sensitive-local")
--maxNumMapLoc N Maximum number of locations to which a read can map
(sometimes useful when mapping against similar sequences)
(default when not using --org: 1)
--saveNewAllele Save the new allele found for the selected type
(default: false)
--typeNotInNew Do not save the type of the selected sequence in the header
of the new allele (when writing uses the "--typeSeparator").
(default: false)
--doNotRemoveConsensus
Do not remove ReMatCh consensus sequences
--debug Debug mode: do not remove temporary files
--resume Resume seq_typing.py reads
Run ecoli_stx_subtyping using a fasta file.
usage: ecoli_stx_subtyping.py assembly [-h]
-f /path/to/query/assembly_file.fasta
-b /path/to/Blast/db.sequences.file ... -t nucl | --org stx subtyping
[--stx2covered N] [--stx2identity N]
[--sample sample-ID] [-o /path/to/output/directory/] [-j N]
[--typeSeparator _] [--extraSeq N] [--minGeneCoverage N]
[--minGeneIdentity N] [--saveNewAllele] [--typeNotInNew]
[--debug] [--resume]
Run ecoli_stx_subtyping.py using a fasta file. If running multiple samples using the
same DB sequence file, consider use first "seq_typing.py blast"
module.
optional arguments:
-h, --help show this help message and exit
Required options:
-f /path/to/query/assembly_file.fasta, --fasta /path/to/query/assembly_file.fasta
Path to fasta file containing the query sequences from
which the stx subtypes should be assessed
Required one of the following options:
-b --blast 1_virulence_db.stx1_subtyping.fasta 2_virulence_db.stx2_subtyping.fasta
Path to stx subtyping DB sequence file (if not want to use
the ones provided together with seq_typing.py).
If Blast DB was already produced (using "seq_typing.py blast"
module) only provide the file that do not end with ".n*"
something (do not use for example
/blast_db.sequences.fasta.nhr). If no Blast DB is
found for the DB sequence file, one will be created in
--outdir. If more than one Blast DB file is passed, a
type for each file will be determined. Give the files
in the same order that the type must be determined.
--org stx subtyping To use stx subtyping reference sequences provided
together with seq_typing.py
Required option for --blast:
-t --type nucl Blast DB type (available options: nucl, prot)
ecoli_stx_subtyping specific facultative options:
--stx2covered 95 Minimal percentage of sequence covered to consider
extra stx2 subtypes (value between [0, 100]) (default: 100)
--stx2identity 95 Minimal sequence identity to consider extra stx2
subtypes (value between [0, 100]) (default: 99.5)
General facultative options:
-s --sample sample-ID
Sample name (default: sample)
-o --outdir /path/to/output/directory/
Path to the directory where the information will be
stored (default: ./)
-j --threads N Number of threads to use (default: 1)
--typeSeparator _ Last single character separating the general sequence
header from the last part containing the type (default: _)
--extraSeq N Sequence length added to both ends of target sequences
(usefull when analysing data by reads mapping)
that will be trimmed for Blast analysis.
--minGeneCoverage N Minimum percentage of target reference sequence
covered to consider a sequence to be present (value
between [0, 100]) (default: 60)
--minGeneIdentity N Minimum percentage of identity of reference sequence
covered to consider a gene to be present (value
between [0, 100])
--saveNewAllele Save the new allele found for the selected type
(default: false)
--typeNotInNew Do not save the type of the selected sequence in the header
of the new allele (when writing uses the "--typeSeparator").
(default: false)
--debug Debug mode: do not remove temporary files
--resume Resume seq_typing.py reads
To construct stx subtypes Blast DB, proceed as described here:
seq_typing.py blast --org stx subtyping
.
An updated stx subtyping reference sequences can be obtained from VirulenceFinder DB Bitbucket account. A specific script was created to get the most recent stx reference sequences.
usage: get_stx_db.py [-h] [--version]
[-o /path/to/output/directory/]
Gets STX sequences from virulencefinder_db to produce a STX subtyping DB.
optional arguments:
-h, --help show this help message and exit
--version Version information
General facultative options:
-o --outdir /path/to/output/directory/
Path to the directory where the sequences will be
stored (default: ./)
Usage example
get_stx_db.py --outdir /path/output/directory/
What is a (Docker) container?
"(...) is a tool that can package an application and its dependencies in a virtual container that can run on any Linux server," Lyman explained. "This helps enable flexibility and portability on where the application can run, whether on premise, public cloud, private cloud, bare metal, etc." From here.
Why are containers useful?
"(...) Docker containers technology allows you to write self-contained and truly reproducible computational pipelines." From here.
For detailed information on how to run seq_typing using containers, please check here.
seq_typing.report.txt
Text file with the typing result. If it was not possible to determine a type for a given reference file, NT
(for None Typeable) will be returned for that file.
Example of E. coli serotyping (two reference files):
O157:H7
Example of Dengue virus serotyping and genotyping (only one reference file):
3-III
seq_typing.report_types.tab
Tabular file with detailed results:
- General fields
- sequence_type: type of the results reported. Three values can be found here.
selected
for the reference sequences selected for the reported typing result.other_probable_type
for other reference sequences that could have been selected because fulfill selection thresholds.most_likely
for the most likely reference sequences when no reference sequences fulfill selection thresholds. - reference_file: the reference file where the sequences came from.
- type: the type associated to the reference sequence
- sequence: reference sequences name
- sequenced_covered: percentage of reference sequences covered
- coverage_depth: mean reference sequences depth of coverage of the positions present (1 if assembly was used)
- sequence_identity: percentage identity of reference sequences covered
- sequence_type: type of the results reported. Three values can be found here.
- Assembly fields (filled with
NA
if reads were used)- query: name of the provided sequence that had hit with the given reference sequence
- q_start: hit starting position of the provided sequence
- q_end: hit ending position of the provided sequence
- s_start: hit starting position of the reference sequence
- s_end: hit ending position of the reference sequence
- evalue: hit E-value
Example of E. coli serotyping (two reference files) using reads:
#sequence_type | reference_file | type | sequence | sequenced_covered | coverage_depth | sequence_identity | query | q_start | q_end | s_start | s_end | evalue | gaps |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
selected | O_type.fasta | O26 | wzy_192_AF529080_O26 | 100.0 | 281.95405669599216 | 100.0 | NA | NA | NA | NA | NA | NA | NA |
selected | H_type.fasta | H11 | fliC_269_AY337465_H11 | 99.4546693933197 | 51.76490747087046 | 99.86291980808772 | NA | NA | NA | NA | NA | NA | NA |
other_probable_type | O_type.fasta | O26 | wzx_208_AF529080_O26 | 100.0 | 223.3072050673001 | 100.0 | NA | NA | NA | NA | NA | NA | NA |
other_probable_type | H_type.fasta | H11 | fliC_276_AY337472_H11 | 98.84117246080436 | 37.52551724137931 | 99.86206896551724 | NA | NA | NA | NA | NA | NA | NA |
Example of Dengue virus serotyping and genotyping (only one reference file) using assembly:
#sequence_type | reference_file | type | sequence | sequenced_covered | coverage_depth | sequence_identity | query | q_start | q_end | s_start | s_end | evalue | gaps |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
selected | 1_GenotypesDENV_14-05-18.fasta | 3-III | gb:EU529683#...#Subtype:3-III#Host:Human#seqTyping_3-III | 100.0 | 1 | 99.223 | NODE_1_length_10319_cov_2021.782660 | 138 | 10307 | 10170 | 1 | 0.0 | 0 |
other_probable_type | 1_GenotypesDENV_14-05-18.fasta | 1-V | gb:GQ868570#...#Subtype:1-V#Host:Human#seqTyping_1-V | 100.0 | 1 | 99.479 | NODE_2_length_10199_cov_229.028848 | 13 | 10188 | 1 | 10176 | 0.0 | 0 |
other_probable_type | 1_GenotypesDENV_14-05-18.fasta | 4-II | gb:GQ868585#...#Subtype:4-II#Host:Human#seqTyping_4-II | 100.0 | 1 | 99.38 | NODE_4_length_10182_cov_29.854132 | 13 | 10173 | 1 | 10161 | 0.0 | 3 |
new_allele/
Folder with a subfolder named with the reference file name from which the new allele was found. The novel allele is stored inside a file named with the selected type. If it is not possible to retreive the entire sequence of the new allele, "_partial" string will be added to the header. The header of the sequence will contain the sample name (the default is "sample") and the selected type separated by the --typeSeparator
option (this behaviour can be deactivated with the --typeNotInNew
option).
In the case of using extra/flanking sequences to the target sequence, if the full length of such extra/flanking sequences could be retreived, a new file ending with ".extra_seq.fasta" will be created (not yet implemented for reads).
Example
For Dengue virus serotyping and genotyping:
/outdir/
seq_typing.report.txt
seq_typing.report_types.tab
new_allele/
1_GenotypesDENV_14-05-18.fasta/
3-III.fasta
>sample_partial_3-III
ATGTAAGCATGAGGTCACCAT ...
3-III.extra_seq.fasta
>sample_partial_3-III
CCCCCTTTTTATGTAAGCATGAGGTCACCAT ...
run.20190131-162341.log
run.*.log
Running log file.
seq_typing.ecoli_stx_subtyping.txt
Text file with the typing result. The secondary results for stx2 genes are presented between brackets.
Example:
stx1a:stx2c(stx2d)
NOTE: For stx2 gene, stx2a, stx2c and stx2d variants are grouped together as stx2acd due to the fact
that all of these subtypes are the most potent ones to cause HUS and are difficult to separate from each other by the
methods in use right now.
seq_typing.ecoli_stx_subtyping.report_types.tab
Tabular file with detailed results similar to the above seq_typing.report_types.tab file:
Example (using reads):
#sequence_type | reference_file | type | sequence | sequenced_covered | coverage_depth | sequence_identity | query | q_start | q_end | s_start | s_end | evalue | gaps |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
selected | 1_virulence_db.stx1_subtyping.fasta | stx1a | stx1A:15:AF461168:A:seqTyping_stx1a | 100.0 | 65.37447257383967 | 100.0 | NA | NA | NA | NA | NA | NA | NA |
selected | 2_virulence_db.stx2_subtyping.fasta | stx2c | stx2B:15:AB071845:C:seqTyping_stx2c | 100.0 | 19.377777777777776 | 100.0 | NA | NA | NA | NA | NA | NA | NA |
other_probable_type | 1_virulence_db.stx1_subtyping.fasta | stx1c | stx1B:11:AB071620:C:seqTyping_stx1c | 100.0 | 21.64814814814815 | 99.25925925925925 | NA | NA | NA | NA | NA | NA | NA |
other_probable_type | 1_virulence_db.stx1_subtyping.fasta | stx1a | stx1B:14:AM230663:A:seqTyping_stx1a | 100.0 | 45.06666666666667 | 100.0 | NA | NA | NA | NA | NA | NA | NA |
other_probable_type | 2_virulence_db.stx2_subtyping.fasta | stx2c | stx2B:10:EF441604:C:seqTyping_stx2c | 100.0 | 17.2 | 99.25925925925925 | NA | NA | NA | NA | NA | NA | NA |
other_probable_type | 2_virulence_db.stx2_subtyping.fasta | stx2d | stx2B:11:FM998840:D:seqTyping_stx2d | 100.0 | 9.996296296296297 | 99.62962962962963 | NA | NA | NA | NA | NA | NA | NA |
new_allele/
Folder with a subfolder named with the reference file name from which the new allele was found. The novel allele is stored inside a file named with the selected type. If it is not possible to retreive the entire sequence of the new allele, "_partial" string will be added to the header. The header of the sequence will contain the sample name (the default is "sample") and the selected type separated by the --typeSeparator
option (this behaviour can be deactivated with the --typeNotInNew
option).
In the case of using extra/flanking sequences to the target sequence, if the full length of such extra/flanking sequences could be retreived, a new file ending with ".extra_seq.fasta" will be created (not yet implemented for reads).
Example:
/outdir/
seq_typing.ecoli_stx_subtyping.txt
seq_typing.ecoli_stx_subtyping.report_types.tab
new_allele/
2_virulence_db.stx2_subtyping.fasta/
stx2c.fasta
>sample_stx2c
ATGTAAGCATGAGGTCACCAT ...
stx2c.extra_seq.fasta
>sample_stx2c
CCCCCTTTTTATGTAAGCATGAGGTCACCAT ...
1_virulence_db.stx1_subtyping.fasta/
stx1a.extra_seq.fasta
>sample_partial
CCCCCTTTTTATGTAAGCATGAGGTCACCAT ...
run.20190131-162341.log
run.*.log
Running log file.
MP Machado, J Halkilahti, I Mendes, M Pinto, E Lizarazo, JP Gomes, M Ramirez, M Rossi, JA Carrico. seq_typing GitHub https://github.com/B-UMMI/seq_typing
Miguel Machado
[email protected]