-
Notifications
You must be signed in to change notification settings - Fork 23
denoise_seq
Sequences from Next Generation Sequencing projects contain errors introduced by polymerases during growth, amplification, and sequencing, but also errors introduced by the sequencing platforms such as miscalled bases and homopolymer errors. For amplicon type sequencing you can used the abundance of sequences from a particular region to level out these errors. This is done by denoise_seq which clusters all sequences with quality scores from the stream using Usearch at a specified clustering identity threshold. These clusters are subsequenctly aligned using Ustar, and for each aligned cluster a consensus sequence is determined along with mean quality scores for included residues.
denoise_seq should be used on seperate amplicon experiments only (i.e. after demultiplexing).
The consensus determination is controlled by a number of parameters. Consider the below alignment:
0 1 2 3 4 5 6
123456789012345678901234567890123456789012345678901234567890 <- position
---------GCTTCAGCATCGACTACGACtACGACTATCAGCTACGATCcGATCGGATCg
--------AGCTTCAGCATCGACTACGACtACGACTATCAGCTACGATCcGATCGGATCg
-------TAGCTTCAGCATCGACTACGACtACGACTATCAGCTACGATCGGATCGGATCg
------CTAGCTTCAGCATCGACTACGACtACGACTATCAGCTACGATCGGATCGGATCg
-----ACTAGCTTCAGCATCGACTACGACtACGACTATCAGCTACGATCGGATCGGATCg
----CACTAGCATCtGCATCGACTACGAC-ACGACTATCAGCTACGATCGGATCGGATCg
---gCACTAGCATC-GCATCGACTACGAC-ACGACTATC-GCTACGATCcGATCGGATCg
--cgCACTAGCATC-GCATCGACTACGAC-ACGACTATC-GCTACGATCGGATCGGATCg
-acgCACTAGCATC-GCATCGACTACGAC-ACGACTATC-GCTACGATCGGATCGGATC-
tacgCACTAGCATC-GCATCGACTACGAC-ACGACTATC-GCTACGATCGGATCGGATC-
----CACTAGCWTCAGCATCGACTACGAC-ACGACTATCAGCTACGATCGGATCGGATC- <- consensus
The sequence minimum (the -s
switch) indicates that columns with less than a specified number
of sequences present are removed. Thus, using -s 2
then columns 1 and 2 in the above alignment
will be deleted as indicated by the lower case letters.
The residue minimum (the -r
switch) indicates what percentage of residues that must be present,
not considering gaps, to be reported in the consensus. If a column contains 5 A's and 5 T's both of
these constitute more than 30% (default value) and the appropriate ambiguity residue (W) is
reported (column 12). If a column contain 5 A's and 1 T then the T constitute less than 30% and an
A is reported (column 15). Setting this value > 0.5 will result in the supression of ambiguity
codes unless there is ambiguity codes in the input sequences.
The gap maximum (the -g
switch) is used to remove columns that contain more than a specified
percentage of gaps. Thus, using the default value 0.4 will remove columns 1-4 and 30, but not 40.
The quality minimum (the -q
switch) remove all residues with a quality score below 10 (default).
Quality scores are not indicated in the above alignment, but row 1,2 and 7 in column 50 is shown in
lower case indicating that these residues were masked by this criteria.
The quality mean (the -Q
switch) remove columns where the mean quality score for non-gap
residues is below 15 (default value). Which could have been the case for column 60.
Note that the above filters may remove the same residues, and that the execution order of the filters are the same order as listed above.
The output of denoise_seq is consensus sequences from each cluster with mean quality scores
for each column. The cluster number is indicated by the CLUSTER
key and the number of sequences
in each cluster is indicated by the CLUSTER_COUNT
:
SEQ_NAME: 129_2
SEQ: CGTAGACTAGCCTACGGGGGCGAGCAGTGGGGAAGTAATTCAAAACACGATGCAGCGACGCCG
SEQ_LEN: 63
SCORES: eeeeeeeccb^]]YVNNNNNUORRRWWYTUUUYYQVYZ\WKKKLLLLMPPOPPPPQPPPPPPN
CLUSTER: 129
CLUSTER_COUNT: 2
---
Warning: denoise_seq is in a testing phase and rapid changes can be expected. Notice that using
the --verbose
switch will dump the aligned clusters allowing visual inspection. Lowercase residues
in the verbose output are deleted during the determination of the consensus sequence.
Usearch must be installed in order for denoise_seq to work.
Read more here:
http://www.drive5.com/usearch/
... | denoise_seq [options]
[-? | --help] # Print full usage description.
[-i <float> | --cluster_ident=<float>] # Minimum global cluster identity - Default=0.97
[-c <uint> | --cluster_min=<uint>] # Minimum cluster size to process - Default=1
[-s <uint> | --sequence_min=<uint>] # Minimum sequence coverage in alignment - Default=1
[-r <float> | --residue_min=<uint>] # Minimum frequency of residues - Default=0.3
[-g <float> | --gap_max=<float>] # Maximum gap ratio in columns - Default=0.4
[-q <uint> | --quality_min=<uint>] # Minimum quality score - Default=10
[-Q <uint> | --quality_mean=<uint>] # Minimum column quality mean score - Default=15
[-I <file!> | --stream_in=<file!>] # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT
[-v | --verbose] # Verbose output.
We can denoise an amplicon experiment in a FASTQ file using denoise_seq like this:
read_fastq -i input.fq | denoise_seq | write_fastq -o clusters.fq
Thus denoised consensus sequences representing each of the clusters will be written to clusters.fq
It is possible to run denoise_seq iteratively for two or three rounds:
read_fastq -i input.fq | denoise_seq | denoise_seq | denoise_seq | write_fastq -o clusters.fq
Warning the behaviour of usearch with ambiguity codes is not clear so remove these from the input. Also, outputting ambiguity codes can be prevented by setting -r > 0.5.
For iterative denoising, you can inflate the clusters using duplicate_record to get the correct cluster counts at the end:
read_fastq -i input.fq |
denoise_seq -i 1 -r 0.55 |
split_vals -k SEQ_NAME -K CLUSTER,CLUSTER_COUNT |
duplicate_record -k CLUSTER_COUNT |
denoise_seq -i 0.97 -c 5 |
write_fastq -o consensus.fq
Martin Asser Hansen - Copyright (C) - All rights reserved.
February 2012
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
denoise_seq is part of the Biopieces framework.