Skip to content
Martin Asser Hansen edited this page Oct 2, 2015 · 6 revisions

Biopiece: mask_seq

Description

mask_seq masks sequences in the stream using either hard masking or soft masking (default). Hard masking is replacing residues with corresponding quality score below a specified cutoff with a N, while soft is replacing such residues with lower case. The sequences are values to SEQ keys and the quality scores are values to SCORES keys. The SCORES are encoded as ranges of ASCII characters from '@' to 'h' indicating scores from 0 to 40.

Read more here:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2847217/

Usage

... | mask_seq [options]

Options

[-?          | --help]               #  Print full usage description.
[-c <int>    | --cutoff=<int>]       #  Cutoff used for soft masking low scoring sequence  -  Default=20
[-h          | --hardmask]           #  Hard mask instead of soft mask.
[-I <file!>  | --stream_in=<file!>]  #  Read input stream from file                        -  Default=STDIN
[-O <file>   | --stream_out=<file>]  #  Write output stream to file                        -  Default=STDOUT
[-v          | --verbose]            #  Verbose output.

Examples

Consider the following FASTQ entry in the file test.fq:

@HWI-EAS157_20FFGAAXX:2:1:888:434
TTGGTCGCTCGCTCCGCGACCTCAGATCAGACGTGGGCGAT
+HWI-EAS157_20FFGAAXX:2:1:888:434
@ABCDEFGHIJKLMNOPQRSTUVWhgfedcba`_^]\[ZYX

We can read in these sequence using read_fastq and then soft mask the sequence with mask_seq like this:

read_fastq -i test.fq | mask_seq

SCORES: @ABCDEFGHIJKLMNOPQRSTUVWhgfedcba`_^]\[ZYX
SEQ: ttggtcgctcgctccgcgacCTCAGATCAGACGTGGGCGAT
SEQ_LEN: 41
SEQ_NAME: HWI-EAS157_20FFGAAXX:2:1:888:434
---

Using the -c switch we can change the cutoff:

read_fastq -i test.fq | mask_seq -c 25

SCORES: @ABCDEFGHIJKLMNOPQRSTUVWhgfedcba`_^]\[ZYX
SEQ: ttggtcgctcgctccgcgacctcaGATCAGACGTGGGCGAt
SEQ_LEN: 41
SEQ_NAME: HWI-EAS157_20FFGAAXX:2:1:888:434
---

Using the -h swich for hard masking:

read_fastq -i test.fq | mask_seq -h

SEQ_NAME: HWI-EAS157_20FFGAAXX:2:1:888:434
SEQ: NNNNNNNNNNNNNNNNNNNNCTCAGATCAGACGTGGGCGAT
SEQ_LEN: 41
SCORES: @ABCDEFGHIJKLMNOPQRSTUVWhgfedcba`_^]\[ZYX
---

See also

read_fastq

scores_to_dec

Author

Martin Asser Hansen - Copyright (C) - All rights reserved.

[email protected]

August 2010

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

mask_seq is part of the Biopieces framework.

http://www.biopieces.org

Clone this wiki locally