Skip to content
Martin Asser Hansen edited this page Oct 2, 2015 · 9 revisions

Biopiece: read_fastq

Description

read_fastq read in sequence entries from FASTQ files. Each sequence entry consists of 4 lines:

  1. sequence name after @
  2. sequence
  3. quality score name after + (optional)
  4. quality scores in ASCII

It is possible to read in pair-end sequence data from different files using the -j switch in such a way the sequences become interleaved in the stream.

Quality scores are in the range of -5 to 41 encoded with ASCII characters 33 to 74 (! .. J) or 59 to 104 (; .. h) for Phred/Sanger and Solexa/Illumina(<1.8), respectively.

If no encoding is supplied analyzes the first sequence entry and tries to automagically determine what encoding was used, and validate that this encoding fits the following 1000 entries.

  • sanger - base 33
  • solexa - base 64
  • illumina1.3 - base 64
  • illumina1.5 - base 64
  • illumina1.8 - base 33

The resulting records look like this:

SEQ_NAME: test
SEQ: ccccccccccccccccccccccccccccccccccccccccc
SEQ_LEN: 41
SCORES: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI
---

Input files may be compressed with gzip or bzip2.

For more about the FASTQ format:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2847217/

Usage

read_fastq [options] -i <FASTQ file(s)>

Options

[-?          | --help]               #  Print full usage description.
[-i <files!> | --data_in=<files!>]   #  Comma separated list of files or glob expression to read.
[-j <files!> | --data_in2=<files!>]  #  Similar to -i but for pair-end data.
[-n <uint>   | --num=<uint>]         #  Limit number of records to read.
[-e <string> | --encoding=<string>]  #  Encoding <auto|base_33|base_64>  -  Default=auto
[-I <file>   | --stream_in=<file!>]  #  Read input stream from file      -  Default=STDIN
[-O <file>   | --stream_out=<file>]  #  Write output stream to file      -  Default=STDOUT
[-v          | --verbose]            #  Verbose output.

Examples

To read all FASTQ entries in the file test.fq do:

read_fastq -i test.fq

To read a limited number of entries use the -n switch:

read_fastq -i test.fq -n 10

To enforce the encoding use the -e switch:

read_fastq -i test.fq -e base_64

To read in pair-end sequence data:

read_fastq -i exp_A_1.fq,exp_B_1,exp_C_1 -j exp_A_2.fq,exp_B_2,exp_C_2

See also

mask_seq

scores_to_dec

Author

Martin Asser Hansen - Copyright (C) - All rights reserved.

[email protected]

October 2010

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

read_fastq is part of the Biopieces framework.

http://www.biopieces.org

Clone this wiki locally