Skip to content
Martin Asser Hansen edited this page Oct 2, 2015 · 6 revisions

Biopiece: write_fastq_files

Description

write_fastq_files writes sequences from the data stream to multiple FASTQ files given a specified key. All FASTQ type records containing the specified key will be written to files according to the value of the key.

write_fastq_files supports gzip/bzip2 output.

For more about the FASTQ format:

http://en.wikipedia.org/wiki/FASTQ_format

Usage

... | write_fastq_files [options]

Options

[-?          | --help]               #  Print full usage description.
[-k <string> | --key=<string>]       #  Key for seperating records and naming files.
[-d <dir!>   | --directory=<dir!>]   #  Target directory.
[-p <string> | --prefix=<string>]    #  Optional prefix for file names.
[-e <string> | --encoding=<string>   #  Encoding <base_33|base_64>   -  Default=base_33
[-x          | --no_stream]          #  Do not emit records.
[-Z <string> | --compress=<string>]  #  Compress output using <gzip|bzip2>.
[-I <file!>  | --stream_in=<file!>]  #  Read input from stream file  -  Default=STDIN
[-O <file>   | --stream_out=<file>]  #  Write output to stream file  -  Default=STDOUT
[-v          | --verbose]            #  Verbose output.

Examples

Consider the following FASTQ entries in the file test.fastq.

@ILLUMINA-52179E_0004:2:1:1040:5263#TTAGGC/1
TTCGGCATCGGCGGCGACGTTGGCGGCGGGGCCGGGCGGGTCGANNNCAT
+ILLUMINA-52179E_0004:2:1:1040:5263#TTAGGC/1
ffeaffd`ce`eecccKLT`bT^]bYHV^BBBBBBBBBBBBBBBBBBBBB
@ILLUMINA-52179E_0004:2:1:1041:14486#TTAGGC/1
CATGGCGTATGCCAGACGGCCAGAACGATGGCCGCCGGGCTTCANNNAAG
+ILLUMINA-52179E_0004:2:1:1041:14486#TTAGGC/1
eeeecac^dddddddeffe`f`fdece\aefeeffcccc\`a``BBBBBB
@ILLUMINA-52179E_0004:2:1:1043:19446#TTAGGC/2
CGGTACTGATCGAGTGTCAGGCTGTTGATCGCCGCGGGCGGGGGTNNGAC
+ILLUMINA-52179E_0004:2:1:1043:19446#TTAGGC/2
db`dadddddeeeeedeeeeccdddfffffcdaddbac`d_BBBBBBBBB
@ILLUMINA-52179E_0004:2:1:1044:7943#TTAGGC/2
CTGATGCATGAAGATAGTCGGATGCACAATATACACGGCTAACGCNNAGG
+ILLUMINA-52179E_0004:2:1:1044:7943#TTAGGC/2
ffffcfffffded^eddddddbdcdeedcefecfefdffecabccBB`b`

We can seperate these sequences into different files based on the trailing /1 and /2 in the sequence name by using first read_fastq to scoop in the data, followed by split_vals to obtain a key containing the values of the trailing number - and then write files based on these by using write_fastq_files:

read_fastq -i test.fastq | split_vals -k SEQ_NAME -d '/' | write_fastq_files -d Test_dir -k SEQ_NAME_1

SCORES: ffeaffd`ce`eecccKLT`bT^]bYHV^BBBBBBBBBBBBBBBBBBBBB
SEQ: TTCGGCATCGGCGGCGACGTTGGCGGCGGGGCCGGGCGGGTCGANNNCAT
SEQ_LEN: 50
SEQ_NAME: ILLUMINA-52179E_0004:2:1:1040:5263#TTAGGC/1
SEQ_NAME_1: 1
SEQ_NAME_0: ILLUMINA-52179E_0004:2:1:1040:5263#TTAGGC
---
SCORES: eeeecac^dddddddeffe`f`fdece\aefeeffcccc\`a``BBBBBB
SEQ: CATGGCGTATGCCAGACGGCCAGAACGATGGCCGCCGGGCTTCANNNAAG
SEQ_LEN: 50
SEQ_NAME: ILLUMINA-52179E_0004:2:1:1041:14486#TTAGGC/1
SEQ_NAME_1: 1
SEQ_NAME_0: ILLUMINA-52179E_0004:2:1:1041:14486#TTAGGC
---
SCORES: db`dadddddeeeeedeeeeccdddfffffcdaddbac`d_BBBBBBBBB
SEQ: CGGTACTGATCGAGTGTCAGGCTGTTGATCGCCGCGGGCGGGGGTNNGAC
SEQ_LEN: 50
SEQ_NAME: ILLUMINA-52179E_0004:2:1:1043:19446#TTAGGC/2
SEQ_NAME_1: 2
SEQ_NAME_0: ILLUMINA-52179E_0004:2:1:1043:19446#TTAGGC
---
SCORES: ffffcfffffded^eddddddbdcdeedcefecfefdffecabccBB`b`
SEQ: CTGATGCATGAAGATAGTCGGATGCACAATATACACGGCTAACGCNNAGG
SEQ_LEN: 50
SEQ_NAME: ILLUMINA-52179E_0004:2:1:1044:7943#TTAGGC/2
SEQ_NAME_1: 2
SEQ_NAME_0: ILLUMINA-52179E_0004:2:1:1044:7943#TTAGGC
---

And the resulting directory tree will look like this:

Test_dir/
|-- 1.fastq
`-- 2.fastq

Notice that the Test_dir must exist. One can use . to denote the current directory, but that is probably not a good idea.

Here we add an optional prefix to the file names using the -p option and compress the output with bzip2:

read_fastq -i test.fastq | split_vals -k SEQ_NAME -d '/' | write_fastq_files -d Test_dir -k SEQ_NAME_1 -p Pair -Z bzip2 -x

And the output directory tree:

Test_dir/
|-- Pair_1.fastq.bz2
`-- Pair_2.fastq.bz2

See also

read_fastq

split_vals

write_fastq

write_fastq_files

Author

Martin Asser Hansen - Copyright (C) - All rights reserved.

[email protected]

October 2011

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

write_fastq_files is part of the Biopieces framework.

http://www.biopieces.org

Clone this wiki locally