-
Notifications
You must be signed in to change notification settings - Fork 23
write_fastq_files
write_fastq_files writes sequences from the data stream to multiple FASTQ files given a specified key. All FASTQ type records containing the specified key will be written to files according to the value of the key.
write_fastq_files supports gzip/bzip2 output.
For more about the FASTQ format:
http://en.wikipedia.org/wiki/FASTQ_format
... | write_fastq_files [options]
[-? | --help] # Print full usage description.
[-k <string> | --key=<string>] # Key for seperating records and naming files.
[-d <dir!> | --directory=<dir!>] # Target directory.
[-p <string> | --prefix=<string>] # Optional prefix for file names.
[-e <string> | --encoding=<string> # Encoding <base_33|base_64> - Default=base_33
[-x | --no_stream] # Do not emit records.
[-Z <string> | --compress=<string>] # Compress output using <gzip|bzip2>.
[-I <file!> | --stream_in=<file!>] # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT
[-v | --verbose] # Verbose output.
Consider the following FASTQ entries in the file test.fastq
.
@ILLUMINA-52179E_0004:2:1:1040:5263#TTAGGC/1
TTCGGCATCGGCGGCGACGTTGGCGGCGGGGCCGGGCGGGTCGANNNCAT
+ILLUMINA-52179E_0004:2:1:1040:5263#TTAGGC/1
ffeaffd`ce`eecccKLT`bT^]bYHV^BBBBBBBBBBBBBBBBBBBBB
@ILLUMINA-52179E_0004:2:1:1041:14486#TTAGGC/1
CATGGCGTATGCCAGACGGCCAGAACGATGGCCGCCGGGCTTCANNNAAG
+ILLUMINA-52179E_0004:2:1:1041:14486#TTAGGC/1
eeeecac^dddddddeffe`f`fdece\aefeeffcccc\`a``BBBBBB
@ILLUMINA-52179E_0004:2:1:1043:19446#TTAGGC/2
CGGTACTGATCGAGTGTCAGGCTGTTGATCGCCGCGGGCGGGGGTNNGAC
+ILLUMINA-52179E_0004:2:1:1043:19446#TTAGGC/2
db`dadddddeeeeedeeeeccdddfffffcdaddbac`d_BBBBBBBBB
@ILLUMINA-52179E_0004:2:1:1044:7943#TTAGGC/2
CTGATGCATGAAGATAGTCGGATGCACAATATACACGGCTAACGCNNAGG
+ILLUMINA-52179E_0004:2:1:1044:7943#TTAGGC/2
ffffcfffffded^eddddddbdcdeedcefecfefdffecabccBB`b`
We can seperate these sequences into different files based on the trailing
/1
and /2
in the sequence name by using first read_fastq to scoop in
the data, followed by split_vals to obtain a key containing the values of
the trailing number - and then write files based on these by using
write_fastq_files:
read_fastq -i test.fastq | split_vals -k SEQ_NAME -d '/' | write_fastq_files -d Test_dir -k SEQ_NAME_1
SCORES: ffeaffd`ce`eecccKLT`bT^]bYHV^BBBBBBBBBBBBBBBBBBBBB
SEQ: TTCGGCATCGGCGGCGACGTTGGCGGCGGGGCCGGGCGGGTCGANNNCAT
SEQ_LEN: 50
SEQ_NAME: ILLUMINA-52179E_0004:2:1:1040:5263#TTAGGC/1
SEQ_NAME_1: 1
SEQ_NAME_0: ILLUMINA-52179E_0004:2:1:1040:5263#TTAGGC
---
SCORES: eeeecac^dddddddeffe`f`fdece\aefeeffcccc\`a``BBBBBB
SEQ: CATGGCGTATGCCAGACGGCCAGAACGATGGCCGCCGGGCTTCANNNAAG
SEQ_LEN: 50
SEQ_NAME: ILLUMINA-52179E_0004:2:1:1041:14486#TTAGGC/1
SEQ_NAME_1: 1
SEQ_NAME_0: ILLUMINA-52179E_0004:2:1:1041:14486#TTAGGC
---
SCORES: db`dadddddeeeeedeeeeccdddfffffcdaddbac`d_BBBBBBBBB
SEQ: CGGTACTGATCGAGTGTCAGGCTGTTGATCGCCGCGGGCGGGGGTNNGAC
SEQ_LEN: 50
SEQ_NAME: ILLUMINA-52179E_0004:2:1:1043:19446#TTAGGC/2
SEQ_NAME_1: 2
SEQ_NAME_0: ILLUMINA-52179E_0004:2:1:1043:19446#TTAGGC
---
SCORES: ffffcfffffded^eddddddbdcdeedcefecfefdffecabccBB`b`
SEQ: CTGATGCATGAAGATAGTCGGATGCACAATATACACGGCTAACGCNNAGG
SEQ_LEN: 50
SEQ_NAME: ILLUMINA-52179E_0004:2:1:1044:7943#TTAGGC/2
SEQ_NAME_1: 2
SEQ_NAME_0: ILLUMINA-52179E_0004:2:1:1044:7943#TTAGGC
---
And the resulting directory tree will look like this:
Test_dir/
|-- 1.fastq
`-- 2.fastq
Notice that the Test_dir
must exist. One can use .
to denote the current
directory, but that is probably not a good idea.
Here we add an optional prefix to the file names using the -p
option and compress
the output with bzip2:
read_fastq -i test.fastq | split_vals -k SEQ_NAME -d '/' | write_fastq_files -d Test_dir -k SEQ_NAME_1 -p Pair -Z bzip2 -x
And the output directory tree:
Test_dir/
|-- Pair_1.fastq.bz2
`-- Pair_2.fastq.bz2
Martin Asser Hansen - Copyright (C) - All rights reserved.
October 2011
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
write_fastq_files is part of the Biopieces framework.