Skip to content
meren edited this page Jan 19, 2013 · 6 revisions

This is a small library and a bunch of clients to perform various operations on FASTQ files (such as merging partial overlaps, or quality filtering). For now it only works with,

  • Illumina output files generated by CASAVA version 1.8.0 or higher (it will support earlier versions of CASAVA soon),
  • Paired-end runs (it will support single runs soon).

At this point documentation is very limited, please don't hesitate to send me an e-mail if you have any questions (meren [at] mbl.edu).

Config File Format

The same config file format is required to use clients under scripts directory, if they require a config file as an input. Following is a config file template (there is also a sample in the codebase):

[general]
project_name = project name
researcher_email = [email protected]
input_directory = test_input
output_directory = test_output


[files]
pair_1 = pair_1_aaa, pair_1_aab, pair_1_aac, pair_1_aad, pair_1_aae, pair_1_aaf 
pair_2 = pair_2_aaa, pair_2_aab, pair_2_aac, pair_2_aad, pair_2_aae, pair_2_aaf

[prefixes]
pair_1_prefix = ^....TACGCCCAGCAGC[C,T]GCGGTAA.
pair_2_prefix = ^CCGTC[A,T]ATT[C,T].TTT[G,A]A.T

Two critical things in [general] section are input_directory and output_directory:

  • input_directory: Full path to the directory where FASTQ files reside.
  • output_directory: Full path to the directory where the output of the operation you will perform on this config to be stored. Since when it is Illumina we are dealing with huge files, the codebase is pretty conservative to protect users from making simple mistakes which may result in huge losses. So, if you don't create the output_directory, you will get an error (it will not be automatically generated). If there is already a file in the output_directory with the same name with one of the outputs, you will get an error (it will not be overwritten). project_name will be used as a prefix for the naming convention of output files, so it would be wise to choose something descriptive and UNIX-compatible.

files section is where you list your files to be found in the output_directory. Each file has to be comma separated, and pairs should be ordered to match.

prefixes section is optional. More explanation about prefixes will be here, soon.

Quality Filtering

The actual purpose of this library is to perform very stringent quality control on Illumina results. However, at this point I will mention only two published quality filtering approaches. Both methods implemented to operate on paired-end reads.

Minoche et al.

Quality filtering suggestions made by Minoche et al is implemented in analyze-illumina-quality-minoche script. The output of the scripts include these files:

  • project_name-STATS.txt (file that contains all the numbers about quality filtering process, an example output can be seen below)
  • project_name-QUALITY_PASSED_R1.fa (pair 1's that passed quality filtering)
  • project_name-QUALITY_PASSED_R2.fa (matching pair 2's)
  • project_name-READ_IDs.cPickle.z (gzipped cPickle object for Python that keeps the fate of read IDs, this file may be required by other scripts in the library for purposes such as visualization, or extracting a particular group of reads from the original FASTQ files)

If the program is run with --visualize-quality-curves option, these files will also be generated in the output directory:

  • project_name-PASSED.png (visualization of mean quality scores per tile for pairs that passed the quality filtering)
  • project_name-FAILED_REASON_C33.png (visualization of mean quality scores per tile for pairs that failed quality filtering due to C33 filtering (C33: less than 2/3 of bases were Q30 or higher in the first half of the read following the B-tail trimming))
  • project_name-FAILED_REASON_N.png (same above, but for pairs that contained an ambiguous base after B-tail trimming)
  • project_name-FAILED_REASON_P.png (same above, but for pairs that were too short after B-tail trimming)
  • project_name-Q_DICT.cPickle.z (gzipped cPickle object for Python that holds mean quality scores for each group of reads)

Example project_name-STATS.txt output:

$ cat 9022_B9-STATS.txt
number of pairs analyzed      : 122929
total pairs passed            : 109041 (%88.70 of all pairs)
  total pair_1 trimmed        : 6476 (%5.94 of all passed pairs)
  total pair_2 trimmed        : 9059 (%8.31 of all passed pairs)
total pairs failed            : 13888 (%11.30 of all pairs)
  pairs failed due to pair_1  : 815 (%5.87 of all failed pairs)
  pairs failed due to pair_2  : 12193 (%87.80 of all failed pairs)
  pairs failed due to both    : 880 (%6.34 of all failed pairs)
  FAILED_REASON_P             : 12223 (%88.01 of all failed pairs)
  FAILED_REASON_N             : 38 (%0.27 of all failed pairs)
  FAILED_REASON_C33           : 1627 (%11.72 of all failed pairs)

Example PNG files:

Example output

Bokulich et al.

Quality filtering suggestions made by Bokulich et al is implemented in analyze-illumina-quality-bokulich script. The output of the scripts include these files:

  • project_name-STATS.txt
  • project_name-QUALITY_PASSED_R1.fa
  • project_name-QUALITY_PASSED_R2.fa
  • project_name-READ_IDs.cPickle.z

If the program is run with --visualize-quality-curves option, these files will also be generated in the output directory:

  • project_name-PASSED.png
  • project_name-FAILED_REASON_P.png (visualization of mean quality scores per tile for pairs that failed quality filtering for being too short after quality trimming)
  • project_name-FAILED_REASON_N.png (same above, but having more ambiguous bases than n after quality trimming)
  • project_name-Q_DICT.cPickle.z

Example project_name-STATS.txt output:

number of pairs analyzed      : 122929
total pairs passed            : 111598 (%90.78 of all pairs)
  total pair_1 trimmed        : 1994 (%1.79 of all passed pairs)
  total pair_2 trimmed        : 9227 (%8.27 of all passed pairs)
total pairs failed            : 11331 (%9.22 of all pairs)
  pairs failed due to pair_1  : 738 (%6.51 of all failed pairs)
  pairs failed due to pair_2  : 10159 (%89.66 of all failed pairs)
  pairs failed due to both    : 434 (%3.83 of all failed pairs)
  FAILED_REASON_P             : 11299 (%99.72 of all failed pairs)
  FAILED_REASON_N             : 32 (%0.28 of all failed pairs)

Example PNG files:

Example output

(to be continued)