Skip to content

sequence command

Nacho edited this page Jun 30, 2015 · 1 revision

The 'sequence' command allows you to process FastQ sequence files both in a local scenario or in a Hadoop cluster.

Assuming you are in the hpg-bigdata folder, type the following command to see the available sequence sub-commands for the Hadoop scenario:

$ build/bin/ sequence

Usage: sequence <subcommand> [options]

     convert  Converts FastQ files to different big data formats such as Avro
     stats    Calculates different stats from sequencing data

For a local scenario, use the script

$ build/bin/ sequence

Usage: sequence <subcommand> [options]

     convert  Converts FastQ files to different big data formats such as Avro
Sub-command: convert

Converts FastQ files to different big data formats such as Avro according to the GA4GH schema models.

Hadoop scenario:

$ build/bin/ sequence convert -h

Usage: sequence convert [options]

      -x, --compression    STRING     Accepted values: snappy, deflate, bzip2, xz, null. Default: snappy [snappy]
      -L, --log-level      STRING     Set the level log, values: debug, info, warning, error, fatal [info]
      -h, --help                      This parameter prints this help [false]
          --conf           STRING     Set the configuration file [null]
      -v, --verbose        BOOLEAN    This parameter set the level of the logging [false]
    * -i, --input          STRING     HDFS input file in FastQ format [null]
    * -o, --output         STRING     HDFS output file to store the FastQ sequences according to the GA4GH/Avro model [null]


$ hadoop fs -mkdir /test
$ hadoop fs -copyFromLocal build/data/test.fq /test
$ hadoop fs -ls /test
Found 1 items
-rw-r--r--   1 jtarraga supergroup      29290 2015-06-30 15:52 /test/test.fq
$ hadoop fs -mkdir /out
$ build/bin/ sequence convert -i /test/test.fq -o /out/test.fq.avro
$ hadoop fs -ls /out/test.fq.avro
Found 2 items
-rw-r--r--   1 jtarraga supergroup          0 2015-06-30 15:54 /out/test.fq.avro/_SUCCESS
-rw-r--r--   1 jtarraga supergroup       9912 2015-06-30 15:54 /out/test.fq.avro/part-r-00000.avro

Local scenario:

$ build/bin/ sequence convert -h

Usage: sequence convert [options]

      --conf           STRING     Set the configuration file [null]
  -x, --compression    STRING     Accepted values: snappy, deflate, bzip2, xz, null. Default: snappy [snappy]
  -v, --verbose        BOOLEAN    This parameter set the level of the logging [false]
  -h, --help                      This parameter prints this help [false]
* -i, --input          STRING     Local input file in FastQ format [null]
  -L, --log-level      STRING     Set the level log, values: debug, info, warning, error, fatal [info]
* -o, --output         STRING     Local output file to store the FastQ sequences according to the GA4GH/Avro model [null]


$ mkdir /tmp/out
$ build/bin/ sequence convert -i build/data/test.fq -o /tmp/out/test.fq.avro
$ ls -ltr /tmp/out/test.fq.avro 
-rw-rw-r-- 1 jtarraga jtarraga 9924 jun 30 16:00 /tmp/out/test.fq.avro
Sub-command: stats

Hadoop scenario:

$ build/bin/ sequence stats -h

Usage: sequence stats [options]

    * -o, --output         STRING     Local output directory to save stats results in JSON format  [null]
    * -i, --input          STRING     HDFS input file containing the FastQ sequences stored according to the GA4GH/Avro model) [null]
      -L, --log-level      STRING     Set the level log, values: debug, info, warning, error, fatal [info]
      -h, --help                      This parameter prints this help [false]
          --conf           STRING     Set the configuration file [null]
      -k, --kmers          INTEGER    Compute k-mers (according to the indicated length) [0]
      -v, --verbose        BOOLEAN    This parameter set the level of the logging [false]


$ mkdir /tmp/out-fastq-stats
$ build/bin/ sequence stats -i /out/test.fq.avro/part-r-00000.avro -o /tmp/out-fastq-stats/ --kmers 7
$ ls -ltr /tmp/out-fastq-stats/
total 8
-rw-r--r-- 1 jtarraga jtarraga 5813 jun 30 16:07 stats.json
$ cat /tmp/out-fastq-stats/stats.json 
{"num_reads": 100, "num_A": 3662, "num_T": 3756, "num_G": 2567, "num_C": ...