Skip to content
Tim Stevens edited this page Apr 6, 2018 · 15 revisions

NucProcess

Installation

NucProcess does not require and special installation and may be run directly from its download location (e.g. after cloning the GitHub repository), though all the component files must reside in the same directory.

This software uses Python version 2 or 3 and requires that the NumPy package is installed and available to the Python version that runs NucProcess.

Numpy is available in bundled Python packages like Anaconda or Canopy, in most Linux distributions' package managers or can be installed on most UNIX-like systems using pip:

pip install numpy

Commands

To run NucProcess issue the nuc_process command with the command line options described below. The options -i (input FASTQ files) and -g (genome reference) are mandatory, though its is usual to also use -re1 (primary restriction enzyme. Default is MboI), -o (root name of output files) and -re2 (secondary restriction enzyme in double-digest experiments). Chromatin contact output is created in the NCC data format

The split_fastq_barcodes command can be used to split FASTQ files that represent many cells, each with a different barcodes sequence, into separate paired read files.

The nuc_contact_map command takes the contact data from NCC format files to make all-chromosome contact map graphics in SVG format. This is automatically run on the main output of NucProcess, but can be run as required on any NCC format file.

The nuc_contact_probability command takes the contact data from one or more NCC format files to create log plots of contact probability versus sequence separation for intra chromosomal contacts.

Barcoded input

Any barcoded input FASTQ files must first be split into separate samples/cells before running nuc_process. We provide split_fastq_barcodes to achieve this.

However the somewhat more basic splitFastqBarcodes.py script mentioned in the primary reference is still available and may be run as follows:

python splitFastqBarcodes.py MULTIPLEXED_DATA_r_1.fq MULTIPLEXED_DATA_r_2.fq

This will generate paired FASTQ files of the form:

MULTIPLEXED_DATA_r_1_CGC.fq MULTIPLEXED_DATA_r_2_CGC.fq MULTIPLEXED_DATA_r_1_TAA.fq MULTIPLEXED_DATA_r_2_TAA.fq

where the file names are tagged with the corresponding barcode sequence. These demultiplexed files can then be used as input to NucProcess, specifying only one barcode for each run.

Basic operation

Typical first-time use, which creates a genome index and RE-digest files:

nuc_process -f /chromosomes/*.fa -o CELL_NAME -v -a -k -re1 MboI -re2 AluI -s 150-2000 -n 12 -g GENOME_DIR/GENOME_BUILD_NAME -i DATA_DIR/SEQUENCING_DATA_r_?.fq

(substituting upper-case values)

Typical use thereafter:

nuc_process -o CELL_NAME -v -a -k -re1 MboI -re2 AluI -s 150-2000 -n 8 -g GENOME_DIR/GENOME_BUILD_NAME -i DATA_DIR/SEQUENCING_DATA_r_?.fq

For the above commands:

-f /chromosomes/*.fa states that all FASTA files (ending in .fa) in the /chromosomes/ directory will be used for creation of the genome index and RE cut site files

-o specifies CELL_NAME will be used for naming the NCC format output. In this case the main output contact file will be CELL_NAME.ncc

-v specifies verbose output of processing progress

-a specifies to generate ambiguous contact files: CELL_NAME_ambig.ncc in this case

-k specifies to keep all the intermediate processing files: Filtered NCC files, clipped FASTQ files and the main Bowtie2 mapping SAM files

-re1 is the primary restriction enzyme type at the ligation junction (see enzymes.conf)

-re2 is the secondary restriction enzyme used to release the fragments (option not used for sonication/Nextera based protocol)

-s is the valid molecule/fragment size range, as used in the DNA sequencing

-n is the number of parallel CPU cores to use with Bowtie2

-g GENOME_BUILD_NAME is the root name for the Bowtie2 genome index without any file extension and in this case would refer to files GENOME_BUILD_NAME.1.bt2, GENOME_BUILD_NAME.rev.1.bt2 etc.

-i SEQUENCING_DATA_r_?.fq is a wild-card expression matching the two input FASTQ files (though two separate file names, separated by a space can be specified). In this case the expression would match SEQUENCING_DATA_r_1.fq and SEQUENCING_DATA_r_2.fq - the paired sequence read files.

To generate contact map graphics from an NCC format file:

nuc_contact_map -i CELL_NAME.ncc

This will generate the output graphics file CELL_NAME_contact_map.svg. However, the output file name may be specified via the -o option.

Versions

There are two main versions of NucProcess, a stable release that corresponds to the primary publications and a development version where new features are being added and tested. We recommend using the stable release under normal circumstances.

The current stable version is available here: Release 1.0

The development version is available here: Development master

Citations

If you use NucProcess in published work, please cite the following reference:

Stevens et al. Nature. 2017 Apr 6;544(7648):59-64 PMID:28289288

Licensing

NucProcess is licensed under GNU Lesser General Public License v3.0

Clone this wiki locally