-
Notifications
You must be signed in to change notification settings - Fork 10
Home
NucProcess does not require and special installation and may be run directly from its download location (e.g. after cloning the GitHub repository), though all the component files must reside in the same directory.
This software uses Python version 2 or 3 and requires that the NumPy package is installed and available to the Python version that runs NucProcess.
Numpy is available in bundled Python packages like Anaconda or Canopy, in most Linux distributions' package managers or can be installed on most UNIX-like systems using pip:
pip install numpy
To run NucProcess issue the nuc_process command with the command line options
described below. The options -i
(input FASTQ files) and -g
(genome reference)
are mandatory, though its is usual to also use -re1 (primary restriction enzyme.
Default is MboI), -o
(root name of output files) and -re2
(secondary restriction
enzyme in double-digest experiments). Chromatin contact output is created in the NCC data format
The split_fastq_barcodes command can be used to split FASTQ files that represent many cells, each with a different barcodes sequence, into separate paired read files.
The nuc_contact_map command takes the contact data from NCC format files to make all-chromosome contact map graphics in SVG format. This is automatically run on the main output of NucProcess, but can be run as required on any NCC format file.
The nuc_contact_probability command takes the contact data from one or more NCC format files to create log plots of contact probability versus sequence separation for intra chromosomal contacts.
Any barcoded input FASTQ files must first be split into separate samples/cells before running nuc_process. We provide split_fastq_barcodes to achieve this.
However the somewhat more basic splitFastqBarcodes.py
script mentioned in the primary reference is still available and may be run as follows:
python splitFastqBarcodes.py MULTIPLEXED_DATA_r_1.fq MULTIPLEXED_DATA_r_2.fq
This will generate paired FASTQ files of the form:
MULTIPLEXED_DATA_r_1_CGC.fq MULTIPLEXED_DATA_r_2_CGC.fq
MULTIPLEXED_DATA_r_1_TAA.fq MULTIPLEXED_DATA_r_2_TAA.fq
where the file names are tagged with the corresponding barcode sequence. These demultiplexed files can then be used as input to NucProcess, specifying only one barcode for each run.
Typical first-time use, which creates a genome index and RE-digest files:
nuc_process -f /chromosomes/*.fa -o CELL_NAME -v -a -k -re1 MboI -re2 AluI -s 150-2000 -n 12 -g GENOME_DIR/GENOME_BUILD_NAME -i DATA_DIR/SEQUENCING_DATA_r_?.fq
(substituting upper-case values)
Typical use thereafter:
nuc_process -o CELL_NAME -v -a -k -re1 MboI -re2 AluI -s 150-2000 -n 8 -g GENOME_DIR/GENOME_BUILD_NAME -i DATA_DIR/SEQUENCING_DATA_r_?.fq
For the above commands:
-f /chromosomes/*.fa
states that all FASTA files (ending in .fa) in the
/chromosomes/ directory will be used for creation of the genome index and RE cut
site files
-o specifies CELL_NAME
will be used for naming the NCC format output. In this case the main
output contact file will be CELL_NAME.ncc
-v
specifies verbose output of processing progress
-a
specifies to generate ambiguous contact files: CELL_NAME_ambig.ncc
in this case
-k
specifies to keep all the intermediate processing files: Filtered NCC files,
clipped FASTQ files and the main Bowtie2 mapping SAM files
-re1
is the primary restriction enzyme type at the ligation junction (see
enzymes.conf)
-re2
is the secondary restriction enzyme used to release the fragments (option
not used for sonication/Nextera based protocol)
-s
is the valid molecule/fragment size range, as used in the DNA sequencing
-n
is the number of parallel CPU cores to use with Bowtie2
-g GENOME_BUILD_NAME
is the root name for the Bowtie2 genome index without any file
extension and in this case would refer to files GENOME_BUILD_NAME.1.bt2
,
GENOME_BUILD_NAME.rev.1.bt2
etc.
-i SEQUENCING_DATA_r_?.fq
is a wild-card expression matching the two input FASTQ
files (though two separate file names, separated by a space can be specified).
In this case the expression would match SEQUENCING_DATA_r_1.fq
and
SEQUENCING_DATA_r_2.fq
- the paired sequence read files.
To generate contact map graphics from an NCC format file:
nuc_contact_map -i CELL_NAME.ncc
This will generate the output graphics file CELL_NAME_contact_map.svg
. However, the
output file name may be specified via the -o
option.
There are two main versions of NucProcess, a stable release that corresponds to the primary publications and a development version where new features are being added and tested. We recommend using the stable release under normal circumstances.
The current stable version is available here: Release 1.0
The development version is available here: Development master
If you use NucProcess in published work, please cite the following reference:
Stevens et al. Nature. 2017 Apr 6;544(7648):59-64 PMID:28289288
NucProcess is licensed under GNU Lesser General Public License v3.0