-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy pathREADME.txt
303 lines (230 loc) · 13.8 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
NucProcess
----------
NucProcess is a Python program to perform single-cell Hi-C sequence processing
of paired read data. This software takes paired FASTQ sequence read files
(representing Hi-C data for only ONE cell), a reference genome sequence and
knowledge of experimental parameters (restriction enzyme type and the range of
size of DNA fragments sequenced in the library) to create processed, single-cell
Hi-C contact files. By default, the output is generated in Nuc Chromatin Contact
(NCC) format, which is a simple text based format described below. A processing
report document and a contact map will also be generated for each run; in SVG
format that can be viewed in most web browsers.
To run NucProcess issue the 'nuc_process' command with the command line options
described below. The options -i (input FASTQ files) and -g (genome reference)
are mandatory, though its is usual to also use -re1 (primary restriction enzyme.
Default is MboI), -o (root name of output files) and -re2 (secondary restriction
enzyme in double-digest experiments).
The 'nuc_contact_map' command takes the contact data from NCC format files to
make all-chromosome contact map graphics in SVG format. This is automatically
run on the main output of NucProcess, but can be run as required on any NCC
format file.
The 'nuc_contact_probability' command takes the contact data from one or more
NCC format files to create log plots of contact probability versus sequence
separation for intra chromosomal contacts.
IMPORTANT: If the files for the genome index and the corresponding restriction
enzyme (RE) cut location files have not been created they will be generated
automatically by NucProcess so long as the -f option is specified. This option
specifies the location of complete chromosomal sequences in FASTA format,
typically by using a wild-card specification. Once these files are present the -f
option need not be specified, but it will not trigger the re-creation of the
files unless the -m or -x options are used (or if files are deleted).
NOTE: the names of the chromosomes used by NucProcess are determined by the file
names of the chromosome sequence files that were used to build the genome index
and RE files (and these must match). The chromosome sequence files should be
tagged with the chromosome name after an underscore but before the file
extension. For example "chr7" will be taken from a file named
"mm_ref_GRCm38.p2_chr7.fa".
Citations
---------
If you use NucProcess in published work, please cite the following reference:
Stevens et al. Nature. 2017 Apr 6;544(7648):59-64 [PMID:28289288]
Barcoded input
--------------
The splitFastqBarcodes.py script is provided to split FASTQ files that represent
many cells, each with a different barcode sequence, into separate paired read
files. The script can be run as follows, specifying the names of the two paired,
multiplex FASTQ read files after the script:
python splitFastqBarcodes.py MULTIPLEXED_DATA_r_1.fq MULTIPLEXED_DATA_r_2.fq
This will generate paired FASTQ files of the form:
MULTIPLEXED_DATA_r_1_CGC.fq MULTIPLEXED_DATA_r_2_CGC.fq
MULTIPLEXED_DATA_r_1_TAA.fq MULTIPLEXED_DATA_r_2_TAA.fq
where the file names are tagged with the corresponding barcode sequence. These
demultiplexed files can then be used as input to NucProcess, specifying only one
barcode for each run.
Running NucProcess
------------------
Typical first-time use:
nuc_process -f /chromosomes/*.fa -o CELL_1 -v -a -k -re1 MboI -re2 AluI -s 150-2000 -n 12 -g /genome/GENOME_BUILD -i /data/SEQUENCING_DATA_r_?.fq
Typical use thereafter:
nuc_process -o CELL_1 -v -a -k -re1 MboI -re2 AluI -s 150-2000 -n 8 -g /genome/GENOME_BUILD -i /data/SEQUENCING_DATA_r_?.fq
For the above commands:
-f /chromosomes/*.fa states that all FASTA files (ending in .fa) in the
/chromosomes/ directory will be used for creation of the genome index and RE cut
site files
-o specifies CELL_1 will be used for naming the output. In this case the main
output contact file will be CELL_1.ncc
-v specifies verbose output of processing progress
-a specifies to generate ambiguous contact files: CELL_1_ambig.ncc in this case
-k specifies to keep all the intermediate processing files: Filtered NCC files,
clipped FASTQ files and the main Bowtie2 mapping SAM files
-re1 is the primary restriction enzyme type at the ligation junction (see
enzymes.conf)
-re2 is the secondary restriction enzyme used to release the fragments (option
not used for Nextera based protocol)
-s is the valid molecule/fragment size range, as used in the DNA sequencing
-n is the number of parallel CPU cores to use with Bowtie2
-g GENOME_BUILD is the root name for the Bowtie2 genome index without any file
extension and in this case would refer to files GENOME_BUILD.1.bt2,
GENOME_BUILD.rev.1.bt2 etc.
-i SEQUENCING_DATA_r_?.fq is a wild-card expression matching the two input FASTQ
files (though two separate file names, separated by a space can be specified).
In this case the expression would match SEQUENCING_DATA_r_1.fq and
SEQUENCING_DATA_r_2.fq - the paired sequence read files.
To generate contact map graphics from an NCC format file:
nuc_contact_map -i CELL_1.ncc
This will generate the output graphics file CELL_1_contact_map.svg. However, the
output file name maye be specified via the -o option.
NCC data format
---------------
The main NCC output format for contact data takes the form of space-separate
lines, where each line represents a different chromosomal contact that pairs two
chromosomal regions. This format specifies more than just the chromosome contact
map: it also records the original read locations within the (ligated) primary RE
digest fragments, strand information, ambiguity information and information to
relate the data back to the original FASTQ input files. Accordingly this format
can be used for data filtering and validation.
The columns of NCC files correspond to:
Name of chromosome A
First base position of sequence read A
Last base position of sequence read A
5' base position of primary RE fragment containing read A
3' base position of primary RE fragment containing read A
The strand of sequence read A
Name of chromosome B
First base position of sequence read B
Last base position of sequence read B
5' base position of primary RE fragment containing read B
3' base position of primary RE fragment containing read B
The strand of sequence read B
The number of the ambiguity group to which the paired reads belong
The ID number of the read pair in the original FASTQ files
Whether read pairs are swapped relative to original FASTQ files
i.e:
chr_a start_a end_a re1_a_start re1_a_end strand_a chr_b start_b end_b re1_b_start re1_b_end strand_b ambig_group pair_id swap_pair
For example two lines could be:
chr10 100002828 100002899 100002733 100003107 + chr10 100015771 100015700 100015676 100016001 - 1498612 1534464 0
chr10 100007729 100007658 100007551 100008354 - chr13 107185630 107185700 107184984 107185698 + 1009602 1032357 1
Command line options for nuc_process
------------------------------------
usage: nuc_process [-h] [-i FASTQ_FILE [FASTQ_FILE ...]] [-g GENOME_FILE]
[-re1 ENZYME] [-re2 ENZYME] [-s SIZE_RANGE] [-n CPU_COUNT]
[-r COUNT] [-o NCC_FILE] [-oa NCC_FILE] [-or REPORT_FILE]
[-b EXE_FILE] [-q SCHEME] [-qm MIN_QUALITY] [-m] [-p]
[-pt PAIRED_READ_TAGS PAIRED_READ_TAGS] [-x]
[-f FASTA_FILES [FASTA_FILES ...]] [-a] [-k] [-sam]
[-l SEQUENCE] [-z] [-v] [-u] [-c GENOME_COPIES]
Chromatin contact paired-read Hi-C processing module for Nuc3D and NucTools
optional arguments:
-h, --help show this help message and exit
-i FASTQ_FILE [FASTQ_FILE ...]
Input paired-read FASTQ files to process. Accepts
wildcards that match paired files. If more than two
files are input, processing will be run in batch mode
using the same parameters.
-g GENOME_FILE Location of genome index files to map sequence reads
to without any file extensions like ".1.b2" etc. A new
index will be created with the name if the index is
missing and genome FASTA files are specified
-re1 ENZYME Primary restriction enzyme (for ligation junctions).
Default: MboI. Available: AluI, BglII, DpnII, HindIII,
MboI
-re2 ENZYME Secondary restriction enzyme (if used). Available:
AluI, BglII, DpnII, HindIII, MboI
-s SIZE_RANGE Allowed range of sequenced molecule sizes, e.g.
"150-1000", "100,800" or "200" (no maximum)
-n CPU_COUNT Number of CPU cores to use in parallel
-r COUNT Minimum number of sequencing repeats required to
support a contact
-o NCC_FILE Optional output name for NCC format chromosome contact
file. This option will be ignored if more than two
paired FASTA files are input (i.e. for batch mode);
automated naming will be used instead.
-oa NCC_FILE Optional output name for ambiguous contact NCC file.
This option will be ignored if more than two paired
FASTA files are input (i.e. for batch mode); automated
naming will be used instead.
-or REPORT_FILE Optional output name for SVG format report file. This
option will be ignored if more than two paired FASTA
files are input (i.e. for batch mode); automated
naming will be used instead.
-b EXE_FILE Path to bowtie2 (read aligner) executable (will be
searched for if not specified)
-q SCHEME Use a specific FASTQ quality scheme (normally not set
and deduced automatically). Available: phred33,
phred64, solexa
-qm MIN_QUALITY Minimum acceptable FASTQ quality score in range 0-40
for clipping 3' end of reads. Default: 10
-m Force a re-mapping of genome restriction enzyme sites
(otherwise cached values will be used if present)
-p The input data is multi-cell/population Hi-C; single-
cell processing steps are avoided
-pt PAIRED_READ_TAGS PAIRED_READ_TAGS
When more than two FASTQ files are input (batch mode),
the subtrings/tags which differ between paired FASTQ
file paths. Default: r_1 r_2
-x, --reindex Force a re-indexing of the genome (given appropriate
FASTA files)
-f FASTA_FILES [FASTA_FILES ...]
Specify genome FASTA files for genome index building
(accepts wildcards)
-a Whether to report ambiguously mapped contacts
-k Keep any intermediate files (e.g. clipped FASTQ etc).
-sam Write paired contacts files to SAM format
-l SEQUENCE Seek a specific ligation junction sequence (otherwise
this is guessed from the primary restriction enzyme)
-z GZIP compress any output FASTQ files
-v, --verbose Display verbose messages to report progress
-u Whether to only accept uniquely mapping genome
positions and not attempt to resolve certain classes
of ambiguous mapping where a single perfect match is
found.
-c GENOME_COPIES Number of whole-genome copies, e.g. for S2 phase;
Default 1.
Note enzymes.conf can be edited to add further restriction enzyme cut-site
definitions.
Command line options for nuc_contact_map
----------------------------------------
usage: nuc_contact_map [-h] [-i NCC_FILE [NCC_FILE ...]] [-o SVG_FILE_TAG]
[-w SVG_WIDTH] [-s BIN_SIZE] [-b] [-c RGB_COLOR]
Chromatin contact (NCC format) Hi-C contact map display module for Nuc3D and
NucTools
optional arguments:
-h, --help show this help message and exit
-i NCC_FILES Input NCC format chromatin contact file(s). Wildcards accepted
-o SVG_FILE Optional name tag to put at end of SVG format contact
map file. Use "-" to print SVG to stdout rather than
make a file. Default: "_contact_map"
-w SVG_WIDTH SVG document width
-s BIN_SIZE Sequence region size represented by each small square (the
resolution) in megabases. Default is 5 kb
-b Specifies that the contact map should have a black background
(default is white)
-c RGB_COLOR Optional main color for the contact points as a 24-bit
hexidecimal RBG code e.g. "#0080FF" (with quotes)
For further help email [email protected] or [email protected]
Command line options for nuc_contact_probability
------------------------------------------------
usage: nuc_contact_probability [-h] [-i NCC_FILE [NCC_FILE ...]] [-o SVG_FILE]
[-w SVG_WIDTH] [-s KB_BIN_SIZE]
Chromatin contact (NCC format) probability vs sequence separation graph module
for Nuc3D and NucTools
optional arguments:
-h, --help show this help message and exit
-i NCC_FILE [NCC_FILE ...]
Input NCC format chromatin contact file(s). Wildcards
accepted
-o SVG_FILE Output SVF format file. Use "-" to print SVG to stdout
rather than make a file.
-w SVG_WIDTH SVG document width
-s KB_BIN_SIZE Sequence region size in kilobases for calculation of
contact probabilities. Default is 100 (kb)