-
Notifications
You must be signed in to change notification settings - Fork 0
/
ReadMe.txt
198 lines (174 loc) · 10.1 KB
/
ReadMe.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
Copyright (c) 2015, Bo Wang, Academy of Mathematics and Systems Science,
Chinese Academy of Sciences, Beijing 100190, China
3Dec User Guide
Please note that the Module for quality scores has not been fully investigated yet. We will
update it in future versions.
Please contact Bo Wang ([email protected]) for any problems, bugs or suggestions.
3Dec
Pre-requisite for Operating systems:
The executable files were built under Ubuntu 14.01 LTS and tested to work well on
Redhat Enterprise 7.2(Maipo). However, We do not guarentee that they run as expected
on other systems, especially lower-version ones. If the executable files did not work,
please re-build them from the source codes or contact us for help.
To run the base caller, execute the binary file 3Dec.linux (under Linux) or 3Dec
(built from source code).
Type in "3Dec --help" to see the manuals.
3Dec-train:
This module has not been fully investigated yet. 3Dec can implement the
base-calling scheme without this file. We keep it here just for an option.
Be careful when using it.
3Dec-train is used to train new model for Phred quality scores. The default model
3Dec uses was trained based on the first tile in BlindCall Hiseq2000 PhiX dataset
(distributed with paper PMID 24413520). Training the model requires an entire tile,
in which all reads should have a known reference. Mismatches between short reads and
reference will result in underestimating the quality scores.
Before the training, you should run 3Dec with arguments -q -t to generate corrected
intensity file "cifname.cif", then align the generated .fastq file to the reference of the
short reads using a mapping software such as bowtie2 or BWA; the mapping results should
be stored in "samname.sam" in the same order. Then the model file "modelname" used by
3Dec(with option -m) can be generated by typing the command:
$ 3Dec-train cifname samname modelname
For example, if one would like to train the model using the first tile of the BlindCall
dataset, he/she should follow these steps:
1) Download the dataset (links are provided in "Test data"), and unpack it;
2) Install an sequence alignment software, such as bowtie2, or BWA;
3) Download the bacteriophage PhiX174 reference, which is provided in NCBI;
4) Run 3Dec with arguments -q -t to generate the intensity file and fastq file:
$ 3Dec -q -t -f -s -c 1,101 --osubfix _clean -i ./PhiX174_UMD_HiSeq_201305/Data/Intensities/L004 s_4_1113
then two files s_4_1113_clean.cif and s_4_1113_clean.fastq will be generated;
5) Align the sequences in s_4_1113_clean.fastq to the reference using an alignment software,
and output the results in SAM format in a file, namely s_4_1113.sam. To make the model more
accurate, one can modify the reference for SNPs based on the alignment result.
6) Train the model by the following command:
$ 3Dec-train s_4_1113_clean s_4_1113_clean s_4_1113.sam s_4_1113.model
Then the coefficients of the model will be stored in the file s_4_1113.model, which can
be used in 3Dec by the argument "-m s_4_1113.model".
locs2pos
locs2txt converts the cluster location file in "locs" format into plain text (_pos.txt format).
Details see locs2pos/Readme.txt
Please notice that though 3Dec supports _pos.txt, it has not been tested on Miseq data yet.
Unexpected results may occur in this version. Use it on Miseq data CAREFULLY.
Building & Installation from source codes
The makefile works under Linux (Ubuntu and Redhat). Current version may not
support other OSs well.
Pre-requisite
The following tools should be installed in the system:
1)make
2)gcc
Recommanded version: 4.8.2+ (which is the version we used)
The following libraries should be either installed in the system or provided
in the "include" folder. We will explain the details in the next part.
3)liblinear
This library can be downloaded at
http://www.csie.ntu.edu.tw/~cjlin/liblinear/.
Required version: 2.0+
4)Eigen
This library can be downloaded at
http://eigen.tuxfamily.org/index.php?title=Main_Page
Required version: 3.2+
Recommanded version: 3.2.3+
(I placed the libraries liblinear and Eigen in Folder ./include. Users may replace
or modify them with other versions)
Build & Installation
Open an terminal, change directory to the root of the package, first type:
$ make clean
to remove previously generated files.
Next type
$ make dependency=included
The executable files will be built in ./bin.
Then type:
$ sudo make install
The executable files will be copied to $(DESTDIR) (Default: /usr/local/bin)
Installation directory can be changed to DESTFOLDER by typing
$ make install DESTDIR=DESTFOLDER
As an alternative, you can also compile them using libraries that are
already installed in the system by:
$ make
If your compiler does not support openmp, please add the argument “openmp=disalbed”
with make to disable the parallel feature:
$ make openmp=disabled
Due to the API differences among liblinear-1.9X, liblinear-1.9- and liblinear-2.0+, we
had met troubles. So if you meet compiling errors for 3Dec-train, please update
the Library Liblinear to 2.0+, or try to re-run make with
the argument “DEFINES=-D_liblinear_1”:
$ make DEFINES=-D_liblinear_1
Test data:
Two datasets are available for testing this program:
Hiseq2000 Phix174 dataset:
This dataset contains the cluster intensity data for 3 tiles. It was distributed along
with BlindCall and can be downloaded at
ftp://ftp.cbcb.umd.edu/pub/data/hcorrada/BlindCall_data.tar.gz
or be obtained by wget:
$ wget ftp://ftp.cbcb.umd.edu/pub/data/hcorrada/BlindCall_data.tar.gz
GAII Phix174 dataset:
This dataset contains about 5 tiles. Each tile includes ~100,000 single-end reads of 37
sequencing cycles. It can be obtained at
https://1drv.ms/u/s!Alz39M_owi523324TpJHTVcf2eM9
Command for the paper:
CIF files with corrected spatial crosstalk were generated by the command:
(Hiseq2000 Phix174 dataset)
$ 3Dec -t -s -c 1,101 -i ./PhiX174_UMD_HiSeq_201305/Data/Intensities/L004 -o outputfolder s+
(GAII Phix174 dataset)
$ 3Dec -t -L -i ./GAII-ABCtoy -o outputfolder s+
The .fastq files were generated by the command:
(Hiseq2000 Phix174 dataset)
$ 3Dec -q -f -s -c 1,101 -i ./PhiX174_UMD_HiSeq_201305/Data/Intensities/L004 -o outputfolder s+
(GAII Phix174 dataset)
$ 3Dec -q -f -L -i ./GAII-ABCtoy -o outputfolder s+
Manuals (printed by "3Dec --help")
Type "3Dec --help" to show the help.
Usage:
3Dec [options]* {-t -q | -r} <name|pattern> [name|pattern]* ...
-t outputs spatial-crosstalk-corrected CIF files.
-q outputs called sequences in Fastq format.
-r outputs called sequences in Fastq format, re-estimating matrices
after correcting spatial crosstalk (slower but more accurate).
name Specifies the tile name to be processed.
pattern A pattern XYZ+ specifies all tile names beginning with XYZ.
Options:
-l Specifies the subfix (or the expand name) of
input location files in the next input arguments(Default .clocs)
-L short for [-l _pos.txt].
-s inputs CIFs are seperated (eg. when input Illumina Runfolder):
each cycle in a subfolder.(Default: input intensities from a
single file.)
-S outputs CIFs are seperated. Will be ignored for [-q] or [-r].
(Default: outputs intensities in a single CIF file.)
-e Specifies the total ends.(Default: only one end.) Data are
processed independently for each ends.
-c specifies the begin and end cycle for each ends in the next
arguments. Must be set after [-e]. Eg. [-c 1,101,102,109,110,210]
specifies the cycles for the 3 ends of [-e 3].
-i specifies the input directory in the next arguments. Default:
current folder.
-o specifies the output directory in the next arguments. Default:
current folder.
-m specifies the .model file used for Phred-Score prediction. Details
see the help of 3Dec-train.
-n does not correct ACC if -q or -t.
-f reduces iteration for latter blocks when estimating phasing. This
will reduce calculation time while slightly reducing the accuracy.
-p specifies the processes used. Default: OPENMP default value.
--inpath the same as [-i].
--loctype the same as [-l].
--outpath the same as [-o].
--version print 3Dec version.
--inprefix prefix for input
--oprefix prefix for output
--insubfix subfix for input
--osubfix subfix for output
Arguments following the four commands specifies the extra part of
input and output CIF(fastq) files' names comparing with location files'
names. The four argument adds prefix or subfix to the I/O files' names.
Examples:
3Dec -i ./L001 -o ./output -q s_1_1101 s_1_12+
This command reads location file s_1_1101.clocs in directory ./L001, then reads CIF file s_4_1101.cif in the same direcotory, and then do the base-calling and output s_4_1101.fastq in directory ./output. Then it searches the directory ./L001 for all files with the name pattern s_1_12*.clocs, and reads the cif file with the same tile name and write fastq files in ./output.
3Dec -i ./L001 -o ./L001 -s -S -c 1,101 -t s+
This command searches the directory ./L001 for all location files with the name pattern s*.clocs, then for each location file sA.clocs, it reads seperated CIF files ./L001/C1.1/sA.cif, ./L001/C2.1/sA.cif, ... , ./L001/C101.1/sA.cif, and correct spatial crosstalk for them and then write the corrected CIF files back (overwrite the original files).
Licence
3Dec is subject to Creative Commons Attribution-NonCommercial-ShareAlike 4.0
International Public License. A copy of the licence is attached with the software. You can
also obtain one at http://creativecommons.org/licenses/by-nc-sa/4.0/.
Please notice that the source codes in the "include" folder are subject to different licences
such as MPL, MIT or BSD and the author of 3Dec does not have their copyright.
Licences for them can be found within or along with their files.