-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
205 lines (141 loc) · 8.26 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
ORFanFinder
(c) Yin Lab, 2014
Contents:
I. Installation
II. Input
III. Running
IV. Output
V. A Note on Using New Genomes
=========================================================================
I. Installation
Compiling ORFanFinder requires the g++ compiler and GNU make. All
Unix style operating systems ship with g++ and make by default. Windows
users should look into the Cgywin program to get g++ and GNU make.
While in the ORFanFinder directory, run the 'make' command. This will create
the ORFanFinder executable. Do the same in the ORFanDbFormat directory to
create the ORFanDbFormat executable. To make these programs availble
system wide, copy both execuatables into a directory within your PATH
(usually /usr/bin, /bin, /usr/local/bin)
Note: Preformatted databases and taxonomy files, along with example outputs
are available at cys.bios.niu.edu/orfanfinder/download.
------------------------------------------------------------------------
II. Input
Several input files are required for ORFanFinder. The following
entries list the files required, a description, and a small example
of the expected input.
1. BLAST output
This is a file that was given as output from NCBI BLAST
in tabular format. The argument "-outfmt 6" or "-outfmt 7" for
blast+ will give data in this format.
Data should look like:
ATCG00500.1|PACid:19637947 ACCD_ARATH 99.80 488 1 0 1 488 1 488 0.0 1006
ATCG00500.1|PACid:19637947 ACCD_CAPBU 96.31 488 14 1 1 488 1 484 0.0 967
ATCG00500.1|PACid:19637947 ACCD_CRUWA 95.29 488 19 1 1 488 1 484 0.0 954
2. ID file
This file contains a list of all of the gene IDs in the query
genome used in the BLAST.
Data should look like:
ATCG00500.1|PACid:19637947
ATCG00510.1|PACid:19637948
ATCG00280.1|PACid:19637949
3. Nodes file
This file is a tab-delimited file that lists each rank's taxonomy
id, that rank's parent's id, and the name of the rank. This
information stems from NCBI's taxonomy database, but individual
files can be generated so long as they follow this format, and
as long as every rank included in the database BLASTed against
is included.
Data should look like:
1 1 no rank
2 131567 superkingdom
6 335928 genus
7 6 species
4. Database File
The database file is generated with the ORFanDbFormat tool
that is included in this package. The file is a binary file.
Several databases can be compiled alongside the program. See
Section I. and README.Format for further details.
5. Names
The names file is an optional, tab-delimited file that
lists every taxonomy id in the database and its name. This file
is based off of one that comes from NCBI's taxonomy database,
but an individual one can be generated so long as they follow
this format, and as long as every rank included in the database
BLASTed against is included.
Data should look like:
1 root
2 Bacteria
6 Azorhizobium
7 Azorhizobium caulinodans
-------------------------------------------------------------------------
III. Running
ORFanFinder has several arguments that must be passed in, as well as
several options. For each of the file inputs, the data required has
been explained above.
[-query blast_filename] Tabular BLAST output file name
[-id id_list_filename] List of all query genes
[-tax int_val] The taxonomy id of the query genome. For example,
if A. thaliana were being run, the argument would be 3702.
[-db database_file] Database file created by ORFDbFormat
[-nodes nodes_filename] The file is tab-delimited and gives each
rank's taxonomy id, its parent's id
[-out out_filename] The name of the file to write output to.
<-names names_filename> This file is tab-delimited and give's
each taxonomy id and its name. Optional. If passed, aditional output is
given.
<-threads int_val> The number of worker threads to use. The total
threads used will be 3 greater than this number. Optional, defaults to 1.
Note: Preformatted databases and taxonomy files, along with example outputs
are available at cys.bios.niu.edu/orfanfinder/download.
Ex:
ORFanFinder -query Athaliana.bl -id Athaliana.id -tax 3702 \
-db nr.hdb -nodes nodes -threads 3 -out Athaliana.orf
--------------------------------------------------------------------------
IV. Output
Output from ORFanFinder is tab-delimited. With generic output, the file
will look like so:
gi|16127995|ref|NP_414542.1| family ORFan
gi|16127997|ref|NP_414544.1| superkingdom ORFan
gi|16127998|ref|NP_414545.1| superkingdom ORFan
The first column is the gene ID, the second column is the ORFan level
of the gene.
If the names argument was provided, aditional output will have been
generated. Instead, output will look like so:
gi|16127995|ref|NP_414542.1| family ORFan - Enterobacteriaceae [genus,2]Escherichia(Enterobacteriaceae),Shigella(Enterobacteriaceae), [species,3]Escherichia coli(Escherichia),Shigella flexneri(Escherichia),Shigella sonnei(Shigella), [subspecies,13](Escherichia coli),Escherichia coli O145:H28(Escherichia coli),Escherichia coli SE15(Escherichia coli),Escherichia coli IHE3034(Escherichia coli),Escherichia coli O103:H2(Escherichia coli),Escherichia coli W(Escherichia coli),Escherichia coli D6-113.11(Escherichia coli),Escherichia coli ETEC H10407(Escherichia coli),Escherichia coli BL21(DE3)(Escherichia coli),Escherichia coli ECC-1470(Escherichia coli),Escherichia coli O111:H-(Escherichia coli),Escherichia coli DH1(Escherichia coli),Escherichia coli B(Escherichia coli),
The first column still gives the gene ID. The second column displays
the ORFan level, but now also gives the name of that rank. The following
column display extra information for each hit rank. The data looks like:
[rankName,numGenesHit]hitGene(parentGene),hitGene(parentGene)...
Inside the brackets are the name of the rank and the number of hits to
that rank. Following it are each of the taxons that were hit in that rank.
Inside the parentheses is the taxon that is the parent to that taxon. Each
subsequent entry is separated by a comma.
--------------------------------------------------------------------------
V. A Note on Using New Genomes
By default, using genomes not in NCBI's Taxanomy database will not be recognized by ORFanFinder.
This is likely to be the case when using newly sequenced genomes, although other circumstances can
create this issue as well. This can be solved by using a custom nodes file as explained in
section II.2. At the very minimum, the nodes file must contain all genomes in the database and all
genomes that will be queried.
Oftentimes you will just need to run a new genome, and use the same database. In this case
the best solution will like be to use a modified version of the included nodes file. For example,
say that we wish to run Anabaena sp. WA113 with ORFanFinder. As it is not in Taxonomy, this will
not work. The first step is to retrieve an identifier for this strain. If possible, it is best to
use an NCBI taxonomy number. If one cannot be obtained, an arbitrary number may be used. The only
requirement is that it is unique and does not conflict with another line in the file. If you use
an arbitrary number, note that it could conflict with future NCBI numbers in Taxonomy. In this
example we use the Taxonomy number 1710889. The next step is to find the parent. In this case
we find Anabaena sp. in NCBI Taxonomy, with Taxonomy number 1167. If we were unable to find it
we would reapeat these steps for the taxon above our query. Finally, we need to determine
the taxon level. In this case we're dealing with a subspecies. Now that we have this information,
we add it to the end of the nodes file, tab delimited.
Eg:
1710889 1167 subspecies
At this point you could run ORFanFinder on Anabaena sp. WA113 safely. Make certain
to use this procedure on all necessary genomes in order to run ORFanFinder correctly.
If creating a database, this needs to be done for every genome. Make sure that root is ID
1 and has a parent of 1. Everything else can be identified however you want so long as it
has a unique name and a valid parent.
If you're trying to create the nodes file from a newer copy of the nodes.dmp, you can do so
with:
sed 's/\t[|]\t/\t/g' nodes.dmp | cut -f1,2,3