-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathfaq.html
125 lines (101 loc) · 4.9 KB
/
faq.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>GenomeThreader Frequently Asked Questions (FAQ)</title>
<link rel="stylesheet" type="text/css" href="style.css">
</head>
<body>
<h1><i>GenomeThreader</i> Frequently Asked Questions (FAQ)</h1>
<ol>
<li><a href="#q1">How can I estimate the memory requirements of
<i>GenomeThreader</i>?</a></li>
<li><a href="#q2">How can I reduce the memory requirements of
<i>GenomeThreader</i>?</a></li>
<li><a href="#q3">How can I split the input files to reduce the memory
consumption and utilize multiple CPUs?</a></li>
</ol>
<p><b> (1) How can I estimate the memory requirements of
<i>GenomeThreader</i>?</b>
<a name="q1"></a>
</p>
<blockquote>
<p>
The major driving forces are the size of the input files, the number
of stored spliced alignments and the maximum size of the Dynamic
Programming matrix. For the genomic file(s) (which are used to construct
the index) you need about factor 8 of the uncompressed file size(s). For the
EST input file(s) you need about facor 2 of the uncompressed file size(s). The
number of stored spliced alignments is important, because they are all held in
main memory (the size depends on your use case, see statistics at the end of
the <i>GenomeThreader</i> output).
The maximum size of the Dynamic Programming matrix is very important,
because it could create a "space peak" which lets you run out of
memory. See next question for advice on how to limit the size of the matrix.
</p>
</blockquote>
<p><b> (2) How can I reduce the memory requirements of <i>GenomeThreader</i>?</b>
<a name="q2"></a>
</p>
<blockquote>
<p>
You should set option <tt>-gcmaxgapwidth</tt> accordingly (depending on your
species) to reduce the maximum possible size of the Dynamic Programming
matrix.
If you do not use option <tt>-introncutout</tt> you can use
<tt>-autointroncutout</tt> to prevent "space peaks" for large Dynamic
Programming tables.
You can also split up the input files which is described in the next question.
</p>
</blockquote>
<p><b> (3) How can I split the input files to reduce the memory consumption and
utilize multiple CPUs?</b> <a name="q3"></a>
</p>
<blockquote>
<p>
The basic strategy is to use <tt>gth</tt> with option <tt>-intermediate</tt>
on different subsets of the input and combine the results afterwards.
The input files can be split with the <tt>gt splitfasta</tt> tool contained in
the <a href="http://genometools.org"><i>GenomeTools</i></a> package (an
open-source collection of bioinformatics tools).
To determine the right sizes for the genomic and EST/protein input files keep
the answer to <a href="#q1">question (1)</a> in mind.
A common strategy in practice is to leave the genomic input files untouched
(if you can afford it memory wise) and split the EST/protein input files into
chunks of 50 MB.
There are two possible formats to store the intermediate results: XML and GFF3
(options <tt>-xmlout</tt> and <tt>-gff3out</tt>, respectively).
XML output is lossless but takes more resources to process. GFF3 is much more
resource-efficient, but it makes only sense if you want to have GFF3 output in
the end, because the other formats cannot be reconstructed from GFF3 output.
To combine XML intermediate files use <tt>gthconsensus</tt> which is described
in the <i>GenomeThreader</i> <a href="doc/gthmanual.pdf">manual</a>.
Using <tt>gthconsensus</tt> is quite memory intensive, because it takes the
intermediate XML files and reconstructs the alignments exactly as they were
during the <tt>gth</tt> run. That is, each spliced alignment needs to be
stored in memory.
If you don't need the full alignments but could live with the
structure in GFF3 the following strategy is recommended:
You call <tt>gth</tt> with the options <tt>-intermediate</tt> and
<tt>-gff3out</tt>. This gives you the spliced alignments predicted by
<i>GenomeThreader</i> in GFF3 format.
You can then postprocess this spliced alignments with some tools from
the <a href="http://genometools.org"><i>GenomeTools</i></a> package.
You cannot convert the GFF3 output back to the intermediate form, but
you can perform the same analysis which <tt>gthconsensus</tt> performs with
tools operating on the GFF3 output.
To do so, sort the intermediate files with <tt>gt gff3 -sort</tt> and merge
them with <tt>gt merge</tt> afterwards.
Then compute the consensus spliced alignments with <tt>gt csa</tt>.
If you also want to predict coding sequences add them to the GFF3 with
<tt>gt cds</tt>.
<tt>gt filter</tt> allows you to filter spliced alignments according to their
scores.
</p>
</blockquote>
<div id="footer">
Copyright © 2003-2016 <a href="mailto:[email protected]">
Gordon Gremme</a>. Last update: 2016-09-20
</div>
</body>
</html>