-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path14.6troubleshootingsystemerrors.html
148 lines (102 loc) · 6.75 KB
/
14.6troubleshootingsystemerrors.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta name="generator" content="HTML Tidy for Linux (vers 25 March 2009), see www.w3.org">
<title></title>
</head>
<body>
<div class="sright">
<div class="sub-content-head">
Maui Scheduler
</div>
<div id="sub-content-rpt" class="sub-content-rpt">
<div class="tab-container docs" id="tab-container">
<div class="topNav">
<div class="docsSearch"></div>
<div class="navIcons topIcons">
<a href="index.html"><img src="home.png" title="Home" alt="Home" border="0"></a> <a href="14.0troubleshootingandsysmaintenance.html"><img src="upArrow.png" title="Up" alt="Up" border="0"></a> <a href="14.5troubleshootingclients.html"><img src="prevArrow.png" title="Previous" alt="Previous" border="0"></a> <a href="14.7troubleshootingjobs.html"><img src="nextArrow.png" title="Next" alt="Next" border="0"></a>
</div>
<h1>14.6 Tracking System Failures</h1>
<h2>14.6.1 System Failures</h2>
<p>The scheduler has a number of dependencies which may cause failures if not satisfied. These dependencies are in the areas of disk space, network access, memory, and processor utilization.</p>
<h3>14.6.1.1 Disk Space</h3>
<p>The scheduler utilizes a number of files. If the file system is full or otherwise inaccessible, the following behaviors might be noted:</p>
<table border="2">
<tr>
<td><b>File</b></td>
<td><b>Failure</b></td>
</tr>
<tr>
<td><strong>maui.pid</strong></td>
<td>scheduler cannot perform <i>single instance</i> check</td>
</tr>
<tr>
<td><strong>maui.ck*</strong></td>
<td>scheduler cannot store persistent record of reservations, jobs, policies, summary statistics, etc.</td>
</tr>
<tr>
<td><strong>maui.cfg</strong></td>
<td>scheduler cannot load local configuration</td>
</tr>
<tr>
<td><strong>log/*</strong></td>
<td>scheduler cannot log activities</td>
</tr>
<tr>
<td><strong>stats/*</strong></td>
<td>scheduler cannot write job records</td>
</tr>
</table>
<h3>14.6.1.2 Network</h3>
<p>The scheduler utilizes a number of socket connections to perform basic functions. Network failures may affect the following facilities.</p>
<table border="2">
<tr>
<td><b>Network Connection</b></td>
<td><b>Failure</b></td>
</tr>
<tr>
<td>scheduler client</td>
<td>scheduler client commands fail</td>
</tr>
<tr>
<td>resource manager</td>
<td>scheduler is unable to load/update information regarding nodes and jobs</td>
</tr>
<tr>
<td>allocation manager</td>
<td>scheduler is unable to validate account access or reserve/debit account balances</td>
</tr>
</table>
<h3>14.6.1.3 Memory</h3>
<p>Depending on cluster size and configuration, the scheduler may require up to 50 MB of memory on the server host. If inadequate memory is available, multiple aspects of scheduling may be negatively affected. The scheduler log files should indicate is memory failures are detected and mark any such messages with the <i>ERROR</i> or <i>ALERT</i> keywords.</p>
<h5>14.6.1.4 Processor Utilization</h5>
<p>On a heavily loaded system, the scheduler may appear sluggish and unresponsive. However no direct failures should result from this slowdown. Indirect failures may include timeouts of peer services (such as the resource manager or allocation manager) or timeouts of client commands. All timeouts should be recorded in the scheduler log files.</p>
<h4>14.6.2 Internal Errors</h4>
<p>The Maui scheduling system contains features to assist in diagnosing internal failures. If the scheduler exits unexpectedly, the scheduler logs may provide information regarding the cause. If no reason can be determined, use of a debugger may be required.</p>
<h3>14.6.2.1 Logs</h3>
<p>The first step in any exit failure is to check the last few lines of the scheduler log. In many cases, the scheduler may have exited due to misconfiguration or detected system failures. The last few lines of the log should indicate why the scheduler exited and what changes would be required to correct the situation. If the scheduler did not intentionally exit, increasing the <a href="a.fparameters.html">LOGLEVEL</a> parameter to <b>7</b> or higher, may help isolate the problem.</p>
<h3>14.6.2.1 Tracing the Failure with a Debugger</h3>If the scheduler did not intentionally exit due detected environmental conditions, use of a debugger may assist in pursuing the problem further. The fastest method to isolate such situations is to launch the scheduler under the debugger and run it until the failure occurs. Use of the <b>MAUIDEBUG</b> environment variable will prevent the scheduler from <i>backgrounding</i> itself and allow the debugger to remain attached. The example below describes a standard debugging session.
<pre>
> export MAUIDEBUG=yes
> gdb maui
(gdb) r
...
signal SIGILL received
(gdb) where
#0 MPBSJobStart() MPBSI.c:2013
#1 MRMJobStart() MRM.c:1107
#2 main() MSys.c:372
</pre>
<p>The debugger output should locate the source of the failure and help isolate the root cause.</p>
<h2>14.6.3 Reporting Failures</h2>
<p>If an internal failure is detected on your system, the information of greatest value to developers in isolating the problem will be the output of the gdb <b>where</b> subcommand and a printout of all variables associated with the failure. In addition, a level 7 log covering the failure can also help in determining the environment which caused the failure. This information should be sent to <a mailto="[email protected]">[email protected]</a>.</p>
<div class="navIcons bottomIcons">
<a href="index.html"><img src="home.png" title="Home" alt="Home" border="0"></a> <a href="14.0troubleshootingandsysmaintenance.html"><img src="upArrow.png" title="Up" alt="Up" border="0"></a> <a href="14.5troubleshootingclients.html"><img src="prevArrow.png" title="Previous" alt="Previous" border="0"></a> <a href="14.7troubleshootingjobs.html"><img src="nextArrow.png" title="Next" alt="Next" border="0"></a>
</div>
</div>
</div>
</div>
<div class="sub-content-btm"></div>
</div>
</body>
</html>