-
Notifications
You must be signed in to change notification settings - Fork 0
/
03200-Power.tex
71 lines (60 loc) · 4.53 KB
/
03200-Power.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
\subsection{Job Impact}
\label{s:jobimpact}
%%%%%%%%%%%%%%%%%%%%%%%%%
We have annotated messages relating to potentially performance impacting conditions, including
thermal throttling events, power budgets exceeded, and memory errors. Example annotation
descriptions include:
\begin{itemize}
\item \scriptsize\texttt{Correctable memory error. This may result in degraded performance}\normalsize
\item \scriptsize\texttt{Blade or Cabinet controller taking correctable memory errors. This may affect performance.}\normalsize
\item \scriptsize\texttt{Package temperature above threshold (too hot). The CPU clock has been throttled. Should result in all threads for all cores will be throttled. This may affect application performance.}\normalsize
\end{itemize}
Of particular interest in XC systems is the ability to power cap. In such cases, not only is it of interest when
the power budget is exceeded, but also when caps are applied, or perhaps fail to be applied. Our annotation
system enables such events to be exposed to the user.
Many of these messages are identified by commands in the \texttt{commands} file or in the \texttt{controller} logs
and therefore are not typically released to users.
Since our move to SLURM, there has been no automatic reporting of cap settings to the user.
Currently, our users must manually \texttt{cat /sys/cray/pm\_counters/power\_cap}
on each node to see current cap setting, or poll it to catch dynamic changes during runtime.
Examples include commands annotated as:
\begin{itemize}
\item \scriptsize\texttt{applying a power profile}\normalsize
\item \scriptsize\texttt{enforcing a power limit}\normalsize
\end{itemize}
error messages annotated as:
\begin{itemize}
\item \scriptsize\texttt{Error getting initial node power status. This may affect power capping.}\normalsize
\item \scriptsize\texttt{Error disabling power monitor. This may affect power capping.}\normalsize
\item \scriptsize\texttt{Node Error setting power budget. This may affect power capping.}\normalsize
\end{itemize}
A fundamental goal of the annotations to expose information to users that will enable
understanding of why performance and power limits did not perform as expected.
Basic capabilities enabled by our infrastructure include the ability to determine annotations occurring during a given job
and jobs running while an annotation occurred. Examples of each are given in Figure~\ref{f:powerbudget}.
In the top of the figure, all annotations during a job are queried -- it is seen that multiple times on multiple nodes, the
power budget is exceeded, which may result in performance impact. The cap is not a hard limit. The capping mechanism
will keep the average power draw at or below the cap over a few second window at or below the cap; however
instantaneous values may exceed the cap. Excursions over the cap are not made available to the user.
In the bottom of the figure jobs running while an annotation occurs are queried. In this case,
while the annotation was specific to node \texttt{c0-0c1s3n2}, the job itself was running on 96 nodes. Facilitating
tracing the propagation of impact of an event is one of the design goals of this work.
\begin{figure*}
\begin{annol}
# query for annotations during JobId 163510
python get.py -j 163510 -f table annotations
id authorid starttime endtime description logfiles LDcategory components balerpatternid
<@\textbf{\textcolor{red}{855158 acg 2015-05-03 01:12:16 2015-05-03 01:12:16 Node power budget exceeded. controllermessages PW ["c0-0c1s3n2"] 2871}}@>
<@\textbf{\textcolor{red}{864565 acg 2015-05-03 01:12:21 2015-05-03 01:12:21 Node now within power budget after it was exceeded. controllermessages PW ["c0-0c1s3n2"] 2872}}@>
<@\textbf{\textcolor{red}{855159 acg 2015-05-03 01:12:22 2015-05-03 01:12:22 Node power budget exceeded. controllermessages PW ["c0-0c0s11n0"] 2871}}@>
<@\textbf{\textcolor{red}{855160 acg 2015-05-03 01:12:23 2015-05-03 01:12:23 Node power budget exceeded. controllermessages PW ["c0-0c1s3n1"] 2871}}@>
<@\textbf{\textcolor{red}{...Occurs multiple times for multiple nodes...}}@>
# query for jobs running during annotation 864574
python get.py --jobs 855158 -f table annotations
JobId UID JobName NumNodes Start End
<@\textbf{\textcolor{red}{('163510', XXX, ``'xhpl''', 96, '2015-05-03 00:51:24', '2015-05-03 01:19:15')}}@>
\end{annol}
\caption{Annotations provide access to power state information in otherwise unavailable logs. Basic implementation
capabilities include discovery of annotations during a job (top) and and jobs running while an annotation occurred (bottom).}
\label{f:powerbudget}
\end{figure*}