-
Notifications
You must be signed in to change notification settings - Fork 22
/
README.txt
254 lines (185 loc) · 9.76 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
================================================================================
perf-libs-tools Copyright 2017-21 Arm Limited All rights reserved
================================================================================
Tools to support Arm Performance Libraries
------------------------------------------
This project provides tools to enable users of HPC applications to understand
which routines from Arm Performance Libraries are being called.
The main component of this suite is the logging library that produces output to
summarize the key information about routines called. A selection of Python
scripts for visualizating some of this data is also included.
Licensing
---------
This project is distributed under an Apache 2.0 license, available in the file
LICENSE. All inbound contributions will also be under this same licence.
Compiling
---------
1) Ensure that you are using the latest version of the logging library to match
your build of Arm Performance Libraries. This version matches 21.0. Note
you can build using either GCC or Arm Compiler, however an Arm Performance
Libraries or Arm Compiler module must have been loaded to add the correct
directory containing 'armpl.h' to your path.
2) From this top level directory type "make".
You should now have a "lib/libarmpl-summarylog.so" created.
Usage
-----
1) Use LD_PRELOAD to pick-up the newly created logging library:
export LD_PRELOAD=$PWD/lib/libarmpl-summarylog.so
If your application is itself threaded you should use the library
libarmpl_mp-summarylog.so instead as this will enable correct logging from
the different threads all calling Arm Performance Libraries functions
concurrently.
2) When you are building your application ensure that you are linking in
the shared library (e.g. libarmpl_lp64_mp.so)
rather than the static library (e.g. libarmpl_lp64_mp.a).
3) Run your application as normal. Output files will be produced in the /tmp/
directory when the program completes. These will be called
/tmp/armplsummary_<pid>.apl where <pid> is the process ID.
Note an extra line is added to the output of your program reading
Arm Performance Libraries output summary stored by default in files named
/tmp/armplsummary_16776.apl
with the appropriate PID included. The root of this can be modified using
the environment variable ARMPL_SUMMARY_FILEROOT.
4) For certain applications we have seen situations where programs do not
terminate as expected, for instance if they are killed, crash, or are forked
from Python. In these cases it can be useful to have intermediate output
files written during the execution. In order to use this functionality use
the environment variable ARMPL_SUMMARY_FREQ to set an initial frequency of
library calls between each output. By default this frequency grows by 10%
each output in order to limit the overhead in cases where the user has
little knowledge of the number of cases expected.
If using this frequency output, these extra output files will have an
additional subscript in the name written, e.g.
/tmp/armplsummary_16776_01.apl
This is because a terminating program may die mid-write of the output file,
hence allowing multiple file names to be iterated around means you can check
which was the latest output.
Note we have also included the library "lib/libarmpl-logger.so" which is used in
the same way, however this gives much more verbose output producing an output
line for every single library call. The file root for output here is set
using ARMPL_LOGGING_FILEROOT.
Output
------
The output files generated list the following information:
Column 1 : "Routine:"
Column 2 : <routine name>
Column 3 : "nCalls:"
Column 4 : <number of calls>
Column 5 : "Mean_time"
Column 6 : <mean time of this call>
Column 7 : "nUserCalls:"
Column 8 : <number of top level calls>
Column 9 : "Mean_user_time:"
Column 10 : <mean time of this top-level call>
Column 11 : "Inputs:"
Column 12 onwards : input parameters to routine, integers then characters
That is, at the end of the execution of a program a summary list is produced
itemizing each of the routines called, detailing the input cases that it has
been called with, and then calculating the mean time taken for those inputs.
Only the integer and character parameters are recorded in this output. This is
because the input cases of matrix contents should not normally be relevant to
understanding the cases used. It does mean certain input parameters, such as
ALPHA and BETA for a DGEMM case, are missed, however this is an acceptable
situation for the automatic creation of a logging library.
Note that we now provide two sets of timing results: inclusive and top-level.
Typically users may call a routine such as DPOTRF, but internally they may
in turn call many other LAPACK or BLAS functions.
Tools
-----
In the "tools/" directory are some example scripts to produce graphical summaries
of the output produced from the logging library. They are written using the
Python package MatPlotLib, and we recommend running the visualization part of
that script on a local machine rather than on a remote box.
At present we are seeing these as a two-stage process:
(1) process the /tmp/armplsummary_<pid>.apl output
(2) visualize this output
There will be many more useful visualizations that can be done with this data,
and we look forward to receiving additional contributions.
All tools, at present, are written to be run from the "tools" directory so you
need to
cd tools
before following the instructions below.
---Overall library usage---
./process_summary.py <input files>
This produces information about all library calls made. A high-level
summary is returned, folllowed by break-downs of all the calls to each
library component (BLAS, LAPACK, FFT). This data is split into output
summarising both the total of calls and time spent in the function, and
also the number and time spent in this function from just user-called
invocations.
For FFT calls this data also matches planning and execution stages
enabling detailed profliing of fftw_execute() calls with only a pointer.
An additional file generated is the BLAS summary data, previously
produced by "./process-blas.sh", which can be fed into "./blas_usage.py",
see below.
Input: One or more /tmp/armplsummary_<pid>.apl files.
Output: <text> and /tmp/armpl.blas
---DGEMM calls---
./process-dgemm.sh <input files>
This produces summary information about GEMM calls made in an application.
Input: One or more /tmp/armplsummary_<pid>.apl files.
Output: /tmp/armpl.dgemm
./heat_dgemm.py
This visualizes the data from an armpl.dgemm file.
Usage: heat_dgemm.py [-h] [-i INFILE] [-l]
Optional arguments:
-h, --help Show help message and exit
-i INFILE, --input INFILE Input file
-l, --legend Show graph legend
What does it show?
Two rows, each with two heat maps and a pie chart
The top row shows the number of calls made.
The bottom row shows the time spent.
The heat maps show the proportion of calls/time at each problem dimension
for matix A (left) and B (right).
The pie charts show the proportion of calls/time for the given transpose
options for matrices A and B.
Input: /tmp/armpl.dgemm
Output: A matplotlib window showing usage.
---BLAS calls---
./process-blas.sh
DEPRECATED -- use "./process-summary.py" instead
This produces summary information about BLAS calls made in an application.
Input: One or more /tmp/armplsummary_<pid>.apl files.
Output: /tmp/armpl.blas
./blas_usage.py
This visualizes the data from an armpl.blas file generated by
'process_summary.py'.
Usage: blas_usage.py [-h] [-l] [-n] [-x] [-i INFILE]
Optional arguments:
-h, --help Show help message and exit
-i INFILE, --input INFILE Input file
-l, --legend Show graph legend
-n, --normalize Normalize bars to percentages of max
-x, --exclude Exclude unrepresented functions
What does it show?
Two sets of three rows of bar charts:
Top set: Number of calls
Bottom set: Time spent in these calls
For each set:
Top row: BLAS level 1 routines
Middle row: BLAS level 2 routines
Bottom row: BLAS level 3 routines
On each row different families of functions are shown, for example GEMM.
For each family different bars show the number of routines of:
single precision real (red)
double precision real (blue)
single precision complex (yellow)
double precision complex (green)
Useful additional option include "-x" to exclude routines that are never
called and "-n" to normalise time taken against the most expensive
routine.
Input: /tmp/armpl.blas.
Output: A matplotlib window showing usage.
Known issues
------------
* If being used with GCC it is necessary to remove the functions cdotc_, cdotu_,
zdotc_ and zdotu_ from src/PROTOTYPES. For convenience these are the first
four lines of that file.
* The top level call timings do not work when called from within a nested OpenMP
parallel regions. Set your code to use only one level of OpenMP.
* For certain codes that create enormous numbers of FFTW plans then it may
be necessary to prevent trying to match plans with executes. This situation
would result in a significant, and worsening, run-time performance of the
application. This is not detailed here, but instructions can be made
available upon request.