-
Notifications
You must be signed in to change notification settings - Fork 1
/
interop.tex
287 lines (243 loc) · 15.8 KB
/
interop.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
\chapter{Interoperation}
\label{chap:interop}
In this chapter we briefly discuss the most common scenarios where Legion programs need to interoperate with other systems. We will rely in this chapter on examples from the {\tt legion/examples} directory in the Legion repository.
\section{MPI}
\label{sec:mpi}
Legion has well-developed support for interoperation with MPI. The essentials of the approach are:
\begin{itemize}
\item The top-level Legion task is control-replicated, with the number of shards equal to the number of ranks of MPI.
\item Legion and MPI time-slice the machine: One of MPI or Legion is running at any given time, while the other runtime
waits.
\item Data can be shared between MPI and Legion by attaching an MPI buffer as a region with simultaneous coherence (with the correct layout constraint
to ensure the buffer contents are interpreted correctly by Legion). The data can be moved back and forth between a shard of the top-level
task and the corresponding MPI rank using the producer-consumer synchronization discussed in Section~\ref{sec:simultaneous}.
\end{itemize}
MPI interoperation is illustrated in {\tt legion/examples/mpi\_with\_ctrl\_repl/mpi\_with\_ctrl\_repl.cc}:
\begin{lstlisting}
MPILegionHandshake handshake;
...
// This is the preferred way of using handshakes in Legion
IndexLauncher worker_launcher(WORKER_TASK_ID, launch_bounds, TaskArgument(NULL, 0), args_map);
// We can use our handshake as a phase barrier
// Record that we will wait on this handshake
worker_launcher.add_wait_handshake(handshake);
// Advance the handshake to the next version
handshake.advance_legion_handshake();
// Then record that we will arrive on this version
worker_launcher.add_arrival_handshake(handshake);
// Launch our worker task
// No need to wait for anything
runtime->execute_index_space(ctx, worker_launcher);
\end{lstlisting}
In this excerpt, we see that the synchronization between MPI and Legion is wrapped in a {\tt MPILegionHandshake} object (line 1). The handshake encapsulates a phase barrier and is used similarly (see Section~\ref{sec:simultaneous}), but a handshake also knows how to work with MPI.
An index task launcher is built to run the Legion-side work (line 4) and its execution is deferred until the MPI side signals it is done running (line 7). Just like a phase barrier, handshakes have generations so that they can be reused multiple times, typically across
iterations of a loop. The handshake is advanced to the next generation (line 9) and when the index tasks are finished the (new generation of the) handshake is signaled to restart the MPI side (line 11).
The MPI side of the interface is symmetric. From the same example:
\begin{lstlisting}
for (int i = 0; i < total_iterations; i++)
{
printf("MPI Doing Work on rank %d\n", rank); // MPI work goes here
if (strict_bulk_synchronous_execution)
MPI_Barrier(MPI_COMM_WORLD);
// Perform a handoff to Legion, this call is
// asynchronous and will return immediately
handshake.mpi_handoff_to_legion();
..
// Wait for Legion to hand control back,
// This call will block until a Legion task
// running in this same process hands control back
handshake.mpi_wait_on_legion();
if (strict_bulk_synchronous_execution)
MPI_Barrier(MPI_COMM_WORLD);
}
\end{lstlisting}
MPI uses the same handshake object as Legion. Note that the
call to {\tt mpi\_wait\_on\_legion} blocks until Legion arrives at the handshake; the other arrive/wait handshake methods are asynchronous.
Because the MPI side blocks while it is waiting on Legion, it is not concerned with the generation of the handshake, so the generation should only be
advanced by the Legion side to allow for deferred execution of Legion tasks.
\section{OpenMP}
\label{sec:openmp}
Legion provides a straightfoward model of interoperation with OpenMP. Legion tasks may use OpenMP pragmas internally to exploit
multiple threads in a single kernel. Legion tasks that use OpenMP should be mapped to OMP processors, which can be enforced by
adding an OMP constraint when the task is registered.
Under the hood Legion interoperates with OpenMP by directly implementing OpenMP functionality. Only a subset of OpenMP is supported,
but the support extends to the most commonly used features, particularly {\tt omp parallel for}.
The program {\tt legion/omp\_saxpy} illustrates typical uses of OpenMP in Legion programs. In this code, the leaf tasks (the tasks
that do not call other tasks) include OpenMP pragmas. For example, in {\tt simple\_blas.cc} a dot product operation is defined:
\begin{lstlisting}
template <>
float BlasTaskImplementations<float>::dot_task_cpu(const Task *task,
const std::vector<PhysicalRegion> ®ions,
Context ctx, Runtime *runtime)
{
IndexSpace is = regions[1].get_logical_region().get_index_space();
Rect<1> bounds = runtime->get_index_space_domain(ctx, is);
const FieldAccessor<READ_ONLY,float,1,coord_t,
Realm::AffineAccessor<float,1,coord_t> > fa_x(regions[0], task->regions[0].instance_fields[0]);
const FieldAccessor<READ_ONLY,float,1,coord_t,
Realm::AffineAccessor<float,1,coord_t> > fa_y(regions[1], task->regions[0].instance_fields[0]);
float acc = 0;
#\##pragma omp parallel for reduction(+:acc) if(blas_do_parallel)
for(int i = bounds.lo[0]; i <= bounds.hi[0]; i++)
acc += fa_x[i] * fa_y[i];
return acc;
}
\end{lstlisting}
Note that unlike most other examples in this manual, this code uses an {\tt AffineAccessor} for the fields. Affine accessors support indexing into regions like arrays, which is necessary in this example because the different
iterations of the loop will be split across multiple OpenMP threads---we cannot use an iterator here, as an iterator by definition defines a sequential order of access to the elements iterated over. The {\tt axpy} task in the same file gives another example of using
OpenMP pragmas within Legion tasks.
The code for registering the tasks that use OpenMP is in {\tt simple\_blas.inl}:
\begin{lstlisting}
{
dot_task_id = Runtime::generate_static_task_id();
TaskVariantRegistrar tvr(dot_task_id);
#\##ifdef REALM_USE_OPENMP
tvr.add_constraint(ProcessorConstraint(Processor::OMP_PROC));
#\##else
tvr.add_constraint(ProcessorConstraint(Processor::LOC_PROC));
#\##endif
Runtime::preregister_task_variant<T, BlasTaskImplementations<T>::dot_task_cpu>(tvr, "dot (cpu)");
}
\end{lstlisting}
This code is parameterized on whether OpenMP is to be used or not; if it is used the processor constraint for the task is set to {\tt OMP\_PROC}, otherwise it is set to {\tt LOC\_PROC} (i.e., CPUs).
There are a few command-line flags that affect the execution of Legion programs using OpenMP:
\begin{itemize}
\item {\tt -ll:ocpus n} sets the number of CPUs to be reserved for OpenMP to $n$.
\item {\tt -ll:othr t} sets the number of threads per CPU to $t$.
\item {\tt -ll:okindhack} exposes the master OpenMP thread as a CPU processor. This flag is useful when running with {\tt -ll:cpus 0} to give an extra processor to the OpenMP runtime; if there are some remaining CPU tasks they can be sent to the
procesor running the master OpenMP thread using {\tt -ll:okindhack}.
\item {\tt -ll:onuma} ensures that OpenMP cores are grouped by NUMA domain; a warning is printed if NUMA support is not found.
\item {\tt -ll:ostack m} sets the OpenMP stack size to $m$ bytes.
\end{itemize}
Finally, Legion is not compiled with OpenMP by default. To enable OpenMP, build Legion with {\tt USE\_OPENMP = 1}.
\section{HDF5}
\label{sechdf5}
HDF5 is a standard file format for storing very large files. HDF5 is hierarchical: files can be made of {\em groups}, which can
be nested, with data sets stored at the leaves of the group structure. HDF5's hierarchical files map well on to Legion's
regions and subregions.
Legion provides support for reading and writing HDF5 files using logical regions. We have already seen all of the mechanisms in
Section~\ref{sec:simultaneous}: HDF5 files are treated as external resources that are attached to a region with simultaneous coherence. After acquiring
the region (to release the copy restriction) copies can be made of the data; when the coherence is released any updates are
flushed back to the original region instance.
A simple example of creating and writing checkpoints using an HDF5 file is {\tt legion/examples/attach\_file}. Most
of the calls that manipulate HDF5 files in this program are actually direct calls to the HDF5 library, such as calls to
{\tt H5Fcreate}, {\tt H5Gcreate2}, {\tt H5Gclose}, {\tt H5Sclose}, and {\tt H5Fclose} in {\tt generate\_hdf\_file}. See the
HDF5 documentation for the semantics of these calls.
The place where the API intersects with HDF5 is in the use of an attach launcher to bind a region to an HDF5 file.
The relevant excerpt from this example:
\begin{lstlisting}
#\##ifdef LEGION_USE_HDF5
if(*hdf5_file_name) {
// create the HDF5 file first - attach wants it to already exist
bool ok = generate_hdf_file(hdf5_file_name, hdf5_dataset_name, num_elements);
assert(ok);
std::map<FieldID,const char*> field_map;
field_map[FID_CP] = hdf5_dataset_name;
printf("Checkpointing data to HDF5 file '%s' (dataset='%s')\n",
hdf5_file_name, hdf5_dataset_name);
AttachLauncher al(LEGION_EXTERNAL_HDF5_FILE, cp_lr, cp_lr);
al.attach_hdf5(hdf5_file_name, field_map, LEGION_FILE_READ_WRITE);
cp_pr = runtime->attach_external_resource(ctx, al);
} else
#\##endif
\end{lstlisting}
The only differences in this code from the discussion of attach
launchers in Section~\ref{sec:simultaneous} are
the constant {\tt LEGION\_EXTERNAL\_HDF5\_FILE} for the external resource argument of the constructor on line 10 and the use of the
{\tt attach\_hdf5} method with the access mode {\tt LEGION\_FILE\_READ\_WRITE} on line 11.
HDF5 is not included in the Legion build by default. Set {\tt USE\_HDF5=1} to build with HDF5 supoort.
The variable {\tt HDF\_ROOT} can be set to the root directory of the HDF library if needed.
\section{Kokkos}
\label{sec:kokkos}
Legion supports running Kokkos tasks as part of a Legion execution. An example program is in the directory
{\tt legion/examples/kokkos\_saxpy}. From the Legion point of view, Kokkos is most useful as a processor-agnostic
portability layer for kernels, allowing the same kernel to target both GPUs and CPUs, for example.
In Kokkos, the parameterization over the processor kind is handled by C++ templates over {\em execution spaces}.
Instantiating the execution space with different arguments allows Kokkos to generate code for GPUs, GPUs, and OpenMP.
The need to parameterize Kokkos tasks on the execution space, combined some limitations of C++ templates,
causes the implementation of Kokkos tasks in Legion to look quite different from other task implementations
we have seen. For example, the initialization task in {\tt kokkos\_saxpy} is:
\begin{lstlisting}
template <typename execution_space>
class InitTask {
public:
static void task_body(const Task *task,
const std::vector<PhysicalRegion> ®ions,
Context ctx, Runtime *runtime)
{
printf("kokkos(%s) init task on processor " IDFMT ", kind %d\n",
typeid(execution_space).name(),
runtime->get_executing_processor(ctx).id,
runtime->get_executing_processor(ctx).kind());
const float offset = *reinterpret_cast<const float *>(task->args);
Rect<1> subspace = runtime->get_index_space_domain(ctx,
task->regions[0].region.get_index_space());
AccessorRW acc(regions[0], task->regions[0].instance_fields[0]);
// you can use relative indexing for your own kernels too - just make
// sure you do the right thing when operating on a partitioned
// subregion!
Kokkos::View<float *,
Kokkos::LayoutStride,
typename execution_space::memory_space> view = acc.accessor;
size_t n_elements = subspace.hi.x - subspace.lo.x + 1;
Kokkos::RangePolicy<execution_space> range(runtime->get_executing_processor(ctx).kokkos_work_space(),
0, n_elements);
Kokkos::parallel_for(range,
KOKKOS_LAMBDA (int i) {
// using a relative address, but value to store
// is based on global index
// have to use a relative address!
view(i) = (i + subspace.lo.x) + offset;
});
}
};
\end{lstlisting}
A Kokkos task is implemented as a C++ class with a method {\tt task\_body} that has the code for the task (and has the standard Legion signature for a task). The class
definition is wrapped in a template with an execution space parameter.
Another point where Legion and Kokkos interact is in the mapping of Legion's region accessors into Kokkos views---for code generation in Kokkos to work well, it is important
that Kokkos views be used to access data, though these views can often be Legion accessors cast to a suitable Kokkos type (e.g., lines 21-25 above).
Finally, because Kokkos tasks are templated classes, registering a Kokkos task requires additional logic to instantiate the template with the kind of processor to be used and then registering the
{\tt task\_body} function of the resulting class. In {\tt kokkos\_saxpy}, this logic is encapsulated in the function {\tt preregister\_kokkos\_task}:
\begin{lstlisting}
template <template<typename> class PORTABLE_KOKKOS_TASK>
void preregister_kokkos_task(TaskID task_id, const char *name)
{
// register a serial version on the CPU
{
TaskVariantRegistrar registrar(task_id, name);
registrar.add_constraint(ProcessorConstraint(Processor::LOC_PROC));
Runtime::preregister_task_variant<
PORTABLE_KOKKOS_TASK<Kokkos::Serial>::task_body >(registrar, name);
}
...
}
...
preregister_kokkos_task<InitTask>(INIT_TASK_ID, "init");
\end{lstlisting}
Note that {\tt pregister\_kokkos\_task} is templated on the class representing the task to be registered.
Enable {\tt USE\_Kokkos=1} to build with Kokkos support, which is not included in Legion by default.
The compile-time flags {\tt KOKKOS\_ENABLE\_CUDA}, {\tt KOKKOS\_ENABLE\_OPENMP} and {\tt KOKKOS\_ENABLE\_SERIAL} enables
generation of kernels for GPUs, OpenMP, and CPUs, respectively.
\section{Python}
\label{sec:python}
Legion provides extensive interoperation support with Python through Pygion, which implements the Legion programming model
in Python. Pygion is essentially untyped Regent with Python syntax. All of the Legion programming features are available
through Pygion.
One of the additional features that Pygion provides is the ability to define a Legion task in Python. This task, when run,
will execute in a Python interpreter. Given the overheads of Python, tasks written in Python (as opposed to task simply
called from Python) will generally be slow compared to the same task written in C++ or CUDA, but for small amounts of compute
the convenience of having all of the Python facilities available outweighs what should be a negligible performance penalty.
Python tasks run on dedicated Python processors, for which Legion reserves a core to run
the Python interpreter.
There are numerous examples of Pygion code illustrating the use of all of the Legion features in
{\tt legion/bindings/python/examples}. As an example, a simple ``hello, world'' program is in
{\tt legion/bindings/python/examples/hello.py}:
\begin{lstlisting}
from pygion import task
@task
def main():
print("Hello, Legion!")
if __name__ == '__main__':
main()
\end{lstlisting}
More information on Pygion can be found in \cite{pygion19}.