Lesson_Materials/Data_Parallelism/index.html

﻿<!DOCTYPE html>

<html>
	<head>
	    <meta charset="utf-8">
		<link rel="stylesheet" href="../common-revealjs/css/reveal.css">
		<link rel="stylesheet" href="../common-revealjs/css/theme/white.css">
		<link rel="stylesheet" href="../common-revealjs/css/custom.css">
		<script>
			// This is needed when printing the slides to pdf
			var link = document.createElement( 'link' );
			link.rel = 'stylesheet';
			link.type = 'text/css';
			link.href = window.location.search.match( /print-pdf/gi ) ? '../common-revealjs/css/print/pdf.css' : '../common-revealjs/css/print/paper.css';
			document.getElementsByTagName( 'head' )[0].appendChild( link );
		</script>
		<script>
		    // This is used to display the static images on each slide,
			// See global-images in this html file and custom.css
			(function() {
				if(window.addEventListener) {
					window.addEventListener('load', () => {
						let slides = document.getElementsByClassName("slide-background");

						if (slides.length === 0) {
							slides = document.getElementsByClassName("pdf-page")
						}

						// Insert global images on each slide
						for(let i = 0, max = slides.length; i < max; i++) {
							let cln = document.getElementById("global-images").cloneNode(true);
							cln.removeAttribute("id");
							slides[i].appendChild(cln);
						}

						// Remove top level global images
						let elem = document.getElementById("global-images");
						elem.parentElement.removeChild(elem);
					}, false);
				}
			})();
		</script>
		
	</head>
	<body>
		<div class="reveal">
			<div class="slides">
				<div id="global-images" class="global-images">
					<img src="../common-revealjs/images/sycl_academy.png" />
					<img src="../common-revealjs/images/sycl_logo.png" />
					<div class="trademarks">SYCL and the SYCL logo are trademarks of the Khronos Group Inc.</div>
				</div>
				<!--Slide 1-->
				<section class="hbox" data-markdown>
					## Data Parallelism
				</section>
				<!--Slide 2-->
				<section class="hbox" data-markdown>
					## Learning Objectives
					* Learn about task parallelism and data parallelism
					* Learn about the SPMD model for describing data parallelism
					* Learn about SYCL execution and memory models
					* Learn about enqueuing kernel functions with `parallel_for`
				</section>
				<!--Slide 3-->
				<section>
					<div class="hbox" data-markdown>
						#### Task vs data parallelism
					</div>
					<div class="container" data-markdown>
						![Task vs Data](../common-revealjs/images/task_parallelism_data_parallelism.png "Task parallelism vs data parallelism")
					</div>
					<div class="container" data-markdown>
						* **Task parallelism** is where you have several,
						possibly distinct tasks executing in parallel.
						  * In task parallelism you optimize for latency.
						* **Data parallelism** is where you have the same
						task being performed on multiple elements of data.
						  * In data parallelism you optimize for throughput.
					</div>
				</section>
				<!--Slide 4-->
				<section>
					<div class="hbox" data-markdown>
						#### Vector processors
					</div>
					<div class="container" data-markdown>
						* Many processors are vector processors, which means
						they can naturally perform data parallelism.
							* GPUs are designed to be parallel.
							* CPUs have SIMD instructions  which perform the
							same instruction on a number elements of data.
					</div>
				</section>
				<!--Slide 5-->
				<section>
					<div class="hbox" data-markdown>
						#### SPMD model for describing data parallelism
					</div>
					<div class="container">
						<div class="col">
							Sequential CPU code
							<code><pre>
void calc(const int in[], int out[]) {
  // all iterations are run in the same
  // thread in a loop
  for (int i = 0; i < 1024; i++){
    out[i] = in[i] * in[i];
  }
}

// calc is invoked just once and all
// iterations are performed inline
calc(in, out);
							</code></pre>
						</div>
						<div class="col">
							Parallel SPMD code
							<code><pre>
void calc(const int in[], int out[], int id) {
  // function is described in terms of
  // a single iteration
  out[id] = in[id] * in[id];
}

// parallel_for invokes calc multiple
// times in parallel
parallel_for(calc, in, out, 1024);


							</code></pre>
						</div>
					</div>
				</section>
				<!--Slide 6-->
				<section>
					<div class="hbox" data-markdown>
						#### SYCL execution model
					</div>
					<div class="container">
						<div class="col-left" data-markdown>
							* In SYCL kernel functions are executed by 
							 **work- items**.
							* You can think of a work-item as a thread of 
							execution.
							* Each work-item will execute a SYCL kernel function from start to end.
							* A work-item can run on CPU threads, SIMD lanes,
							GPU threads, or any other kind of processing
							element.
						</div>
						<div class="col-right" data-markdown>
							![Work-Item](../common-revealjs/images/workitem.png "Work-Item")
						</div>
					</div>
				</section>
				<!--Slide 5-->
				<section>
					<div class="hbox" data-markdown>
						#### SYCL execution model
					</div>
					<div class="container">
						<div class="col" data-markdown>
							* SYCL kernel functions are invoked within an **nd-range**
							* An nd-range has a number of work-groups and subsequently a number of work-items
							* Work-groups always have the same number of work-items
						</div>
						<div class="col" data-markdown>
							![ND-Range](../common-revealjs/images/ndrange.png "ND-Range")
						</div>
					</div>
				</section>
				<!--Slide 6-->
				<section>
					<div class="hbox" data-markdown>
						#### SYCL execution model
					</div>
					<div class="container">
						<div class="col" data-markdown>
							* The nd-range describes an **iteration space**: how it is composed in terms of work-groups and work-items
							* An nd-range can be 1, 2 or 3 dimensions
							* An nd-range has two components
							  * The **global-range** describes the total number of work-items in each dimension
							  * The **local-range** describes the number of work-items in a work-group in each dimension
						</div>
						<div class="col" data-markdown>
							![ND-Range](../common-revealjs/images/ndrange-example.png "ND-Range")
						</div>
					</div>
				</section>
				<!--Slide 7-->
				<section>
					<div class="hbox" data-markdown>
						#### SYCL execution model
					</div>
					<div class="container">
						<div class="col" data-markdown>
							* Each invocation in the iteration space of an nd-range is a work-item
							* Each invocation knows which work-item it is on and can query certain information about its position in the nd-range
							* Each work-item has the following:
							  * **Global range**: {12, 12}
							  * **Global id**: {5, 6}
							  * **Group range**: {3, 3}
							  * **Group id**: {1, 1}
							  * **Local range**: {4, 4}
							  * **Local id**: {1, 2}
						</div>
						<div class="col" data-markdown>
							![ND-Range](../common-revealjs/images/ndrange-example-work-item.png "ND-Range")
						</div>
					</div>
				</section>
				<!--Slide 8-->
				<section>
					<div class="hbox" data-markdown>
						#### SYCL execution model
					</div>
					<div class="container">
						<div class="col" data-markdown>
							Typically an nd-range invocation SYCL will execute the SYCL kernel function on a very large number of work-items, often in the thousands
						</div>
						<div class="col" data-markdown>
							![ND-Range](../common-revealjs/images/ndrange-invocation.png "ND-Range")
						</div>
					</div>
				</section>
				<!--Slide 9-->
				<section>
					<div class="hbox" data-markdown>
						#### SYCL execution model
					</div>
					<div class="container">
						<div class="col" data-markdown>
							* Multiple work-items will generally execute concurrently
							* On vector hardware this is often done in lock-step, which means the same hardware instructions
							* The number of work-items that will execute concurrently can vary from one device to another
							* Work-items will be batched along with other work-items in the same work-group
							* The order work-items and work-groups are executed in is implementation defined
						</div>
						<div class="col" data-markdown>
							![ND-Range](../common-revealjs/images/ndrange-lock-step.png "ND-Range")
						</div>
					</div>
				</section>
				<!--Slide 10-->
				<section>
					<div class="hbox" data-markdown>
						#### SYCL execution model
					</div>
					<div class="container">
						<div class="col" data-markdown>
							* Work-items in a work-group can be synchronized using a work-group barrier
							  * All work-items within a work-group must reach the barrier before any can continue on
						</div>
						<div class="col" data-markdown>
							![ND-Range](../common-revealjs/images/work-group-0.png "ND-Range")
						</div>
					</div>
				</section>
				<!--Slide 12-->
				<section>
					<div class="hbox" data-markdown>
						#### SYCL execution model
					</div>
					<div class="container">
						<div class="col" data-markdown>
							* SYCL does not support synchronizing across all work-items in the nd-range
							* The only way to do this is to split the computation into separate SYCL kernel functions
						</div>
						<div class="col" data-markdown>
							![ND-Range](../common-revealjs/images/work-group-0-1.png "ND-Range")
						</div>
					</div>
				</section>
				<!--Slide 7-->
				<section>
					<div class="hbox" data-markdown>
						#### SYCL execution model
					</div>
					<div style="display: grid; grid-template-columns: 5fr 2fr;">
						<div class="container" data-markdown>
							* SYCL also provides a simplified execution model with `sycl::range` in place of `sycl::nd_range`
							* Caller only provides the global range
							* Local range is decided by the runtime and cannot be inspected
							* No synchronization is possible between work items
							* Useful for simple problems which don't require synchronization, local memory and ultimate performance
							   * Runtime may not always have enough information to choose the best-performing size
						</div>
						<div style="text-align: right;">
							<img src="../common-revealjs/images/ndrange.png" alt="ND-Range" style="width:90%" /><br />
							<img src="../common-revealjs/images/SYCL_range.png" alt="SYCL-Range" style="width:90%" />
						</div>
					</div>
				</section>

				<!--Slide 14-->
				<section>
					<div class="hbox" data-markdown>
						#### Parallel_for
					</div>
					<div class="container">
						<div class="col">
							<code><pre>
cgh.<mark>parallel_for</mark>&lt;my_kernel&gt;(<mark>nd_range{{1024, 16}, {32, 4}}</mark>,
                          [=](<mark>nd_item&lt;2&gt; item)</mark>{
  // SYCL kernel function is executed 
  // on a range of work-items
});
							</code></pre>
						</div>
					</div>
					<div class="container" data-markdown>
					  * In SYCL, kernel functions can be enqueued to execute
					  over a range of work-items using `parallel_for`
					  * The first argument to `parallel_for` is an `nd_range` or
					  a `range` which describes the iteration space over which
					  the kernel is to be executed
					  * The kernel function has to take an `nd_item` or `item`,
					  respectively, as the parameter (or any type they can be
					  implicitly converted to, commonly from `item` to `id`)
					</div>
				</section>
				<!--Slide 16-->
				<section>
					<div class="hbox" data-markdown>
						#### Expressing parallelism
					</div>
					<div class="container">
						<div style="font-size: 90%; display: grid; grid-template-columns: 45% 55%; grid-template-rows: 1fr 1fr 1fr;">
							<div style="margin: auto 0; vertical-align: middle;">
								<code><pre>
cgh.parallel_for&lt;kernel&gt;((<mark>nd_range&lt;1&gt;{1024,32}</mark>,
  [=](<mark>nd_item&lt;1&gt; ndItem</mark>){
    /* kernel function code */
    id globalId = ndItem.get_global_id();
    id localId = ndItem.get_local_id();
});
								</code></pre>
							</div>
							<div style="margin: auto 0; vertical-align: middle;" data-markdown>
								* Overload taking an `nd_range` object specifies the global and local range
								* An `nd_item` parameter represents the global and local range and index
							</div>
							<div style="margin: auto 0; vertical-align: middle;">
								<code><pre>
cgh.parallel_for&lt;kernel&gt;(<mark>range&lt;1&gt;{1024}</mark>,
  [=](<mark>item&lt;1&gt; item</mark>){
    /* kernel function code */
    id globalId = item.get_id();
});
								</code></pre>
							</div>
							<div style="margin: auto 0; vertical-align: middle;" data-markdown>
								* Overload taking a `range` object specifies the global range, runtime decides local range
								* An `item` parameter represents the global range and the index within the global range
							</div>
							<div style="margin: auto 0; vertical-align: middle;">
								<code><pre>
cgh.parallel_for&lt;kernel&gt;(<mark>range&lt;1&gt;{1024}</mark>,
  [=](<mark>id&lt;1&gt; globalId</mark>){
    /* kernel function code */
});
								</code></pre>
							</div>
							<div style="margin: auto 0; vertical-align: middle;" data-markdown>
								* Overload taking a `range` object specifies the global range, runtime decides local range
								* An `id` parameter represents the index within the global range
							</div>
						</div>
					</div>
				</section>
				<!--Slide 14-->
				<section>
					<div class="hbox" data-markdown>
						#### SYCL memory model
					</div>
					<div class="container">
						<div class="col" data-markdown>
							* Each work-item can access a dedicated region of **private memory**
							* A work-item cannot access the private memory of another work-item
						</div>
						<div class="col" data-markdown>
							![Private Memory](../common-revealjs/images/workitem-privatememory.png "Private Memory")
						</div>

					</div>
				</section>
				<!--Slide 15-->
				<section>
					<div class="hbox" data-markdown>
						#### SYCL memory model
					</div>
					<div class="container">
						<div class="col-left-3" data-markdown>
							![Local Memory](../common-revealjs/images/workitem-localmemory.png "Local Memory")
						</div>
						<div class="col-right-2" data-markdown>
							* Each work-item can access a dedicated region of **local memory** accessible to all work-items in a work-group
							* A work-item cannot access the local memory of another work-group
						</div>
					</div>
				</section>
				<!--Slide 16-->
				<section>
					<div class="hbox" data-markdown>
						#### SYCL memory model
					</div>
					<div class="container">
						<div class="col-left-3" data-markdown>
							![Global Memory](../common-revealjs/images/workitem-constantmemory.png "Global Memory")
						</div>
						<div class="col-right-2" data-markdown>
							* Each work-item can access a single region of **global memory** that's accessible to all work-items in a ND-range
						</div>

					</div>
				</section>
				<!--Slide 17-->
				<section>
					<div class="hbox" data-markdown>
						#### SYCL memory model
					</div>
					<div class="container">
						<div class="col" data-markdown>
							* Each memory region has a different size and access latency
							* Global memory is larger than local memory and local memory is larger than private memory
							* Private memory is faster than local memory and local memory is faster than global memory
						</div>
						<div class="col" data-markdown>
							![Memory Regions](../common-revealjs/images/memory-regions.png "Memory Regions")
						</div>
					</div>
				</section>
				<!--Slide 22-->
				<section>
					<div class="hbox" data-markdown>
						#### Accessing Data With Accessors
					</div>
					<div class="container" data-markdown>
					* There are a few different ways to access the data represented by an accessor
					  *  The subscript operator can take an **id**
					    * Must be the same dimensionality of the accessor
					    * For dimensions > 1, linear address is calculated in row major
					* Nested subscript operators can be called for each dimension taking a **size_t**
					  * E.g. a 3-dimensional accessor: acc[x][y][z] = …
					* A pointer to memory can be retrieved by calling **get_pointer**
					  * This returns a raw pointer to the data
					</div>
				</section>
				<!--Slide 23-->
				<section>
					<div class="hbox" data-markdown>
						#### Accessing Data With Accessors
					</div>
					<div class="container">
						<div class="col-left-3">
							<code><pre>
buffer&ltfloat, 1&gt bufA(dA.data(), range&lt1&gt(dA.size()));
buffer&ltfloat, 1&gt bufB(dB.data(), range&lt1&gt(dB.size()));
buffer&ltfloat, 1&gt bufO(dO.data(), range&lt1&gt(dO.size()));

gpuQueue.submit([&](handler &cgh){
  sycl::accessor inA{bufA, cgh, sycl::read_only};
  sycl::accessor inB{bufB, cgh, sycl::read_only};
  sycl::accessor out{bufO, cgh, sycl::write_only};
  cgh.parallel_for&ltadd&gt(range&lt1&gt(dA.size()),
    [=](id&lt1&gt i){
    <mark>out[i] = inA[i] + inB[i];</mark>
  });
});
							</code></pre>
						</div>
						<div class="col-right-2" data-markdown>
							* Here we access the data of the `accessor` by
							passing in the `id` passed to the SYCL kernel
							function.
						</div>
					</div>
				</section>
				<!--Slide 24-->
				<section>
					<div class="hbox" data-markdown>
						#### Accessing Data With Accessors
					</div>
					<div class="container">
						<div class="col-left-3">
							<code><pre>
buffer&ltfloat, 1&gt bufA(dA.data(), range&lt1&gt(dA.size()));
buffer&ltfloat, 1&gt bufB(dB.data(), range&lt1&gt(dB.size()));
buffer&ltfloat, 1&gt bufO(dO.data(), range&lt1&gt(dO.size()));

gpuQueue.submit([&](handler &cgh){
  sycl::accessor inA{bufA, cgh, sycl::read_only};
  sycl::accessor inB{bufB, cgh, sycl::read_only};
  sycl::accessor out{bufO, cgh, sycl::write_only};
  cgh.parallel_for&ltadd&gt(rng, [=](item&lt3&gt i){
    <mark>auto ptrA = inA.get_pointer();</mark>
    <mark>auto ptrB = inB.get_pointer();</mark>
    <mark>auto ptrO = out.get_pointer();</mark>
    <mark>auto linearId = i.get_linear_id();</mark>

    <mark>ptrA[linearId] = ptrB[linearId] + ptrO[linearId]; </mark>
  });
});
							</code></pre>
						</div>
						<div class="col-right-2" data-markdown>
							* Here we retrieve the underlying pointer for each
							of the `accessor`s.
							* We then access the pointer using the linearized
							`id` by calling the `get_linear_id` member function
							on the `item`.
							* Again this linearization is calculated in
							row-major order.
						</div>
					</div>
				</section>
				<!--Slide 17-->
				<section class="hbox" data-markdown>
					## Questions
				</section>
				<!--Slide 18-->
				<section>
					<div class="hbox" data-markdown>
						#### Exercise
					</div>
					<div class="container" data-markdown>
						Code_Exercises/Data_Parallelism/source.cpp
					</div>
					<div class="container" data-markdown>
						Implement a SYCL application using `parallel_for` to add two arrays of values
					</div>
					<div class="container" data-markdown>
						* Use buffers and accessors to manage data
						* Try the `sycl::range` and `sycl::nd_range` variants
					</div>
				</section>
			</div>
		</div>
		<script src="../common-revealjs/js/reveal.js"></script>
		<script src="../common-revealjs/plugin/markdown/marked.js"></script>
		<script src="../common-revealjs/plugin/markdown/markdown.js"></script>
		<script src="../common-revealjs/plugin/notes/notes.js"></script>
		<script>
	Reveal.initialize();
	Reveal.configure({ slideNumber: true });
		</script>
	</body>
</html>