Lesson_Materials/Local_Memory_Tiling/index.html

<!DOCTYPE html>

<html>
	<head>
	    <meta charset="utf-8">
		<link rel="stylesheet" href="../common-revealjs/css/reveal.css">
		<link rel="stylesheet" href="../common-revealjs/css/theme/white.css">
		<link rel="stylesheet" href="../common-revealjs/css/custom.css">
		<script>
			// This is needed when printing the slides to pdf
			var link = document.createElement( 'link' );
			link.rel = 'stylesheet';
			link.type = 'text/css';
			link.href = window.location.search.match( /print-pdf/gi ) ? '../common-revealjs/css/print/pdf.css' : '../common-revealjs/css/print/paper.css';
			document.getElementsByTagName( 'head' )[0].appendChild( link );
		</script>
		<script>
		    // This is used to display the static images on each slide,
			// See global-images in this html file and custom.css
			(function() {
				if(window.addEventListener) {
					window.addEventListener('load', () => {
						let slides = document.getElementsByClassName("slide-background");

						if (slides.length === 0) {
							slides = document.getElementsByClassName("pdf-page")
						}

						// Insert global images on each slide
						for(let i = 0, max = slides.length; i < max; i++) {
							let cln = document.getElementById("global-images").cloneNode(true);
							cln.removeAttribute("id");
							slides[i].appendChild(cln);
						}

						// Remove top level global images
						let elem = document.getElementById("global-images");
						elem.parentElement.removeChild(elem);
					}, false);
				}
			})();
		</script>
		
	</head>
	<body>
		<div class="reveal">
			<div class="slides">
				<div id="global-images" class="global-images">
					<img src="../common-revealjs/images/sycl_academy.png" />
					<img src="../common-revealjs/images/sycl_logo.png" />
					<div class="trademarks">SYCL and the SYCL logo are trademarks of the Khronos Group Inc.</div>
				</div>
				<!--Slide 1-->
				<section class="hbox">
					<div class="hbox" data-markdown>
						## Local memory
					</div>
				</section>
				<!--Slide 2-->
				<section class="hbox" data-markdown>
					## Learning Objectives
					* Learn about tiling using local memory
					* Learn about how to synchronize work-groups
                </section>
				<!--Slide 3-->
				<section>
					<div class="hbox" data-markdown>
						#### Cost of accessing global memory
					</div>
					<div class="container" data-markdown>
						* As we covered earlier global memory is very expensive to access.
						* Even with coalesced global memory access if you are accessing the same elements multiple times that can be expensive.
						* Instead you want to cache those values in a lower latency memory.
					</div>
				</section>
				<!--Slide 4-->
				<section>
					<div class="hbox" data-markdown>
						#### Why are image convolutions good on a GPU?
					</div>
					<div class="container">
						<div class="col" data-markdown>
							![SYCL](../common-revealjs/images/cost_of_global_memory_access.png "SYCL")
						</div>
						<div class="col" data-markdown>
							* Looking at the image convolution example.
							* For each output pixel we are reading up to NxM pixels from the input image, where N and M are the dimensions of the filter.
							* This means each input pixel is being read up to NxM times:
							* 3x3 filter: up to 9 ops.
							* 5x5 filter: up to 25 ops.
							* 7x7 filter: up to 49 ops.
							* If each of these operations is a separate load from global memory this becomes very expensive.
						</div>
					</div>
				</section>
				<!--Slide 5-->
				<section>
					<div class="hbox" data-markdown>
						#### Using local memory
					</div>
					<div class="container" data-markdown>
						![SYCL](../common-revealjs/images/local_memory.png "SYCL")
					</div>
					<div class="container" data-markdown>
						* The solution is local memory.
						* Local memory is generally on-chip and doesn't have a cache as it's managed manually so is much lower latency.
						* Local memory is a smaller dedicated region of memory per work-group.
						* Local memory can be used to cache, allowing us to read from global memory just once and then read from local memory instead, often referred to as a scratchpad.
					</div>
				</section>
				<!--Slide 6-->
				<section>
					<div class="hbox" data-markdown>
						#### Tiling
					</div>
					<div class="container">
						<div class="col" data-markdown>
							![SYCL](../common-revealjs/images/tiling.png "SYCL")
						</div>
						<div class="col" data-markdown>
							* The iteration space of the kernel function is mapped across multiple work-groups.
							* Each work-group has it's own region of local memory.
							* You want to split the input image data into tiles, one for each work-group.
						</div>
					</div>
				</section>
				<!--Slide 7-->
				<section>
					<div class="hbox" data-markdown>
						#### Local accessors
					</div>
					<div class="container">
						<code class="code-100pc"><pre>
auto scratchpad = sycl::accessor&lt;int, 1, sycl::access::target::local&gt;(sycl::range{workGroupSize}, cgh);
						</code></pre>
					</div>
					<div class="container" data-markdown>
						* Local memory is allocated via an `accessor` with the `access::target::local` access target.
						* Unlike regular `accessor`s they are not created with a `buffer`, they allocate memory per work-group for the duration of the kernel function.
						* The `range` provided is the number of elements of the specified type to allocate per work-group.
					</div>
				</section>
				<!--Slide 8-->
				<section>
					<div class="hbox" data-markdown>
						#### Synchronization
					</div>
					<div class="container" data-markdown>
						* Local memory can be used to share partial results between work-items.
						* When doing so it's important to synchronize between writes and read to memory to ensure all work-items have reached the same point in the program.
					</div>
				</section>
				<!--Slide 9-->
				<section>
					<div class="hbox" data-markdown>
						#### Synchronization
					</div>
					<div class="container">
						<div class="col" data-markdown>
							![SYCL](../common-revealjs/images/barrier_1.png "SYCL")
						</div>
						<div class="col" data-markdown>
							* Remember that work-items are not guaranteed to all execute at the same time (in parallel).
						</div>
					</div>
				</section>
				<!--Slide 10-->
				<section>
					<div class="hbox" data-markdown>
						#### Synchronization
					</div>
					<div class="container">
						<div class="col" data-markdown>
							![SYCL](../common-revealjs/images/barrier_2.png "SYCL")
						</div>
						<div class="col" data-markdown>
							* A work-item can share results with other work-items via local (or global) memory.
						</div>
					</div>
				</section>
				<!--Slide 11-->
				<section>
					<div class="hbox" data-markdown>
						#### Synchronization
					</div>
					<div class="container">
						<div class="col" data-markdown>
							![SYCL](../common-revealjs/images/barrier_3.png "SYCL")
						</div>
						<div class="col" data-markdown>
							* This means it's possible for a work-item to read a result that hasn't been written to yet.
							* This creates a data race.
						</div>
					</div>
				</section>
				<!--Slide 12-->
				<section>
					<div class="hbox" data-markdown>
						#### Synchronization
					</div>
					<div class="container">
						<div class="col" data-markdown>
							![SYCL](../common-revealjs/images/barrier_4.png "SYCL")
						</div>
						<div class="col" data-markdown>
							* This problem can be solved with a synchronization primitive called a work-group barrier.
						</div>
					</div>
				</section>
				<!--Slide 13-->
				<section>
					<div class="hbox" data-markdown>
						#### Synchronization
					</div>
					<div class="container">
						<div class="col" data-markdown>
							![SYCL](../common-revealjs/images/barrier_5.png "SYCL")
						</div>
						<div class="col" data-markdown>
							* When a work-group barrier is inserted work-items will wait until all work-items in the work-group have reached that point.
						</div>
					</div>
				</section>
				<!--Slide 14-->
				<section>
					<div class="hbox" data-markdown>
						#### Synchronization
					</div>
					<div class="container">
						<div class="col" data-markdown>
							![SYCL](../common-revealjs/images/barrier_6.png "SYCL")
						</div>
						<div class="col" data-markdown>
							* Only then can any work-items in the work-group continue execution.
						</div>
					</div>
				</section>
				<!--Slide 15-->
				<section>
					<div class="hbox" data-markdown>
						#### Synchronization
					</div>
					<div class="container">
						<div class="col" data-markdown>
							![SYCL](../common-revealjs/images/barrier_7.png "SYCL")
						</div>
						<div class="col" data-markdown>
							* So now you can be sure that all of the results that you want to read have been written to.
						</div>
					</div>
				</section>
				<!--Slide 16-->
				<section>
					<div class="hbox" data-markdown>
						#### Synchronization
					</div>
					<div class="container">
						<div class="col" data-markdown>
							![SYCL](../common-revealjs/images/barrier_8.png "SYCL")
						</div>
						<div class="col" data-markdown>
							* However note that this does not apply across work-group boundaries.
							* So if you write in a work-item of one work-group and then read it in a work-item of another work-group you again have a data race.
							* Furthermore, remember that work-items can only access their own local memory and not that of any other work-groups.
						</div>
					</div>
				</section>
				<!--Slide 17-->
				<section>
					<div class="hbox" data-markdown>
						#### Group_barrier
					</div>
					<div class="container">
						<code class="code-100pc"><pre>
sycl::group_barrier(item.get_group());
						</code></pre>
					</div>
					<div class="container" data-markdown>
						* Work-group barriers can be invoked by calling `group_barrier` and passing a `group` object.
						* You can retrieve a `group` object representing the current work-group by calling `get_group` on an `nd_item`.
						* Note this requires the `nd_range` variant of `parallel_for`.
					</div>
				</section>
				<!--Slide 18-->
				<section>
					<div class="hbox" data-markdown>
						#### Local memory image convolution performance
					</div>
					<div class="container"data-markdown>
						![SYCL](../common-revealjs/images/image_convolution_performance_local_mem.png "SYCL")
					</div>
				</section>
				<!--Slide 11-->
				<section>
					<div class="hbox" data-markdown>
						## Questions
					</div>
				</section>
				<!--Slide 12-->
				<section>
					<div class="hbox" data-markdown>
						#### Exercise
					</div>
					<div class="container" data-markdown>
						Code_Exercises/Local_Memory_Tiling/source
					</div>
					<div class="container" data-markdown>
						Use local memory to cache a tile of the input image data per work-group.
					</div>
				</section>
			</div>
		</div>
		<script src="../common-revealjs/js/reveal.js"></script>
		<script src="../common-revealjs/plugin/markdown/marked.js"></script>
		<script src="../common-revealjs/plugin/markdown/markdown.js"></script>
		<script src="../common-revealjs/plugin/notes/notes.js"></script>
		<script>
			Reveal.initialize({mouseWheel: true, defaultNotes: true});
			Reveal.configure({ slideNumber: true });
		</script>
	</body>
</html>