Skip to content

Commit 0fc1631

Browse files
authored
[Videos] Use local thumbnails
1 parent e4d2dcf commit 0fc1631

File tree

20 files changed

+27
-27
lines changed

20 files changed

+27
-27
lines changed

GetStarted.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Others are optional depend on your platform of choice. So far we support native
1212

1313
Watch the warmup video:
1414

15-
[<img src="https://drive.google.com/uc?export=view&id=1AbuZJdfc-BbpNLdxZukMILs2l5_HBH32" width="30%">](https://www.youtube.com/watch?v=jFRwAcIoLgQ&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
15+
[<img src="img/WarmupLabAssignment.png">](https://www.youtube.com/watch?v=jFRwAcIoLgQ&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
1616

1717
Every lab assignment has the following:
1818
* Video that introduces a particular transformation.

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
This is an online course where you can learn to find and fix low-level performance issues, for example CPU cache misses and branch mispredictions. It's all about practice. So we offer you this course in a form of lab assignments and youtube videos. You will spend at least 90% of the time analyzing performance of the code and trying to improve it.
88

9-
[<img src="https://drive.google.com/uc?export=view&id=1pYZEkSV3fiLo04b0UdJzHoEhLkhc6T09" width="30%">](https://www.youtube.com/watch?v=2tzdkC6IDbo&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
9+
[<img src="img/WelcomeVideo.png">](https://www.youtube.com/watch?v=2tzdkC6IDbo&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
1010

1111
Each lab assignment focuses on a specific performance problem and can take anywhere from 30 mins up to 4 hours depending on your background and the complexity of the lab assignment itself. Once you're done improving the code, you can submit your solution to Github for automated benchmarking and verification.
1212

labs/bad_speculation/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Bad Speculation
22

3-
[<img src="https://drive.google.com/uc?export=view&id=1seCqxloOfy5kx5-Hw6j-y1JEjEv0QdjA" width="30%">](https://www.youtube.com/watch?v=B8AsUSN3Xa4&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
3+
[<img src="../../img/BadSpecIntro.png">](https://www.youtube.com/watch?v=B8AsUSN3Xa4&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
44

55
This is a collection of labs, which experience large amount of branch mispredictions. We will be covering branchless algorithms. Here are some of the topics we plan to cover:
66

labs/bad_speculation/lookup_tables_1/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
[<img src="https://drive.google.com/uc?export=view&id=1NdSmpytK0QOpkLcJ_dpl6KNnN7XMxmOf" width="30%">](https://www.youtube.com/watch?v=bhz4t5QYApE&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
1+
[<img src="../../../img/LookupTables1.png">](https://www.youtube.com/watch?v=bhz4t5QYApE&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
22

33
Welcome to the next lab assignment, where we will fight branch mispredictions by replacing them with lookup tables. The code in this lab assignment maps values from `[0;99]` into buckets, which involves a lot of comparisons, and so, branches. To solve this assignment you need to figure out a way how to replace branches.
44

labs/core_bound/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Core Bound
22

3-
[<img src="https://drive.google.com/uc?export=view&id=1CHZW8uWu0rhWLLlrx8AnehA_0-QFqDRp" width="30%">](https://www.youtube.com/watch?v=CcGhMusQFXA&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
3+
[<img src="../../img/CoreBoundIntro.png">](https://www.youtube.com/watch?v=CcGhMusQFXA&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
44

55
This is a collection of labs with performance bound by core execution unit. Here are some of the topics we plan to cover:
66

Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
1-
[<img src="https://drive.google.com/uc?export=view&id=1yXIIkV3Z_K4xUY5jweNy6EHawrR2s74E" width="30%">](https://www.youtube.com/watch?v=mlXw_qYRi78&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d&index=12)
1+
[<img src="../../../img/CompilerIntrinsics1-Intro.png">](https://www.youtube.com/watch?v=mlXw_qYRi78&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d&index=12)
22

33
This is a lab about using [compiler intrinsics](https://en.wikipedia.org/wiki/Intrinsic_function) to speed up parts of the code, where compilers fail to generate optimal code.
44

55
The kernel in this lab assignment is a part of the Average ImageSmoothing algorithm, which is reduced to 1 dimension and lacks division part. The algorithm uses sliding window approach to compute a sum in the subrange [-radius .. +radius]. It is a very fast approach compared to a classical Gaussian blur.
66

7-
[<img src="https://drive.google.com/uc?export=view&id=1EK-lw82Qc054uOMJeym5OB9myb3gR0wo" width="30%">](https://www.youtube.com/watch?v=fP6Rhwf3rEs&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d&index=12)
7+
[<img src="../../../img/CompilerIntrinsics1-Summary.png">](https://www.youtube.com/watch?v=fP6Rhwf3rEs&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d&index=12)
88

99
Author: @adamf88.

labs/core_bound/compiler_intrinsics_2/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
[<img src="https://drive.google.com/uc?export=view&id=1ZQG3JW5sQK4bQdLxgblAIouRnXVbse7k" width="30%">](https://www.youtube.com/watch?v=0WUihFxjzSE&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
1+
[<img src="../../../img/CompIntrin2.png">](https://www.youtube.com/watch?v=0WUihFxjzSE&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
22

33
This is a second lab about using [compiler intrinsics](https://en.wikipedia.org/wiki/Intrinsic_function) to speed up parts of the code, where compilers fail to generate optimal code.
44

labs/core_bound/dep_chains_1/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Speed up data dependency chains #1.
22

3-
[<img src="https://drive.google.com/uc?export=view&id=1S26wtb3hwSPpz85wF0tctnff8G4TSqT_" width="30%">](https://www.youtube.com/watch?v=nXf6MxNlXdg&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
3+
[<img src="../../../img/DepChains1.png">](https://www.youtube.com/watch?v=nXf6MxNlXdg&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
44

55
Critical data dependency chains are increasingly becoming the [only thing that matters](https://easyperf.net/blog/2022/05/11/Visualizing-Performance-Critical-Dependency-Chains) for performance of a general-purpose application. That is why it is very important to identify those and know possible ways to make them run faster. On a SW level, you can sometimes occasionally introduce an artificial data dependency, which should not exist in the first place. Those cases are usually easy to find. In a contrast, some data dependency chains are inherent to a particular type of data structure.
66

Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
1-
[<img src="https://drive.google.com/uc?export=view&id=1CdVTip7DNJKmvouo2pfmog0OWl0km9p_" width="30%">](https://www.youtube.com/watch?v=fp1_e3rjZQs&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d&index=12)
1+
[<img src="../../../img/FunctionInlining1Intro.png">](https://www.youtube.com/watch?v=fp1_e3rjZQs&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d&index=12)
22

33
This is a lab about [function inlining](https://en.wikipedia.org/wiki/Inline_expansion) to speed up sorting.
44

55
Function inlining is a transformation that replaces a call to a function `F` with the body for `F` specialized with the actual arguments of the call. Inlining is one of the most important compiler optimizations, not only because it eliminates the overhead of calling a function (prologue and epilogue), but also it enables other optimizations.
66

77
Whenever you find in a performance profile a function with hot prologue and epilogue, consider such function as one of the potential candidates for being inlined. In this lab assignment you will practice fixing such performance issues.
88

9-
[<img src="https://drive.google.com/uc?export=view&id=1lOajRZmKvGDuU2bEm24Iu76zirPEdw1a" width="30%">](https://www.youtube.com/watch?v=qlFUV0FjpPQ&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d&index=12)
9+
[<img src="../../../img/FunctionInlining1Summary.png">](https://www.youtube.com/watch?v=qlFUV0FjpPQ&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d&index=12)
+2-2
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
1-
[<img src="https://drive.google.com/uc?export=view&id=1RMJ4F9sqnetaB4qjkx8DiCuii5MxuRao" width="30%">](https://www.youtube.com/watch?v=osfIC5uO0G8&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d&index=12)
1+
[<img src="../../../img/Vectorization1-Intro.png">](https://www.youtube.com/watch?v=osfIC5uO0G8&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d&index=12)
22

33
[Sequence alignment](https://en.wikipedia.org/wiki/Sequence_alignment) is an important algorithm in many bioinformatics applications and pipelines. The goal of the alignment is to gain insights about their biological relation. In particular, one is interested how the sequences diverged from a common ancestor by evolutionary events like point mutations or insertions and deletions in the respective sequences.
44
This problem, however, has quadratic complexity and optimizing it can have a great benefit in many applications.
55
Since many bioinformatic problems start with the alignment of millions of short sequence pieces of length 150 to 300 symbols, we can gain great performance improvements by using SIMD vectors. In this lab you will learn how the algorithm can be improved by transforming the data layout and exposing SIMD computations.
66

7-
[<img src="https://drive.google.com/uc?export=view&id=1PPrjFf9yN6-DmwHSlaCVJoWOX5Orj--F" width="30%">](https://www.youtube.com/watch?v=OvM6eAh8wBc&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d&index=12)
7+
[<img src="../../../img/Vectorization1-Summary.png">](https://www.youtube.com/watch?v=OvM6eAh8wBc&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d&index=12)
88

99

1010
Author: @rrahn.

labs/core_bound/vectorization_2/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
[<img src="https://drive.google.com/uc?export=view&id=16Zg-PInJ8bQilDY1zBXiAIBIL3C7iStZ" width="30%">](https://www.youtube.com/watch?v=m4SWal8EAgM&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
1+
[<img src="../../../img/Vectorization2_button.png">](https://www.youtube.com/watch?v=m4SWal8EAgM&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
22

33
This is a second lab about [auto vectorization](https://llvm.org/docs/Vectorizers.html). The subject of this lab assignment is a part of a checksum algorithm from the 80s, which has risen from the popularity of the Internet and [accompanying needs to validate transmitted packets](https://www.alpharithms.com/internet-checksum-calculation-steps-044921/). Even the problem is old, similar issues may exist nowadays in production code.
44

labs/memory_bound/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Memory Bound
22

3-
[<img src="https://drive.google.com/uc?export=view&id=14ZWNVXxqsV_uPBYuVxXJUmQK_mNq0N6W" width="30%">](https://www.youtube.com/watch?v=jxK6GAyp8XE&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
3+
[<img src="../../img/MemoryBoundIntro.png">](https://www.youtube.com/watch?v=jxK6GAyp8XE&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
44

55
This is a collection of labs with performance bound by memory accesses. Here are some of the topics we plan to cover:
66

labs/memory_bound/data_packing/README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
This is a lab about data packing.
66

7-
[<img src="https://drive.google.com/uc?export=view&id=16uvUgz327TXrysAf2HXYRe_KRBALHw2j" width="30%">](https://www.youtube.com/watch?v=-V-oIXrqA2s&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
7+
[<img src="../../../img/DataPacking1Intro.png">](https://www.youtube.com/watch?v=-V-oIXrqA2s&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
88

99
You can decrease the memory traffic of the application if you pack the data more efficiently.
1010
Some of the ways to do that include:
@@ -13,4 +13,4 @@ Some of the ways to do that include:
1313
* Use types that require less memory or less precision e.g. (int -> short, double -> float).
1414
* Use bitfields to pack the data even further.
1515

16-
[<img src="https://drive.google.com/uc?export=view&id=12iavTVH9WUbb9BguLBLKe0QqdiPBMBiG" width="30%">](https://www.youtube.com/watch?v=ta096PQ6gTg&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
16+
[<img src="../../../img/DataPacking1Summary.png">](https://www.youtube.com/watch?v=ta096PQ6gTg&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)

labs/memory_bound/false_sharing_1/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
[<img src="https://drive.google.com/uc?export=view&id=1xYhEv7la96grMjUSyFNryQPOEONUl4vV" width="30%">](https://www.youtube.com/watch?v=uRmQSHsZoxE&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
1+
[<img src="../../../img/FalseSharing1.png">](https://www.youtube.com/watch?v=uRmQSHsZoxE&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
22

33

44
This lab assignment focuses on improving performance by eliminating false sharing. In this lab, we
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
This is a lab about [loop interchange](https://en.wikipedia.org/wiki/Loop_interchange).
22

3-
[<img src="https://drive.google.com/uc?export=view&id=19g9RQifLdObp2mUHcaCHXwk6WCXmupZV" width="30%">](https://www.youtube.com/watch?v=TLDR_nO9XVc&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
3+
[<img src="../../../img/LoopInterchange1Intro.png">](https://www.youtube.com/watch?v=TLDR_nO9XVc&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
44

55
[Matrix multiplication](https://en.wikipedia.org/wiki/Matrix_multiplication) is an important building block for many numerical algorithms. In this lab assignment, we compute the integer power of a given real square matrix.
66
The binary representation of the power significantly reduces the number of matrix operations. Still, the code has a major performance flaw. Your job is to find it out.
77

8-
[<img src="https://drive.google.com/uc?export=view&id=1cOvE8kIF1CVAA3CGQTPaXq-1MSIe3l9q" width="30%">](https://www.youtube.com/watch?v=G6BbPB37sYg&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
8+
[<img src="../../../img/LoopInterchane1Summary.png">](https://www.youtube.com/watch?v=G6BbPB37sYg&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
This is a lab about [loop interchange](https://en.wikipedia.org/wiki/Loop_interchange), which is more advanced than the previous one.
22

3-
[<img src="https://drive.google.com/uc?export=view&id=1pX20Lb2E11invOb9_0kqndoGykusxQTV" width="30%">](https://www.youtube.com/watch?v=vsvdtOgBHWo&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
3+
[<img src="../../../img/LoopInterchange2-Intro.png">](https://www.youtube.com/watch?v=vsvdtOgBHWo&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
44

55
In this lab assignment you will optimize [Gaussian blur](https://en.wikipedia.org/wiki/Gaussian_blur) algorithm applied to a grayscale image.
66
Modern cameras have good matrices and produce big files. How fast can modern CPU filter a camera shot?
77
Significant speedup has been already achieved by two passes of 1-dimensional digital filter instead of a plain 2D convolution.
88

9-
[<img src="https://drive.google.com/uc?export=view&id=1u0Go7Mp30Bs_nZUYyZcK7gTvKo8BG4Va" width="30%">](https://www.youtube.com/watch?v=uUPOKCT8lyo&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
9+
[<img src="../../../img/LoopInterchange2-Summary.png">](https://www.youtube.com/watch?v=uUPOKCT8lyo&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)

labs/memory_bound/loop_tiling_1/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
[<img src="https://drive.google.com/uc?export=view&id=1pqmSJVPLCP8n-MIvSPKZqSZJ6RdMvWhP" width="30%">](https://www.youtube.com/watch?v=wPcDgju8VkI&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
1+
[<img src="../../../img/LoopTiling1.png">](https://www.youtube.com/watch?v=wPcDgju8VkI&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
22

33
Loop tiling (blocking) is an important technique that you can use to speed up code that is working with multi-dimensional arrays. If one of the memory access patterns on your array is column-wise, or if in the code you are accessing the same data several times in the loop, this technique can be very beneficial for the performance. It is often seen in matrix multiplication and matrix rotation operations, to speed them up.
44

Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Software memory prefetching
22

3-
[<img src="https://drive.google.com/uc?export=view&id=1yio888BmVCz8T-PRKiTvCekqem5ltQ0M" width="30%">](https://www.youtube.com/watch?v=yTkaLNuUCXw&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
3+
[<img src="../../../img/SWMemPrefetch1-Intro.png">](https://www.youtube.com/watch?v=yTkaLNuUCXw&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
44

55
When the CPU data prefetcher cannot figure out the memory access pattern, software prefetching comes in handy. The idea is to use special instructions that tell the CPU: "Hey, I plan to use this memory location a bit later, could you fetch it for me while I do other stuff so it waits for me when I am back".
66

@@ -10,6 +10,6 @@ Prefetching can benefit the performance, but it can also hurt the performance. I
1010

1111
An additional prerequisite for the speedup with prefetching is that between the time you request prefetching, and the time you actually access your data, some time needs to pass (known as "prefetching window"). Immediately accessing data that you want to prefetch will not give the expected results.
1212

13-
[<img src="https://drive.google.com/uc?export=view&id=14m5Gm39Z9Ps1JjZNR9eneOiN9gPcgJjo" width="30%">](https://www.youtube.com/watch?v=XkzTTh-CEUc&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
13+
[<img src="../../../img/SWMemPrefetch1-Summary.png">](https://www.youtube.com/watch?v=XkzTTh-CEUc&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
1414

1515
Authored-by: @ibogosavljevic

labs/misc/lto/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
[<img src="https://drive.google.com/uc?export=view&id=1QLyhFannnuqJw8heBnBsexMnE7tWNZIi" width="30%">](https://www.youtube.com/watch?v=j2sND8ATjsE&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
1+
[<img src="../../../img/LTO.png">](https://www.youtube.com/watch?v=j2sND8ATjsE&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
22

33
Link Time Optimization (LTO) is a collection of compiler transformations that are performed across multiple translation units. It is also frequently referred to as IPO ([Interprocedural optimizations](https://en.wikipedia.org/wiki/Interprocedural_optimization)). Traditionally, compilers perform optimization within one translation unit. LTO helps a lot in situations when many function calls cross translation unit boundaries. In this lab, you will see the effect of LTO in practice.
44

labs/misc/pgo/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
[<img src="https://drive.google.com/uc?export=view&id=183oi_PRQ_m29kMF-Tcsl7Ppc0gcmfDHO" width="30%">](https://www.youtube.com/watch?v=ERqFtOZ61AA&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
1+
[<img src="../../../img/PGO.png">](https://www.youtube.com/watch?v=ERqFtOZ61AA&list=PLRWO2AL1QAV6bJAU2kgB4xfodGID43Y5d)
22

33
Profile Guided Optimizations (PGO) are a set of transformations in most optimizing compilers that can adjust their algorithms based on the profiling data. Sometimes in literature, one can find the term Feedback Directed Optimizations (FDO), which essentially refers to the same thing. The quality and relevance of the profiling data has a critical impact on performance since compiler will be "trained" using that data and use it for generating machine code. Profiling data helps compiler improve its inlining decisions, code placement, register allocation, and more. It is not uncommon to see real workloads performance increase by up to 15% from using PGO.
44

0 commit comments

Comments
 (0)