From 85de16db87c42e96d318ba720c143b30c30ca51c Mon Sep 17 00:00:00 2001 From: Scott Straughan Date: Fri, 16 Aug 2024 14:07:12 +0100 Subject: [PATCH 1/6] Added the Llama blogs part one and two. --- ...and-oneapi-one-llama-at-a-time-part-one.md | 147 ++++++++++++++++++ ...-to-sycl-and-oneapi-one-llama-at-a-time.md | 96 ++++++++++++ 2 files changed, 243 insertions(+) create mode 100644 _collections/_updates/2024-07-31-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one.md create mode 100644 _collections/_updates/2024-08-13-part-two-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time.md diff --git a/_collections/_updates/2024-07-31-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one.md b/_collections/_updates/2024-07-31-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one.md new file mode 100644 index 0000000..bd22218 --- /dev/null +++ b/_collections/_updates/2024-07-31-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one.md @@ -0,0 +1,147 @@ +--- +title: "Part One - Porting AI codes from CUDA to SYCL and oneAPI, one llama at a time" +date: 2024-07-31 +layout: update +tags: + - cuda + - sycl + - oneapi + - porting +--- + +## Introduction + +The rapid advancement of LLMs can be attributed to their ability to effectively tackle complex problems, such as those +encountered in chatbots, virtual assistants, content generation, and language translation. Their performance, which +matches human capabilities, places LLMs at the forefront of AI models. + +The classical general purpose graph frameworks like PyTorch, TensorFlow, etc. can cover very wide ranges of machine +learning domains such as image and video classification, semantic segmentation, object detection, and other natural +language processing for general-purpose language generation through several neural networks (NN) architectures such as +convolutional Neural networks, Recurrent neural networks, and various types of Transformer-based architectures for +generative AI. + +While such omnipotent frameworks can cover almost all training and inference aspects of AI models that are now used, in +some scenarios a particular type of inference only NN architecture is required for specific devices such as edge +computing or systems without a network connection. Such architectures may have some hardware limitations, e.g. single +GPU or Single CPU only with limited memory and cache sizes and restricted operating system support. Hence developers may +struggle to use such frameworks. + +With the popularity of large language models, there are several lightweight frameworks, such as Meta’s llama models, +llama.cpp, and vllm are provided to target only transformer-based architectures for inference models. Among +them, llama.cpp is a C++-based open source library that can be used +with the llama model amongst others. This is written using pure C/C++ and that enables LLM inference with minimal +dependency to any third party libraries, while providing a state-of-the-art performance on a wide variety of local and +cloud based hardware. + +[llama.cpp](https://github.com/ggerganov/llama.cpp) is designed to run large language models efficiently on +devices with limited resources, such as laptops or desktop pcs with GPUs. The C++ based implementation makes llama.cpp +highly performant and portable, ideal for scenarios where computational power and memory are at a premium. At the core +of llama.cpp is the quantization. Llama.cpp uses custom quantization types that drastically reduce model sizes, which in +turn enables them to run on devices with limited memory. The challenging part here is to find the right quantization +scheme that would prevent precision loss without causing hallucinations in the output; hence, a lot of effort of tuning +the models goes into finding the right quantization parameters, and the code performs several custom matrix +multiplication operations to reduce precision loss on custom quantization schemes. + +## [SYCLomatic](https://github.com/oneapi-src/SYCLomatic) + +This article will now describe how to migrate the existing llama.cpp CUDA backend to +SYCL [using the SYCLomatic open source tool](https://github.com/oneapi-src/SYCLomatic). The migrated code can +then be run across an NVIDIA system, and another system with Intel Data Center Max GPUs - demonstrating truly portable, +single-source code. + +Spoiler alert: We don’t really need to do this migration, Llama.cpp already has SYCL in upstream, thanks to the work of +Intel and Codeplay teams. The work started with a SYCLomatic conversion back in December 2023. The feedback from that +conversion led to a lot of improvements in SYCLomatic. The SYCL upstream support is now maintained by Codeplay and Intel +on both NVIDIA and Intel GPUs. + +A key benefit of SYCLomatic is that it is a whole project migration tool. This means it does not focus on migrating +individual kernels or files, but instead provides a migration of the entire project that you can then use as a starting +point for your SYCL multi-target application. + +## Preparation + +For this exercise, I am going to use two distinct machines: my local desktop pc with an integrated NVIDIA GPU, and a +remote system with an Intel Data Center GPU Max series 1110. + +I have installed the latest CUDA toolkit on both systems, as well as the Intel oneAPI base toolkit version 2024.2. + +Remember to set your environment variables so that all the tools we are going to use are in your path (replace the first +with the path to your Intel oneAPI Base Toolkit location): + +```shell +$ cd /path/to/intel/oneAPI/Toolkit +$ . setvars.sh ~/intel/oneapi +$ dpct --versionIntel(R) DPC++ Compatibility Tool version 2024.2.0. Codebase:(55a3f034030e4bd0f36d7c37f24f8366079a639b). clang version 19.0.0 +``` + +Before we can run our model, we have to download it. There are many models supported +by llama.cpp, and the list keeps growing! In this example we are going to download the llama 2 –7B model, already +quantized in ‘gguf’ format to save some steps, so you can just wget from your prompt. In this case, I have opted for +creating a model's directory in my home folder. + +```shell +$ mkdir $HOME/models/ ; cd $HOME/models/ +$ wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf +``` + +On your NVIDIA system, you need to have a local copy of oneMKL for NVIDIA GPU’s, this is currently not available as a +download, so you must build it as follows: + +```shell +$ git clone https://github.com/oneapi-src/oneMKL.git +$ cd oneMKL/; mkdir build; cd build +$ cmake ../ -GNinja -DCMAKE_CXX_COMPILER=icpx -DCMAKE_C_COMPILER=icx -DENABLE_MKLGPU_BACKEND=False -DENABLE_MKLCPU_BACKEND=False -DENABLE_CUFFT_BACKEND=True -DENABLE_CUBLAS_BACKEND=True -DENABLE_CUSOLVER_BACKEND=True -DENABLE_CURAND_BACKEND=True -DBUILD_FUNCTIONAL_TESTS=False -DCMAKE_INSTALL_PREFIX=${HOME}/soft/mkl/ +$ ninja install +``` + +This builds the [oneMKL interfaces for NVIDIA](https://github.com/oneapi-src/oneMKL) and installs it in the soft/mkl +directory within your home folder. + +## Steps for the conversion + +The first step is to clone the llama.cpp repository, and configure cmake as usual for NVIDIA GPUs, as shown below. + +```shell +$ git clone https://github.com/ggerganov/llama.cpp.git +$ cd llama.cpp +$ git checkout 3c04bf6da89eaf4c7d317e0518f0687dfcbf2de7 +$ mkdir build && cd build +$ cmake .. -DLLAMA_CUBLAS=ON -DLLAMA_CUDA=ON - +$ DCMAKE_CUDA_ARCHITECTURES=80 +``` + +In this example we are using an earlier version of the llama.cpp repository closer to the one we used to do the initial +porting. The llama.cpp project moves really fast, and some of the latest versions of the project may not work straight +out of the box with SYCLomatic. + +Now, here is the first change: pre-pend “intercept-build” to the make command you would normally run, as below: + +```shell +$ intercept-build make +``` + +intercept-build is a really useful tool, distributed with SYCLomatic, that collects all compilation commands issued +while building a yaml file that SYCLomatic can then use to generate new build system files to compile your SYCL version +of the application. + +Now we are going to use the information collected by intercept-build to generate a SYCL +build directory by running the dpct command itself: + +```shell +$ cd ../.. && mkdir dpct_out +``` + +```shell +$ dpct -p ./llama.cpp/build --enable-profiling --use-experimental-features=all --in-root=./llama.cpp --out-root=./dpct_out --migrate-build-script=CMake --process-all +``` + +When using the `-p` option, it will find the compilation database and use that to convert all project files. In this +case, we have also enabled profiling (which adds profiling information to the SYCL generated code), and we are opted in +to all experimental features (more on this later). We are also migrating the build script using CMake, and telling it to +process all files. + +## Next Part + +Now, we have successfully converted our llama.cpp project from CUDA to SYCL. In part two, we will build and run this on +NVIDIA and Intel GPUs. diff --git a/_collections/_updates/2024-08-13-part-two-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time.md b/_collections/_updates/2024-08-13-part-two-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time.md new file mode 100644 index 0000000..0bda9cb --- /dev/null +++ b/_collections/_updates/2024-08-13-part-two-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time.md @@ -0,0 +1,96 @@ +--- +title: "Part Two - Porting AI codes from CUDA to SYCL and oneAPI, one llama at a time" +date: 2024-08-13 +layout: update +tags: +- cuda +- sycl +- oneapi +- porting +--- + +## Prelude + +[In our first part](https://codeplay.com/portal/blogs/2024/07/31/porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one) +we looked at the conversion from CUDA to SYCL via the whole project migration tool, SYCLomatic. Now we are going to take +this portable code, and run it across an NVIDIA and Intel GPU. + +## Building on the NVIDIA system + +Now we are going to build the converted code directly using the CMake file that SYCLomatic has created for us, and then +build the main binary for llama.cpp. + +```shell +$ cd dpct_out && mkdir syclbuild && cd syclbuild +$ MKLROOT=/home/ruyman/soft/mkl CC=icx CXX=icpx cmake .. -DLLAMA_CUBLAS=ON -DCMAKE_CUDA_ARCHITECTURES=80 -DCMAKE_CXX_FLAGS="-fsycl -fsycl-targets=nvptx64-nvidia-cuda -L${MKLROOT}/lib" +$ make main +``` + +Note that now we are not using the CUDA compiler to build, but the Intel SYCL compiler, so we are passing the CC and CXX +flags accordingly. We also pass manually the target triple (`-fsycl-targets=nvptx64-nvidia-cuda`) which tells the +SYCL compiler to generate code for NVIDIA CUDA architectures (using PTX). We can now run our model using the following +command: + +```shell +$ ONEAPI_DEVICE_SELECTOR=cuda:gpu ./bin/main -m ../../models/ -ngl 12899 -no-mmap +``` + +The environment variable `ONEAPI_DEVICE_SELECTOR` allows users to override the default selection mechanism of the SYCL +queue in favour of a user-defined setting. The default selection in this case would use OpenCL for the CPU, which won’t +work because we explicitly build for NVIDIA GPUs. + +The conversion out of the box won’t be fast, as it won’t be using the most optimized path for NVIDIA. But it is a good +starting point that allows you to try your SYCL code on the existing environment before moving to a new machine with an +Intel GPU, and you can also re-use your CI infrastructure to test the SYCL path. + +## Running on an Intel GPU system + +To prove we have now a truly portable application, let's take this code and build it and run it for an Intel GPU. + +Log onto your system with the Intel Data Center Max GPU and repeat the cloning and building for CUDA steps, so you can +run intercept-build on the new system, or copy over the DPCT generated project. Now, let's configure and build for Intel +GPUs, using the original CMake flags we used to convert the project. + +```shell +$ CC=icx CXX=icpx cmake .. -DLLAMA_CUBLAS=ON -DCMAKE_CUDA_ARCHITECTURES=80 +``` + +Yes, you still use the CUBLAS and CUDA CMake flags, the user visible CMake flags won’t change, but the internal logic on +the CMake file generated by SYCLomatic will handle finding the paths for the Intel oneAPI base toolkit dependencies. +Once it is configured, you can + +```shell +$ make main +``` + +Which will build llama.cpp for the default target – Intel GPUs (using SPIR-V binaries). To run llama on your Intel GPU, +just use the level zero GPU backend, as shown below: + +```shell +$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./bin/main -m ../../llama-2-7b-chat.Q4_K_M.gguf --no-mmap -ngl 128 +``` + +Now this is the same application running on an Intel GPU with no user intervention! That means all the heavy lifting is +done by the tool, and you can focus on optimization and refactoring of the generated code. + +## Conclusions + +In this article we have shown a practical use case of a CUDA to SYCL C++ application for AI, and a popular one at that! +The conversion works straight out of the box, no code changes needed. Typically the SYCLomatic tool is there to assist +you with porting applications from CUDA to SYCL, so it gives you good warning messages and introduces code that you can +then replace later on for code that suits your application better. + +We have also shown that the same code works on two completely different GPU’s without any modification, NVIDIA and Intel +with the potential for others through the use of open standard SYCL. Although llama.cpp has a CUDA backend already, +having the SYCL backend run on both platforms means we can re-use CI infrastructure for testing and run the application +in a wider set of platforms with less code changes. + +The current SYCL backend supported in upstream llama.cpp started as a DPCT conversion, not too dissimilar to the one we +just did in this article. Developers have been working on the SYCL backend to improve performance on a wide variety of +platforms (NVIDIA, AMD, Intel GPUs on client and datacenter, and others incl RISC-V), but we still re-use some of the +original code that SYCLomatic generated for us. That original conversion saved several engineering months to get +something up and running, and allowed us to focus on the important parts of the project: performance and code quality. + +If you want help porting a CUDA application to SYCL, or have questions about anything in this article, reach out to us +at [dev-rel@codeplay.com](mailto:dev-rel@codeplay.com). + From 15f3a47b048574dcb2c78791716bfa15c78a7068 Mon Sep 17 00:00:00 2001 From: Scott Straughan Date: Fri, 16 Aug 2024 14:07:21 +0100 Subject: [PATCH 2/6] Added some code/pre styling. --- static/css/styled.scss | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/static/css/styled.scss b/static/css/styled.scss index 5fa0bae..5854233 100644 --- a/static/css/styled.scss +++ b/static/css/styled.scss @@ -811,6 +811,17 @@ body { width: 100%; height: auto; } + + pre code { + display: block; + max-width: 100%; + word-break: break-word; + white-space: break-spaces; + background-color: var(--hint-color); + color: white; + padding: 1rem; + border-radius: 12px; + } } } From 8010196e4ce9d0c1ee842a3d90cdeec9e2ea9435 Mon Sep 17 00:00:00 2001 From: Scott Straughan Date: Fri, 16 Aug 2024 14:09:52 +0100 Subject: [PATCH 3/6] Added some code/pre styling. --- static/css/styled.scss | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/static/css/styled.scss b/static/css/styled.scss index 5854233..fc953f5 100644 --- a/static/css/styled.scss +++ b/static/css/styled.scss @@ -812,6 +812,13 @@ body { height: auto; } + code { + padding: .1rem .2rem; + background-color: #d0d0d0; + display: inline-block; + border-radius: 6px; + } + pre code { display: block; max-width: 100%; From a551f17c015bc965ae02cc1d3053504f02d3f771 Mon Sep 17 00:00:00 2001 From: Scott Straughan Date: Fri, 16 Aug 2024 14:11:05 +0100 Subject: [PATCH 4/6] Added some more links to make it easier to follow the llama blogs. --- ...from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one.md | 2 ++ ...ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time.md | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/_collections/_updates/2024-07-31-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one.md b/_collections/_updates/2024-07-31-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one.md index bd22218..e27e4d6 100644 --- a/_collections/_updates/2024-07-31-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one.md +++ b/_collections/_updates/2024-07-31-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one.md @@ -145,3 +145,5 @@ process all files. Now, we have successfully converted our llama.cpp project from CUDA to SYCL. In part two, we will build and run this on NVIDIA and Intel GPUs. + +[Click here to view part two.](/updates/2024/08/13/part-two-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time) diff --git a/_collections/_updates/2024-08-13-part-two-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time.md b/_collections/_updates/2024-08-13-part-two-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time.md index 0bda9cb..3f20be2 100644 --- a/_collections/_updates/2024-08-13-part-two-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time.md +++ b/_collections/_updates/2024-08-13-part-two-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time.md @@ -11,7 +11,7 @@ tags: ## Prelude -[In our first part](https://codeplay.com/portal/blogs/2024/07/31/porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one) +[In our first part](/updates/2024/07/31/porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one) we looked at the conversion from CUDA to SYCL via the whole project migration tool, SYCLomatic. Now we are going to take this portable code, and run it across an NVIDIA and Intel GPU. From c22ebd50087a5443f667cb817f6d4448318ffeb2 Mon Sep 17 00:00:00 2001 From: Scott Straughan Date: Fri, 16 Aug 2024 14:11:27 +0100 Subject: [PATCH 5/6] Tweaks to code block. --- ...from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_collections/_updates/2024-07-31-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one.md b/_collections/_updates/2024-07-31-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one.md index e27e4d6..6b015b8 100644 --- a/_collections/_updates/2024-07-31-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one.md +++ b/_collections/_updates/2024-07-31-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one.md @@ -129,7 +129,7 @@ Now we are going to use the information collected by intercept-build to generate build directory by running the dpct command itself: ```shell -$ cd ../.. && mkdir dpct_out +$ cd ../.. && mkdir dpct_out ``` ```shell From 855a40980ee80edba4029dbe605194bbd53363d6 Mon Sep 17 00:00:00 2001 From: Scott Straughan Date: Fri, 16 Aug 2024 14:11:55 +0100 Subject: [PATCH 6/6] Tweaks to code block. --- ...from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one.md | 2 +- ...ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/_collections/_updates/2024-07-31-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one.md b/_collections/_updates/2024-07-31-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one.md index 6b015b8..f08d2cd 100644 --- a/_collections/_updates/2024-07-31-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one.md +++ b/_collections/_updates/2024-07-31-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one.md @@ -106,7 +106,7 @@ The first step is to clone the llama.cpp repository, and configure cmake as usua $ git clone https://github.com/ggerganov/llama.cpp.git $ cd llama.cpp $ git checkout 3c04bf6da89eaf4c7d317e0518f0687dfcbf2de7 -$ mkdir build && cd build +$ mkdir build && cd build $ cmake .. -DLLAMA_CUBLAS=ON -DLLAMA_CUDA=ON - $ DCMAKE_CUDA_ARCHITECTURES=80 ``` diff --git a/_collections/_updates/2024-08-13-part-two-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time.md b/_collections/_updates/2024-08-13-part-two-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time.md index 3f20be2..3f370a3 100644 --- a/_collections/_updates/2024-08-13-part-two-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time.md +++ b/_collections/_updates/2024-08-13-part-two-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time.md @@ -21,7 +21,7 @@ Now we are going to build the converted code directly using the CMake file that build the main binary for llama.cpp. ```shell -$ cd dpct_out && mkdir syclbuild && cd syclbuild +$ cd dpct_out && mkdir syclbuild && cd syclbuild $ MKLROOT=/home/ruyman/soft/mkl CC=icx CXX=icpx cmake .. -DLLAMA_CUBLAS=ON -DCMAKE_CUDA_ARCHITECTURES=80 -DCMAKE_CXX_FLAGS="-fsycl -fsycl-targets=nvptx64-nvidia-cuda -L${MKLROOT}/lib" $ make main ```