-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #30 from codeplaysoftware/add-llama-updates
Add llama updates
- Loading branch information
Showing
3 changed files
with
263 additions
and
0 deletions.
There are no files selected for viewing
149 changes: 149 additions & 0 deletions
149
...1-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,149 @@ | ||
--- | ||
title: "Part One - Porting AI codes from CUDA to SYCL and oneAPI, one llama at a time" | ||
date: 2024-07-31 | ||
layout: update | ||
tags: | ||
- cuda | ||
- sycl | ||
- oneapi | ||
- porting | ||
--- | ||
|
||
## Introduction | ||
|
||
The rapid advancement of LLMs can be attributed to their ability to effectively tackle complex problems, such as those | ||
encountered in chatbots, virtual assistants, content generation, and language translation. Their performance, which | ||
matches human capabilities, places LLMs at the forefront of AI models. | ||
|
||
The classical general purpose graph frameworks like PyTorch, TensorFlow, etc. can cover very wide ranges of machine | ||
learning domains such as image and video classification, semantic segmentation, object detection, and other natural | ||
language processing for general-purpose language generation through several neural networks (NN) architectures such as | ||
convolutional Neural networks, Recurrent neural networks, and various types of Transformer-based architectures for | ||
generative AI. | ||
|
||
While such omnipotent frameworks can cover almost all training and inference aspects of AI models that are now used, in | ||
some scenarios a particular type of inference only NN architecture is required for specific devices such as edge | ||
computing or systems without a network connection. Such architectures may have some hardware limitations, e.g. single | ||
GPU or Single CPU only with limited memory and cache sizes and restricted operating system support. Hence developers may | ||
struggle to use such frameworks. | ||
|
||
With the popularity of large language models, there are several lightweight frameworks, such as Meta’s llama models, | ||
llama.cpp, and vllm are provided to target only transformer-based architectures for inference models. Among | ||
them, <a href="https://github.com/ggerganov/llama.cpp">llama.cpp is a C++-based open source library</a> that can be used | ||
with the llama model amongst others. This is written using pure C/C++ and that enables LLM inference with minimal | ||
dependency to any third party libraries, while providing a state-of-the-art performance on a wide variety of local and | ||
cloud based hardware. | ||
|
||
[llama.cpp](https://github.com/ggerganov/llama.cpp) is designed to run large language models efficiently on | ||
devices with limited resources, such as laptops or desktop pcs with GPUs. The C++ based implementation makes llama.cpp | ||
highly performant and portable, ideal for scenarios where computational power and memory are at a premium. At the core | ||
of llama.cpp is the quantization. Llama.cpp uses custom quantization types that drastically reduce model sizes, which in | ||
turn enables them to run on devices with limited memory. The challenging part here is to find the right quantization | ||
scheme that would prevent precision loss without causing hallucinations in the output; hence, a lot of effort of tuning | ||
the models goes into finding the right quantization parameters, and the code performs several custom matrix | ||
multiplication operations to reduce precision loss on custom quantization schemes. | ||
|
||
## [SYCLomatic](https://github.com/oneapi-src/SYCLomatic) | ||
|
||
This article will now describe how to migrate the existing llama.cpp CUDA backend to | ||
SYCL [using the SYCLomatic open source tool](https://github.com/oneapi-src/SYCLomatic). The migrated code can | ||
then be run across an NVIDIA system, and another system with Intel Data Center Max GPUs - demonstrating truly portable, | ||
single-source code. | ||
|
||
Spoiler alert: We don’t really need to do this migration, Llama.cpp already has SYCL in upstream, thanks to the work of | ||
Intel and Codeplay teams. The work started with a SYCLomatic conversion back in December 2023. The feedback from that | ||
conversion led to a lot of improvements in SYCLomatic. The SYCL upstream support is now maintained by Codeplay and Intel | ||
on both NVIDIA and Intel GPUs. | ||
|
||
A key benefit of SYCLomatic is that it is a whole project migration tool. This means it does not focus on migrating | ||
individual kernels or files, but instead provides a migration of the entire project that you can then use as a starting | ||
point for your SYCL multi-target application. | ||
|
||
## Preparation | ||
|
||
For this exercise, I am going to use two distinct machines: my local desktop pc with an integrated NVIDIA GPU, and a | ||
remote system with an Intel Data Center GPU Max series 1110. | ||
|
||
I have installed the latest CUDA toolkit on both systems, as well as the Intel oneAPI base toolkit version 2024.2. | ||
|
||
Remember to set your environment variables so that all the tools we are going to use are in your path (replace the first | ||
with the path to your Intel oneAPI Base Toolkit location): | ||
|
||
```shell | ||
$ cd /path/to/intel/oneAPI/Toolkit | ||
$ . setvars.sh ~/intel/oneapi | ||
$ dpct --versionIntel(R) DPC++ Compatibility Tool version 2024.2.0. Codebase:(55a3f034030e4bd0f36d7c37f24f8366079a639b). clang version 19.0.0 | ||
``` | ||
|
||
Before we can run our model, we have to download it. There are many models supported | ||
by llama.cpp, and the list keeps growing! In this example we are going to download the llama 2 –7B model, already | ||
quantized in ‘gguf’ format to save some steps, so you can just wget from your prompt. In this case, I have opted for | ||
creating a model's directory in my home folder. | ||
|
||
```shell | ||
$ mkdir $HOME/models/ ; cd $HOME/models/ | ||
$ wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf | ||
``` | ||
|
||
On your NVIDIA system, you need to have a local copy of oneMKL for NVIDIA GPU’s, this is currently not available as a | ||
download, so you must build it as follows: | ||
|
||
```shell | ||
$ git clone https://github.com/oneapi-src/oneMKL.git | ||
$ cd oneMKL/; mkdir build; cd build | ||
$ cmake ../ -GNinja -DCMAKE_CXX_COMPILER=icpx -DCMAKE_C_COMPILER=icx -DENABLE_MKLGPU_BACKEND=False -DENABLE_MKLCPU_BACKEND=False -DENABLE_CUFFT_BACKEND=True -DENABLE_CUBLAS_BACKEND=True -DENABLE_CUSOLVER_BACKEND=True -DENABLE_CURAND_BACKEND=True -DBUILD_FUNCTIONAL_TESTS=False -DCMAKE_INSTALL_PREFIX=${HOME}/soft/mkl/ | ||
$ ninja install | ||
``` | ||
|
||
This builds the [oneMKL interfaces for NVIDIA](https://github.com/oneapi-src/oneMKL) and installs it in the soft/mkl | ||
directory within your home folder. | ||
|
||
## Steps for the conversion | ||
|
||
The first step is to clone the llama.cpp repository, and configure cmake as usual for NVIDIA GPUs, as shown below. | ||
|
||
```shell | ||
$ git clone https://github.com/ggerganov/llama.cpp.git | ||
$ cd llama.cpp | ||
$ git checkout 3c04bf6da89eaf4c7d317e0518f0687dfcbf2de7 | ||
$ mkdir build && cd build | ||
$ cmake .. -DLLAMA_CUBLAS=ON -DLLAMA_CUDA=ON - | ||
$ DCMAKE_CUDA_ARCHITECTURES=80 | ||
``` | ||
|
||
In this example we are using an earlier version of the llama.cpp repository closer to the one we used to do the initial | ||
porting. The llama.cpp project moves really fast, and some of the latest versions of the project may not work straight | ||
out of the box with SYCLomatic. | ||
|
||
Now, here is the first change: pre-pend “intercept-build” to the make command you would normally run, as below: | ||
|
||
```shell | ||
$ intercept-build make | ||
``` | ||
|
||
intercept-build is a really useful tool, distributed with SYCLomatic, that collects all compilation commands issued | ||
while building a yaml file that SYCLomatic can then use to generate new build system files to compile your SYCL version | ||
of the application. | ||
|
||
Now we are going to use the information collected by intercept-build to generate a SYCL | ||
build directory by running the dpct command itself: | ||
|
||
```shell | ||
$ cd ../.. && mkdir dpct_out | ||
``` | ||
|
||
```shell | ||
$ dpct -p ./llama.cpp/build --enable-profiling --use-experimental-features=all --in-root=./llama.cpp --out-root=./dpct_out --migrate-build-script=CMake --process-all | ||
``` | ||
|
||
When using the `-p` option, it will find the compilation database and use that to convert all project files. In this | ||
case, we have also enabled profiling (which adds profiling information to the SYCL generated code), and we are opted in | ||
to all experimental features (more on this later). We are also migrating the build script using CMake, and telling it to | ||
process all files. | ||
|
||
## Next Part | ||
|
||
Now, we have successfully converted our llama.cpp project from CUDA to SYCL. In part two, we will build and run this on | ||
NVIDIA and Intel GPUs. | ||
|
||
[Click here to view part two.](/updates/2024/08/13/part-two-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time) |
96 changes: 96 additions & 0 deletions
96
...3-part-two-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
--- | ||
title: "Part Two - Porting AI codes from CUDA to SYCL and oneAPI, one llama at a time" | ||
date: 2024-08-13 | ||
layout: update | ||
tags: | ||
- cuda | ||
- sycl | ||
- oneapi | ||
- porting | ||
--- | ||
|
||
## Prelude | ||
|
||
[In our first part](/updates/2024/07/31/porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one) | ||
we looked at the conversion from CUDA to SYCL via the whole project migration tool, SYCLomatic. Now we are going to take | ||
this portable code, and run it across an NVIDIA and Intel GPU. | ||
|
||
## Building on the NVIDIA system | ||
|
||
Now we are going to build the converted code directly using the CMake file that SYCLomatic has created for us, and then | ||
build the main binary for llama.cpp. | ||
|
||
```shell | ||
$ cd dpct_out && mkdir syclbuild && cd syclbuild | ||
$ MKLROOT=/home/ruyman/soft/mkl CC=icx CXX=icpx cmake .. -DLLAMA_CUBLAS=ON -DCMAKE_CUDA_ARCHITECTURES=80 -DCMAKE_CXX_FLAGS="-fsycl -fsycl-targets=nvptx64-nvidia-cuda -L${MKLROOT}/lib" | ||
$ make main | ||
``` | ||
|
||
Note that now we are not using the CUDA compiler to build, but the Intel SYCL compiler, so we are passing the CC and CXX | ||
flags accordingly. We also pass manually the target triple (`-fsycl-targets=nvptx64-nvidia-cuda`) which tells the | ||
SYCL compiler to generate code for NVIDIA CUDA architectures (using PTX). We can now run our model using the following | ||
command: | ||
|
||
```shell | ||
$ ONEAPI_DEVICE_SELECTOR=cuda:gpu ./bin/main -m ../../models/ -ngl 12899 -no-mmap | ||
``` | ||
|
||
The environment variable `ONEAPI_DEVICE_SELECTOR` allows users to override the default selection mechanism of the SYCL | ||
queue in favour of a user-defined setting. The default selection in this case would use OpenCL for the CPU, which won’t | ||
work because we explicitly build for NVIDIA GPUs. | ||
|
||
The conversion out of the box won’t be fast, as it won’t be using the most optimized path for NVIDIA. But it is a good | ||
starting point that allows you to try your SYCL code on the existing environment before moving to a new machine with an | ||
Intel GPU, and you can also re-use your CI infrastructure to test the SYCL path. | ||
|
||
## Running on an Intel GPU system | ||
|
||
To prove we have now a truly portable application, let's take this code and build it and run it for an Intel GPU. | ||
|
||
Log onto your system with the Intel Data Center Max GPU and repeat the cloning and building for CUDA steps, so you can | ||
run intercept-build on the new system, or copy over the DPCT generated project. Now, let's configure and build for Intel | ||
GPUs, using the original CMake flags we used to convert the project. | ||
|
||
```shell | ||
$ CC=icx CXX=icpx cmake .. -DLLAMA_CUBLAS=ON -DCMAKE_CUDA_ARCHITECTURES=80 | ||
``` | ||
|
||
Yes, you still use the CUBLAS and CUDA CMake flags, the user visible CMake flags won’t change, but the internal logic on | ||
the CMake file generated by SYCLomatic will handle finding the paths for the Intel oneAPI base toolkit dependencies. | ||
Once it is configured, you can | ||
|
||
```shell | ||
$ make main | ||
``` | ||
|
||
Which will build llama.cpp for the default target – Intel GPUs (using SPIR-V binaries). To run llama on your Intel GPU, | ||
just use the level zero GPU backend, as shown below: | ||
|
||
```shell | ||
$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./bin/main -m ../../llama-2-7b-chat.Q4_K_M.gguf --no-mmap -ngl 128 | ||
``` | ||
|
||
Now this is the same application running on an Intel GPU with no user intervention! That means all the heavy lifting is | ||
done by the tool, and you can focus on optimization and refactoring of the generated code. | ||
|
||
## Conclusions | ||
|
||
In this article we have shown a practical use case of a CUDA to SYCL C++ application for AI, and a popular one at that! | ||
The conversion works straight out of the box, no code changes needed. Typically the SYCLomatic tool is there to assist | ||
you with porting applications from CUDA to SYCL, so it gives you good warning messages and introduces code that you can | ||
then replace later on for code that suits your application better. | ||
|
||
We have also shown that the same code works on two completely different GPU’s without any modification, NVIDIA and Intel | ||
with the potential for others through the use of open standard SYCL. Although llama.cpp has a CUDA backend already, | ||
having the SYCL backend run on both platforms means we can re-use CI infrastructure for testing and run the application | ||
in a wider set of platforms with less code changes. | ||
|
||
The current SYCL backend supported in upstream llama.cpp started as a DPCT conversion, not too dissimilar to the one we | ||
just did in this article. Developers have been working on the SYCL backend to improve performance on a wide variety of | ||
platforms (NVIDIA, AMD, Intel GPUs on client and datacenter, and others incl RISC-V), but we still re-use some of the | ||
original code that SYCLomatic generated for us. That original conversion saved several engineering months to get | ||
something up and running, and allowed us to focus on the important parts of the project: performance and code quality. | ||
|
||
If you want help porting a CUDA application to SYCL, or have questions about anything in this article, reach out to us | ||
at [[email protected]](mailto:[email protected]). | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters