Skip to content

Commit

Permalink
Merge pull request #30 from codeplaysoftware/add-llama-updates
Browse files Browse the repository at this point in the history
Add llama updates
  • Loading branch information
codeplaymax authored Aug 16, 2024
2 parents 2b97ae7 + 855a409 commit 921fbfb
Show file tree
Hide file tree
Showing 3 changed files with 263 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
---
title: "Part One - Porting AI codes from CUDA to SYCL and oneAPI, one llama at a time"
date: 2024-07-31
layout: update
tags:
- cuda
- sycl
- oneapi
- porting
---

## Introduction

The rapid advancement of LLMs can be attributed to their ability to effectively tackle complex problems, such as those
encountered in chatbots, virtual assistants, content generation, and language translation. Their performance, which
matches human capabilities, places LLMs at the forefront of AI models.

The classical general purpose graph frameworks like PyTorch, TensorFlow, etc. can cover very wide ranges of machine
learning domains such as image and video classification, semantic segmentation, object detection, and other natural
language processing for general-purpose language generation through several neural networks (NN) architectures such as
convolutional Neural networks, Recurrent neural networks, and various types of Transformer-based architectures for
generative AI.

While such omnipotent frameworks can cover almost all training and inference aspects of AI models that are now used, in
some scenarios a particular type of inference only NN architecture is required for specific devices such as edge
computing or systems without a network connection. Such architectures may have some hardware limitations, e.g. single
GPU or Single CPU only with limited memory and cache sizes and restricted operating system support. Hence developers may
struggle to use such frameworks.

With the popularity of large language models, there are several lightweight frameworks, such as Meta’s llama models,
llama.cpp, and vllm are provided to target only transformer-based architectures for inference models. Among
them, <a href="https://github.com/ggerganov/llama.cpp">llama.cpp is a C++-based open source library</a> that can be used
with the llama model amongst others. This is written using pure C/C++ and that enables LLM inference with minimal
dependency to any third party libraries, while providing a state-of-the-art performance on a wide variety of local and
cloud based hardware.

[llama.cpp](https://github.com/ggerganov/llama.cpp) is designed to run large language models efficiently on
devices with limited resources, such as laptops or desktop pcs with GPUs. The C++ based implementation makes llama.cpp
highly performant and portable, ideal for scenarios where computational power and memory are at a premium. At the core
of llama.cpp is the quantization. Llama.cpp uses custom quantization types that drastically reduce model sizes, which in
turn enables them to run on devices with limited memory. The challenging part here is to find the right quantization
scheme that would prevent precision loss without causing hallucinations in the output; hence, a lot of effort of tuning
the models goes into finding the right quantization parameters, and the code performs several custom matrix
multiplication operations to reduce precision loss on custom quantization schemes.

## [SYCLomatic](https://github.com/oneapi-src/SYCLomatic)

This article will now describe how to migrate the existing llama.cpp CUDA backend to
SYCL [using the SYCLomatic open source tool](https://github.com/oneapi-src/SYCLomatic). The migrated code can
then be run across an NVIDIA system, and another system with Intel Data Center Max GPUs - demonstrating truly portable,
single-source code.

Spoiler alert: We don’t really need to do this migration, Llama.cpp already has SYCL in upstream, thanks to the work of
Intel and Codeplay teams. The work started with a SYCLomatic conversion back in December 2023. The feedback from that
conversion led to a lot of improvements in SYCLomatic. The SYCL upstream support is now maintained by Codeplay and Intel
on both NVIDIA and Intel GPUs.

A key benefit of SYCLomatic is that it is a whole project migration tool. This means it does not focus on migrating
individual kernels or files, but instead provides a migration of the entire project that you can then use as a starting
point for your SYCL multi-target application.

## Preparation

For this exercise, I am going to use two distinct machines: my local desktop pc with an integrated NVIDIA GPU, and a
remote system with an Intel Data Center GPU Max series 1110.

I have installed the latest CUDA toolkit on both systems, as well as the Intel oneAPI base toolkit version 2024.2.

Remember to set your environment variables so that all the tools we are going to use are in your path (replace the first
with the path to your Intel oneAPI Base Toolkit location):

```shell
$ cd /path/to/intel/oneAPI/Toolkit
$ . setvars.sh ~/intel/oneapi
$ dpct --versionIntel(R) DPC++ Compatibility Tool version 2024.2.0. Codebase:(55a3f034030e4bd0f36d7c37f24f8366079a639b). clang version 19.0.0
```

Before we can run our model, we have to download it. There are many models supported
by llama.cpp, and the list keeps growing! In this example we are going to download the llama 2 –7B model, already
quantized in ‘gguf’ format to save some steps, so you can just wget from your prompt. In this case, I have opted for
creating a model's directory in my home folder.

```shell
$ mkdir $HOME/models/ ; cd $HOME/models/
$ wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf
```

On your NVIDIA system, you need to have a local copy of oneMKL for NVIDIA GPU’s, this is currently not available as a
download, so you must build it as follows:

```shell
$ git clone https://github.com/oneapi-src/oneMKL.git
$ cd oneMKL/; mkdir build; cd build
$ cmake ../ -GNinja -DCMAKE_CXX_COMPILER=icpx -DCMAKE_C_COMPILER=icx -DENABLE_MKLGPU_BACKEND=False -DENABLE_MKLCPU_BACKEND=False -DENABLE_CUFFT_BACKEND=True -DENABLE_CUBLAS_BACKEND=True -DENABLE_CUSOLVER_BACKEND=True -DENABLE_CURAND_BACKEND=True -DBUILD_FUNCTIONAL_TESTS=False -DCMAKE_INSTALL_PREFIX=${HOME}/soft/mkl/
$ ninja install
```

This builds the [oneMKL interfaces for NVIDIA](https://github.com/oneapi-src/oneMKL) and installs it in the soft/mkl
directory within your home folder.

## Steps for the conversion

The first step is to clone the llama.cpp repository, and configure cmake as usual for NVIDIA GPUs, as shown below.

```shell
$ git clone https://github.com/ggerganov/llama.cpp.git
$ cd llama.cpp
$ git checkout 3c04bf6da89eaf4c7d317e0518f0687dfcbf2de7
$ mkdir build && cd build
$ cmake .. -DLLAMA_CUBLAS=ON -DLLAMA_CUDA=ON -
$ DCMAKE_CUDA_ARCHITECTURES=80
```

In this example we are using an earlier version of the llama.cpp repository closer to the one we used to do the initial
porting. The llama.cpp project moves really fast, and some of the latest versions of the project may not work straight
out of the box with SYCLomatic.

Now, here is the first change: pre-pend “intercept-build” to the make command you would normally run, as below:

```shell
$ intercept-build make
```

intercept-build is a really useful tool, distributed with SYCLomatic, that collects all compilation commands issued
while building a yaml file that SYCLomatic can then use to generate new build system files to compile your SYCL version
of the application.

Now we are going to use the information collected by intercept-build to generate a SYCL
build directory by running the dpct command itself:

```shell
$ cd ../.. && mkdir dpct_out
```

```shell
$ dpct -p ./llama.cpp/build --enable-profiling --use-experimental-features=all --in-root=./llama.cpp --out-root=./dpct_out --migrate-build-script=CMake --process-all
```

When using the `-p` option, it will find the compilation database and use that to convert all project files. In this
case, we have also enabled profiling (which adds profiling information to the SYCL generated code), and we are opted in
to all experimental features (more on this later). We are also migrating the build script using CMake, and telling it to
process all files.

## Next Part

Now, we have successfully converted our llama.cpp project from CUDA to SYCL. In part two, we will build and run this on
NVIDIA and Intel GPUs.

[Click here to view part two.](/updates/2024/08/13/part-two-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time)
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---
title: "Part Two - Porting AI codes from CUDA to SYCL and oneAPI, one llama at a time"
date: 2024-08-13
layout: update
tags:
- cuda
- sycl
- oneapi
- porting
---

## Prelude

[In our first part](/updates/2024/07/31/porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one)
we looked at the conversion from CUDA to SYCL via the whole project migration tool, SYCLomatic. Now we are going to take
this portable code, and run it across an NVIDIA and Intel GPU.

## Building on the NVIDIA system

Now we are going to build the converted code directly using the CMake file that SYCLomatic has created for us, and then
build the main binary for llama.cpp.

```shell
$ cd dpct_out && mkdir syclbuild && cd syclbuild
$ MKLROOT=/home/ruyman/soft/mkl CC=icx CXX=icpx cmake .. -DLLAMA_CUBLAS=ON -DCMAKE_CUDA_ARCHITECTURES=80 -DCMAKE_CXX_FLAGS="-fsycl -fsycl-targets=nvptx64-nvidia-cuda -L${MKLROOT}/lib"
$ make main
```

Note that now we are not using the CUDA compiler to build, but the Intel SYCL compiler, so we are passing the CC and CXX
flags accordingly. We also pass manually the target triple (`-fsycl-targets=nvptx64-nvidia-cuda`) which tells the
SYCL compiler to generate code for NVIDIA CUDA architectures (using PTX). We can now run our model using the following
command:

```shell
$ ONEAPI_DEVICE_SELECTOR=cuda:gpu ./bin/main -m ../../models/ -ngl 12899 -no-mmap
```

The environment variable `ONEAPI_DEVICE_SELECTOR` allows users to override the default selection mechanism of the SYCL
queue in favour of a user-defined setting. The default selection in this case would use OpenCL for the CPU, which won’t
work because we explicitly build for NVIDIA GPUs.

The conversion out of the box won’t be fast, as it won’t be using the most optimized path for NVIDIA. But it is a good
starting point that allows you to try your SYCL code on the existing environment before moving to a new machine with an
Intel GPU, and you can also re-use your CI infrastructure to test the SYCL path.

## Running on an Intel GPU system

To prove we have now a truly portable application, let's take this code and build it and run it for an Intel GPU.

Log onto your system with the Intel Data Center Max GPU and repeat the cloning and building for CUDA steps, so you can
run intercept-build on the new system, or copy over the DPCT generated project. Now, let's configure and build for Intel
GPUs, using the original CMake flags we used to convert the project.

```shell
$ CC=icx CXX=icpx cmake .. -DLLAMA_CUBLAS=ON -DCMAKE_CUDA_ARCHITECTURES=80
```

Yes, you still use the CUBLAS and CUDA CMake flags, the user visible CMake flags won’t change, but the internal logic on
the CMake file generated by SYCLomatic will handle finding the paths for the Intel oneAPI base toolkit dependencies.
Once it is configured, you can

```shell
$ make main
```

Which will build llama.cpp for the default target – Intel GPUs (using SPIR-V binaries). To run llama on your Intel GPU,
just use the level zero GPU backend, as shown below:

```shell
$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./bin/main -m ../../llama-2-7b-chat.Q4_K_M.gguf --no-mmap -ngl 128
```

Now this is the same application running on an Intel GPU with no user intervention! That means all the heavy lifting is
done by the tool, and you can focus on optimization and refactoring of the generated code.

## Conclusions

In this article we have shown a practical use case of a CUDA to SYCL C++ application for AI, and a popular one at that!
The conversion works straight out of the box, no code changes needed. Typically the SYCLomatic tool is there to assist
you with porting applications from CUDA to SYCL, so it gives you good warning messages and introduces code that you can
then replace later on for code that suits your application better.

We have also shown that the same code works on two completely different GPU’s without any modification, NVIDIA and Intel
with the potential for others through the use of open standard SYCL. Although llama.cpp has a CUDA backend already,
having the SYCL backend run on both platforms means we can re-use CI infrastructure for testing and run the application
in a wider set of platforms with less code changes.

The current SYCL backend supported in upstream llama.cpp started as a DPCT conversion, not too dissimilar to the one we
just did in this article. Developers have been working on the SYCL backend to improve performance on a wide variety of
platforms (NVIDIA, AMD, Intel GPUs on client and datacenter, and others incl RISC-V), but we still re-use some of the
original code that SYCLomatic generated for us. That original conversion saved several engineering months to get
something up and running, and allowed us to focus on the important parts of the project: performance and code quality.

If you want help porting a CUDA application to SYCL, or have questions about anything in this article, reach out to us
at [[email protected]](mailto:[email protected]).

18 changes: 18 additions & 0 deletions static/css/styled.scss
Original file line number Diff line number Diff line change
Expand Up @@ -811,6 +811,24 @@ body {
width: 100%;
height: auto;
}

code {
padding: .1rem .2rem;
background-color: #d0d0d0;
display: inline-block;
border-radius: 6px;
}

pre code {
display: block;
max-width: 100%;
word-break: break-word;
white-space: break-spaces;
background-color: var(--hint-color);
color: white;
padding: 1rem;
border-radius: 12px;
}
}
}

Expand Down

0 comments on commit 921fbfb

Please sign in to comment.