From cfe6765c9d35404e6bfee0133eb907a108b80e79 Mon Sep 17 00:00:00 2001 From: Rajashekar Kasturi <134040933+rskasturi@users.noreply.github.com> Date: Thu, 1 Feb 2024 17:55:44 +0530 Subject: [PATCH] Update README-sycl.md Refined existing instructions and verified the build on Max 1100 in Linux. --- README-sycl.md | 497 +++++++++++++++++++++++-------------------------- 1 file changed, 231 insertions(+), 266 deletions(-) diff --git a/README-sycl.md b/README-sycl.md index 2b2cfe03aac3a..fb59293ed413f 100644 --- a/README-sycl.md +++ b/README-sycl.md @@ -1,22 +1,13 @@ # llama.cpp for SYCL -[Background](#background) - -[OS](#os) - -[Intel GPU](#intel-gpu) - -[Linux](#linux) - -[Windows](#windows) - -[Environment Variable](#environment-variable) - -[Known Issue](#known-issue) - -[Q&A](#q&a) - -[Todo](#todo) +* [Background](#background) +* [Supported OS](#supported-os) +* [Intel® GPU Portfolio](#intel-gpu) +* [Linux](#linux) +* [Windows](#windows) +* [Environment Variable](#environment-variable) +* [Known Issues and Steps to troubleshoot](#known-issues-and-steps-to-troubleshoot) +* [Todo](#todo) ## Background @@ -24,129 +15,120 @@ SYCL is a higher-level programming model to improve programming productivity on oneAPI is a specification that is open and standards-based, supporting multiple architecture types including but not limited to GPU, CPU, and FPGA. The spec has both direct programming and API-based programming paradigms. -Intel uses the SYCL as direct programming language to support CPU, GPUs and FPGAs. +Intel® uses the SYCL as direct programming language to support CPU, GPUs and FPGAs. To avoid to re-invent the wheel, this code refer other code paths in llama.cpp (like OpenBLAS, cuBLAS, CLBlast). We use a open-source tool [SYCLomatic](https://github.com/oneapi-src/SYCLomatic) (Commercial release [Intel® DPC++ Compatibility Tool](https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compatibility-tool.html)) migrate to SYCL. -The llama.cpp for SYCL is used to support Intel GPUs. +The llama.cpp for SYCL is used to support Intel® GPUs. -For Intel CPU, recommend to use llama.cpp for X86 (Intel MKL building). +For Intel® CPUs, recommend to use llama.cpp for X86 [Intel® MKL building](https://github.com/ggerganov/llama.cpp#intel-onemkl). -## OS +## Supported OS |OS|Status|Verified| |-|-|-| |Linux|Support|Ubuntu 22.04| |Windows|Support|Windows 11| - ## Intel GPU |Intel GPU| Status | Verified Model| |-|-|-| -|Intel Data Center Max Series| Support| Max 1550| -|Intel Data Center Flex Series| Support| Flex 170| -|Intel Arc Series| Support| Arc 770, 730M| +|Intel® Data Center Max Series| Support| Max 1550, 1100| +|Intel® Data Center Flex Series| Support| Flex 170| +|Intel® Arc Series| Support| Arc 770, 730M| |Intel built-in Arc GPU| Support| built-in Arc GPU in Meteor Lake| |Intel iGPU| Support| iGPU in i5-1250P, i7-1165G7| - ## Linux ### Setup Environment -1. Install Intel GPU driver. - -a. Please install Intel GPU driver by official guide: [Install GPU Drivers](https://dgpu-docs.intel.com/driver/installation.html). +* Install Intel® GPU driver. + * You can install the drivers by following the official guide: [Install GPU Drivers](https://dgpu-docs.intel.com/driver/installation.html) + + * Note: for iGPU, please install the client GPU driver. + + * Add user to group: video, render. + + * ```bash + sudo usermod -aG render username + sudo usermod -aG video username + ``` -Note: for iGPU, please install the client GPU driver. + * Note: re-login to enable it. -b. Add user to group: video, render. + * Test the compute stack. -``` -sudo usermod -aG render username -sudo usermod -aG video username -``` + * ```bash + sudo apt install clinfo + sudo clinfo -l + ``` -Note: re-login to enable it. + Output (example): -c. Check + ```bash + Platform #0: Intel(R) OpenCL Graphics + `-- Device #0: Intel(R) Arc(TM) A770 Graphics -``` -sudo apt install clinfo -sudo clinfo -l -``` -Output (example): + Platform #0: Intel(R) OpenCL HD Graphics + `-- Device #0: Intel(R) Iris(R) Xe Graphics [0x9a49] + ``` -``` -Platform #0: Intel(R) OpenCL Graphics - `-- Device #0: Intel(R) Arc(TM) A770 Graphics +* Install Intel® oneAPI Base toolkit. + * Please follow the procedure in [Get the Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html). -Platform #0: Intel(R) OpenCL HD Graphics - `-- Device #0: Intel(R) Iris(R) Xe Graphics [0x9a49] -``` + * Recommend to install to default folder: **/opt/intel/oneapi**. -2. Install Intel® oneAPI Base toolkit. + Following guide use the default folder as example. If you use other folder, please modify the following guide info with your folder. + * Activate oneAPI environment and list the available compute stack. -a. Please follow the procedure in [Get the Intel® oneAPI Base Toolkit ](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html). + ```bash + source /opt/intel/oneapi/setvars.sh + sycl-ls + ``` -Recommend to install to default folder: **/opt/intel/oneapi**. + There should be one or more level-zero devices. Like **[ext_oneapi_level_zero:gpu:0]**. -Following guide use the default folder as example. If you use other folder, please modify the following guide info with your folder. + Output (example): -b. Check + ```bash + [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.10.0.17_160000] + [opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i7-13700K OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000] + [opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [23.30.26918.50] + [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26918] + ``` -``` -source /opt/intel/oneapi/setvars.sh +* Build locally step-by-step: -sycl-ls -``` + ```bash + mkdir -p build + cd build + source /opt/intel/oneapi/setvars.sh -There should be one or more level-zero devices. Like **[ext_oneapi_level_zero:gpu:0]**. + #for FP16 + #cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_SYCL_F16=ON # faster for long-prompt inference -Output (example): -``` -[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.10.0.17_160000] -[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i7-13700K OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000] -[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [23.30.26918.50] -[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26918] + #for FP32 + cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -``` + #build example/main only + #cmake --build . --config Release --target main -2. Build locally: + #build all binary + cmake --build . --config Release -v + ``` -``` -mkdir -p build -cd build -source /opt/intel/oneapi/setvars.sh + or -#for FP16 -#cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_SYCL_F16=ON # faster for long-prompt inference + ```bash + ./examples/sycl/build.sh + ``` -#for FP32 -cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx - -#build example/main only -#cmake --build . --config Release --target main - -#build all binary -cmake --build . --config Release -v - -cd .. -``` - -or - -``` -./examples/sycl/build.sh -``` - -Note: - -- By default, it will build for all binary files. It will take more time. To reduce the time, we recommend to build for **example/main** only. + * Note: By default, it will build for all binary files. It will take more time. To reduce the time, we recommend to build for **example/main** only. ### Run @@ -154,235 +136,223 @@ Note: 2. Enable oneAPI running environment -``` -source /opt/intel/oneapi/setvars.sh -``` - -3. List device ID - -Run without parameter: + ```bash + source /opt/intel/oneapi/setvars.sh + ``` -``` -./build/bin/ls-sycl-device +3. Display list of devices -or + Run without parameter: -./build/bin/main -``` + ```bash + ./build/bin/ls-sycl-device + ``` -Check the ID in startup log, like: + or -``` -found 4 SYCL devices: - Device 0: Intel(R) Arc(TM) A770 Graphics, compute capability 1.3, - max compute_units 512, max work group size 1024, max sub group size 32, global mem size 16225243136 - Device 1: Intel(R) FPGA Emulation Device, compute capability 1.2, - max compute_units 24, max work group size 67108864, max sub group size 64, global mem size 67065057280 - Device 2: 13th Gen Intel(R) Core(TM) i7-13700K, compute capability 3.0, - max compute_units 24, max work group size 8192, max sub group size 64, global mem size 67065057280 - Device 3: Intel(R) Arc(TM) A770 Graphics, compute capability 3.0, - max compute_units 512, max work group size 1024, max sub group size 32, global mem size 16225243136 + ```bash + ./build/bin/main + ``` -``` + Check the ID in startup log, (Example below): -|Attribute|Note| -|-|-| -|compute capability 1.3|Level-zero running time, recommended | -|compute capability 3.0|OpenCL running time, slower than level-zero in most cases| + ```bash + found 4 SYCL devices: + Device 0: Intel(R) Arc(TM) A770 Graphics, compute capability 1.3, + max compute_units 512, max work group size 1024, max sub group size 32, global mem size 16225243136 + Device 1: Intel(R) FPGA Emulation Device, compute capability 1.2, + max compute_units 24, max work group size 67108864, max sub group size 64, global mem size 67065057280 + Device 2: 13th Gen Intel(R) Core(TM) i7-13700K, compute capability 3.0, + max compute_units 24, max work group size 8192, max sub group size 64, global mem size 67065057280 + Device 3: Intel(R) Arc(TM) A770 Graphics, compute capability 3.0, + max compute_units 512, max work group size 1024, max sub group size 32, global mem size 16225243136 + ``` -4. Set device ID and execute llama.cpp + |Attribute|Note| + |-|-| + |compute capability 1.3|Level-zero running time, recommended | + |compute capability 3.0|OpenCL running time, slower than level-zero in most cases| -Set device ID = 0 by **GGML_SYCL_DEVICE=0** +4. Set device ID and execute llama.cpp. -``` -GGML_SYCL_DEVICE=0 ./build/bin/main -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -``` -or run by script: + You can set device ID = 0 with **GGML_SYCL_DEVICE=0** -``` -./examples/sycl/run-llama2.sh -``` + ```bash + GGML_SYCL_DEVICE=0 ./build/bin/main -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 + ``` -Note: + or run by script: -- By default, mmap is used to read model file. In some cases, it leads to the hang issue. Recommend to use parameter **--no-mmap** to disable mmap() to skip this issue. + ```bash + ./examples/sycl/run_llama2.sh + ``` + Note: By default, mmap is used to read model file. In some cases, it leads to the hang issue. Recommend to use parameter **--no-mmap** to disable mmap() to skip this issue. 5. Check the device ID in output -Like: -``` -Using device **0** (Intel(R) Arc(TM) A770 Graphics) as main device -``` + Example output: -## Windows + ```bash + Using device **0** (Intel(R) Arc(TM) A770 Graphics) as main device + ``` -### Setup Environment - -1. Install Intel GPU driver. - -Please install Intel GPU driver by official guide: [Install GPU Drivers](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/software/drivers.html). - -2. Install Intel® oneAPI Base toolkit. - -a. Please follow the procedure in [Get the Intel® oneAPI Base Toolkit ](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html). - -Recommend to install to default folder: **/opt/intel/oneapi**. - -Following guide uses the default folder as example. If you use other folder, please modify the following guide info with your folder. - -b. Enable oneAPI running environment: - -- In Search, input 'oneAPI'. - -Search & open "Intel oneAPI command prompt for Intel 64 for Visual Studio 2022" +## Windows -- In Run: +### Setup Intel® oneAPI Environment (Pre-requisite) -In CMD: -``` -"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 -``` +* Install Intel GPU drivers. -c. Check GPU + * You can follow the instructions from official guide: [Install GPU Drivers](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/software/drivers.html). -In oneAPI command line: +* Install Intel® oneAPI Base toolkit. + + * Please follow the procedure in [Get the Intel® oneAPI Base Toolkit ](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html). + + * Recommend to install to default folder: **/opt/intel/oneapi**. -``` -sycl-ls -``` + * Following guide uses the default folder as example. If you use other folder, please modify the following guide info with your folder. -There should be one or more level-zero devices. Like **[ext_oneapi_level_zero:gpu:0]**. +* Enable oneAPI running environment: + * In Search, input 'oneAPI'. + * Search & open "Intel oneAPI command prompt for Intel 64 for Visual Studio 2022" + * In CMD (Activate oneAPI Environment): -Output (example): -``` -[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.10.0.17_160000] -[opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000] -[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Iris(R) Xe Graphics OpenCL 3.0 NEO [31.0.101.5186] -[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.28044] + ```bash + "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 + ``` -``` + * Verify the compute stack in oneAPI command line: -3. Install cmake & make + ```bash + sycl-ls + ``` -a. Download & install cmake for windows: https://cmake.org/download/ + There should be one or more level-zero devices. Like **[ext_oneapi_level_zero:gpu:0]**. -b. Download & install make for windows provided by mingw-w64: https://www.mingw-w64.org/downloads/ + Output (example): + ```bash + [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.10.0.17_160000] + [opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000] + [opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Iris(R) Xe Graphics OpenCL 3.0 NEO [31.0.101.5186] + [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.28044] + ``` -### Build locally: +* Install CMake & Make to build the project. + * Download & install cmake for windows: https://cmake.org/download/ + * Download & install make for windows provided by mingw-w64: https://www.mingw-w64.org downloads/ -In oneAPI command line window: +### Build Instructions -``` -mkdir -p build -cd build -@call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force + In oneAPI command line window: -:: for FP16 -:: faster for long-prompt inference -:: cmake -G "MinGW Makefiles" .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icx -DCMAKE_BUILD_TYPE=Release -DLLAMA_SYCL_F16=ON + ```bash + mkdir -p build + cd build + @call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force -:: for FP32 -cmake -G "MinGW Makefiles" .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icx -DCMAKE_BUILD_TYPE=Release + # for FP16 faster for long-prompt inference + cmake -G "MinGW Makefiles" .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icx -DCMAKE_BUILD_TYPE=Release -DLLAMA_SYCL_F16=ON + #for FP32 + cmake -G "MinGW Makefiles" .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icx -DCMAKE_BUILD_TYPE=Release -:: build example/main only -:: make main + #build example/main only + make main -:: build all binary -make -j -cd .. -``` + # build all binary + make -j + cd .. + ``` -or + or -``` -.\examples\sycl\win-build-sycl.bat -``` + ```bash + .\examples\sycl\win-build-sycl.bat + ``` -Note: +* Note: -- By default, it will build for all binary files. It will take more time. To reduce the time, we recommend to build for **example/main** only. + By default, it will build for all binary files. It will take more time. To reduce the time, we recommend to build for **example/main** only. ### Run -1. Put model file to folder **models** +* Put model file into folder **models** -2. Enable oneAPI running environment - -- In Search, input 'oneAPI'. - -Search & open "Intel oneAPI command prompt for Intel 64 for Visual Studio 2022" - -- In Run: - -In CMD: -``` -"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 -``` +* Enable oneAPI environment + * In Search, input 'oneAPI'. + * Open "Intel oneAPI command prompt for Intel 64 for Visual Studio 2022" + * In the Command Line: + + ```bash + "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 + ``` + + * Display list of devices + + Run without parameter: -3. List device ID + ```bash + build\bin\ls-sycl-device.exe + ``` -Run without parameter: + or -``` -build\bin\ls-sycl-device.exe + ```bash + build\bin\main.exe + ``` -or + * Check the ID in startup log, like: -build\bin\main.exe -``` + ```bash + found 4 SYCL devices: + Device 0: Intel(R) Arc(TM) A770 Graphics, compute capability 1.3, + max compute_units 512, max work group size 1024, max sub group size 32, global mem size 16225243136 + Device 1: Intel(R) FPGA Emulation Device, compute capability 1.2, + max compute_units 24, max work group size 67108864, max sub group size 64, global mem size 67065057280 + Device 2: 13th Gen Intel(R) Core(TM) i7-13700K, compute capability 3.0, + max compute_units 24, max work group size 8192, max sub group size 64, global mem size 67065057280 + Device 3: Intel(R) Arc(TM) A770 Graphics, compute capability 3.0, + max compute_units 512, max work group size 1024, max sub group size 32, global mem size 16225243136 -Check the ID in startup log, like: + ``` -``` -found 4 SYCL devices: - Device 0: Intel(R) Arc(TM) A770 Graphics, compute capability 1.3, - max compute_units 512, max work group size 1024, max sub group size 32, global mem size 16225243136 - Device 1: Intel(R) FPGA Emulation Device, compute capability 1.2, - max compute_units 24, max work group size 67108864, max sub group size 64, global mem size 67065057280 - Device 2: 13th Gen Intel(R) Core(TM) i7-13700K, compute capability 3.0, - max compute_units 24, max work group size 8192, max sub group size 64, global mem size 67065057280 - Device 3: Intel(R) Arc(TM) A770 Graphics, compute capability 3.0, - max compute_units 512, max work group size 1024, max sub group size 32, global mem size 16225243136 + |Attribute|Note| + |-|-| + |compute capability 1.3|Level-zero running time, recommended | + |compute capability 3.0|OpenCL running time, slower than level-zero in most cases| -``` +* Set device ID and execute llama.cpp -|Attribute|Note| -|-|-| -|compute capability 1.3|Level-zero running time, recommended | -|compute capability 3.0|OpenCL running time, slower than level-zero in most cases| + You can set device ID = 0 with **set GGML_SYCL_DEVICE=0** -4. Set device ID and execute llama.cpp + ```bash + set GGML_SYCL_DEVICE=0 + build\bin\main.exe -m models\llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0 + ``` -Set device ID = 0 by **set GGML_SYCL_DEVICE=0** + or run by script: -``` -set GGML_SYCL_DEVICE=0 -build\bin\main.exe -m models\llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0 -``` -or run by script: + ```bash + .\examples\sycl\win-run-llama2.bat + ``` -``` -.\examples\sycl\win-run-llama2.bat -``` + Note: -Note: + By default, mmap is used to read model file. In some cases, it leads to the hang issue. Recommend to use parameter **--no-mmap** to disable mmap() to skip this issue. -- By default, mmap is used to read model file. In some cases, it leads to the hang issue. Recommend to use parameter **--no-mmap** to disable mmap() to skip this issue. +* Check the device ID in output + Example output: -5. Check the device ID in output - -Like: -``` -Using device **0** (Intel(R) Arc(TM) A770 Graphics) as main device -``` + ```bash + Using device **0** (Intel(R) Arc(TM) A770 Graphics) as main device + ``` ## Environment Variable -#### Build +### Build |Name|Value|Function| |-|-|-| @@ -391,36 +361,31 @@ Using device **0** (Intel(R) Arc(TM) A770 Graphics) as main device |CMAKE_C_COMPILER|icx|Use icx compiler for SYCL code path| |CMAKE_CXX_COMPILER|icpx (Linux), icx (Windows)|use icpx/icx for SYCL code path| -#### Running - +### Running |Name|Value|Function| |-|-|-| |GGML_SYCL_DEVICE|0 (default) or 1|Set the device id used. Check the device ids by default running output| |GGML_SYCL_DEBUG|0 (default) or 1|Enable log function by macro: GGML_SYCL_DEBUG| -## Known Issue +## Known Issues and Steps to troubleshoot -- Hang during startup +* Hang during startup llama.cpp use mmap as default way to read model file and copy to GPU. In some system, memcpy will be abnormal and block. Solution: add **--no-mmap**. -## Q&A - -- Error: `error while loading shared libraries: libsycl.so.7: cannot open shared object file: No such file or directory`. +* Error: `error while loading shared libraries: libsycl.so.7: cannot open shared object file: No such file or directory`. Miss to enable oneAPI running environment. Install oneAPI base toolkit and enable it by: `source /opt/intel/oneapi/setvars.sh`. -- In Windows, no result, not error. +* In Windows, no result, not error. Miss to enable oneAPI running environment. ## Todo -- Support to build in Windows. - -- Support multiple cards. +* Support multiple cards.