Optimize LLaMA for inference #513

mryab · 2023-09-18T09:25:55Z

Similarly to #500, this PR aims to speed up Llama models by making the following optimizations compared to the original Transformers implementation:

Position indices are not generated, because rotary embeddings do not need them (only the lengths of the prefix and the number of encoded tokens)
All operations before and after the attention layer (i.e., RMS normalization and MLP) are fused within a single CUDA graph
Similarly, the rotary PE function is fused as a CUDA graph

Additionally, this PR introduces a petals.utils.cuda_graphs.make_inference_graphed_callable function that converts any inference-mode callable into its CUDA graph version. This is meant to serve as an alternative for torch.cuda.make_graphed_callables that does not attempt to build a graph for the backward pass: inference is called in inference_mode, so the original function fails (that's why the Falcon PR used custom graph tracing as well)

borzunov · 2023-09-20T03:11:52Z

Benchmarks: this PR gives +44% to inference speed

Model: Stable Beluga 2 (70B)
GPU: A6000 Ada

main @ a2484b3:

Sep 20 03:08:50.845 [INFO] Inference throughput: 750.6 tokens/sec per block (1 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)                      
Sep 20 03:09:04.064 [INFO] Forward pass throughput: 48486.8 tokens/sec per block (1024 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)

optimize_llama @ f332b0e:

Sep 20 03:10:13.415 [INFO] Inference throughput: 1078.9 tokens/sec per block (1 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)
Sep 20 03:10:26.583 [INFO] Forward pass throughput: 48003.5 tokens/sec per block (1024 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)

tests/test_optimized_layers.py

poedator · 2023-11-21T23:49:26Z

When testing with TinyLlama for some unrelated thing I caught error Caught too many indices for tensor of dimension 2
It happened in this line cos = cos[:, :, kv_seq_len - q_len :]
https://github.com/bigscience-workshop/petals/pull/513/files#diff-492af4f870c9613ff6b5fce973ddd1d75bf135b30f40a7cb83f455c4f0e72ea6R87
Env: Tranformers 4.35.2
ref to test run https://github.com/bigscience-workshop/petals/actions/runs/6950529337/job/18910867509?pr=545 - see line 2755
@mryab ?

borzunov changed the title ~~[WIP] Optimize LLaMa for inference~~ [WIP] Optimize Llama for inference Sep 19, 2023

mryab requested a review from borzunov September 20, 2023 12:01

mryab force-pushed the optimize_llama branch from f332b0e to 57762b4 Compare September 20, 2023 12:11

mryab marked this pull request as ready for review September 20, 2023 12:13

mryab changed the title ~~[WIP] Optimize Llama for inference~~ Optimize Llama for inference Sep 20, 2023

borzunov reviewed Sep 20, 2023

View reviewed changes

tests/test_optimized_layers.py Outdated Show resolved Hide resolved

tests/test_optimized_layers.py Outdated Show resolved Hide resolved

mryab force-pushed the optimize_llama branch from 4ae5649 to ee10e4e Compare October 8, 2023 14:41

mryab changed the title ~~Optimize Llama for inference~~ Optimize LLaMA for inference Oct 8, 2023

mryab force-pushed the optimize_llama branch from ee10e4e to 400685b Compare October 8, 2023 19:10

mryab added 3 commits October 24, 2023 15:23

Optimize LLaMa for inference

78c7f58

Fix optimizations, test LLaMA properly

94b052d

Fix model type detection in tests

a19ccea

mryab force-pushed the optimize_llama branch from d87c23b to a19ccea Compare October 24, 2023 13:25

Merge branch 'main' into optimize_llama

31dfea4

justheuristic merged commit 03cbe90 into main Nov 14, 2023
11 checks passed

justheuristic deleted the optimize_llama branch November 14, 2023 17:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize LLaMA for inference #513

Optimize LLaMA for inference #513

mryab commented Sep 18, 2023 •

edited

Loading

borzunov commented Sep 20, 2023 •

edited

Loading

poedator commented Nov 21, 2023

Optimize LLaMA for inference #513

Optimize LLaMA for inference #513

Conversation

mryab commented Sep 18, 2023 • edited Loading

borzunov commented Sep 20, 2023 • edited Loading

Benchmarks: this PR gives +44% to inference speed

poedator commented Nov 21, 2023

mryab commented Sep 18, 2023 •

edited

Loading

borzunov commented Sep 20, 2023 •

edited

Loading