Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Upstream sync 2024 04 21 #198

Closed
wants to merge 65 commits into from
Closed

Conversation

robertgshaw2-redhat
Copy link
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat commented Apr 22, 2024

Upstream sync 2024 04 21 (#198)

SUMMARY:
Merge commits from 7fd3949 to a37d815 into nm-vllm/main

Note that 7fd3949 is NOT included in this merge.

rkooo567 and others added 30 commits April 21, 2024 23:45
mmoskal and others added 28 commits April 21, 2024 23:47
…roject#4118)

Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726

This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.

Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.

Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:

BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.
@robertgshaw2-redhat robertgshaw2-redhat deleted the upstream-sync-2024-04-21 branch April 30, 2024 00:43
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.