Upstream sync 2024 04 21 #198

…roject#4118) Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726 This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine. Algorithm: We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass. Initial Results: Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128: BF16: 1.47s FP8: 1.66s I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.

…aries in docs (vllm-project#4222)

Co-authored-by: Harry Mellor <[email protected]>

…ct#3748) Co-authored-by: Yun Ding <[email protected]> Co-authored-by: Roger Wang <[email protected]>

rkooo567 and others added 30 commits April 21, 2024 23:45

[Test] Test multiple attn backend for chunked prefill. (vllm-project#…

1f045c7

…4023)

[Bugfix] fix type hint for py 3.8 (vllm-project#4036)

eb2af39

[Misc] Fix typo in scheduler.py (vllm-project#4022)

c79c13b

[mypy] Add mypy type annotation part 1 (vllm-project#4006)

2b1b6ab

[Core] fix custom allreduce default value (vllm-project#4040)

b4b3ff6

Fix triton compilation issue (vllm-project#3984)

e2793ec

Co-authored-by: Woosuk Kwon <[email protected]>

[Bugfix] Fix LoRA bug (vllm-project#4032)

12dda10

[CI/Test] expand ruff and yapf for all supported python version (vllm…

16e5403

…-project#4037)

[Bugfix] More type hint fixes for py 3.8 (vllm-project#4039)

808fdff

[Core][Distributed] improve logging for init dist (vllm-project#4042)

12ea4a0

[Bugfix] fix_log_time_in_metrics (vllm-project#4050)

5b0ea55

[Bugfix] fix_small_bug_in_neuron_executor (vllm-project#4051)

137751e

[Kernel] Add punica dimension for Baichuan-13B (vllm-project#4053)

023060f

[Frontend] [Core] feat: Add model loading using tensorizer (vllm-pr…

b7a30d9

…oject#3476)

[Core] avoid too many cuda context by caching p2p test (vllm-project#…

fbeff2b

…4021)

[BugFix] Fix tensorizer extra in setup.py (vllm-project#4072)

e006233

[Docs] document that mixtral 8x22b is supported (vllm-project#4073)

629be50

[Misc] Upgrade triton to 2.2.0 (vllm-project#4061)

10dfb8c

[Bugfix] Fix filelock version requirement (vllm-project#4075)

e135642

[Misc][Minor] Fix CPU block num log in CPUExecutor. (vllm-project#4088)

d13492f

[Core] Simplifications to executor classes (vllm-project#4071)

b17e123

[Doc] Add better clarity for tensorizer usage (vllm-project#4090)

4df9a94

Co-authored-by: Roger Wang <[email protected]>

[Bugfix] Fix ray workers profiling with nsight (vllm-project#4095)

dd069a3

[Typing] Fix Sequence type GenericAlias only available after Python 3…

180b6f9

….9. (vllm-project#4092)

[Core] Fix engine-use-ray broken (vllm-project#4105)

ead1e24

LM Format Enforcer Guided Decoding Support (vllm-project#3868)

723e328

Co-authored-by: Simon Mo <[email protected]>

[Core] Refactor model loading code (vllm-project#4097)

24e4f03

[Speculative decoding 6/9] Integrate speculative decoding with LLMEng…

07a4d87

…ine (vllm-project#3894)

[Misc] [CI] Fix CI failure caught after merge (vllm-project#4126)

c2961e5

[CI] Move CPU/AMD tests to after wait (vllm-project#4123)

a88ee32

mmoskal and others added 28 commits April 21, 2024 23:47

[Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill (v…

a3b843c

…llm-project#4128)

[Docs] document that Meta Llama 3 is supported (vllm-project#4175)

55c7a31

[Bugfix] Support logprobs when using guided_json and other constraine…

abd2488

…d decoding fields (vllm-project#4149)

[Misc] Bump transformers to latest version (vllm-project#4176)

4850e62

[CI/CD] add neuron docker and ci test scripts (vllm-project#3571)

9a00c63

[Bugfix] Fix CustomAllreduce nvlink topology detection (vllm-project#…

acd6291

…3974) [Bugfix] Fix CustomAllreduce pcie nvlink topology detection (vllm-project#3974) (vllm-project#4159)

[Core] add an option to log every function call to for debugging hang…

b1983ad

…/crash in distributed inference (vllm-project#4079) Co-authored-by: Simon Mo <[email protected]>

Support eos_token_id from generation_config.json (vllm-project#4182)

3a4fa49

[Bugfix] Fix LoRA loading check (vllm-project#4138)

50e1d90

Co-authored-by: simon-mo <[email protected]>

[Misc] fix docstrings (vllm-project#4191)

58dbe5f

Co-authored-by: Zhong Wang <[email protected]>

[Bugfix][Core] Restore logging of stats in the async engine (vllm-pro…

5d413be

…ject#4150)

[Misc] add nccl in collect env (vllm-project#4211)

6433ad7

Pass tokenizer_revision when getting tokenizer in openai serving (v…

0f098a0

…llm-project#4214)

[Bugfix] Add fix for JSON whitespace (vllm-project#4189)

24f5ff4

Co-authored-by: Ubuntu <[email protected]>

Fix missing docs and out of sync EngineArgs (vllm-project#4219)

322cc2c

Co-authored-by: Harry Mellor <[email protected]>

[Frontend] multiple sampling params support (vllm-project#3570)

b6e755f

Updating lm-format-enforcer version and adding links to decoding libr…

68f7a90

…aries in docs (vllm-project#4222)

Don't show default value for flags in EngineArgs (vllm-project#4223)

9253d2c

Co-authored-by: Harry Mellor <[email protected]>

[Doc]: Update the doc of adding new models (vllm-project#4236)

0a07feb

Make initialization of tokenizer and detokenizer optional (vllm-proje…

6804189

…ct#3748) Co-authored-by: Yun Ding <[email protected]> Co-authored-by: Roger Wang <[email protected]>

added sparsity to refactored model loading

26c614a

./format.sh

7542a72

ruff ruff

29d1590

isort

1555470

removed stray file

54dbe3c

removed starcoder

94d6a2e

added additional model test

8c3cda6

robertgshaw2-redhat closed this Apr 30, 2024

robertgshaw2-redhat deleted the upstream-sync-2024-04-21 branch April 30, 2024 00:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upstream sync 2024 04 21 #198

Upstream sync 2024 04 21 #198

robertgshaw2-redhat commented Apr 22, 2024 •

edited

Loading

Upstream sync 2024 04 21 #198

Upstream sync 2024 04 21 #198

Conversation

robertgshaw2-redhat commented Apr 22, 2024 • edited Loading

robertgshaw2-redhat commented Apr 22, 2024 •

edited

Loading