Llama support #32

tengomucho · 2024-04-30T16:04:50Z

What does this PR do?

This add support for sharded Llama models on TPU, tested on TPU v5e litepod-8.
A test that show inference with Llama3-8b on TGI has been added.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

HuggingFaceDocBuilderDev · 2024-04-30T16:09:51Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

nuprin · 2024-05-01T20:53:00Z

Does "sharded llama models" imply that this could cover 70b as well?

tengomucho · 2024-05-01T21:01:06Z

For now I have only tested with an 8B model on a single-host TPU v5e 4x2. I reckon we would need something bigger to test the 70b model.

nuprin · 2024-05-01T21:23:06Z

Do you need help with capacity?

imported from transformers v4.40.1.

This essentially copies commit 8a4a98d2472b8e0180eb9bd4a1824f983e220811 from optimum-neuron, that fixed the same problem.

mfuntowicz

LGTM! 👏🏻

mfuntowicz · 2024-05-03T08:46:32Z

tests/test_distributed_model.py

+
+@pytest.mark.slow
+def test_distributed_model_prefill_llama3_8b():
+    _test_distributed_model_prefill("meta-llama/Meta-Llama-3-8B")


Should we start parametrize the test rather than dupplicate?

I should definitely do a bit of refactoring on the tests to avoid duplicaiton... will do later if you agree.

mfuntowicz · 2024-05-03T08:49:10Z

text-generation-inference/server/text_generation_server/generator.py

+                    pad_token_id = self.tokenizer.pad_token_id
+                    if pad_token_id is None:
+                        if isinstance(self.tokenizer.eos_token_id, list):
+                            pad_token_id = self.tokenizer.eos_token_id[0]
+                        else:
+                            pad_token_id = self.tokenizer.eos_token_id


This block doesn't seem to depend on i should we move it outside the for loop?

tengomucho requested a review from mfuntowicz April 30, 2024 16:19

tengomucho marked this pull request as ready for review May 1, 2024 20:04

tengomucho added 12 commits May 2, 2024 07:13

chore(build): remove ruff warning when running make style_check

7c44b72

refactor(modeling): import specific modeling only when required

1d3e1ea

chore: import Llama modeling from transformers, to allow loading model

2e468d2

imported from transformers v4.40.1.

feat: internal LLamaModelforCausalLM selected if possible

b79ea72

feat(llama): sharding on o_proj

17f95ae

feat(llama): sharding on q,k,v

d65a2c8

feat(llama): sharding on MLP Linears

320cc2a

feat(llama): sharding on lm_head

ed7902c

test: add slow test to verify sharded Llama3-8b can be loaded on TPU

8ac85d5

fix(generation): eos_token_id can be a list in configs

d66b694

This essentially copies commit 8a4a98d2472b8e0180eb9bd4a1824f983e220811 from optimum-neuron, that fixed the same problem.

test(tgi): added test to validate Llama3 8b on TGI

0090c14

doc(README): include Llama in list of TGI supported models

9308c76

tengomucho force-pushed the llama-support branch from e922d96 to 9308c76 Compare May 2, 2024 07:13

mfuntowicz approved these changes May 3, 2024

View reviewed changes

refactor(generator): move var initialization out of a loop

38751a1

tengomucho merged commit 973655d into main May 3, 2024
4 checks passed

tengomucho deleted the llama-support branch May 3, 2024 10:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama support #32

Llama support #32

tengomucho commented Apr 30, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 30, 2024

nuprin commented May 1, 2024

tengomucho commented May 1, 2024

nuprin commented May 1, 2024

mfuntowicz left a comment

mfuntowicz May 3, 2024

tengomucho May 3, 2024

mfuntowicz May 3, 2024

Llama support #32

Llama support #32

Conversation

tengomucho commented Apr 30, 2024 • edited Loading

What does this PR do?

Before submitting

HuggingFaceDocBuilderDev commented Apr 30, 2024

nuprin commented May 1, 2024

tengomucho commented May 1, 2024

nuprin commented May 1, 2024

mfuntowicz left a comment

Choose a reason for hiding this comment

mfuntowicz May 3, 2024

Choose a reason for hiding this comment

tengomucho May 3, 2024

Choose a reason for hiding this comment

mfuntowicz May 3, 2024

Choose a reason for hiding this comment

tengomucho commented Apr 30, 2024 •

edited

Loading