Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] FP8 support. #1484

Closed
wants to merge 7 commits into from
Closed

[WIP] FP8 support. #1484

wants to merge 7 commits into from

Conversation

Narsil
Copy link
Collaborator

@Narsil Narsil commented Jan 25, 2024

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@flozi00
Copy link
Contributor

flozi00 commented Jan 29, 2024

Would it be worth to discuss about transformers engine implementation since they say it's compatible with ada lovelace and higher ?
The comment in your code says it needs Hopper+

source: https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html#:~:text=Support%20for%20FP8%20on%20NVIDIA%20Hopper%20and%20NVIDIA%20Ada%20GPUs

the implementation seems to be very easy, as its already done in accelerate
https://github.com/huggingface/accelerate/blob/7aafa25673f4324cbcaeef508a13e09576f12dc4/src/accelerate/utils/transformer_engine.py#L30-L41

@Narsil
Copy link
Collaborator Author

Narsil commented Jan 29, 2024

Would it be worth to discuss about transformers engine implementation since they say it's compatible with ada lovelace and higher ? The comment in your code says it needs Hopper+

Well it works whereever FP8 works, H100 are the most common cluster ones I think.

source: https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html#:~:text=Support%20for%20FP8%20on%20NVIDIA%20Hopper%20and%20NVIDIA%20Ada%20GPUs

the implementation seems to be very easy, as its already done in accelerate https://github.com/huggingface/accelerate/blob/7aafa25673f4324cbcaeef508a13e09576f12dc4/src/accelerate/utils/transformer_engine.py#L30-L41

The implementation is silently not done with dimensions are not 16 aligned, this is not acceptable in TGI.

There are 2 issues with the current PR implem.
Every call packs f16 buffer before calling gemm, that's a slowdown for everycall, + the 16 padding.

That's what makes it slow, and TransformerEngine is just as slow: https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/module/linear.py#L91C13-L92

So afaik there's no win to using transformer engine over this implementation.
The real work will be removing those unecessary ops (unpacking and padding).

And since LLM mostly operate at seqlen < 16 (during decode this is pretty much batch_size), we also most likely need a customized gemv to make things faster (This limitation on size is very suprising to me)

@flozi00
Copy link
Contributor

flozi00 commented Feb 1, 2024

The pytorch lab implementation will have inference support later

pytorch-labs/float8_experimental#187 (comment)

I am also checking the tensorrt llm implementation of fp8

@github-actions github-actions bot added the Stale label Mar 3, 2024
@github-actions github-actions bot closed this Mar 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants