Skip to content
This repository has been archived by the owner on Aug 10, 2024. It is now read-only.

Report on aphrodite-engine #5

Open
the-crypt-keeper opened this issue Jun 27, 2024 · 1 comment
Open

Report on aphrodite-engine #5

the-crypt-keeper opened this issue Jun 27, 2024 · 1 comment

Comments

@the-crypt-keeper
Copy link

vLLM has a cousin named aphrodite which adds EXL2, GGUF and a bunch of other handy features to it.

The good news is that the wheels already work out of the box on both P100 and P40 so no patching is needed there. EXL2 with batching on P100 is a creative writing beast, and GGUF on P40 with batching is 2-3x faster then without.

The bad news is that if you try to run it naively the "Triton doesn't care about Pascal" bugs will bite you. After a few calls the engine would simply hang on one of my P100 consuming 100% GPU and never completing the API call. It's difficult to even terminate the process.

Installing the patched Triton from this repo appears to both improve performance almost 1.5x and fix the hangs, at least I have not had any problem since the flip. Wondering if this is worth documenting in case others have similar aspirations of batching with SOTA quants on pascal GPUs.

@sasha0552
Copy link
Owner

Thanks, added to README.md. Maybe we should notify aphrodite-engine developers to add this repository to their documentation?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants