Report on aphrodite-engine #5

the-crypt-keeper · 2024-06-27T18:03:43Z

vLLM has a cousin named aphrodite which adds EXL2, GGUF and a bunch of other handy features to it.

The good news is that the wheels already work out of the box on both P100 and P40 so no patching is needed there. EXL2 with batching on P100 is a creative writing beast, and GGUF on P40 with batching is 2-3x faster then without.

The bad news is that if you try to run it naively the "Triton doesn't care about Pascal" bugs will bite you. After a few calls the engine would simply hang on one of my P100 consuming 100% GPU and never completing the API call. It's difficult to even terminate the process.

Installing the patched Triton from this repo appears to both improve performance almost 1.5x and fix the hangs, at least I have not had any problem since the flip. Wondering if this is worth documenting in case others have similar aspirations of batching with SOTA quants on pascal GPUs.

sasha0552 · 2024-06-27T18:50:20Z

Thanks, added to README.md. Maybe we should notify aphrodite-engine developers to add this repository to their documentation?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report on aphrodite-engine #5

Report on aphrodite-engine #5

the-crypt-keeper commented Jun 27, 2024

sasha0552 commented Jun 27, 2024

Report on aphrodite-engine #5

Report on aphrodite-engine #5

Comments

the-crypt-keeper commented Jun 27, 2024

sasha0552 commented Jun 27, 2024