You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Aug 10, 2024. It is now read-only.
vLLM has a cousin named aphrodite which adds EXL2, GGUF and a bunch of other handy features to it.
The good news is that the wheels already work out of the box on both P100 and P40 so no patching is needed there. EXL2 with batching on P100 is a creative writing beast, and GGUF on P40 with batching is 2-3x faster then without.
The bad news is that if you try to run it naively the "Triton doesn't care about Pascal" bugs will bite you. After a few calls the engine would simply hang on one of my P100 consuming 100% GPU and never completing the API call. It's difficult to even terminate the process.
Installing the patched Triton from this repo appears to both improve performance almost 1.5x and fix the hangs, at least I have not had any problem since the flip. Wondering if this is worth documenting in case others have similar aspirations of batching with SOTA quants on pascal GPUs.
The text was updated successfully, but these errors were encountered:
vLLM has a cousin named aphrodite which adds EXL2, GGUF and a bunch of other handy features to it.
The good news is that the wheels already work out of the box on both P100 and P40 so no patching is needed there. EXL2 with batching on P100 is a creative writing beast, and GGUF on P40 with batching is 2-3x faster then without.
The bad news is that if you try to run it naively the "Triton doesn't care about Pascal" bugs will bite you. After a few calls the engine would simply hang on one of my P100 consuming 100% GPU and never completing the API call. It's difficult to even terminate the process.
Installing the patched Triton from this repo appears to both improve performance almost 1.5x and fix the hangs, at least I have not had any problem since the flip. Wondering if this is worth documenting in case others have similar aspirations of batching with SOTA quants on pascal GPUs.
The text was updated successfully, but these errors were encountered: