Replies: 1 comment
-
This would be really interesting - initially like you suggest can be a separate engine, but would be nice to have also part as the core. The downside I see is that could push even further the image sizes. LocalAI already supports remote grpc backends which might come handy exactly for this. In the long run I'm thinking to have engines to be installed/uninstalled dynamically which might tie into this. Compiling to TRT could be even done automatically when the model loads the first time, which should reduce friction in usage (e.g. we would need to introduce a new "compile" action which would be quite specific in this case). |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Wanted to get some community feedback on interest in a TensorRT backend.
TensorRT should be quite a bit faster than GGML/GGUF for folks with NVIDIA hardware. However, it comes with the tradeoff that a model needs to be specifically compiled for the machine's GPU. This step isn't terribly difficult, especially for raw PyTorch and/or TRT models.
I am not thinking, at first, that this should be built into LocalAI. Rather, it could be a standalone engine for now. Probably utilizing TensorRT and TensorRT-LLM. This project also shows how simple it could be to setup the gRPC for TensorRT.
Although it would be nice for LocalAI to provide some UX/model download hooks. That way you could check a "compile to TRT" option when downloading a model. This would make it automatic for most users.
Someone's benchmark of llama.cpp vs tensorRT-llm: https://hackmd.io/@janhq/benchmarking-tensorrt-llm . (Spoiler, they found a 50-60% speedup for
Mistral 7B v0.2 GGUF Q4_K_M
.)I'm interested in putting some work into this. Mostly curious how interested the maintainers of LocalAI are in this (longer-term). As well as how interested the community would be.
Beta Was this translation helpful? Give feedback.
All reactions