Idea: Support distributed computing inference serialized over layer-subset groups. #542
ghchris2021
started this conversation in
Ideas
Replies: 1 comment 3 replies
-
I'd like to switch from llama.cpp, but I need this feature! There's significant room for improvement vs their implementation too! |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Idea: Support distributed computing inference serialized over layer-subset groups.
The primary use case would be enabling models larger than any single system's available RAM / VRAM can handle to be inferenced with relatively high speed by using the available compute capacity of any number of networked (e.g. LAN) computers each with some available memory, some available CPU compute, and possibly some GPU/VRAM compute resources available.
Now there are commonly many attractively compelling open-weights models between 30B and well over 200B parameter sizes which, generally speaking, are too large for almost all users to inference effectively on any single commonly available laptop / desktop / modest workstation system because of limited available VRAM capacity in their GPUs and even limited available RAM capacity wrt. the system CPU/RAM characteristics.
It is well known that typical laptop / desktop / workstation computers GPUs have VRAM BW O(10x) higher than the system's RAM BW. It is also well known that most typically personal desktop systems have no more than 1-2 directly attached GPUs with typical total VRAM capacity in the 8-48 GBy size range.
However it is typically the case that a given person or household or small business might commonly have access to two or more laptop / desktop computers each with significant CPU/RAM computing resources, and also in many cases with significantly useful if modest GPU capabilities (e.g. 8-24 GBy VRAM GPUs).
It has been demonstrated by other FOSS attempts at distributed LLM inference computing that this is a practically viable
way for LLMs to be inferenced using similar codebases / logic as are used for single-computer + (none / single / multiple) GPU based inference to occur; e.g. "petals", llama.cpp "RPC-server" mode, et. al.
In the professional / enterprise / data center space distributed computing is widely used by mainstream training and also inference frameworks to enable multiple servers and multiple GPUs per server to be used as a group for either training or inference and to gain capacity scaling thereby.
I believe the extra "business logic" to support this could be small relative to a platform that can already support single-system inference using multi-core CPUs, and single-systems which use multiple locally attached GPUs.
Several low level frameworks e.g. MPI, PoCL, et. al. already support / facilitate distributed computing as a primary or first class optional capability.
And since we're discussing, at minimum, the coarsely sharded distribution of model inferencing among several servers the actual logic to inference a single layer or group of juxtaposed layers is not technically very different than what would be done on a single host or single GPU use case.
Models of key interest for such scaling capability would include all typical 70B range current generation models, models in the 100-140B size range, models in the 200B+ range (q.v. deekseek-v2, command-r+, dbrx, etc. etc.), and MoE models requiring 70-200B+ range memory (e.g. Mixtral 8x22b, et. al.).
Eliminating the use of swap on a single host can frequently improve effective performance by 1+ orders of magnitude.
Eliminating the use of slow CPU/RAM based inference by enabling the congregation of several "modest" GPUs + VRAM to be used for inference of a single model can also typically improve performance by O(10x) vs. even a common modern fast desktop PC with 20-50 GBy/s RAM BW, 6-16 CPU cores, etc.
Beta Was this translation helpful? Give feedback.
All reactions