Parrot is a distributed serving system for LLM-based Applications. It can be divided into three layers basically:
- Application Layer:
- Parrot's LLM programming frontend: PFunc.
- Semantic Variable.
- Serve Layer:
- ServeCore, a.k.a. Parrot Manager.
- Global Scheduler.
- Parrot's Graph Representation.
- Parrot's Graph Executor, read how Parrot efficiently executes a DAG of requests.
- Context, read the cluster-level memory management of Parrot.
- Engines, read the management of engines.
- Sessions, read the management of sessions.
- Semantic Variable Manager, read the management of Semantic Variable.
- Engine Layer:
- Internal APIs between
ServeCore
andEngine
. - Builtin Engine.
- OpenAI Engine.
- Shared Attention Kernel.
- Internal APIs between
The Parrot API w/ Semantic Variable is served by a centralized cluster manager called ServeCore
, which manages many Engine
instances.
ServeCore
serves the Parrot APIs w/ Semantic Variable. It also responsible for managing everything in the cluster and scheduling requests (GlobalScheduler
).
Most optimizations and scheduling strategies in Parrot are implemented in ServeCore
.
Each Parrot Engine
runs a single LLM model and communicates with ServeCore
by contextual Fill/Gen APIs. Note that:
-
Engine server is independent: Each
Engine
is capable of providing language model services independently. EachEngine
has its own scheduler (We call itLocalScheduler
) to perform common techniques like Continous Batching. And there are also many kernel-level optimizations (e.g. PagedAttention, Sharing Prompts) in our builtin engine implementation. -
Engine is an abstraction: Any server which can implement our internal
Engine APIs
can be registered as anEngine
in Parrot, therefore the system is horizontally scalable and many types ofEngine
s can be integrated into Parrot (e.g., vLLM, FasterTransformer, etc.) easily.For example, you can use a distributed serving mechanism (like tensor parallelism) in a single multi-GPU machine or multi machines, expose a single HTTP server w/ our Engine APIs and register it as a
Engine
.
The following picture illustrates the overview architecture of Parrot. Please refer our OSDI'24 paper Parrot: Efficient Serving of LLM-based Applications with Semantic Variable for more details.
The code of Parrot is organized basically by the above three-layer architecture.
parrot/
frontend/
pfunc/ # PFunc frontend
serve/ # Serve Layer
engine/ # Engine Layer
protocol/ # Common Protocols & APIs
utils/ # Utilities (logging, async, recycle pool, ...)
testing/ # Test related tools