What is NEW! |
---|
First Release: Nov 15th, 2023. Kaito v0.1.0. |
March 1st, 2024. Kaito v0.2.0. |
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. The target models are popular large open-sourced inference models such as falcon and llama 2. Kaito has the following key differentiations compared to most of the mainstream model deployment methodologies built on top of virtual machine infrastructures:
- Manage large model files using container images. A http server is provided to perform inference calls using the model library.
- Avoid tuning deployment parameters to fit GPU hardware by providing preset configurations.
- Auto-provision GPU nodes based on model requirements.
- Host large model images in the public Microsoft Container Registry (MCR) if the license allows.
Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
Kaito follows the classic Kubernetes Custom Resource Definition(CRD)/controller design pattern. User manages a workspace
custom resource which describes the GPU requirements and the inference specification. Kaito controllers will automate the deployment by reconciling the workspace
custom resource.
The above figure presents the Kaito architecture overview. Its major components consist of:
- Workspace controller: It reconciles the
workspace
custom resource, createsmachine
(explained below) custom resources to trigger node auto provisioning, and creates the inference workload (deployment
orstatefulset
) based on the model preset configurations. - Node provisioner controller: The controller's name is gpu-provisioner in Kaito helm chart. It uses the
machine
CRD originated from Karpenter to interact with the workspace controller. It integrates with Azure Kubernetes Service(AKS) APIs to add new GPU nodes to the AKS cluster. Note that the gpu-provisioner is not an open sourced component. It can be replaced by other controllers if they support Karpenter-core APIs.
Please check the installation guidance here.
After installing Kaito, one can try following commands to start a falcon-7b inference service.
$ cat examples/kaito_workspace_falcon_7b.yaml
apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
name: workspace-falcon-7b
resource:
instanceType: "Standard_NC12s_v3"
labelSelector:
matchLabels:
apps: falcon-7b
inference:
preset:
name: "falcon-7b"
$ kubectl apply -f examples/kaito_workspace_falcon_7b.yaml
The workspace status can be tracked by running the following command. When the WORKSPACEREADY column becomes True
, the model has been deployed successfully.
$ kubectl get workspace workspace-falcon-7b
NAME INSTANCE RESOURCEREADY INFERENCEREADY WORKSPACEREADY AGE
workspace-falcon-7b Standard_NC12s_v3 True True True 10m
Next, one can find the inference service's cluster ip and use a temporal curl
pod to test the service endpoint in the cluster.
$ kubectl get svc workspace-falcon-7b
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
workspace-falcon-7b ClusterIP <CLUSTERIP> <none> 80/TCP,29500/TCP 10m
export CLUSTERIP=$(kubectl get svc workspace-falcon-7b -o jsonpath="{.spec.clusterIPs[0]}")
$ kubectl run -it --rm --restart=Never curl --image=curlimages/curl -- curl -X POST http://$CLUSTERIP/chat -H "accept: application/json" -H "Content-Type: application/json" -d "{\"prompt\":\"YOUR QUESTION HERE\"}"
The detailed usage for Kaito supported models can be found in HERE. In case users want to deploy their own containerized models, they can provide the pod template in the inference
field of the workspace custom resource (please see API definitions for details). The controller will create a deployment workload using all provisioned GPU nodes. Note that currently the controller does NOT handle automatic model upgrade. It only creates inference workloads based on the preset configurations if the workloads do not exist.
The number of the supported models in Kaito is growing! Please check this document to see how to add a new supported model.
When using hosted public models, a user can delete the existing inference workload (Deployment
of StatefulSet
) manually, and the workspace controller will create a new one with the latest preset configuration (e.g., the image version) defined in the current release. For private models, it is recommended to create a new workspace with a new image version in the Spec.
To update model or inference parameters for a deployed service, perform a kubectl edit
on the workload type, which could be either a StatefulSet
or Deployment
.
For example, to enable 4-bit quantization on a falcon-7b-instruct
deployment, you would execute:
kubectl edit deployment workspace-falcon-7b-instruct
Within the deployment configuration, locate the command section and modify it as follows:
Original command:
accelerate launch --num_processes 1 --num_machines 1 --machine_rank 0 --gpu_ids all inference_api.py --pipeline text-generation --torch_dtype bfloat16
Modified command to enable 4-bit Quantization
accelerate launch --num_processes 1 --num_machines 1 --machine_rank 0 --gpu_ids all inference_api.py --pipeline text-generation --torch_dtype bfloat16 --load_in_4bit
For a comprehensive list of inference parameters for the text-generation models, refer to the following options:
pipeline
: The model pipeline for the pre-trained model. For text-generation models this can be eithertext-generation
orconversational
pretrained_model_name_or_path
: Path to the pretrained model or model identifier from huggingface.co/models.- Additional parameters such as
state_dict
,cache_dir
,from_tf
,force_download
,resume_download
,proxies
,output_loading_info
,allow_remote_files
,revision
,trust_remote_code
,load_in_4bit
,load_in_8bit
,torch_dtype
, anddevice_map
can also be customized as needed.
Should you need an undocumented parameter, kindly file an issue for potential future inclusion.
The main distinction lies in their intended use cases. Instruct models are fine-tuned versions optimized for interactive chat applications. They are typically the preferred choice for most implementations due to their enhanced performance in conversational contexts.
On the other hand, non-instruct, or raw models, are designed for further fine-tuning. Future developments in Kaito may include features that allow users to apply fine-tuned weights to these raw models.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
See LICENSE.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
"Kaito devs" [email protected]