Kaito (Kubernetes AI Toolchain Operator)
Kaito automates AI model provisioning on Kubernetes -- it handles GPU node allocation, model image selection, and inference workload deployment as a single Kubernetes-native operation.
Kaito (Kubernetes AI Toolchain Operator)
Kaito automates AI model provisioning on Kubernetes – it handles GPU node allocation, model image selection, and inference workload deployment as a single Kubernetes-native operation.
What Kaito Solves
Deploying LLMs on Kubernetes typically requires:
- Provisioning GPU nodes (manually or via cluster autoscaler)
- Selecting the right container image with model weights
- Configuring inference runtime (vLLM, TGI, etc.)
- Setting up resource requests, limits, health checks
- Managing model updates and rollbacks
Kaito collapses this into a single Kubernetes custom resource (CRD).
Architecture
1
2
3
4
5
6
7
8
9
10
11
12
13
┌──────────────────────────────────┐
│ Kaito Workspace CRD │
│ (model: llama-3-70b-instruct) │
├──────────────────────────────────┤
│ Kaito Controller │
│ ├── GPU Node Provisioner │ <- auto-provisions GPU nodes
│ ├── Model Image Selector │ <- picks right container image
│ └── Inference Deployer │ <- deploys serving workload
├──────────────────────────────────┤
│ Kubernetes Cluster │
│ ├── GPU Node Pool (auto-scaled) │
│ └── Inference Pod (vLLM/TGI) │
└──────────────────────────────────┘
Key Features
- Workspace CRD: Declare what model you want, Kaito handles the rest
- Automatic GPU provisioning: Integrates with cloud provider node pools (Azure, AWS, GCP)
- Pre-built model images: Falcon, LLaMA, Mistral, Phi and more – pre-packaged with optimal configs
- Multiple runtimes: Supports vLLM and HuggingFace TGI as inference backends
- Fine-tuning support: QLoRA fine-tuning workflows built-in
- Model adapters: Hot-swap LoRA adapters without reloading base model
Example: Deploy Mistral 7B
1
2
3
4
5
6
7
8
9
10
11
apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
name: mistral-7b
spec:
resource:
instanceType: "Standard_NC24ads_A100_v4"
count: 1
inference:
preset:
name: "mistral-7b-instruct"
Apply this YAML and Kaito:
- Provisions an A100 GPU node
- Pulls the Mistral 7B container image
- Deploys inference endpoint with health checks
- Exposes an OpenAI-compatible API
When to Use
Use Kaito when:
- You run Kubernetes and want simplified LLM deployment
- You need to deploy open-weight models (not using managed APIs)
- You want GPU auto-provisioning integrated with model serving
- You’re doing fine-tuning workflows on K8s
Avoid Kaito when:
- You use managed AI APIs (OpenAI, Anthropic, Vertex AI) – no need for self-hosting
- You need maximum control over inference runtime config – use vLLM directly
- You’re not on Kubernetes
Relationship to Other Tools
| Tool | Relationship |
|---|---|
| vLLM | Kaito uses vLLM as one of its inference backends |
| llm-d | Complementary – llm-d focuses on request routing/scheduling, Kaito on provisioning |
| KAgent | Different scope – KAgent is about agent orchestration, Kaito is about model serving |
References
This post is licensed under
CC BY 4.0
by the author.