Post

Kaito (Kubernetes AI Toolchain Operator)

Kaito automates AI model provisioning on Kubernetes -- it handles GPU node allocation, model image selection, and inference workload deployment as a single Kubernetes-native operation.

Kaito (Kubernetes AI Toolchain Operator)

Kaito automates AI model provisioning on Kubernetes – it handles GPU node allocation, model image selection, and inference workload deployment as a single Kubernetes-native operation.


What Kaito Solves

Deploying LLMs on Kubernetes typically requires:

  1. Provisioning GPU nodes (manually or via cluster autoscaler)
  2. Selecting the right container image with model weights
  3. Configuring inference runtime (vLLM, TGI, etc.)
  4. Setting up resource requests, limits, health checks
  5. Managing model updates and rollbacks

Kaito collapses this into a single Kubernetes custom resource (CRD).


Architecture

1
2
3
4
5
6
7
8
9
10
11
12
13
┌──────────────────────────────────┐
│  Kaito Workspace CRD             │
│  (model: llama-3-70b-instruct)   │
├──────────────────────────────────┤
│  Kaito Controller                │
│  ├── GPU Node Provisioner        │  <- auto-provisions GPU nodes
│  ├── Model Image Selector        │  <- picks right container image
│  └── Inference Deployer          │  <- deploys serving workload
├──────────────────────────────────┤
│  Kubernetes Cluster              │
│  ├── GPU Node Pool (auto-scaled) │
│  └── Inference Pod (vLLM/TGI)    │
└──────────────────────────────────┘

Key Features

  • Workspace CRD: Declare what model you want, Kaito handles the rest
  • Automatic GPU provisioning: Integrates with cloud provider node pools (Azure, AWS, GCP)
  • Pre-built model images: Falcon, LLaMA, Mistral, Phi and more – pre-packaged with optimal configs
  • Multiple runtimes: Supports vLLM and HuggingFace TGI as inference backends
  • Fine-tuning support: QLoRA fine-tuning workflows built-in
  • Model adapters: Hot-swap LoRA adapters without reloading base model

Example: Deploy Mistral 7B

1
2
3
4
5
6
7
8
9
10
11
apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
  name: mistral-7b
spec:
  resource:
    instanceType: "Standard_NC24ads_A100_v4"
    count: 1
  inference:
    preset:
      name: "mistral-7b-instruct"

Apply this YAML and Kaito:

  1. Provisions an A100 GPU node
  2. Pulls the Mistral 7B container image
  3. Deploys inference endpoint with health checks
  4. Exposes an OpenAI-compatible API

When to Use

Use Kaito when:

  • You run Kubernetes and want simplified LLM deployment
  • You need to deploy open-weight models (not using managed APIs)
  • You want GPU auto-provisioning integrated with model serving
  • You’re doing fine-tuning workflows on K8s

Avoid Kaito when:

  • You use managed AI APIs (OpenAI, Anthropic, Vertex AI) – no need for self-hosting
  • You need maximum control over inference runtime config – use vLLM directly
  • You’re not on Kubernetes

Relationship to Other Tools

Tool Relationship
vLLM Kaito uses vLLM as one of its inference backends
llm-d Complementary – llm-d focuses on request routing/scheduling, Kaito on provisioning
KAgent Different scope – KAgent is about agent orchestration, Kaito is about model serving

References

This post is licensed under CC BY 4.0 by the author.