Kaito (Kubernetes AI Toolchain Operator)

Kaito automates AI model provisioning on Kubernetes -- it handles GPU node allocation, model image selection, and inference workload deployment as a single Kubernetes-native operation.

Posted Mar 8, 2026

2 min read

Kaito automates AI model provisioning on Kubernetes – it handles GPU node allocation, model image selection, and inference workload deployment as a single Kubernetes-native operation.

What Kaito Solves

Deploying LLMs on Kubernetes typically requires:

Provisioning GPU nodes (manually or via cluster autoscaler)
Selecting the right container image with model weights
Configuring inference runtime (vLLM, TGI, etc.)
Setting up resource requests, limits, health checks
Managing model updates and rollbacks

Kaito collapses this into a single Kubernetes custom resource (CRD).

Architecture

┌──────────────────────────────────┐
│  Kaito Workspace CRD             │
│  (model: llama-3-70b-instruct)   │
├──────────────────────────────────┤
│  Kaito Controller                │
│  ├── GPU Node Provisioner        │  <- auto-provisions GPU nodes
│  ├── Model Image Selector        │  <- picks right container image
│  └── Inference Deployer          │  <- deploys serving workload
├──────────────────────────────────┤
│  Kubernetes Cluster              │
│  ├── GPU Node Pool (auto-scaled) │
│  └── Inference Pod (vLLM/TGI)    │
└──────────────────────────────────┘

Key Features

Workspace CRD: Declare what model you want, Kaito handles the rest
Automatic GPU provisioning: Integrates with cloud provider node pools (Azure, AWS, GCP)
Pre-built model images: Falcon, LLaMA, Mistral, Phi and more – pre-packaged with optimal configs
Multiple runtimes: Supports vLLM and HuggingFace TGI as inference backends
Fine-tuning support: QLoRA fine-tuning workflows built-in
Model adapters: Hot-swap LoRA adapters without reloading base model

Example: Deploy Mistral 7B

        
      
apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
  name: mistral-7b
spec:
  resource:
    instanceType: "Standard_NC24ads_A100_v4"
    count: 1
  inference:
    preset:
      name: "mistral-7b-instruct"

Apply this YAML and Kaito:

Provisions an A100 GPU node
Pulls the Mistral 7B container image
Deploys inference endpoint with health checks
Exposes an OpenAI-compatible API

When to Use

Use Kaito when:

You run Kubernetes and want simplified LLM deployment
You need to deploy open-weight models (not using managed APIs)
You want GPU auto-provisioning integrated with model serving
You’re doing fine-tuning workflows on K8s

Avoid Kaito when:

You use managed AI APIs (OpenAI, Anthropic, Vertex AI) – no need for self-hosting
You need maximum control over inference runtime config – use vLLM directly
You’re not on Kubernetes

Relationship to Other Tools

Tool	Relationship
vLLM	Kaito uses vLLM as one of its inference backends
llm-d	Complementary – llm-d focuses on request routing/scheduling, Kaito on provisioning
KAgent	Different scope – KAgent is about agent orchestration, Kaito is about model serving

References

AI & Agents, AI Ops

agent-frameworks

This post is licensed under CC BY 4.0 by the author.