llm-d

A Kubernetes-native, disaggregated LLM serving stack that separates prefill and decode phases across specialized node pools, using the Gateway API for intelligent request routing.

Posted Mar 10, 2026

7 min read

llm-d

A Kubernetes-native, disaggregated LLM serving stack that separates prefill and decode phases across specialized node pools, using the Gateway API for intelligent request routing – designed for large-scale, multi-model inference with enterprise-grade operability.

What llm-d Is

llm-d (pronounced “LLM-dee,” the “d” stands for “disaggregated”) is an open-source project led by Red Hat that rethinks how LLM inference works at scale on Kubernetes. While vLLM optimizes inference within a single serving instance, llm-d optimizes the infrastructure layer around and between inference instances – routing, scheduling, scaling, and disaggregating the inference pipeline itself.

The core idea: LLM inference has two distinct computational phases (prefill and decode) with very different resource profiles. Running them on the same hardware wastes resources. llm-d splits them apart and schedules each phase on purpose-built infrastructure.

Why Disaggregated Inference

The Prefill/Decode Problem

LLM inference has two phases:

Prefill (prompt processing): Process all input tokens in parallel. This is compute-bound – it benefits from maximum GPU FLOPs. A long prompt (8K tokens) requires significant computation but completes in a single forward pass.

Decode (token generation): Generate output tokens one at a time, autoregressively. This is memory-bandwidth-bound – each step reads the full KV cache but does relatively little computation. The GPU is mostly idle, waiting for memory reads.

Prefill phase:  [========= HIGH GPU COMPUTE =========] (one pass, all input tokens)
Decode phase:   [=].[=].[=].[=].[=].[=].[=].[=].[=].  (many passes, one token each)
                 ^memory-bound, GPU underutilized

Combined on same GPU:
  GPU Compute: ████░░░░░░░░░░░░░░░░  (20% average utilization during decode)
  GPU Memory:  ████████████████████  (100% allocated for KV cache)

Disaggregated:
  Prefill GPU:  ████████████████████  (high utilization, processes prompts)
  Decode GPU:   ████████████████████  (optimized for memory bandwidth)

By separating these phases, each GPU pool can be optimized independently. Prefill nodes use high-compute GPUs (H100 SXM); decode nodes can use memory-optimized or even cheaper GPUs.

Architecture

+------------------------------------------------------------------+
|  Kubernetes Cluster                                               |
|                                                                   |
|  +------------------+                                            |
|  | Gateway API       |  <-- Ingress point, LLM-aware routing     |
|  | (Envoy + llm-d    |                                           |
|  |  routing plugin)  |                                           |
|  +--------+---------+                                            |
|           |                                                      |
|  +--------v---------+     +------------------+                   |
|  | Prefill Pool       |     | Decode Pool      |                  |
|  | +------+ +------+ |     | +------+ +------+|                  |
|  | |vLLM  | |vLLM  | |     | |vLLM  | |vLLM  ||                  |
|  | |inst-1| |inst-2| | KV  | |inst-3| |inst-4||                  |
|  | +------+ +------+ |cache| +------+ +------+|                  |
|  +--------+----------+trans +------------------+                  |
|           |          fer    |                                     |
|  +--------v-----------------v---------+                           |
|  | Shared KV Cache / Transfer Layer    |                          |
|  | (RDMA, NVLink, or network transfer) |                          |
|  +------------------------------------+                           |
|                                                                   |
|  +----------------------------+                                   |
|  | Endpoint Picker / Scheduler |  <-- Decides which instance     |
|  | (prefix-aware, load-aware)  |      handles each request       |
|  +----------------------------+                                   |
+------------------------------------------------------------------+

Key Components

Gateway API Integration: llm-d uses the Kubernetes Gateway API (not legacy Ingress) as the traffic entry point. The gateway includes an LLM-aware routing layer that understands request characteristics (prompt length, model requested, prefix matching) and routes accordingly.

Endpoint Picker: The intelligent scheduler that decides which vLLM instance handles each request. It considers:

Prefix affinity – route requests with similar system prompts to the same instance to maximize KV cache reuse
Load balancing – avoid overloading any single instance
Phase routing – send prefill work to prefill-optimized nodes, decode work to decode-optimized nodes
Model awareness – route to instances serving the requested model

vLLM as the Inference Engine: llm-d does not replace vLLM – it orchestrates multiple vLLM instances. Each instance in the pool runs vLLM for actual inference. llm-d adds the scheduling, routing, and disaggregation layers on top.

KV Cache Transfer: When prefill and decode are on different nodes, the KV cache computed during prefill must be transferred to the decode node. This is the critical data path and requires fast interconnect (RDMA, NVLink across nodes, or high-bandwidth network).

How llm-d Differs from vLLM

Aspect	vLLM (standalone)	llm-d
Scope	Single inference instance	Fleet of instances + routing
Scaling unit	Replicas of one deployment	Heterogeneous node pools
Routing	Round-robin / random	Prefix-aware, load-aware, phase-aware
Prefill/Decode	Coupled on same GPU	Can be disaggregated
Multi-model	Separate deployments	Unified gateway, multi-model routing
K8s integration	Basic Deployment + Service	Gateway API, CRDs, HPA integration
Operated by	ML engineers	Platform / infra teams

The relationship is complementary, not competitive: llm-d uses vLLM under the hood but adds the platform layer that makes operating a fleet of LLM instances manageable.

Key Features

Prefix-Aware Routing

The endpoint picker maintains a map of which vLLM instances have which prefixes cached. When a new request arrives with a system prompt that matches an existing cache, it routes to that instance, avoiding redundant prefill computation.

This matters enormously for enterprise use cases where many requests share the same system prompt (customer service bot, code assistant, etc.). Without prefix-aware routing, every instance independently caches the same prefix, wasting GPU memory across the fleet.

Multi-Model Serving

A single llm-d gateway can route to different model pools:

/v1/chat/completions (model: llama3-70b)  --> Pool A (4x H100)
/v1/chat/completions (model: mistral-7b)  --> Pool B (2x A100)
/v1/chat/completions (model: phi3-mini)   --> Pool C (1x L4)

Autoscaling Integration

llm-d exposes metrics that KEDA and HPA can use for intelligent scaling:

Pending request queue depth
Per-model utilization
KV cache pressure
Time to first token (TTFT) percentiles

Red Hat and Enterprise Focus

Red Hat’s involvement signals enterprise priorities: integration with OpenShift, support for air-gapped registries, FIPS compliance for the control plane, and a path to commercial support. For enterprises already on OpenShift, llm-d is the natural choice for LLM serving infrastructure.

Deployment on Kubernetes

Prerequisites

Kubernetes 1.29+ with Gateway API CRDs installed
GPU nodes with NVIDIA drivers and device plugin
vLLM container images
Sufficient inter-node bandwidth for KV cache transfer (if using disaggregation)

Basic Setup (Without Disaggregation)

Even without full prefill/decode disaggregation, llm-d provides value through intelligent routing:

        
      
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: llm-gateway
spec:
  gatewayClassName: llm-d
  listeners:
  - name: http
    port: 8080
    protocol: HTTP
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: llama3-route
spec:
  parentRefs:
  - name: llm-gateway
  rules:
  - matches:
    - path:
        value: /v1/chat/completions
    backendRefs:
    - name: vllm-llama3-pool
      port: 8000

Scaling Patterns

Pattern	When to Use	Config
Homogeneous pool	Single model, uniform traffic	N identical vLLM replicas behind llm-d
Heterogeneous pools	Multiple models, different GPU needs	Model-specific pools with per-pool HPA
Disaggregated	High throughput, long prompts + short outputs	Separate prefill/decode pools
Hybrid	Mix of batch and real-time	Batch queue + real-time pool

When to Use llm-d

Use llm-d when:

You are operating multiple LLM models at scale on Kubernetes and need unified routing
You want prefix-aware request routing to maximize KV cache reuse across a fleet
You are on OpenShift and want a Red Hat-supported LLM serving stack
You need disaggregated prefill/decode for workloads with long prompts and short outputs
Your platform team needs to manage LLM infrastructure with familiar K8s patterns (Gateway API, HPA)

Skip llm-d when:

You are running a single model on a single GPU – vLLM alone is sufficient
You are not on Kubernetes or prefer serverless inference (use cloud-managed endpoints)
You need the simplest possible setup for dev/testing (use Ollama)
The project’s maturity does not meet your stability requirements (evaluate carefully)

Maturity

llm-d is an emerging project as of 2025-2026, with active development led by Red Hat engineers. It has been presented at KubeCon and is gaining traction in the CNCF AI ecosystem, but it is not yet a CNCF project. The architecture is sound and addresses real operational pain points, but expect API changes and gaps in documentation. Good for forward-looking platform teams willing to contribute upstream; premature for regulated environments requiring stable, supported software.

References

llm-d GitHub: https://github.com/llm-d/llm-d
Red Hat blog posts on disaggregated LLM serving
Kubernetes Gateway API: https://gateway-api.sigs.k8s.io
vLLM documentation: https://docs.vllm.ai
KubeCon 2025 presentations on LLM infrastructure

AI & Agents, AI Ops

system-design

This post is licensed under CC BY 4.0 by the author.