Post

llm-d

A Kubernetes-native, disaggregated LLM serving stack that separates prefill and decode phases across specialized node pools, using the Gateway API for intelligent request routing.

llm-d

A Kubernetes-native, disaggregated LLM serving stack that separates prefill and decode phases across specialized node pools, using the Gateway API for intelligent request routing – designed for large-scale, multi-model inference with enterprise-grade operability.


What llm-d Is

llm-d (pronounced “LLM-dee,” the “d” stands for “disaggregated”) is an open-source project led by Red Hat that rethinks how LLM inference works at scale on Kubernetes. While vLLM optimizes inference within a single serving instance, llm-d optimizes the infrastructure layer around and between inference instances – routing, scheduling, scaling, and disaggregating the inference pipeline itself.

The core idea: LLM inference has two distinct computational phases (prefill and decode) with very different resource profiles. Running them on the same hardware wastes resources. llm-d splits them apart and schedules each phase on purpose-built infrastructure.


Why Disaggregated Inference

The Prefill/Decode Problem

LLM inference has two phases:

Prefill (prompt processing): Process all input tokens in parallel. This is compute-bound – it benefits from maximum GPU FLOPs. A long prompt (8K tokens) requires significant computation but completes in a single forward pass.

Decode (token generation): Generate output tokens one at a time, autoregressively. This is memory-bandwidth-bound – each step reads the full KV cache but does relatively little computation. The GPU is mostly idle, waiting for memory reads.

1
2
3
4
5
6
7
8
9
10
11
Prefill phase:  [========= HIGH GPU COMPUTE =========] (one pass, all input tokens)
Decode phase:   [=].[=].[=].[=].[=].[=].[=].[=].[=].  (many passes, one token each)
                 ^memory-bound, GPU underutilized

Combined on same GPU:
  GPU Compute: ████░░░░░░░░░░░░░░░░  (20% average utilization during decode)
  GPU Memory:  ████████████████████  (100% allocated for KV cache)

Disaggregated:
  Prefill GPU:  ████████████████████  (high utilization, processes prompts)
  Decode GPU:   ████████████████████  (optimized for memory bandwidth)

By separating these phases, each GPU pool can be optimized independently. Prefill nodes use high-compute GPUs (H100 SXM); decode nodes can use memory-optimized or even cheaper GPUs.


Architecture

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
+------------------------------------------------------------------+
|  Kubernetes Cluster                                               |
|                                                                   |
|  +------------------+                                            |
|  | Gateway API       |  <-- Ingress point, LLM-aware routing     |
|  | (Envoy + llm-d    |                                           |
|  |  routing plugin)  |                                           |
|  +--------+---------+                                            |
|           |                                                      |
|  +--------v---------+     +------------------+                   |
|  | Prefill Pool       |     | Decode Pool      |                  |
|  | +------+ +------+ |     | +------+ +------+|                  |
|  | |vLLM  | |vLLM  | |     | |vLLM  | |vLLM  ||                  |
|  | |inst-1| |inst-2| | KV  | |inst-3| |inst-4||                  |
|  | +------+ +------+ |cache| +------+ +------+|                  |
|  +--------+----------+trans +------------------+                  |
|           |          fer    |                                     |
|  +--------v-----------------v---------+                           |
|  | Shared KV Cache / Transfer Layer    |                          |
|  | (RDMA, NVLink, or network transfer) |                          |
|  +------------------------------------+                           |
|                                                                   |
|  +----------------------------+                                   |
|  | Endpoint Picker / Scheduler |  <-- Decides which instance     |
|  | (prefix-aware, load-aware)  |      handles each request       |
|  +----------------------------+                                   |
+------------------------------------------------------------------+

Key Components

Gateway API Integration: llm-d uses the Kubernetes Gateway API (not legacy Ingress) as the traffic entry point. The gateway includes an LLM-aware routing layer that understands request characteristics (prompt length, model requested, prefix matching) and routes accordingly.

Endpoint Picker: The intelligent scheduler that decides which vLLM instance handles each request. It considers:

  • Prefix affinity – route requests with similar system prompts to the same instance to maximize KV cache reuse
  • Load balancing – avoid overloading any single instance
  • Phase routing – send prefill work to prefill-optimized nodes, decode work to decode-optimized nodes
  • Model awareness – route to instances serving the requested model

vLLM as the Inference Engine: llm-d does not replace vLLM – it orchestrates multiple vLLM instances. Each instance in the pool runs vLLM for actual inference. llm-d adds the scheduling, routing, and disaggregation layers on top.

KV Cache Transfer: When prefill and decode are on different nodes, the KV cache computed during prefill must be transferred to the decode node. This is the critical data path and requires fast interconnect (RDMA, NVLink across nodes, or high-bandwidth network).


How llm-d Differs from vLLM

Aspect vLLM (standalone) llm-d
Scope Single inference instance Fleet of instances + routing
Scaling unit Replicas of one deployment Heterogeneous node pools
Routing Round-robin / random Prefix-aware, load-aware, phase-aware
Prefill/Decode Coupled on same GPU Can be disaggregated
Multi-model Separate deployments Unified gateway, multi-model routing
K8s integration Basic Deployment + Service Gateway API, CRDs, HPA integration
Operated by ML engineers Platform / infra teams

The relationship is complementary, not competitive: llm-d uses vLLM under the hood but adds the platform layer that makes operating a fleet of LLM instances manageable.


Key Features

Prefix-Aware Routing

The endpoint picker maintains a map of which vLLM instances have which prefixes cached. When a new request arrives with a system prompt that matches an existing cache, it routes to that instance, avoiding redundant prefill computation.

This matters enormously for enterprise use cases where many requests share the same system prompt (customer service bot, code assistant, etc.). Without prefix-aware routing, every instance independently caches the same prefix, wasting GPU memory across the fleet.

Multi-Model Serving

A single llm-d gateway can route to different model pools:

1
2
3
/v1/chat/completions (model: llama3-70b)  --> Pool A (4x H100)
/v1/chat/completions (model: mistral-7b)  --> Pool B (2x A100)
/v1/chat/completions (model: phi3-mini)   --> Pool C (1x L4)

Autoscaling Integration

llm-d exposes metrics that KEDA and HPA can use for intelligent scaling:

  • Pending request queue depth
  • Per-model utilization
  • KV cache pressure
  • Time to first token (TTFT) percentiles

Red Hat and Enterprise Focus

Red Hat’s involvement signals enterprise priorities: integration with OpenShift, support for air-gapped registries, FIPS compliance for the control plane, and a path to commercial support. For enterprises already on OpenShift, llm-d is the natural choice for LLM serving infrastructure.


Deployment on Kubernetes

Prerequisites

  • Kubernetes 1.29+ with Gateway API CRDs installed
  • GPU nodes with NVIDIA drivers and device plugin
  • vLLM container images
  • Sufficient inter-node bandwidth for KV cache transfer (if using disaggregation)

Basic Setup (Without Disaggregation)

Even without full prefill/decode disaggregation, llm-d provides value through intelligent routing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: llm-gateway
spec:
  gatewayClassName: llm-d
  listeners:
  - name: http
    port: 8080
    protocol: HTTP
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: llama3-route
spec:
  parentRefs:
  - name: llm-gateway
  rules:
  - matches:
    - path:
        value: /v1/chat/completions
    backendRefs:
    - name: vllm-llama3-pool
      port: 8000

Scaling Patterns

Pattern When to Use Config
Homogeneous pool Single model, uniform traffic N identical vLLM replicas behind llm-d
Heterogeneous pools Multiple models, different GPU needs Model-specific pools with per-pool HPA
Disaggregated High throughput, long prompts + short outputs Separate prefill/decode pools
Hybrid Mix of batch and real-time Batch queue + real-time pool

When to Use llm-d

Use llm-d when:

  • You are operating multiple LLM models at scale on Kubernetes and need unified routing
  • You want prefix-aware request routing to maximize KV cache reuse across a fleet
  • You are on OpenShift and want a Red Hat-supported LLM serving stack
  • You need disaggregated prefill/decode for workloads with long prompts and short outputs
  • Your platform team needs to manage LLM infrastructure with familiar K8s patterns (Gateway API, HPA)

Skip llm-d when:

  • You are running a single model on a single GPU – vLLM alone is sufficient
  • You are not on Kubernetes or prefer serverless inference (use cloud-managed endpoints)
  • You need the simplest possible setup for dev/testing (use Ollama)
  • The project’s maturity does not meet your stability requirements (evaluate carefully)

Maturity

llm-d is an emerging project as of 2025-2026, with active development led by Red Hat engineers. It has been presented at KubeCon and is gaining traction in the CNCF AI ecosystem, but it is not yet a CNCF project. The architecture is sound and addresses real operational pain points, but expect API changes and gaps in documentation. Good for forward-looking platform teams willing to contribute upstream; premature for regulated environments requiring stable, supported software.


References

  • llm-d GitHub: https://github.com/llm-d/llm-d
  • Red Hat blog posts on disaggregated LLM serving
  • Kubernetes Gateway API: https://gateway-api.sigs.k8s.io
  • vLLM documentation: https://docs.vllm.ai
  • KubeCon 2025 presentations on LLM infrastructure
This post is licensed under CC BY 4.0 by the author.