llm-d
A Kubernetes-native, disaggregated LLM serving stack that separates prefill and decode phases across specialized node pools, using the Gateway API for intelligent request routing.
A Kubernetes-native, disaggregated LLM serving stack that separates prefill and decode phases across specialized node pools, using the Gateway API for intelligent request routing – designed for large-scale, multi-model inference with enterprise-grade operability.
What llm-d Is
llm-d (pronounced “LLM-dee,” the “d” stands for “disaggregated”) is an open-source project led by Red Hat that rethinks how LLM inference works at scale on Kubernetes. While vLLM optimizes inference within a single serving instance, llm-d optimizes the infrastructure layer around and between inference instances – routing, scheduling, scaling, and disaggregating the inference pipeline itself.
The core idea: LLM inference has two distinct computational phases (prefill and decode) with very different resource profiles. Running them on the same hardware wastes resources. llm-d splits them apart and schedules each phase on purpose-built infrastructure.
Why Disaggregated Inference
The Prefill/Decode Problem
LLM inference has two phases:
Prefill (prompt processing): Process all input tokens in parallel. This is compute-bound – it benefits from maximum GPU FLOPs. A long prompt (8K tokens) requires significant computation but completes in a single forward pass.
Decode (token generation): Generate output tokens one at a time, autoregressively. This is memory-bandwidth-bound – each step reads the full KV cache but does relatively little computation. The GPU is mostly idle, waiting for memory reads.
1
2
3
4
5
6
7
8
9
10
11
Prefill phase: [========= HIGH GPU COMPUTE =========] (one pass, all input tokens)
Decode phase: [=].[=].[=].[=].[=].[=].[=].[=].[=]. (many passes, one token each)
^memory-bound, GPU underutilized
Combined on same GPU:
GPU Compute: ████░░░░░░░░░░░░░░░░ (20% average utilization during decode)
GPU Memory: ████████████████████ (100% allocated for KV cache)
Disaggregated:
Prefill GPU: ████████████████████ (high utilization, processes prompts)
Decode GPU: ████████████████████ (optimized for memory bandwidth)
By separating these phases, each GPU pool can be optimized independently. Prefill nodes use high-compute GPUs (H100 SXM); decode nodes can use memory-optimized or even cheaper GPUs.
Architecture
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
+------------------------------------------------------------------+
| Kubernetes Cluster |
| |
| +------------------+ |
| | Gateway API | <-- Ingress point, LLM-aware routing |
| | (Envoy + llm-d | |
| | routing plugin) | |
| +--------+---------+ |
| | |
| +--------v---------+ +------------------+ |
| | Prefill Pool | | Decode Pool | |
| | +------+ +------+ | | +------+ +------+| |
| | |vLLM | |vLLM | | | |vLLM | |vLLM || |
| | |inst-1| |inst-2| | KV | |inst-3| |inst-4|| |
| | +------+ +------+ |cache| +------+ +------+| |
| +--------+----------+trans +------------------+ |
| | fer | |
| +--------v-----------------v---------+ |
| | Shared KV Cache / Transfer Layer | |
| | (RDMA, NVLink, or network transfer) | |
| +------------------------------------+ |
| |
| +----------------------------+ |
| | Endpoint Picker / Scheduler | <-- Decides which instance |
| | (prefix-aware, load-aware) | handles each request |
| +----------------------------+ |
+------------------------------------------------------------------+
Key Components
Gateway API Integration: llm-d uses the Kubernetes Gateway API (not legacy Ingress) as the traffic entry point. The gateway includes an LLM-aware routing layer that understands request characteristics (prompt length, model requested, prefix matching) and routes accordingly.
Endpoint Picker: The intelligent scheduler that decides which vLLM instance handles each request. It considers:
- Prefix affinity – route requests with similar system prompts to the same instance to maximize KV cache reuse
- Load balancing – avoid overloading any single instance
- Phase routing – send prefill work to prefill-optimized nodes, decode work to decode-optimized nodes
- Model awareness – route to instances serving the requested model
vLLM as the Inference Engine: llm-d does not replace vLLM – it orchestrates multiple vLLM instances. Each instance in the pool runs vLLM for actual inference. llm-d adds the scheduling, routing, and disaggregation layers on top.
KV Cache Transfer: When prefill and decode are on different nodes, the KV cache computed during prefill must be transferred to the decode node. This is the critical data path and requires fast interconnect (RDMA, NVLink across nodes, or high-bandwidth network).
How llm-d Differs from vLLM
| Aspect | vLLM (standalone) | llm-d |
|---|---|---|
| Scope | Single inference instance | Fleet of instances + routing |
| Scaling unit | Replicas of one deployment | Heterogeneous node pools |
| Routing | Round-robin / random | Prefix-aware, load-aware, phase-aware |
| Prefill/Decode | Coupled on same GPU | Can be disaggregated |
| Multi-model | Separate deployments | Unified gateway, multi-model routing |
| K8s integration | Basic Deployment + Service | Gateway API, CRDs, HPA integration |
| Operated by | ML engineers | Platform / infra teams |
The relationship is complementary, not competitive: llm-d uses vLLM under the hood but adds the platform layer that makes operating a fleet of LLM instances manageable.
Key Features
Prefix-Aware Routing
The endpoint picker maintains a map of which vLLM instances have which prefixes cached. When a new request arrives with a system prompt that matches an existing cache, it routes to that instance, avoiding redundant prefill computation.
This matters enormously for enterprise use cases where many requests share the same system prompt (customer service bot, code assistant, etc.). Without prefix-aware routing, every instance independently caches the same prefix, wasting GPU memory across the fleet.
Multi-Model Serving
A single llm-d gateway can route to different model pools:
1
2
3
/v1/chat/completions (model: llama3-70b) --> Pool A (4x H100)
/v1/chat/completions (model: mistral-7b) --> Pool B (2x A100)
/v1/chat/completions (model: phi3-mini) --> Pool C (1x L4)
Autoscaling Integration
llm-d exposes metrics that KEDA and HPA can use for intelligent scaling:
- Pending request queue depth
- Per-model utilization
- KV cache pressure
- Time to first token (TTFT) percentiles
Red Hat and Enterprise Focus
Red Hat’s involvement signals enterprise priorities: integration with OpenShift, support for air-gapped registries, FIPS compliance for the control plane, and a path to commercial support. For enterprises already on OpenShift, llm-d is the natural choice for LLM serving infrastructure.
Deployment on Kubernetes
Prerequisites
- Kubernetes 1.29+ with Gateway API CRDs installed
- GPU nodes with NVIDIA drivers and device plugin
- vLLM container images
- Sufficient inter-node bandwidth for KV cache transfer (if using disaggregation)
Basic Setup (Without Disaggregation)
Even without full prefill/decode disaggregation, llm-d provides value through intelligent routing:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: llm-gateway
spec:
gatewayClassName: llm-d
listeners:
- name: http
port: 8080
protocol: HTTP
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: llama3-route
spec:
parentRefs:
- name: llm-gateway
rules:
- matches:
- path:
value: /v1/chat/completions
backendRefs:
- name: vllm-llama3-pool
port: 8000
Scaling Patterns
| Pattern | When to Use | Config |
|---|---|---|
| Homogeneous pool | Single model, uniform traffic | N identical vLLM replicas behind llm-d |
| Heterogeneous pools | Multiple models, different GPU needs | Model-specific pools with per-pool HPA |
| Disaggregated | High throughput, long prompts + short outputs | Separate prefill/decode pools |
| Hybrid | Mix of batch and real-time | Batch queue + real-time pool |
When to Use llm-d
Use llm-d when:
- You are operating multiple LLM models at scale on Kubernetes and need unified routing
- You want prefix-aware request routing to maximize KV cache reuse across a fleet
- You are on OpenShift and want a Red Hat-supported LLM serving stack
- You need disaggregated prefill/decode for workloads with long prompts and short outputs
- Your platform team needs to manage LLM infrastructure with familiar K8s patterns (Gateway API, HPA)
Skip llm-d when:
- You are running a single model on a single GPU – vLLM alone is sufficient
- You are not on Kubernetes or prefer serverless inference (use cloud-managed endpoints)
- You need the simplest possible setup for dev/testing (use Ollama)
- The project’s maturity does not meet your stability requirements (evaluate carefully)
Maturity
llm-d is an emerging project as of 2025-2026, with active development led by Red Hat engineers. It has been presented at KubeCon and is gaining traction in the CNCF AI ecosystem, but it is not yet a CNCF project. The architecture is sound and addresses real operational pain points, but expect API changes and gaps in documentation. Good for forward-looking platform teams willing to contribute upstream; premature for regulated environments requiring stable, supported software.
References
- llm-d GitHub: https://github.com/llm-d/llm-d
- Red Hat blog posts on disaggregated LLM serving
- Kubernetes Gateway API: https://gateway-api.sigs.k8s.io
- vLLM documentation: https://docs.vllm.ai
- KubeCon 2025 presentations on LLM infrastructure