Post

Envoy AI Gateway

Envoy AI Gateway is the CNCF-backed open-source option for AI traffic management -- built on the battle-tested Envoy proxy, it brings LLM routing, credential management, and inference-aware load balancing to Kubernetes-native deployments.

Envoy AI Gateway

Envoy AI Gateway is the CNCF-backed open-source option for AI traffic management – built on the battle-tested Envoy proxy, it brings LLM routing, credential management, and inference-aware load balancing to Kubernetes-native deployments.


What It Is

Envoy AI Gateway (EAIGW) is an open-source project built on top of Envoy Gateway (the CNCF Kubernetes Gateway API implementation). It provides unified access to LLM providers and self-hosted models with enterprise concerns: authentication, rate limiting, cost tracking, and intelligent inference routing.

Backed by Tetrate and Bloomberg, it was the first CNCF-backed AI gateway project (v0.1 released February 2025).


Key Features

Unified LLM Access

Single entry point for multiple LLM providers (OpenAI, Anthropic, AWS Bedrock, Azure OpenAI, etc.) with provider-agnostic routing.

Two-Tier Gateway Architecture

1
2
3
4
5
6
7
8
9
Tier 1 Gateway (ingress)
  - Authentication, global rate limiting
  - Top-level routing to provider or self-hosted cluster
        |
        v
Tier 2 Gateway (model serving)
  - Fine-grained model routing
  - Inference-aware load balancing
  - KV-cache-aware endpoint selection

Tier 1 handles external traffic (auth, quotas, provider routing). Tier 2 handles internal traffic to self-hosted model serving clusters with inference-specific intelligence.

Intelligent Inference Routing (EPP)

Endpoint Picker (EPP) integration enables routing decisions based on real-time inference metrics:

  • KV-cache utilization per GPU
  • Queued requests per endpoint
  • LoRA adapter availability
  • GPU memory pressure

This is unique to Envoy AI Gateway and critical for organizations running self-hosted models (vLLM, llm-d, TGI).

MCP Support

As of 2025, EAIGW added Model Context Protocol support with full spec compliance, OAuth authentication, and zero-friction deployment for MCP server proxying.

Credential Management

Centralized credential storage and rotation for LLM provider API keys. Applications never handle raw credentials.

Cost & Quota Enforcement

Token-based rate limiting and cost tracking per consumer, per model, per team.

Kubernetes-Native

Built on Kubernetes Gateway API standard. Configured via CRDs, integrates with Istio service mesh, and runs as a standard K8s deployment.


Architecture

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
External Clients / Agents
        |
        v
+---------------------------+
| Tier 1: Envoy AI Gateway  |
|  - Auth (JWT, API key)    |
|  - Global rate limiting   |
|  - Provider routing       |
|  - Cost tracking          |
+---------------------------+
        |
   +----+----+
   v         v
[OpenAI]  [Self-hosted cluster]
          +---------------------------+
          | Tier 2: Inference Gateway  |
          |  - EPP (endpoint picker)  |
          |  - KV-cache-aware routing |
          |  - LoRA adapter routing   |
          |  - GPU-aware balancing    |
          +---------------------------+
                    |
              +-----+-----+
              v     v     v
           [vLLM] [llm-d] [TGI]

Self-Hosting & Pricing

Fully open-source (Apache 2.0). No enterprise license required for any feature. Self-hosted on Kubernetes.

Component License Cost
Envoy AI Gateway Apache 2.0 Free
Envoy Gateway (dependency) Apache 2.0 Free
Envoy Proxy (foundation) Apache 2.0 Free

Tetrate offers commercial support but the project is fully functional open-source.


Limitations

  • Kubernetes-only – no bare-metal or Docker Compose deployment. Requires K8s + Gateway API.
  • Younger project – v0.1 in Feb 2025, still rapidly evolving. Less battle-tested than Kong.
  • No built-in guardrails – no PII detection, content filtering. Relies on external guardrail services.
  • No virtual keys / budget management UI – more infrastructure-level than application-level.
  • Agent-to-agent routing is basic – primarily LLM and inference routing. Not A2A-protocol-aware like Kong or agentgateway.

When to Use

Strong fit:

  • Kubernetes-native organizations that want open-source AI gateway with zero licensing cost
  • Running self-hosted models (vLLM, llm-d) and need inference-aware routing (EPP)
  • Want to extend existing Envoy/Istio service mesh with AI traffic management
  • CNCF-aligned infrastructure strategy

Weak fit:

  • Not on Kubernetes – can’t use it
  • Need application-level features (virtual keys, budgets, guardrails) – use Portkey or Kong
  • Need mature A2A agent-to-agent routing – use agentgateway or Kong
  • Small team that wants simple setup – LiteLLM or Cloudflare is easier

References

This post is licensed under CC BY 4.0 by the author.