LiteLLM

LiteLLM is the default choice when you need a self-hosted, open-source LLM proxy with zero licensing cost -- it does one thing well (unified LLM API) and stays out of your way.

Posted Feb 1, 2026

4 min read

LiteLLM

LiteLLM is the default choice when you need a self-hosted, open-source LLM proxy with zero licensing cost – it does one thing well (unified LLM API) and stays out of your way.

What It Is

LiteLLM is an open-source (MIT license) Python SDK and proxy server that provides an OpenAI-compatible API to 100+ LLM providers. Any application, SDK, or tool that speaks OpenAI format can use LiteLLM as a drop-in proxy to route requests to Anthropic, Google, AWS Bedrock, Azure OpenAI, Cohere, HuggingFace, vLLM, Ollama, and dozens more.

It is both a Python library (for direct SDK use) and a standalone proxy server (for team-wide deployment). The proxy is the more common production use case.

Key Features

OpenAI-Compatible Proxy

Drop-in replacement for OpenAI’s API. Supports:

/chat/completions, /completions, /embeddings
/images, /audio, /batches, /rerank
/responses (Responses API)
/messages (Anthropic native)
/a2a (Agent-to-Agent)

Applications point their OpenAI SDK at LiteLLM’s URL – no code changes needed.

100+ Provider Support

Unified interface across OpenAI, Anthropic, Google Vertex/Gemini, AWS Bedrock, Azure OpenAI, Cohere, Mistral, HuggingFace, Ollama, vLLM, NVIDIA NIM, Sagemaker, and more.

Cost Tracking & Budget Management

Track spend per virtual key, per user, per team
Set budget caps that block requests when exceeded
Cost alerts and monitoring for unusual patterns
Built-in cost tables for major providers (auto-calculated)

Virtual Keys

Generate proxy API keys with:

Budget limits
Model access restrictions
Rate limits
Team/user assignment

Load Balancing & Fallbacks

Configure multiple deployments of the same model. LiteLLM load-balances across them and falls back on failure. Supports weighted routing, least-latency routing, and round-robin.

Guardrails

PII masking via Presidio plugin
Content moderation
Custom pre/post-processing hooks
Prompt injection detection (via plugins)

Admin UI

Web-based dashboard for managing models, virtual keys, users, budgets, and viewing usage analytics.

Architecture

Client App (any OpenAI-compatible SDK)
    |
    v
+----------------------------------+
| LiteLLM Proxy Server              |
|  (Python / Docker)                |
|                                   |
|  Authentication (API keys)        |
|       |                           |
|  Rate Limiting                    |
|       |                           |
|  Router (model mapping,           |
|    load balancing, fallbacks)     |
|       |                           |
|  Pre-processing (PII, guardrails) |
|       |                           |
|  Provider API Call                |
|       |                           |
|  Post-processing                  |
|       |                           |
|  Cost Tracking & Logging          |
|       |                           |
|  PostgreSQL (logs, budgets, keys) |
+----------------------------------+
    |
    v
LLM Providers

Configuration

LiteLLM is configured via YAML:

        
      
model_list:
  - model_name: gpt-4
    litellm_params:
      model: azure/gpt-4-turbo
      api_base: https://my-azure.openai.azure.com
      api_key: os.environ/AZURE_API_KEY
  - model_name: gpt-4
    litellm_params:
      model: openai/gpt-4-turbo
      api_key: os.environ/OPENAI_API_KEY
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

general_settings:
  master_key: sk-my-master-key
  database_url: postgresql://user:pass@db:5432/litellm

Multiple entries for the same model_name enables automatic load balancing and fallback.

Self-Hosting

LiteLLM is fully self-hosted with no SaaS dependency. Typical deployment:

docker compose up -d  # LiteLLM proxy + PostgreSQL

Component	Purpose
LiteLLM Proxy	Python server (FastAPI)
PostgreSQL	Persistent storage for keys, budgets, logs
Redis (optional)	Caching, rate limiting

Runs on Kubernetes, Docker, bare metal, or any cloud. No phone-home, no license key, no vendor lock-in.

Enterprise (LiteLLM Enterprise)

BerriAI offers an enterprise tier with:

SSO / SAML
Advanced analytics
Premium support
Additional enterprise features

But the core proxy is fully functional in the open-source version.

Pricing / Cost Model

Tier	Cost	Features
Open Source	Free (MIT)	Full proxy, 100+ providers, cost tracking, virtual keys, admin UI
Enterprise	Custom	SSO, advanced analytics, premium support

LiteLLM is the only major LLM gateway where the full feature set is available for free. No per-request fees, no log-based billing, no feature gating.

Limitations

Python-only – the proxy is a Python/FastAPI server. Higher resource consumption than Rust/Go alternatives (Helicone, Kong). At very high throughput, this matters.
8ms P95 latency at 1K RPS – good but not best-in-class. Kong and Helicone are faster.
Observability is basic – built-in logging works but is not as deep as Portkey or Helicone. Most teams pair LiteLLM with Langfuse or similar.
No semantic caching out of the box (exact match only, or integrate Redis).
Community support – no SLA unless you buy enterprise.

When to Use

Strong fit:

Teams that want a free, self-hosted LLM proxy with no licensing cost
Startups and side projects where budget matters
Organizations that already have observability (Langfuse, Datadog) and just need routing
Dev teams that want OpenAI-compatible API across all providers
Air-gapped or regulated environments where SaaS is not an option

Weak fit:

High-throughput enterprise workloads where proxy latency is critical – Kong or Helicone are faster
Teams that want integrated guardrails + observability in one tool – Portkey is more complete
Organizations that need enterprise support and SLAs without paying for enterprise tier

References

AI & Agents, AI Ops

agent-frameworks

This post is licensed under CC BY 4.0 by the author.