Post

vLLM

A high-throughput, memory-efficient LLM serving engine that uses PagedAttention to achieve 2-4x higher throughput than naive implementations, with an OpenAI-compatible API server for drop-in replacement of commercial LLM endpoints.

vLLM

A high-throughput, memory-efficient LLM serving engine that uses PagedAttention to achieve 2-4x higher throughput than naive implementations, with an OpenAI-compatible API server for drop-in replacement of commercial LLM endpoints.


Why vLLM Matters

Self-hosting LLMs is expensive because GPU memory is the bottleneck. A 70B parameter model needs ~140 GB of GPU memory in FP16 just for weights, plus additional memory for the KV cache during inference. Traditional serving frameworks waste 60-80% of KV cache memory due to fragmentation and over-allocation. vLLM’s PagedAttention algorithm solves this by managing KV cache memory like an operating system manages virtual memory – paging, sharing, and defragmenting on the fly.

The practical impact: you serve 2-4x more concurrent requests on the same GPU hardware, which directly translates to lower cost per token in production.


Core Technical Innovations

PagedAttention

The key insight behind vLLM. Traditional inference engines pre-allocate a contiguous block of GPU memory for each request’s KV cache based on the maximum possible sequence length. This wastes memory because most requests do not use the full context window.

PagedAttention divides the KV cache into fixed-size blocks (pages) and allocates them on demand, like virtual memory paging:

1
2
3
4
5
6
7
8
9
Traditional Approach:
Request 1: [########............]  <- 50% wasted (pre-allocated max length)
Request 2: [######..............]  <- 75% wasted
GPU Memory: [Req1-full-block][Req2-full-block][  UNUSABLE  ]

PagedAttention:
Request 1: [##][##][##]                 <- only allocated pages in use
Request 2: [##][##]                     <- grows as needed
GPU Memory: [R1][R1][R1][R2][R2][ FREE ][ FREE ][ FREE ]

Benefits:

  • Near-zero memory waste (internal fragmentation drops to <4%)
  • Memory sharing across requests with common prefixes (system prompts, few-shot examples)
  • Copy-on-write for parallel sampling (beam search, best-of-n)

Continuous Batching

Traditional batching waits for a batch to fill up, then processes all requests together. Short requests wait for the longest request in the batch to finish. Continuous batching (also called iteration-level scheduling) inserts new requests into an active batch at each decode step and evicts completed requests immediately.

1
2
3
4
5
6
7
Static batching:     [Req1####] [Req2##########] [Req3###]
                     All wait for Req2 to finish --> high latency for Req1, Req3

Continuous batching: [Req1####]--done--[Req4##]--done--[Req5#####]
                     [Req2##########]--done--[Req6###]
                     [Req3###]--done--[Req4 continued...]
                     Requests start/finish independently --> lower latency, higher throughput

Tensor Parallelism and Pipeline Parallelism

For models that do not fit on a single GPU:

  • Tensor Parallelism (TP): Splits individual layers across GPUs. Each GPU holds a slice of every layer. Requires fast interconnect (NVLink). Use TP within a single node.
  • Pipeline Parallelism (PP): Splits layers sequentially across GPUs. GPU 0 gets layers 0-15, GPU 1 gets layers 16-31. Works across nodes with slower interconnect.
  • Combined: TP within nodes, PP across nodes for very large models.
1
2
# Serve Llama 3 70B across 4 GPUs with tensor parallelism
vllm serve meta-llama/Llama-3-70B-Instruct --tensor-parallel-size 4

Key Features

OpenAI-Compatible API

vLLM exposes /v1/chat/completions, /v1/completions, and /v1/models endpoints. Any application using the OpenAI SDK can point to vLLM by changing the base URL.

1
2
3
4
5
6
7
8
9
10
11
12
13
from openai import OpenAI

client = OpenAI(
    base_url="http://vllm-service:8000/v1",
    api_key="not-needed"  # vLLM doesn't require auth by default
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3-70B-Instruct",
    messages=[{"role": "user", "content": "Explain PagedAttention"}],
    temperature=0.7,
    max_tokens=512
)

Supported Model Architectures

vLLM supports most major open-weight model families: Llama 2/3, Mistral, Mixtral, Qwen, Falcon, GPT-NeoX, Phi, StableLM, Yi, DeepSeek, Gemma, Command-R, and more. It also supports multi-modal models (LLaVA, Phi-3-Vision) for image+text inputs.

Quantization Support

  • GPTQ, AWQ, SqueezeLLM, FP8 (native on Hopper GPUs)
  • Reduces memory usage by 2-4x with minimal quality loss
  • Enables serving larger models on smaller GPU configurations

Prefix Caching

Automatically caches and shares KV cache blocks for common prefixes across requests. If 100 requests share the same system prompt, the KV cache for that prefix is computed once and shared. Significant speedup for applications with long, repeated system prompts.

Speculative Decoding

Uses a small draft model to generate candidate tokens, then verifies them with the main model in a single forward pass. Can achieve 2x speedup for latency-sensitive applications without quality degradation.


Performance Characteristics

Rough benchmarks (actual numbers depend heavily on hardware, model, and workload):

Metric vLLM TGI (HuggingFace) Naive HF
Throughput (tokens/s, batch) 2-4x baseline 1.5-2x baseline 1x baseline
Memory efficiency 95%+ utilization 70-80% 40-60%
Time to first token (TTFT) Low Low High (no batching)
Max concurrent requests High (paged) Medium Low

The throughput advantage grows with higher concurrency and longer sequences, which is exactly the production scenario.


Kubernetes Deployment

Basic Deployment

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3-70b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-llama3
  template:
    metadata:
      labels:
        app: vllm-llama3
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - "--model"
        - "meta-llama/Llama-3-70B-Instruct"
        - "--tensor-parallel-size"
        - "4"
        - "--max-model-len"
        - "8192"
        - "--gpu-memory-utilization"
        - "0.9"
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 4
          requests:
            memory: "64Gi"
            cpu: "8"
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm-llama3
  ports:
  - port: 8000
    targetPort: 8000

Production Considerations

  • Model loading time: Large models take 2-10 minutes to load. Use pre-warmed replicas and readiness probes with sufficient initialDelaySeconds.
  • GPU scheduling: Use nvidia.com/gpu resource limits. Consider the NVIDIA GPU Operator for automatic driver management.
  • Horizontal scaling: Use KEDA or custom HPA based on queue depth / pending requests rather than CPU utilization.
  • Model storage: Use a PVC backed by fast storage (NVMe, SSD) or pre-pull model weights into node-local storage. Downloading 140 GB from HuggingFace on every pod start is not viable.

When to Use vLLM

Use vLLM when:

  • You need high-throughput LLM serving for production workloads on your own GPUs
  • You want an OpenAI-compatible API for self-hosted models (easy migration path)
  • You are serving models at scale where memory efficiency directly reduces hardware costs
  • You need advanced features like speculative decoding, prefix caching, or multi-modal support

Consider alternatives when:

  • You want managed serving without infrastructure management (use cloud provider endpoints)
  • You are doing development/experimentation only (Ollama is simpler)
  • You need the absolute lowest latency for single requests (TensorRT-LLM may edge out vLLM)
  • You are on CPU-only infrastructure (vLLM is GPU-focused; use llama.cpp or Ollama for CPU)

References

  • vLLM documentation: https://docs.vllm.ai
  • vLLM GitHub: https://github.com/vllm-project/vllm
  • PagedAttention paper: “Efficient Memory Management for Large Language Model Serving with PagedAttention” (Kwon et al., 2023)
  • vLLM blog: https://blog.vllm.ai
  • OpenAI-compatible server docs: https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html
This post is licensed under CC BY 4.0 by the author.