LLM Architecture & Training
Building and training the giants: From transformer foundations to production-scale language models.
Building and training the giants: From transformer foundations to production-scale language models.
Transformer Architecture
Core Equation:
1
2
3
4
5
6
7
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
Where:
Q = queries (what to pay attention to)
K = keys (what can be attended to)
V = values (what information to extract)
sqrt(d_k) = scaling to prevent vanishing gradients
Multi-Head Attention:
- 8–16 parallel attention heads
- Each head learns different semantic relationships
- Concatenate outputs: head_output = concat(head1, head2, …, headn)
Feed-Forward Network:
- Two linear layers with ReLU in between
- Per-position non-linearity
- Structure: [d → 4d → d] (expand then contract)
Layer Norm + Residual:
- Stabilizes training (enables deep stacking)
- Residual: x + sublayer(norm(x))
- Pre-norm (modern): norm(x) then sublayer
Pre-Training Phase
Objective: Next-Token Prediction
Given token sequence [t1, t2, …, tn], predict tn+1.
1
2
loss = CrossEntropyLoss(model_output, target_token)
# Minimize over billions of training steps
Data & Scale:
| Model | Parameters | Training Data | Compute | Time |
|---|---|---|---|---|
| GPT-2 | 1.5B | 40GB (CommonCrawl) | ~5 GPU-years | ~2 weeks |
| GPT-3 | 175B | 300B tokens | ~3000 GPU-years | ~1 month |
| GPT-4 | Unknown (>175B) | ~2T tokens estimated | ~25K GPU-years | ~3 months |
| Claude 3 | 100B–200B | ~1T tokens | ~15K GPU-years | ~2 months |
Key Insight: Compute = data × model_size × training_steps. Scaling laws favor larger models.
Post-Training Phase: RLHF
RLHF = Reinforcement Learning from Human Feedback
Process:
- Collect Human Preferences:
- Generate two outputs from base model
- Human rater chooses better one
- Collect 50K–500K preference pairs
- Train Reward Model:
- Binary classification: output_A > output_B?
- Learns human preferences
- Accuracy: 75–90%
- RL Fine-tuning:
- Use reward model as signal
- Optimize policy (LLM) to maximize reward
- Penalize KL divergence (don’t drift too far from base model)
Alternative: Constitutional AI
- Anthropic’s approach: Train model to follow constitutional principles
- Less human annotation needed
- Effective for safety alignment
Training Optimization
Gradient Accumulation: Simulate large batch sizes
1
2
3
4
for i in range(num_accumulation_steps):
loss = model(batch_i)
loss.backward() # Accumulate gradients
optimizer.step() # Update after accumulation
Gradient Checkpointing: Trade compute for memory
- Recompute activations during backward pass
- Enables larger batch sizes on same GPU memory
Mixed Precision: FP32 (full precision) + FP16 (half precision)
- Weights: FP32
- Gradients & activations: FP16
- Reduces memory by 2x, speeds up by 1.5–2x
- Minimal accuracy loss
Flash Attention: Optimized attention computation
- Reduces memory bandwidth
- 2–4x speedup vs standard attention
- Used in training & inference for large models
Scaling Laws (Chinchilla Scaling)
Finding: For given compute budget, optimal allocation is roughly equal compute to data and model size.
1
2
3
4
5
6
7
8
9
Chinchilla (2022): D ≈ 20 × N
Where:
N = model parameters
D = training tokens
Compute = 6ND (roughly)
For 1 trillion FLOPs:
N = D / 20 → compute balance
Implication: Previous models (GPT-3) were undertrained. Modern models (GPT-4, Claude) train longer.
Inference Optimization
KV Cache: Store computed attention keys/values
- Avoids recomputation on each new token
- Reduces latency from O(seq_len²) to O(seq_len)
- Trade-off: Memory increases with context length
Quantization: Reduce precision
- Int8: 75% memory reduction
- Int4: 87.5% reduction
- Accuracy loss: 1–3% on average
Speculative Decoding: Smaller model drafts, larger model verifies
- 2–3x speedup possible
- Used in production systems
Production Architecture
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Training Infrastructure:
- 1000s of GPUs (A100/H100)
- Distributed training (data + tensor parallelism)
- Checkpointing every 1000 steps
Inference Infrastructure:
- Batch inference (latency-throughput tradeoff)
- KV cache for single-token generation
- Quantization for memory efficiency
- Load balancing across replicas
Cost Breakdown (ChatGPT-scale):
- Training: $1M–$10M upfront
- Infrastructure: $1M–$5M/month (compute, storage)
- Human feedback (RLHF): $100K–$500K
- Post-training: $100K–$1M
Positional Encoding Techniques
Absolute Positional Encoding (Original Transformer):
1
2
3
4
5
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Pros: Works for fixed-length sequences (up to training length)
Cons: Doesn't generalize to longer sequences than trained on
Relative Positional Encoding (Modern):
- Encodes distance between tokens, not absolute positions
- Generalizes to longer sequences than training (key advantage)
- Used in GPT-2, GPT-3, modern LLMs
- ALiBi (Attention with Linear Biases): Even simpler, just linear bias in attention
Rotary Position Embeddings (RoPE):
- Applies rotation matrices to queries/keys
- Combines benefits of both absolute and relative encoding
- Excellent generalization to 2x+ training length
- Used in Llama 2, ChatGPT
Attention Mechanism Variants
Standard Scaled Dot-Product Attention: O(n²) complexity, exact Sparse Attention: Block-sparse, O(n√n) complexity, approximate Linear Attention: O(n) complexity, kernel-based (no softmax) Multi-Query Attention (MQA): Share single K/V across heads, 3x speedup, minimal quality loss Grouped-Query Attention (GQA): Middle ground: group K/V across heads
Trade-off:
1
2
3
Standard: Perfect quality, slow (O(n²))
MQA: Fast (10-20x speedup), 1-3% quality loss
GQA: Balanced (5x speedup, <1% loss)
How Real Systems Train & Deploy
OpenAI GPT-4 Training (Public Info): OpenAI trained GPT-4 on an estimated 2T tokens (13x more than GPT-3’s 300B tokens). Training took ~3 months on a custom tensor-parallel cluster (~25K-50K GPUs, probably H100s). Using Chinchilla scaling: 1.7T parameters (estimated, not confirmed). Training cost: ~$50-100M (compute-only, not including infrastructure). Post-training with RLHF on 50K-100K human preference pairs cost $1-5M. Inference infrastructure: Deployed across Azure regions, estimated 100K+ GPUs for production serving. Why this approach: Large scale necessary for emerging capabilities (reasoning, instruction following); RLHF critical for alignment with human values; distributed training essential to fit on GPU clusters.
Anthropic Claude 3 Series Training: Anthropic trained Claude on ~1-2T tokens using custom training code optimized for Constitutional AI (no traditional RLHF, instead critique-based). Model sizes: Claude 3 Sonnet (~50B params, optimized for speed), Claude 3 Opus (~100B params, most capable). Training infrastructure: Used 10K-20K GPUs, 2-3 months per model. Cost per model: ~$10-50M (smaller than GPT-4 due to efficiency). Constitutional AI alignment: Instead of labeling 100K preference pairs, they use GPT-4 to critique outputs against principles (more scalable). Inference: Deployed on custom infrastructure, 99.9% uptime SLA. Why Constitutional AI: More transparent (rules are explicit), scales with fewer human labels, safety-by-design.
Meta LLaMA 2 Training (Open-Source): Meta trained LLaMA 2 on 2T tokens using consumer-grade hardware (though in data center scale). Model: 70B parameters. Training infrastructure: Custom distributed training on commodity A100 GPUs (12-16K GPUs total). Training cost: ~$5-10M (much cheaper than GPT-4, but longer wall-clock time due to weaker hardware). Post-training: 27K human annotations for instruction tuning + ~1K for safety RLHF. Inference optimizations: Quantization (Int8) reduces memory 4x, enabling single-GPU inference. Why open-source: Research community, industry credibility, faster iteration with external contributions.
Together AI Fine-tuning (Enterprise Service): Together AI offers fine-tuning-as-a-service for custom models. When an enterprise wants to fine-tune LLaMA-7B on 10K examples: (1) customer uploads dataset, (2) Together trains on 1-2 H100 GPUs for 4-8 hours, (3) fine-tuned model deployed. Cost: $500-2000 depending on model size. Latency: 24-48 hours turnaround. Quality gain: 10-20% improvement on task-specific metrics. Why enterprises use it: No infrastructure investment, fast iteration, pay-as-you-go, managed SLAs.
Google PaLM 2 / Gemini Training (Multimodal): Google trained Gemini (1.6M context tokens, multimodal: text + image + audio) on ~10T multimodal tokens. Scale: Estimated 1-1.5T parameters. Training infrastructure: TPU SuperPods (thousands of TPU v5s), 4-6 months training. Innovations: Multimodal alignment (image-text pairs), very long context (1.6M tokens vs 128K GPT-4), in-context learning with 1000s of examples. Why multimodal: Modern AI needs vision; long context enables new use cases (entire books, codebases as context).
Production Inference Optimization Detailed
Batch Processing for Latency-Throughput Trade-off:
1
2
3
Single-request inference: 100ms latency, 100 req/sec throughput
Batch=32 inference: 3 seconds latency, 10K req/sec throughput
Suitable for: Batch scoring jobs, offline analytics (not real-time chat)
KV Cache Memory Explosion Problem:
1
2
3
4
5
6
7
8
9
10
11
Context length 128K tokens:
- Without KV cache: 128K × embedding_dim (manageable)
- With KV cache: 128K × embedding_dim × num_layers × 2 (Keys + Values)
- Example: 128K × 1024 × 40 × 2 = 10.5 GB for single request
- With 1000 concurrent requests: 10.5 TB memory (impossible)
Solutions:
1. PagedAttention: Manage KV cache like virtual memory (vLLM)
2. Quantization: Int8 KV cache reduces memory 4x
3. Sliding window: Only keep last 4K tokens in cache
4. Token merging: Merge similar tokens before attention
Speculative Decoding (Recent Innovation):
1
2
3
4
5
6
7
8
9
Standard decoding: For each token, run full forward pass
Speculative: Small model (7B) drafts 4 tokens, big model (70B) verifies in parallel
Result: 2-3x speedup with no quality loss
How it works:
1. Small model generates draft tokens fast
2. Big model scores all drafted + next token in parallel (1 forward pass)
3. Accept draft tokens if big model agrees on top-1, sample otherwise
4. Empirically: Accept rate ~80% → 2.5x effective speedup
Scaling Laws Deep Dive
Chinchilla & Compute-Optimal Training:
1
2
3
4
5
6
7
8
9
10
11
12
For compute budget C (FLOPs):
Optimal allocation: N ≈ D / 20
Where N = model parameters, D = training tokens
Example: 10^20 FLOPs budget
C = 6 * N * D
10^20 = 6 * N * (20N)
N ≈ 13B parameters
D ≈ 260B tokens
Rule of thumb: Spend equal compute on model + data
GPT-3 violated this (spent 3x more on data than model size), was undertrained
Emergent Abilities (Scaling Phenomenon):
- Below 10B params: No reasoning, can’t do math, no in-context learning
- 10B-100B params: Basic arithmetic, few-shot learning emerges
- 100B+ params: Complex reasoning, code generation, instruction following
- Pattern: Abilities emerge suddenly at certain scales, not gradual
Loss Curves (Predicting Needed Compute):
1
2
3
4
5
6
7
8
9
10
11
12
13
Training loss typically follows: L = α * N^β + γ
Where:
α, β, γ = empirically fitted constants
N = number of training steps
Given: Loss should reach 2.5 on benchmark for acceptable quality
Compute: Training cost = 6 * N * D (Chinchilla)
Public numbers (GPT-3):
- 125M params: 2.5 loss achievable in 100B tokens
- 13B params: 2.5 loss achievable in 300B tokens
- 175B params: 2.5 loss achievable in 300B tokens (not better despite size)
Fine-tuning vs Pre-training Comparison
| Aspect | Pre-training | Fine-tuning |
|---|---|---|
| Data | 2T tokens, diverse | 10K-100K examples, task-specific |
| Time | 3-6 months | 1-7 days |
| Cost | $10M-100M | $1K-100K |
| Hardware | 10K+ GPUs | 1-8 GPUs |
| Quality Gain | Foundation (100% baseline) | 5-20% improvement |
| Best For | General knowledge | Domain-specific tasks |
References
- Attention Is All You Need (Vaswani et al., 2017) — Transformer foundation
- Language Models are Unsupervised Multitask Learners (GPT-2, Radford et al., 2019)
- Training Compute-Optimal Large Language Models (Chinchilla, Hoffmann et al., 2022)
- Training a Helpful and Harmless Assistant (Anthropic Constitutional AI, 2023)
- Flash-Decoding for Long-Context Inference (Zhou et al., 2023)
- Speculative Decoding (Leviathan et al., 2023)
- RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al., 2021)
- Hugging Face Course: Transformers
- Andrej Karpathy: Let’s build GPT from scratch
- LLM Pre-training (Chip Huyen)