Post

LLM Architecture & Training

Building and training the giants: From transformer foundations to production-scale language models.

LLM Architecture & Training

Building and training the giants: From transformer foundations to production-scale language models.

Transformer Architecture

Core Equation:

1
2
3
4
5
6
7
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V

Where:
Q = queries (what to pay attention to)
K = keys (what can be attended to)
V = values (what information to extract)
sqrt(d_k) = scaling to prevent vanishing gradients

Multi-Head Attention:

  • 8–16 parallel attention heads
  • Each head learns different semantic relationships
  • Concatenate outputs: head_output = concat(head1, head2, …, headn)

Feed-Forward Network:

  • Two linear layers with ReLU in between
  • Per-position non-linearity
  • Structure: [d → 4d → d] (expand then contract)

Layer Norm + Residual:

  • Stabilizes training (enables deep stacking)
  • Residual: x + sublayer(norm(x))
  • Pre-norm (modern): norm(x) then sublayer

Pre-Training Phase

Objective: Next-Token Prediction

Given token sequence [t1, t2, …, tn], predict tn+1.

1
2
loss = CrossEntropyLoss(model_output, target_token)
# Minimize over billions of training steps

Data & Scale:

Model Parameters Training Data Compute Time
GPT-2 1.5B 40GB (CommonCrawl) ~5 GPU-years ~2 weeks
GPT-3 175B 300B tokens ~3000 GPU-years ~1 month
GPT-4 Unknown (>175B) ~2T tokens estimated ~25K GPU-years ~3 months
Claude 3 100B–200B ~1T tokens ~15K GPU-years ~2 months

Key Insight: Compute = data × model_size × training_steps. Scaling laws favor larger models.


Post-Training Phase: RLHF

RLHF = Reinforcement Learning from Human Feedback

Process:

  1. Collect Human Preferences:
    • Generate two outputs from base model
    • Human rater chooses better one
    • Collect 50K–500K preference pairs
  2. Train Reward Model:
    • Binary classification: output_A > output_B?
    • Learns human preferences
    • Accuracy: 75–90%
  3. RL Fine-tuning:
    • Use reward model as signal
    • Optimize policy (LLM) to maximize reward
    • Penalize KL divergence (don’t drift too far from base model)

Alternative: Constitutional AI

  • Anthropic’s approach: Train model to follow constitutional principles
  • Less human annotation needed
  • Effective for safety alignment

Training Optimization

Gradient Accumulation: Simulate large batch sizes

1
2
3
4
for i in range(num_accumulation_steps):
    loss = model(batch_i)
    loss.backward()  # Accumulate gradients
optimizer.step()  # Update after accumulation

Gradient Checkpointing: Trade compute for memory

  • Recompute activations during backward pass
  • Enables larger batch sizes on same GPU memory

Mixed Precision: FP32 (full precision) + FP16 (half precision)

  • Weights: FP32
  • Gradients & activations: FP16
  • Reduces memory by 2x, speeds up by 1.5–2x
  • Minimal accuracy loss

Flash Attention: Optimized attention computation

  • Reduces memory bandwidth
  • 2–4x speedup vs standard attention
  • Used in training & inference for large models

Scaling Laws (Chinchilla Scaling)

Finding: For given compute budget, optimal allocation is roughly equal compute to data and model size.

1
2
3
4
5
6
7
8
9
Chinchilla (2022): D ≈ 20 × N

Where:
N = model parameters
D = training tokens
Compute = 6ND (roughly)

For 1 trillion FLOPs:
N = D / 20 → compute balance

Implication: Previous models (GPT-3) were undertrained. Modern models (GPT-4, Claude) train longer.


Inference Optimization

KV Cache: Store computed attention keys/values

  • Avoids recomputation on each new token
  • Reduces latency from O(seq_len²) to O(seq_len)
  • Trade-off: Memory increases with context length

Quantization: Reduce precision

  • Int8: 75% memory reduction
  • Int4: 87.5% reduction
  • Accuracy loss: 1–3% on average

Speculative Decoding: Smaller model drafts, larger model verifies

  • 2–3x speedup possible
  • Used in production systems

Production Architecture

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Training Infrastructure:
  - 1000s of GPUs (A100/H100)
  - Distributed training (data + tensor parallelism)
  - Checkpointing every 1000 steps

Inference Infrastructure:
  - Batch inference (latency-throughput tradeoff)
  - KV cache for single-token generation
  - Quantization for memory efficiency
  - Load balancing across replicas

Cost Breakdown (ChatGPT-scale):
  - Training: $1M–$10M upfront
  - Infrastructure: $1M–$5M/month (compute, storage)
  - Human feedback (RLHF): $100K–$500K
  - Post-training: $100K–$1M

Positional Encoding Techniques

Absolute Positional Encoding (Original Transformer):

1
2
3
4
5
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Pros: Works for fixed-length sequences (up to training length)
Cons: Doesn't generalize to longer sequences than trained on

Relative Positional Encoding (Modern):

  • Encodes distance between tokens, not absolute positions
  • Generalizes to longer sequences than training (key advantage)
  • Used in GPT-2, GPT-3, modern LLMs
  • ALiBi (Attention with Linear Biases): Even simpler, just linear bias in attention

Rotary Position Embeddings (RoPE):

  • Applies rotation matrices to queries/keys
  • Combines benefits of both absolute and relative encoding
  • Excellent generalization to 2x+ training length
  • Used in Llama 2, ChatGPT

Attention Mechanism Variants

Standard Scaled Dot-Product Attention: O(n²) complexity, exact Sparse Attention: Block-sparse, O(n√n) complexity, approximate Linear Attention: O(n) complexity, kernel-based (no softmax) Multi-Query Attention (MQA): Share single K/V across heads, 3x speedup, minimal quality loss Grouped-Query Attention (GQA): Middle ground: group K/V across heads

Trade-off:

1
2
3
Standard: Perfect quality, slow (O(n²))
MQA: Fast (10-20x speedup), 1-3% quality loss
GQA: Balanced (5x speedup, <1% loss)

How Real Systems Train & Deploy

OpenAI GPT-4 Training (Public Info): OpenAI trained GPT-4 on an estimated 2T tokens (13x more than GPT-3’s 300B tokens). Training took ~3 months on a custom tensor-parallel cluster (~25K-50K GPUs, probably H100s). Using Chinchilla scaling: 1.7T parameters (estimated, not confirmed). Training cost: ~$50-100M (compute-only, not including infrastructure). Post-training with RLHF on 50K-100K human preference pairs cost $1-5M. Inference infrastructure: Deployed across Azure regions, estimated 100K+ GPUs for production serving. Why this approach: Large scale necessary for emerging capabilities (reasoning, instruction following); RLHF critical for alignment with human values; distributed training essential to fit on GPU clusters.

Anthropic Claude 3 Series Training: Anthropic trained Claude on ~1-2T tokens using custom training code optimized for Constitutional AI (no traditional RLHF, instead critique-based). Model sizes: Claude 3 Sonnet (~50B params, optimized for speed), Claude 3 Opus (~100B params, most capable). Training infrastructure: Used 10K-20K GPUs, 2-3 months per model. Cost per model: ~$10-50M (smaller than GPT-4 due to efficiency). Constitutional AI alignment: Instead of labeling 100K preference pairs, they use GPT-4 to critique outputs against principles (more scalable). Inference: Deployed on custom infrastructure, 99.9% uptime SLA. Why Constitutional AI: More transparent (rules are explicit), scales with fewer human labels, safety-by-design.

Meta LLaMA 2 Training (Open-Source): Meta trained LLaMA 2 on 2T tokens using consumer-grade hardware (though in data center scale). Model: 70B parameters. Training infrastructure: Custom distributed training on commodity A100 GPUs (12-16K GPUs total). Training cost: ~$5-10M (much cheaper than GPT-4, but longer wall-clock time due to weaker hardware). Post-training: 27K human annotations for instruction tuning + ~1K for safety RLHF. Inference optimizations: Quantization (Int8) reduces memory 4x, enabling single-GPU inference. Why open-source: Research community, industry credibility, faster iteration with external contributions.

Together AI Fine-tuning (Enterprise Service): Together AI offers fine-tuning-as-a-service for custom models. When an enterprise wants to fine-tune LLaMA-7B on 10K examples: (1) customer uploads dataset, (2) Together trains on 1-2 H100 GPUs for 4-8 hours, (3) fine-tuned model deployed. Cost: $500-2000 depending on model size. Latency: 24-48 hours turnaround. Quality gain: 10-20% improvement on task-specific metrics. Why enterprises use it: No infrastructure investment, fast iteration, pay-as-you-go, managed SLAs.

Google PaLM 2 / Gemini Training (Multimodal): Google trained Gemini (1.6M context tokens, multimodal: text + image + audio) on ~10T multimodal tokens. Scale: Estimated 1-1.5T parameters. Training infrastructure: TPU SuperPods (thousands of TPU v5s), 4-6 months training. Innovations: Multimodal alignment (image-text pairs), very long context (1.6M tokens vs 128K GPT-4), in-context learning with 1000s of examples. Why multimodal: Modern AI needs vision; long context enables new use cases (entire books, codebases as context).


Production Inference Optimization Detailed

Batch Processing for Latency-Throughput Trade-off:

1
2
3
Single-request inference: 100ms latency, 100 req/sec throughput
Batch=32 inference: 3 seconds latency, 10K req/sec throughput
Suitable for: Batch scoring jobs, offline analytics (not real-time chat)

KV Cache Memory Explosion Problem:

1
2
3
4
5
6
7
8
9
10
11
Context length 128K tokens:
- Without KV cache: 128K × embedding_dim (manageable)
- With KV cache: 128K × embedding_dim × num_layers × 2 (Keys + Values)
- Example: 128K × 1024 × 40 × 2 = 10.5 GB for single request
- With 1000 concurrent requests: 10.5 TB memory (impossible)

Solutions:
1. PagedAttention: Manage KV cache like virtual memory (vLLM)
2. Quantization: Int8 KV cache reduces memory 4x
3. Sliding window: Only keep last 4K tokens in cache
4. Token merging: Merge similar tokens before attention

Speculative Decoding (Recent Innovation):

1
2
3
4
5
6
7
8
9
Standard decoding: For each token, run full forward pass
Speculative: Small model (7B) drafts 4 tokens, big model (70B) verifies in parallel
Result: 2-3x speedup with no quality loss

How it works:
1. Small model generates draft tokens fast
2. Big model scores all drafted + next token in parallel (1 forward pass)
3. Accept draft tokens if big model agrees on top-1, sample otherwise
4. Empirically: Accept rate ~80% → 2.5x effective speedup

Scaling Laws Deep Dive

Chinchilla & Compute-Optimal Training:

1
2
3
4
5
6
7
8
9
10
11
12
For compute budget C (FLOPs):
Optimal allocation: N ≈ D / 20
Where N = model parameters, D = training tokens

Example: 10^20 FLOPs budget
C = 6 * N * D
10^20 = 6 * N * (20N)
N ≈ 13B parameters
D ≈ 260B tokens

Rule of thumb: Spend equal compute on model + data
GPT-3 violated this (spent 3x more on data than model size), was undertrained

Emergent Abilities (Scaling Phenomenon):

  • Below 10B params: No reasoning, can’t do math, no in-context learning
  • 10B-100B params: Basic arithmetic, few-shot learning emerges
  • 100B+ params: Complex reasoning, code generation, instruction following
  • Pattern: Abilities emerge suddenly at certain scales, not gradual

Loss Curves (Predicting Needed Compute):

1
2
3
4
5
6
7
8
9
10
11
12
13
Training loss typically follows: L = α * N^β + γ

Where:
α, β, γ = empirically fitted constants
N = number of training steps

Given: Loss should reach 2.5 on benchmark for acceptable quality
Compute: Training cost = 6 * N * D (Chinchilla)

Public numbers (GPT-3):
- 125M params: 2.5 loss achievable in 100B tokens
- 13B params: 2.5 loss achievable in 300B tokens
- 175B params: 2.5 loss achievable in 300B tokens (not better despite size)

Fine-tuning vs Pre-training Comparison

Aspect Pre-training Fine-tuning
Data 2T tokens, diverse 10K-100K examples, task-specific
Time 3-6 months 1-7 days
Cost $10M-100M $1K-100K
Hardware 10K+ GPUs 1-8 GPUs
Quality Gain Foundation (100% baseline) 5-20% improvement
Best For General knowledge Domain-specific tasks

References

This post is licensed under CC BY 4.0 by the author.