LLM Architecture & Training

Building and training the giants: From transformer foundations to production-scale language models.

Posted May 20, 2025

9 min read

Building and training the giants: From transformer foundations to production-scale language models.

Transformer Architecture

Core Equation:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V

Where:
Q = queries (what to pay attention to)
K = keys (what can be attended to)
V = values (what information to extract)
sqrt(d_k) = scaling to prevent vanishing gradients

Multi-Head Attention:

8–16 parallel attention heads
Each head learns different semantic relationships
Concatenate outputs: head_output = concat(head1, head2, …, headn)

Feed-Forward Network:

Two linear layers with ReLU in between
Per-position non-linearity
Structure: [d → 4d → d] (expand then contract)

Layer Norm + Residual:

Stabilizes training (enables deep stacking)
Residual: x + sublayer(norm(x))
Pre-norm (modern): norm(x) then sublayer

Pre-Training Phase

Objective: Next-Token Prediction

Given token sequence [t1, t2, …, tn], predict tn+1.

        
loss = CrossEntropyLoss(model_output, target_token)
# Minimize over billions of training steps

Data & Scale:

Model	Parameters	Training Data	Compute	Time
GPT-2	1.5B	40GB (CommonCrawl)	~5 GPU-years	~2 weeks
GPT-3	175B	300B tokens	~3000 GPU-years	~1 month
GPT-4	Unknown (>175B)	~2T tokens estimated	~25K GPU-years	~3 months
Claude 3	100B–200B	~1T tokens	~15K GPU-years	~2 months

Key Insight: Compute = data × model_size × training_steps. Scaling laws favor larger models.

Post-Training Phase: RLHF

RLHF = Reinforcement Learning from Human Feedback

Process:

Collect Human Preferences:
- Generate two outputs from base model
- Human rater chooses better one
- Collect 50K–500K preference pairs
Train Reward Model:
- Binary classification: output_A > output_B?
- Learns human preferences
- Accuracy: 75–90%
RL Fine-tuning:
- Use reward model as signal
- Optimize policy (LLM) to maximize reward
- Penalize KL divergence (don’t drift too far from base model)

Alternative: Constitutional AI

Anthropic’s approach: Train model to follow constitutional principles
Less human annotation needed
Effective for safety alignment

Training Optimization

Gradient Accumulation: Simulate large batch sizes

        
      
for i in range(num_accumulation_steps):
    loss = model(batch_i)
    loss.backward()  # Accumulate gradients
optimizer.step()  # Update after accumulation

Gradient Checkpointing: Trade compute for memory

Recompute activations during backward pass
Enables larger batch sizes on same GPU memory

Mixed Precision: FP32 (full precision) + FP16 (half precision)

Weights: FP32
Gradients & activations: FP16
Reduces memory by 2x, speeds up by 1.5–2x
Minimal accuracy loss

Flash Attention: Optimized attention computation

Reduces memory bandwidth
2–4x speedup vs standard attention
Used in training & inference for large models

Scaling Laws (Chinchilla Scaling)

Finding: For given compute budget, optimal allocation is roughly equal compute to data and model size.

Chinchilla (2022): D ≈ 20 × N

Where:
N = model parameters
D = training tokens
Compute = 6ND (roughly)

For 1 trillion FLOPs:
N = D / 20 → compute balance

Implication: Previous models (GPT-3) were undertrained. Modern models (GPT-4, Claude) train longer.

Inference Optimization

KV Cache: Store computed attention keys/values

Avoids recomputation on each new token
Reduces latency from O(seq_len²) to O(seq_len)
Trade-off: Memory increases with context length

Quantization: Reduce precision

Int8: 75% memory reduction
Int4: 87.5% reduction
Accuracy loss: 1–3% on average

Speculative Decoding: Smaller model drafts, larger model verifies

2–3x speedup possible
Used in production systems

Production Architecture

Training Infrastructure:
  - 1000s of GPUs (A100/H100)
  - Distributed training (data + tensor parallelism)
  - Checkpointing every 1000 steps

Inference Infrastructure:
  - Batch inference (latency-throughput tradeoff)
  - KV cache for single-token generation
  - Quantization for memory efficiency
  - Load balancing across replicas

Cost Breakdown (ChatGPT-scale):
  - Training: $1M–$10M upfront
  - Infrastructure: $1M–$5M/month (compute, storage)
  - Human feedback (RLHF): $100K–$500K
  - Post-training: $100K–$1M

Positional Encoding Techniques

Absolute Positional Encoding (Original Transformer):

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Pros: Works for fixed-length sequences (up to training length)
Cons: Doesn't generalize to longer sequences than trained on

Relative Positional Encoding (Modern):

Encodes distance between tokens, not absolute positions
Generalizes to longer sequences than training (key advantage)
Used in GPT-2, GPT-3, modern LLMs
ALiBi (Attention with Linear Biases): Even simpler, just linear bias in attention

Rotary Position Embeddings (RoPE):

Applies rotation matrices to queries/keys
Combines benefits of both absolute and relative encoding
Excellent generalization to 2x+ training length
Used in Llama 2, ChatGPT

Attention Mechanism Variants

Standard Scaled Dot-Product Attention: O(n²) complexity, exact Sparse Attention: Block-sparse, O(n√n) complexity, approximate Linear Attention: O(n) complexity, kernel-based (no softmax) Multi-Query Attention (MQA): Share single K/V across heads, 3x speedup, minimal quality loss Grouped-Query Attention (GQA): Middle ground: group K/V across heads

Trade-off:

Standard: Perfect quality, slow (O(n²))
MQA: Fast (10-20x speedup), 1-3% quality loss
GQA: Balanced (5x speedup, <1% loss)

How Real Systems Train & Deploy

OpenAI GPT-4 Training (Public Info): OpenAI trained GPT-4 on an estimated 2T tokens (13x more than GPT-3’s 300B tokens). Training took ~3 months on a custom tensor-parallel cluster (~25K-50K GPUs, probably H100s). Using Chinchilla scaling: 1.7T parameters (estimated, not confirmed). Training cost: ~$50-100M (compute-only, not including infrastructure). Post-training with RLHF on 50K-100K human preference pairs cost $1-5M. Inference infrastructure: Deployed across Azure regions, estimated 100K+ GPUs for production serving. Why this approach: Large scale necessary for emerging capabilities (reasoning, instruction following); RLHF critical for alignment with human values; distributed training essential to fit on GPU clusters.

Anthropic Claude 3 Series Training: Anthropic trained Claude on ~1-2T tokens using custom training code optimized for Constitutional AI (no traditional RLHF, instead critique-based). Model sizes: Claude 3 Sonnet (~50B params, optimized for speed), Claude 3 Opus (~100B params, most capable). Training infrastructure: Used 10K-20K GPUs, 2-3 months per model. Cost per model: ~$10-50M (smaller than GPT-4 due to efficiency). Constitutional AI alignment: Instead of labeling 100K preference pairs, they use GPT-4 to critique outputs against principles (more scalable). Inference: Deployed on custom infrastructure, 99.9% uptime SLA. Why Constitutional AI: More transparent (rules are explicit), scales with fewer human labels, safety-by-design.

Meta LLaMA 2 Training (Open-Source): Meta trained LLaMA 2 on 2T tokens using consumer-grade hardware (though in data center scale). Model: 70B parameters. Training infrastructure: Custom distributed training on commodity A100 GPUs (12-16K GPUs total). Training cost: ~$5-10M (much cheaper than GPT-4, but longer wall-clock time due to weaker hardware). Post-training: 27K human annotations for instruction tuning + ~1K for safety RLHF. Inference optimizations: Quantization (Int8) reduces memory 4x, enabling single-GPU inference. Why open-source: Research community, industry credibility, faster iteration with external contributions.

Together AI Fine-tuning (Enterprise Service): Together AI offers fine-tuning-as-a-service for custom models. When an enterprise wants to fine-tune LLaMA-7B on 10K examples: (1) customer uploads dataset, (2) Together trains on 1-2 H100 GPUs for 4-8 hours, (3) fine-tuned model deployed. Cost: $500-2000 depending on model size. Latency: 24-48 hours turnaround. Quality gain: 10-20% improvement on task-specific metrics. Why enterprises use it: No infrastructure investment, fast iteration, pay-as-you-go, managed SLAs.

Google PaLM 2 / Gemini Training (Multimodal): Google trained Gemini (1.6M context tokens, multimodal: text + image + audio) on ~10T multimodal tokens. Scale: Estimated 1-1.5T parameters. Training infrastructure: TPU SuperPods (thousands of TPU v5s), 4-6 months training. Innovations: Multimodal alignment (image-text pairs), very long context (1.6M tokens vs 128K GPT-4), in-context learning with 1000s of examples. Why multimodal: Modern AI needs vision; long context enables new use cases (entire books, codebases as context).

Production Inference Optimization Detailed

Batch Processing for Latency-Throughput Trade-off:

Single-request inference: 100ms latency, 100 req/sec throughput
Batch=32 inference: 3 seconds latency, 10K req/sec throughput
Suitable for: Batch scoring jobs, offline analytics (not real-time chat)

KV Cache Memory Explosion Problem:

Context length 128K tokens:
- Without KV cache: 128K × embedding_dim (manageable)
- With KV cache: 128K × embedding_dim × num_layers × 2 (Keys + Values)
- Example: 128K × 1024 × 40 × 2 = 10.5 GB for single request
- With 1000 concurrent requests: 10.5 TB memory (impossible)

Solutions:
1. PagedAttention: Manage KV cache like virtual memory (vLLM)
2. Quantization: Int8 KV cache reduces memory 4x
3. Sliding window: Only keep last 4K tokens in cache
4. Token merging: Merge similar tokens before attention

Speculative Decoding (Recent Innovation):

Standard decoding: For each token, run full forward pass
Speculative: Small model (7B) drafts 4 tokens, big model (70B) verifies in parallel
Result: 2-3x speedup with no quality loss

How it works:
1. Small model generates draft tokens fast
2. Big model scores all drafted + next token in parallel (1 forward pass)
3. Accept draft tokens if big model agrees on top-1, sample otherwise
4. Empirically: Accept rate ~80% → 2.5x effective speedup

Scaling Laws Deep Dive

Chinchilla & Compute-Optimal Training:

For compute budget C (FLOPs):
Optimal allocation: N ≈ D / 20
Where N = model parameters, D = training tokens

Example: 10^20 FLOPs budget
C = 6 * N * D
10^20 = 6 * N * (20N)
N ≈ 13B parameters
D ≈ 260B tokens

Rule of thumb: Spend equal compute on model + data
GPT-3 violated this (spent 3x more on data than model size), was undertrained

Emergent Abilities (Scaling Phenomenon):

Below 10B params: No reasoning, can’t do math, no in-context learning
10B-100B params: Basic arithmetic, few-shot learning emerges
100B+ params: Complex reasoning, code generation, instruction following
Pattern: Abilities emerge suddenly at certain scales, not gradual

Loss Curves (Predicting Needed Compute):

Training loss typically follows: L = α * N^β + γ

Where:
α, β, γ = empirically fitted constants
N = number of training steps

Given: Loss should reach 2.5 on benchmark for acceptable quality
Compute: Training cost = 6 * N * D (Chinchilla)

Public numbers (GPT-3):
- 125M params: 2.5 loss achievable in 100B tokens
- 13B params: 2.5 loss achievable in 300B tokens
- 175B params: 2.5 loss achievable in 300B tokens (not better despite size)

Fine-tuning vs Pre-training Comparison

Aspect	Pre-training	Fine-tuning
Data	2T tokens, diverse	10K-100K examples, task-specific
Time	3-6 months	1-7 days
Cost	$10M-100M	$1K-100K
Hardware	10K+ GPUs	1-8 GPUs
Quality Gain	Foundation (100% baseline)	5-20% improvement
Best For	General knowledge	Domain-specific tasks

References

AI & Agents, GenAI & LLMs

llm

This post is licensed under CC BY 4.0 by the author.