Controlling Randomness — Temperature, Top-k, Top-p
How to control the creativity-coherence tradeoff: Sampling strategies determine whether your LLM generates repetitive text or hallucinated nonsense.
How to control the creativity-coherence tradeoff: Sampling strategies determine whether your LLM generates repetitive text or hallucinated nonsense.
Quick Reference
| Method | Effect | Best For | Use Case |
|---|---|---|---|
| Greedy | Deterministic (always pick max probability) | Reproducibility, factual tasks | Search, translation |
| Temperature | Controls randomness (0=greedy, 1=baseline, >1=chaos) | Tuning diversity | Creative writing, chat |
| Top-k | Keep top-k tokens, discard tail | Reduce nonsense | General LLM sampling |
| Top-p (Nucleus) | Cumulative probability cutoff | Adaptive, quality | Most production LLMs |
| Beam Search | Keep top-b hypotheses, expand | Longer sequences | Machine translation |
Temperature: The Fundamental Control
How It Works:
LLM outputs logits (raw scores). Apply softmax to convert to probabilities.
1
2
3
4
5
6
7
8
9
10
11
12
13
logits = [2.0, 0.5, 0.1]
Without temperature (T=1):
probs = softmax([2.0, 0.5, 0.1]) = [0.88, 0.10, 0.02]
With low temperature (T=0.5):
scaled = [2.0/0.5, 0.5/0.5, 0.1/0.5] = [4.0, 1.0, 0.2]
probs = softmax([4.0, 1.0, 0.2]) = [0.98, 0.02, 0.00]
→ Peaks sharper, more deterministic
With high temperature (T=2.0):
scaled = [2.0/2.0, 0.5/2.0, 0.1/2.0] = [1.0, 0.25, 0.05]
probs = softmax([1.0, 0.25, 0.05]) = [0.65, 0.30, 0.05]
→ Peaks flatter, more random
Interpretation:
- T=0 (Greedy): Always pick highest probability token. Deterministic, fast.
- T=1.0 (Baseline): Model’s original distribution. Balanced.
- T>1 (Hot): Flatten distribution. More randomness, less coherence. “Creative” but risky.
Production Settings:
- GPT-4 API default: T=1.0
- OpenAI Chat: T=0.7
- Code generation (Copilot): T=0.0 (greedy, reproducible)
- Creative writing: T=1.2–1.5
- Factual QA: T=0.2–0.5
Effect on Output:
1
2
3
4
5
Prompt: "The capital of France is"
T=0.0: "Paris" (100% of the time)
T=0.7: "Paris" (95% of the time), occasional misspellings
T=2.0: "Paris", "Paries", "Parsi", "Paul", "Pterodactyl"
Top-k Sampling: Filter Low-Probability Tokens
How It Works:
- Get logits, compute softmax → probabilities
- Sort, take top-k tokens
- Renormalize probabilities over k tokens
- Sample from renormalized distribution
Example (k=2):
1
2
3
4
Probabilities: [0.6, 0.3, 0.07, 0.02, 0.01]
Top-2: [0.6, 0.3]
Renormalize: [0.6/(0.6+0.3), 0.3/(0.6+0.3)] = [0.67, 0.33]
Sample from [0.67, 0.33]
Effect:
- ✅ Prevents sampling from very low-probability tokens (nonsense)
- ✅ Simple to tune (just choose k)
- ❌ Fixed k may be too restrictive or too loose depending on entropy
Production: k=40 or k=50 is typical. Reduces hallucinations.
Top-p (Nucleus) Sampling: Cumulative Probability
How It Works (Better than Top-k):
- Sort tokens by probability (descending)
- Compute cumulative sum
- Take tokens where cumsum ≤ p
- Renormalize, sample
Example (p=0.9):
1
2
3
4
5
6
7
8
9
Probs: [0.5, 0.3, 0.15, 0.03, 0.02]
CumSum: [0.5, 0.8, 0.95, 0.98, 1.0]
Include if cumsum ≤ 0.9:
Included: [0.5, 0.3, 0.15] (cumsum = 0.95, exceeds but 0.95 > 0.9)
Actually included: [0.5, 0.3] (only go up to cumsum=0.8)
Renormalize: [0.5/0.8, 0.3/0.8] = [0.625, 0.375]
Sample from [0.625, 0.375]
Advantage over Top-k:
- ✅ Adaptive: High-confidence predictions use fewer tokens, low-confidence use more
- ✅ Reduces incoherence from truncation
- ✅ Works across different probability distributions
Production: p=0.9 or p=0.95 most common. OpenAI uses top_p=1 (disabled) by default, preferring temperature.
Beam Search: Multiple Hypotheses
How It Works:
- Maintain top-b candidate sequences (beams)
- Expand each by one token
- Keep top-b overall (prune worst)
- Repeat until EOS token
Example (b=2, max 3 tokens):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Prompt: "Translate 'Hello' to Spanish"
Step 0: "Translate" → 2 options
Beam 1: "Translate" (prob=1.0)
Beam 2: "Translate" (prob=1.0)
Step 1: Each expands by 1 token
Beam 1a: "Translate 'Hello'" (0.85)
Beam 1b: "Translate hello" (0.10)
Beam 2a: "Translate to" (0.04)
Beam 2b: "Translate the" (0.01)
Keep top-2: 1a (0.85), 1b (0.10)
Step 2: Each expands
1a: "Translate 'Hello' to" (0.85 * 0.92)
1b: "Translate hello world" (0.10 * 0.50)
Keep top-2, continue...
Benefit: Finds globally better sequence (not greedy locally).
Cost: O(b × seq_len × vocab_size) vs O(seq_len × vocab_size) for greedy.
Use: Machine translation, summarization where quality > speed.
Production Strategies
OpenAI API (GPT-4):
1
2
3
4
5
6
7
8
9
import openai
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": "Write a poem"}],
temperature=0.7, # Default: 1.0
max_tokens=100,
# Note: top_p defaults to 1 (disabled)
)
Hugging Face Transformers:
1
2
3
4
5
6
7
8
9
10
11
12
13
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
output = model.generate(
input_ids=tokenizer.encode("Hello", return_tensors="pt"),
max_length=50,
temperature=0.7,
top_k=40,
top_p=0.95,
do_sample=True, # Enable sampling (vs greedy)
)
When to Use What
| Task | Temperature | Top-k / Top-p | Beam Search |
|---|---|---|---|
| Factual QA | 0.2 | Disabled | Sometimes (b=3) |
| Code Gen | 0.0 | Disabled | No |
| Chat | 0.7 | Top-p=0.9 | No |
| Creative | 1.2–1.5 | Top-p=0.9 | No |
| Translation | 0.0–0.1 | Disabled | Yes (b=5) |
| Summarization | 0.5 | Top-p=0.95 | Yes (b=3) |
References
📄 The Curious Case of Neural Text Degeneration (Holtzman et al., 2019) — Introduced nucleus sampling 📄 Diverse Beam Search (Vijayakumar et al., 2016) 🔗 Hugging Face Generation Documentation 🔗 OpenAI API Documentation