Post

Controlling Randomness — Temperature, Top-k, Top-p

How to control the creativity-coherence tradeoff: Sampling strategies determine whether your LLM generates repetitive text or hallucinated nonsense.

Controlling Randomness — Temperature, Top-k, Top-p

How to control the creativity-coherence tradeoff: Sampling strategies determine whether your LLM generates repetitive text or hallucinated nonsense.

Quick Reference

Method Effect Best For Use Case
Greedy Deterministic (always pick max probability) Reproducibility, factual tasks Search, translation
Temperature Controls randomness (0=greedy, 1=baseline, >1=chaos) Tuning diversity Creative writing, chat
Top-k Keep top-k tokens, discard tail Reduce nonsense General LLM sampling
Top-p (Nucleus) Cumulative probability cutoff Adaptive, quality Most production LLMs
Beam Search Keep top-b hypotheses, expand Longer sequences Machine translation

Temperature: The Fundamental Control

How It Works:

LLM outputs logits (raw scores). Apply softmax to convert to probabilities.

1
2
3
4
5
6
7
8
9
10
11
12
13
logits = [2.0, 0.5, 0.1]
Without temperature (T=1):
  probs = softmax([2.0, 0.5, 0.1]) = [0.88, 0.10, 0.02]

With low temperature (T=0.5):
  scaled = [2.0/0.5, 0.5/0.5, 0.1/0.5] = [4.0, 1.0, 0.2]
  probs = softmax([4.0, 1.0, 0.2]) = [0.98, 0.02, 0.00]
  → Peaks sharper, more deterministic

With high temperature (T=2.0):
  scaled = [2.0/2.0, 0.5/2.0, 0.1/2.0] = [1.0, 0.25, 0.05]
  probs = softmax([1.0, 0.25, 0.05]) = [0.65, 0.30, 0.05]
  → Peaks flatter, more random

Interpretation:

  • T=0 (Greedy): Always pick highest probability token. Deterministic, fast.
  • T=1.0 (Baseline): Model’s original distribution. Balanced.
  • T>1 (Hot): Flatten distribution. More randomness, less coherence. “Creative” but risky.

Production Settings:

  • GPT-4 API default: T=1.0
  • OpenAI Chat: T=0.7
  • Code generation (Copilot): T=0.0 (greedy, reproducible)
  • Creative writing: T=1.2–1.5
  • Factual QA: T=0.2–0.5

Effect on Output:

1
2
3
4
5
Prompt: "The capital of France is"

T=0.0: "Paris" (100% of the time)
T=0.7: "Paris" (95% of the time), occasional misspellings
T=2.0: "Paris", "Paries", "Parsi", "Paul", "Pterodactyl"

Top-k Sampling: Filter Low-Probability Tokens

How It Works:

  1. Get logits, compute softmax → probabilities
  2. Sort, take top-k tokens
  3. Renormalize probabilities over k tokens
  4. Sample from renormalized distribution

Example (k=2):

1
2
3
4
Probabilities: [0.6, 0.3, 0.07, 0.02, 0.01]
Top-2: [0.6, 0.3]
Renormalize: [0.6/(0.6+0.3), 0.3/(0.6+0.3)] = [0.67, 0.33]
Sample from [0.67, 0.33]

Effect:

  • ✅ Prevents sampling from very low-probability tokens (nonsense)
  • ✅ Simple to tune (just choose k)
  • ❌ Fixed k may be too restrictive or too loose depending on entropy

Production: k=40 or k=50 is typical. Reduces hallucinations.


Top-p (Nucleus) Sampling: Cumulative Probability

How It Works (Better than Top-k):

  1. Sort tokens by probability (descending)
  2. Compute cumulative sum
  3. Take tokens where cumsum ≤ p
  4. Renormalize, sample

Example (p=0.9):

1
2
3
4
5
6
7
8
9
Probs: [0.5, 0.3, 0.15, 0.03, 0.02]
CumSum: [0.5, 0.8, 0.95, 0.98, 1.0]

Include if cumsum ≤ 0.9:
Included: [0.5, 0.3, 0.15] (cumsum = 0.95, exceeds but 0.95 > 0.9)
Actually included: [0.5, 0.3] (only go up to cumsum=0.8)

Renormalize: [0.5/0.8, 0.3/0.8] = [0.625, 0.375]
Sample from [0.625, 0.375]

Advantage over Top-k:

  • ✅ Adaptive: High-confidence predictions use fewer tokens, low-confidence use more
  • ✅ Reduces incoherence from truncation
  • ✅ Works across different probability distributions

Production: p=0.9 or p=0.95 most common. OpenAI uses top_p=1 (disabled) by default, preferring temperature.


Beam Search: Multiple Hypotheses

How It Works:

  1. Maintain top-b candidate sequences (beams)
  2. Expand each by one token
  3. Keep top-b overall (prune worst)
  4. Repeat until EOS token

Example (b=2, max 3 tokens):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Prompt: "Translate 'Hello' to Spanish"

Step 0: "Translate" → 2 options
  Beam 1: "Translate" (prob=1.0)
  Beam 2: "Translate" (prob=1.0)

Step 1: Each expands by 1 token
  Beam 1a: "Translate 'Hello'" (0.85)
  Beam 1b: "Translate hello" (0.10)
  Beam 2a: "Translate to" (0.04)
  Beam 2b: "Translate the" (0.01)
  Keep top-2: 1a (0.85), 1b (0.10)

Step 2: Each expands
  1a: "Translate 'Hello' to" (0.85 * 0.92)
  1b: "Translate hello world" (0.10 * 0.50)
  Keep top-2, continue...

Benefit: Finds globally better sequence (not greedy locally).

Cost: O(b × seq_len × vocab_size) vs O(seq_len × vocab_size) for greedy.

Use: Machine translation, summarization where quality > speed.


Production Strategies

OpenAI API (GPT-4):

1
2
3
4
5
6
7
8
9
import openai

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Write a poem"}],
    temperature=0.7,          # Default: 1.0
    max_tokens=100,
    # Note: top_p defaults to 1 (disabled)
)

Hugging Face Transformers:

1
2
3
4
5
6
7
8
9
10
11
12
13
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

output = model.generate(
    input_ids=tokenizer.encode("Hello", return_tensors="pt"),
    max_length=50,
    temperature=0.7,
    top_k=40,
    top_p=0.95,
    do_sample=True,  # Enable sampling (vs greedy)
)

When to Use What

Task Temperature Top-k / Top-p Beam Search
Factual QA 0.2 Disabled Sometimes (b=3)
Code Gen 0.0 Disabled No
Chat 0.7 Top-p=0.9 No
Creative 1.2–1.5 Top-p=0.9 No
Translation 0.0–0.1 Disabled Yes (b=5)
Summarization 0.5 Top-p=0.95 Yes (b=3)

References

📄 The Curious Case of Neural Text Degeneration (Holtzman et al., 2019) — Introduced nucleus sampling 📄 Diverse Beam Search (Vijayakumar et al., 2016) 🔗 Hugging Face Generation Documentation 🔗 OpenAI API Documentation

This post is licensed under CC BY 4.0 by the author.