Controlling Randomness — Temperature, Top-k, Top-p

How to control the creativity-coherence tradeoff: Sampling strategies determine whether your LLM generates repetitive text or hallucinated nonsense.

Posted Mar 1, 2025

4 min read

How to control the creativity-coherence tradeoff: Sampling strategies determine whether your LLM generates repetitive text or hallucinated nonsense.

Quick Reference

Method	Effect	Best For	Use Case
Greedy	Deterministic (always pick max probability)	Reproducibility, factual tasks	Search, translation
Temperature	Controls randomness (0=greedy, 1=baseline, >1=chaos)	Tuning diversity	Creative writing, chat
Top-k	Keep top-k tokens, discard tail	Reduce nonsense	General LLM sampling
Top-p (Nucleus)	Cumulative probability cutoff	Adaptive, quality	Most production LLMs
Beam Search	Keep top-b hypotheses, expand	Longer sequences	Machine translation

Temperature: The Fundamental Control

How It Works:

LLM outputs logits (raw scores). Apply softmax to convert to probabilities.

logits = [2.0, 0.5, 0.1]
Without temperature (T=1):
  probs = softmax([2.0, 0.5, 0.1]) = [0.88, 0.10, 0.02]

With low temperature (T=0.5):
  scaled = [2.0/0.5, 0.5/0.5, 0.1/0.5] = [4.0, 1.0, 0.2]
  probs = softmax([4.0, 1.0, 0.2]) = [0.98, 0.02, 0.00]
  → Peaks sharper, more deterministic

With high temperature (T=2.0):
  scaled = [2.0/2.0, 0.5/2.0, 0.1/2.0] = [1.0, 0.25, 0.05]
  probs = softmax([1.0, 0.25, 0.05]) = [0.65, 0.30, 0.05]
  → Peaks flatter, more random

Interpretation:

T=0 (Greedy): Always pick highest probability token. Deterministic, fast.
T=1.0 (Baseline): Model’s original distribution. Balanced.
T>1 (Hot): Flatten distribution. More randomness, less coherence. “Creative” but risky.

Production Settings:

GPT-4 API default: T=1.0
OpenAI Chat: T=0.7
Code generation (Copilot): T=0.0 (greedy, reproducible)
Creative writing: T=1.2–1.5
Factual QA: T=0.2–0.5

Effect on Output:

Prompt: "The capital of France is"

T=0.0: "Paris" (100% of the time)
T=0.7: "Paris" (95% of the time), occasional misspellings
T=2.0: "Paris", "Paries", "Parsi", "Paul", "Pterodactyl"

Top-k Sampling: Filter Low-Probability Tokens

How It Works:

Get logits, compute softmax → probabilities
Sort, take top-k tokens
Renormalize probabilities over k tokens
Sample from renormalized distribution

Example (k=2):

Probabilities: [0.6, 0.3, 0.07, 0.02, 0.01]
Top-2: [0.6, 0.3]
Renormalize: [0.6/(0.6+0.3), 0.3/(0.6+0.3)] = [0.67, 0.33]
Sample from [0.67, 0.33]

Effect:

✅ Prevents sampling from very low-probability tokens (nonsense)
✅ Simple to tune (just choose k)
❌ Fixed k may be too restrictive or too loose depending on entropy

Production: k=40 or k=50 is typical. Reduces hallucinations.

Top-p (Nucleus) Sampling: Cumulative Probability

How It Works (Better than Top-k):

Sort tokens by probability (descending)
Compute cumulative sum
Take tokens where cumsum ≤ p
Renormalize, sample

Example (p=0.9):

Probs: [0.5, 0.3, 0.15, 0.03, 0.02]
CumSum: [0.5, 0.8, 0.95, 0.98, 1.0]

Include if cumsum ≤ 0.9:
Included: [0.5, 0.3, 0.15] (cumsum = 0.95, exceeds but 0.95 > 0.9)
Actually included: [0.5, 0.3] (only go up to cumsum=0.8)

Renormalize: [0.5/0.8, 0.3/0.8] = [0.625, 0.375]
Sample from [0.625, 0.375]

Advantage over Top-k:

✅ Adaptive: High-confidence predictions use fewer tokens, low-confidence use more
✅ Reduces incoherence from truncation
✅ Works across different probability distributions

Production: p=0.9 or p=0.95 most common. OpenAI uses top_p=1 (disabled) by default, preferring temperature.

Beam Search: Multiple Hypotheses

How It Works:

Maintain top-b candidate sequences (beams)
Expand each by one token
Keep top-b overall (prune worst)
Repeat until EOS token

Example (b=2, max 3 tokens):

Prompt: "Translate 'Hello' to Spanish"

Step 0: "Translate" → 2 options
  Beam 1: "Translate" (prob=1.0)
  Beam 2: "Translate" (prob=1.0)

Step 1: Each expands by 1 token
  Beam 1a: "Translate 'Hello'" (0.85)
  Beam 1b: "Translate hello" (0.10)
  Beam 2a: "Translate to" (0.04)
  Beam 2b: "Translate the" (0.01)
  Keep top-2: 1a (0.85), 1b (0.10)

Step 2: Each expands
  1a: "Translate 'Hello' to" (0.85 * 0.92)
  1b: "Translate hello world" (0.10 * 0.50)
  Keep top-2, continue...

Benefit: Finds globally better sequence (not greedy locally).

Cost: O(b × seq_len × vocab_size) vs O(seq_len × vocab_size) for greedy.

Use: Machine translation, summarization where quality > speed.

Production Strategies

OpenAI API (GPT-4):

        
      
import openai

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Write a poem"}],
    temperature=0.7,          # Default: 1.0
    max_tokens=100,
    # Note: top_p defaults to 1 (disabled)
)

Hugging Face Transformers:

        
      
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

output = model.generate(
    input_ids=tokenizer.encode("Hello", return_tensors="pt"),
    max_length=50,
    temperature=0.7,
    top_k=40,
    top_p=0.95,
    do_sample=True,  # Enable sampling (vs greedy)
)

When to Use What

Task	Temperature	Top-k / Top-p	Beam Search
Factual QA	0.2	Disabled	Sometimes (b=3)
Code Gen	0.0	Disabled	No
Chat	0.7	Top-p=0.9	No
Creative	1.2–1.5	Top-p=0.9	No
Translation	0.0–0.1	Disabled	Yes (b=5)
Summarization	0.5	Top-p=0.95	Yes (b=3)

References

📄 The Curious Case of Neural Text Degeneration (Holtzman et al., 2019) — Introduced nucleus sampling 📄 Diverse Beam Search (Vijayakumar et al., 2016) 🔗 Hugging Face Generation Documentation 🔗 OpenAI API Documentation

AI & Agents, ML Foundations

llm

This post is licensed under CC BY 4.0 by the author.

Quick Reference

Temperature: The Fundamental Control

Top-k Sampling: Filter Low-Probability Tokens

Top-p (Nucleus) Sampling: Cumulative Probability

Beam Search: Multiple Hypotheses

Production Strategies

When to Use What

References

Trending Tags