LLM Evaluation

Measuring what matters: How to evaluate LLM quality across dimensions—accuracy, latency, cost, safety.

Posted May 5, 2025

7 min read

LLM Evaluation

Evaluation Categories

Category	Metric	How to Measure	Target
Accuracy	Task-specific (BLEU, F1, Exact Match)	Compare vs gold standard	>90%
Fluency	Human raters (1-5 scale)	Manual evaluation	>4.0/5
Coherence	Human judgment + perplexity	Fluency + consistency	>4.0/5
Hallucination	% claims not in context	Human annotation	<5%
Latency	P50, P99 (milliseconds)	Production monitoring	P50<500ms
Cost	$ per 1K tokens	API billing	<$0.01 per query
Safety	Toxicity, bias, safety violations	Automated filters + human review	Near 0% harmful output

Benchmarks

MMLU (Massive Multitask Language Understanding)

57 subjects, 14K questions, high school → graduate level
Measures broad knowledge
Leaderboard: GPT-4 (88%), Claude 3.5 Sonnet (88%), Gemini (88%)

GSM8K (Grade School Math)

8.5K math word problems
Measures reasoning
GPT-4: 92%, Claude: 90%

HellaSwag (Commonsense reasoning)

70K multiple-choice completion tasks
GPT-4: 95%, Claude: 92%

HumanEval (Code generation)

164 Python programming problems
Measures code quality
GPT-4: 92%, Claude: 92%

BigBench (Large-scale collaborative benchmark)

200+ tasks, 1000s of examples
Comprehensive evaluation suite

Automatic Metrics

BLEU (Machine Translation)

        
      
from nltk.translate.bleu_score import sentence_bleu
reference = "the cat is on the mat"
hypothesis = "the cat is in the mat"
bleu = sentence_bleu([reference.split()], hypothesis.split())

Correlation with human judgment: ~0.4–0.6 (moderate)

ROUGE (Summarization)

        
      
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'])
scores = scorer.score(target, prediction)

BERTScore (Semantic Similarity)

Compares embeddings instead of surface-level words
Captures paraphrases, synonyms
Correlation: 0.7–0.8 with human judgment

Human Evaluation Framework

Setup:

100–500 examples
3 raters per example (inter-rater agreement)
Cost: $0.50–5 per example = $50–2500 total

Rubric Example (Customer Support):

Accuracy: Is the answer factually correct? (1–5 scale)
Completeness: Does it address all parts of the question? (1–5)
Tone: Professional, helpful, empathetic? (1–5)
Safety: No harmful content? (Yes/No)

Analysis:

Inter-rater agreement (Cohen’s kappa)
Average scores per dimension
Edge cases for improvement

Production Metrics

Latency:

P50: Median response time
P99: 99th percentile (catches tail latencies)
Target: <500ms for customer-facing, <5s for batch

Cost:

$ per 1K input tokens
$ per 1K output tokens
Total cost per query

User Satisfaction:

Thumbs up/down on responses
Customer satisfaction surveys
Task completion rate (for assistance)

Red Teaming & Adversarial Evaluation

Definition: Attempt to break the model or elicit harmful output.

Examples:

“Write instructions for making a bomb”
Jailbreak attempts (“Pretend you’re an evil AI”)
Bias/stereotypes (“Describe a CEO. What race are they?”)
Privacy leakage (“What’s your training data?”)

Process:

Expert testers attempt exploits
Document failures
Rate severity (low/medium/high/critical)
Iterate model/filtering

Production Systems:

OpenAI, Anthropic, Google do red teaming before release
Anthropic published Constitutional AI paper on safety alignment

LLM-as-Judge Framework

Modern evaluation often replaces humans with LLM judges (GPT-4, Claude). For each output:

        
      
def evaluate_with_llm_judge(output, reference):
    """Use GPT-4 as judge instead of human"""
    prompt = f"""
    Evaluate this response on a 1-5 scale.

    Criteria:
    1. Accuracy: Does it match the reference?
    2. Completeness: Does it address all aspects?
    3. Clarity: Is it easy to understand?

    Reference: {reference}
    Output: {output}

    Score (1-5):
    """
    score = gpt4(prompt)
    return score

Validation: Compare LLM judge scores vs human scores (correlation ~0.85-0.90) Cost: Much cheaper than humans ($0.01-0.05 per eval vs $0.50-5 per human) Reliability: Better with specific rubrics, worse on subjective judgments

RAGAS Metrics (RAG Evaluation)

For RAG systems, RAGAS (Retrieval-Augmented Generation Assessment) measures:

Metric	Meaning	Target	How Measured
Faithfulness	% of claims supported by context	>90%	LLM evaluates claims vs documents
Answer Relevance	Does answer address the question?	>85%	LLM rates relevance 1-5
Context Recall	% of gold context retrieved	>80%	Compare retrieved vs ground truth
Context Precision	% of retrieved context is useful	>85%	Inverse of noise ratio

Example (Customer Support RAG):

Question: "How do I track my order?"
Gold Docs: [Tracking doc, Order management doc]
Retrieved: [Tracking doc, Returns policy] (got 1 of 2, precision: 50%)
Answer: "Check your email for tracking link" (answer relevant, faithful to context)

Metrics:
- Faithfulness: 100% (answer from retrieved docs)
- Answer Relevance: 95% (directly addresses question)
- Context Recall: 50% (missed Order management doc)
- Context Precision: 50% (Returns policy not useful)

Hallucination Benchmarks

TruthfulQA (Benchmark for truth-telling):

817 QA pairs where LLMs commonly hallucinate
Example Q: “What is the largest star in the universe?”
Wrong answers (common): “UY Scuti”, “Betelgeuse”
Correct answer: “We don’t know; observable universe is infinite”
GPT-4: 79% truthfulness; LLaMA 7B: 42%

FEVER (Fact Extraction & Verification):

185K claims + evidence labeled {Supported, Refuted, Not Enough Info}
Evaluate if LLM can verify facts correctly
Baseline: 67% accuracy (good LLMs: 85-90%)

How Real Systems Evaluate

OpenAI’s Evaluation Framework: OpenAI evaluates ChatGPT across 150+ tasks before each release. For each task: (1) collect 100-1K examples, (2) get human consensus labels, (3) evaluate model output, (4) compute accuracy/BLEU/ROUGE, (5) compare vs baseline. They weight evaluations: safety (40%), helpfulness (40%), accuracy (20%). No single release unless safety metrics improve + helpfulness stable. Cost: $1-5M per major release evaluation (human annotation + compute).

Anthropic’s Constitutional AI Evaluation: Anthropic measures safety against their constitution (e.g., “be helpful, harmless, honest”). For each output: (1) LLM critique rates output against principles, (2) collect flagged violations, (3) categorize by severity (critical, high, medium, low), (4) set thresholds (target: <1% critical, <5% high). They also measure capability on standard benchmarks (MMLU, HumanEval, GSM8K). Evaluation runs daily on 10K diverse prompts before shipping updates.

Stripe’s RAG Evaluation: Stripe evaluates their support AI monthly. Metrics tracked: (1) human annotation of 100 random responses (accuracy, citations), (2) retrieval precision@5 (are docs relevant?), (3) user feedback (thumbs up/down), (4) hallucination rate (% claims contradicted by sources). Thresholds: accuracy >90%, hallucination <3%, user satisfaction >4/5. If any threshold breached, they debug: is retrieval wrong? Reranking broken? LLM confused? They found that 60% of failures come from retrieval, 30% from reranking, 10% from LLM.

GitHub Copilot’s Code Evaluation: GitHub evaluates code suggestions weekly on 1000 real scenarios from their codebase. Metrics: (1) does code compile? (2) does it match intended logic? (3) security: any SQL injection risks? (4) test coverage: does it handle edge cases? They achieve: 92% compilability, 75% functional correctness, 99.5% no critical security issues. Feedback loop: when users accept/reject suggestions, they tag reasons (too slow, wrong logic, style violation) which feeds model retraining.

Benchmark Selection for Your Task

For NLP Tasks (Q&A, Summarization):

Primary: MMLU (breadth), HumanEval (coding)
Secondary: ROUGE/BLEU (if text generation)
Custom: Domain-specific test set (10-50 examples)

For RAG Systems:

Retrieval metrics: Recall@20, Precision@5, MRR
Generation metrics: BLEU, ROUGE, LLM judge score
Custom: Hallucination rate, citation accuracy

For Safety (If Public-Facing):

Automated: Toxicity classifier, bias detection
Manual: Red teaming (expert adversarial attempts)
Ongoing: User reports of harmful outputs

Continuous Evaluation Pipeline

Production Setup:

1. Daily Evaluation Job (automated)
   - Run 1K diverse prompts through model
   - Compute MMLU, HumanEval, custom metrics
   - Alert if any metric drops >2%

2. Weekly User Feedback Aggregation
   - Collect thumbs up/down on generations
   - Analyze low-rated responses for patterns
   - Identify failure modes (e.g., "facts incorrect", "too slow")

3. Monthly Human Annotation
   - 100 random outputs annotated for quality
   - Compute correlation with automatic metrics
   - Retrain BLEU/ROUGE weights if misaligned

4. Quarterly Model Update
   - Full benchmark suite (MMLU, GSM8K, HumanEval, etc.)
   - Red team session (2-3 domain experts)
   - Decision: ship, iterate, or rollback

References

📄 HELM: Holistic Evaluation of Language Models (Liang et al., 2023) 📄 Evaluating Large Language Models Trained on Code (Chen et al., 2021) 📄 TruthfulQA: Measuring How Models Mimic Human Falsehoods (Lin et al., 2021) 📄 RAGAS: A Framework for Evaluating Retrieval-Augmented Generation Systems (Shahul et al., 2023) 📄 G-EVAL: NLG Evaluation using GPT-4 (Liu et al., 2023) 🔗 OpenAI Benchmarks 🔗 Papers with Code: LLM Leaderboard 🎥 LLM Evaluation Best Practices (Chip Huyen)

AI & Agents, GenAI & LLMs

llm guardrails

This post is licensed under CC BY 4.0 by the author.