Building AI Evaluations

After this lesson, you will be able to: Measure AI output quality, detect hallucinations, build a simple eval set, and decide when to trust the model.

Evals are the difference between an AI demo that wows once and an AI product you can ship. This lesson covers the categories of eval, how to build a 20-example set today, and the rubric you'll iterate on for months.

Prerequisites:Prompt Engineering

Why evals are the most under-invested AI skill

Without evals, you can't tell if a prompt change made the model better or just different. Most teams ship by vibes for the first six months, then hit a wall when they try to swap models or update prompts. Evals turn 'feels better' into 'is 12% better on the metric we care about'. They're how serious AI products operate.

Three categories of eval

Reference-based: compare model output to a known-correct answer. Used for classification, extraction, math. Easy to automate. Rubric-based: judge the output against a written rubric, often using another LLM as the grader (LLM-as-judge). Used for open-ended generation, summarisation. User-feedback-based: collect thumbs up/down, A/B tests, conversion lift. Used for shipped features. Slowest signal but most honest.

Build a simple eval set in 30 minutes

Use this exact pattern for every feature you ship.

python

# evals/my-feature.json
[
  { "input": "Summarise this email: ...", "expected_contains": ["action items", "deadline"] },
  { "input": "Summarise this email: ...", "expected_excludes": ["opinion"] },
  ...
]

# evals/run.py
import json, anthropic, re

client = anthropic.Anthropic()
cases = json.load(open("evals/my-feature.json"))

score = 0
for case in cases:
    resp = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        messages=[{"role": "user", "content": case["input"]}],
    )
    text = resp.content[0].text.lower()
    ok = all(s.lower() in text for s in case.get("expected_contains", []))
    ok = ok and not any(s.lower() in text for s in case.get("expected_excludes", []))
    score += int(ok)

print(f"Pass rate: {score}/{len(cases)} ({100*score/len(cases):.1f}%)")

LLM-as-judge: when to use it, when not to

LLM-as-judge is cheap, scalable rubric grading. Use Claude or GPT-4 with a prompt like 'rate this summary 1-5 on faithfulness, conciseness, accuracy.' Pitfalls: judges prefer their own outputs (use a different model as judge than the model under test); judges are less reliable for the harder edge cases (use humans for those); judges can be prompted to be too lenient (calibrate against human ratings on a small subset). Promptfoo, Braintrust, LangSmith, and Anthropic Evals are the standard tools.

Operate evals as part of every change

This is the engineering loop that turns a demo into a product.

1
Before any prompt or model change: capture the current pass rate on the eval set
2
Make the change
3
Re-run evals
4
If pass rate drops > 5 points, revert and investigate
5
For any new failure mode you find in production, add it to the eval set so it never silently regresses again
6
Quarterly: review the eval set against current behaviour. Retire stale cases, add new failure modes

Common mistakes only experienced AI engineers avoid

Treating one eval pass as 'shipped'. Run on every code change. Not separating golden test set from prompt-tuning set. If you tune to the same set you score against, the score lies. Forgetting cost in the eval. Quality 1% better at 3x latency / 10x cost is rarely a win. Hand-grading forever. Use LLM-as-judge for scale, anchor with human spot-checks. Skipping the 'refuses to answer when it shouldn't' check. Most factual hallucinations are caught here.

Quick Check

Which is the single best first step when you suspect a regression?

Pick the answer.

←AI Safety, Alignment, and Prompt Injection

Back to Artificial Intelligence

AI Product Development→