After this lesson, you will be able to: Measure AI output quality, detect hallucinations, build a simple eval set, and decide when to trust the model.
Evals are the difference between an AI demo that wows once and an AI product you can ship. This lesson covers the categories of eval, how to build a 20-example set today, and the rubric you'll iterate on for months.
Without evals, you can't tell if a prompt change made the model better or just different. Most teams ship by vibes for the first six months, then hit a wall when they try to swap models or update prompts. Evals turn 'feels better' into 'is 12% better on the metric we care about'. They're how serious AI products operate.
Reference-based: compare model output to a known-correct answer. Used for classification, extraction, math. Easy to automate. Rubric-based: judge the output against a written rubric, often using another LLM as the grader (LLM-as-judge). Used for open-ended generation, summarisation. User-feedback-based: collect thumbs up/down, A/B tests, conversion lift. Used for shipped features. Slowest signal but most honest.
Use this exact pattern for every feature you ship.
# evals/my-feature.json[{ "input": "Summarise this email: ...", "expected_contains": ["action items", "deadline"] },{ "input": "Summarise this email: ...", "expected_excludes": ["opinion"] },...]# evals/run.pyimport json, anthropic, reclient = anthropic.Anthropic()cases = json.load(open("evals/my-feature.json"))score = 0for case in cases:resp = client.messages.create(model="claude-sonnet-4-6",max_tokens=512,messages=[{"role": "user", "content": case["input"]}],)text = resp.content[0].text.lower()ok = all(s.lower() in text for s in case.get("expected_contains", []))ok = ok and not any(s.lower() in text for s in case.get("expected_excludes", []))score += int(ok)print(f"Pass rate: {score}/{len(cases)} ({100*score/len(cases):.1f}%)")
LLM-as-judge is cheap, scalable rubric grading. Use Claude or GPT-4 with a prompt like 'rate this summary 1-5 on faithfulness, conciseness, accuracy.' Pitfalls: judges prefer their own outputs (use a different model as judge than the model under test); judges are less reliable for the harder edge cases (use humans for those); judges can be prompted to be too lenient (calibrate against human ratings on a small subset). Promptfoo, Braintrust, LangSmith, and Anthropic Evals are the standard tools.
This is the engineering loop that turns a demo into a product.
Before any prompt or model change: capture the current pass rate on the eval set
Make the change
Re-run evals
If pass rate drops > 5 points, revert and investigate
For any new failure mode you find in production, add it to the eval set so it never silently regresses again
Quarterly: review the eval set against current behaviour. Retire stale cases, add new failure modes
Treating one eval pass as 'shipped'. Run on every code change. Not separating golden test set from prompt-tuning set. If you tune to the same set you score against, the score lies. Forgetting cost in the eval. Quality 1% better at 3x latency / 10x cost is rarely a win. Hand-grading forever. Use LLM-as-judge for scale, anchor with human spot-checks. Skipping the 'refuses to answer when it shouldn't' check. Most factual hallucinations are caught here.
Pick the answer.
Sign in and purchase access to unlock this lesson.