9  Evaluation Beyond the Benchmark

9.1 Learning objectives

By the end of this chapter you should be able to:

  • Apply the job-interview metaphor to AI evaluation: organisation-specific task suites and judgement assessment over generic public leaderboards.
  • Implement LLM-as-judge evaluation with appropriate calibration, including pairwise comparison and rubric scoring against gold-standard references.
  • Use biomedical benchmarks (MedQA, MedAlign, MMLU-Med) appropriately, with awareness of contamination and the distinction between benchmark and deployment performance.
  • Design rubrics for code-generation correctness on research tasks (model selection, EDA, results interpretation).

9.2 Orientation

The first generation of LLM evaluation imitated academic-benchmark culture: standardised test sets, public leaderboards, single-number scores. It produced a useful sorting of models on common knowledge tasks and generated headline numbers (MedQA scores, MMLU scores, HumanEval scores) that everyone watched. It also produced a generation of users who confused benchmark performance with deployment performance, and a generation of model providers whose internal training optimised for the published benchmarks rather than for the actual work users care about.

The framing that has emerged for serious applied use is organisation-specific evaluation. Mollick (Mollick, 2025) articulates the analogy: ‘you shouldn’t pick the AI that will advise thousands of decisions for your organization based on whether it knows that the mean cranial capacity of Homo erectus is just under 1,000 cubic centimeters’. Evaluate the model the way you would evaluate a job candidate: expert-designed real-world tasks, rubric-based scoring against the work the model will actually do.

The chapter develops three threads. The job-interview metaphor: how to design organisation-specific evaluation suites for applied analytic work. LLM-as- judge with calibration: how to scale evaluation when human review of every output is not feasible. Biomedical benchmarks: their uses, abuses, and the contamination problem that limits their reliability.

The thread that runs through all three: a single number on a public benchmark tells you almost nothing about whether a model will work for your specific task. The work of evaluation is ongoing, contextual, and analytic (in the sense that it involves study design and inference under uncertainty).

A pointed illustration of the gap. In a 2026 randomised study of 1,298 UK adults (Bean et al., 2026), participants who used an LLM (GPT-4o, Llama 3, or Command R+) to self-assess common medical conditions performed worse than or no better than participants who used their normal home resources, even though the same models scored well on standard medical-knowledge benchmarks. In one instance two users described nearly identical symptoms of subarachnoid haemorrhage and received opposite advice: one was told to lie down in a dark room; the other was told to seek emergency care. The benchmarks did not predict this failure. The deployment did. The chapter’s standing question, what evaluation regime would have caught that, is the question the rest of the chapter develops.

9.3 The researcher’s contribution

Three judgements are not delegable.

(Judgement 1.) The evaluation task suite is part of the analysis. A team adopting an LLM for routine applied analytic work needs an evaluation suite that reflects the work, boilerplate code generation, analytical planning, methods drafting, RoB assessment, whatever the team does. The suite is not a one-time benchmark; it is an asset built up over time, refined as failure modes are observed. The researcher designs the suite and maintains it.

(Judgement 2.) LLM-as-judge requires LLM-as-judge calibration. Using one model to evaluate another scales evaluation but introduces a new uncertainty: the judge model has biases too. The researcher calibrates the judge against human expert ratings on a sample, documents the agreement, and maintains the calibration as either model changes.

(Judgement 3.) The deployment context determines the acceptance threshold. A model that scores 0.85 on the team’s task suite may be acceptable for exploratory work but not for clinical-grade analyses. The researcher sets the acceptance threshold by use case and articulates what the threshold means in practice (e.g., ‘85% accuracy on routine code generation, with 100% manual review for clinical-grade work’).

These judgements are what distinguish organisation- specific evaluation from imitating public-benchmark culture in a context where it does not transfer.

9.4 The job-interview metaphor

Mollick’s framing (Mollick, 2025) is operational. To evaluate an AI for use in your team’s work, design the equivalent of a job interview: a small number of representative tasks, expert-designed, rubric-scored, conducted on the actual model under realistic conditions.

Three components.

Vibes-based testing. Open-ended prompts that probe the model’s worldview, judgement, and personality. ‘How would you approach a sensitivity analysis for this trial?’ ‘What would you do if you couldn’t get the cohort to converge?’ The point is not the specific answer but the texture of the response: does the model reason like a careful researcher, or does it produce confident-sounding non-answers?

Real-world task benchmarking. A small set (5–20) of expert-designed tasks that reflect the team’s actual work. Examples for a research-analytics team: - Generate the SAS code for a stratified Cox model given a data dictionary and pre-specified strata. - Draft a methods section for a published study (the team has the published version for comparison). - Identify the RoB issues in a study description. - Critique a proposed analysis plan.

Each task has a rubric: criteria for what ‘correct’ looks like, weighted by importance. The model’s output is rated against the rubric; the score is the headline metric.

Systematic judgement assessment. Tasks where the ‘right’ answer is debatable but quality of reasoning is observable. ‘You see X in the data. Is this likely a measurement issue, a real signal, or a confounder?’ The point is not to grade right vs wrong but to observe whether the model’s reasoning matches what a senior researcher would do.

A practical pattern: maintain a Google sheet or similar with one row per task. Columns: task description, expected behaviour, rubric (with weights), latest score per evaluated model, notes. Update when models are released, when failures are observed in production, when team needs evolve.

Question. Two models score identically on a public biomedical benchmark (e.g., MedQA at 88% accuracy). On the team’s organisation-specific evaluation suite, Model A scores 0.82, Model B scores 0.78. Cost is similar; latency is similar. Which should the team adopt for routine use?

Answer. Model A. The organisation-specific suite is the relevant signal; the public-benchmark equivalence is roughly informative but is dominated by the organisation- specific gap. The 4-point spread (0.82 vs 0.78) is meaningful given the suite was designed to reflect the team’s actual work. Caveat: confirm the sample is large enough that the difference is not noise, for suites with 10–20 tasks, a 4-point difference is usually within sampling noise; for suites with 50+ tasks, it is more stable. If the suite is small, extend it before deciding.

9.5 LLM-as-judge with calibration

Manual scoring of evaluation outputs scales poorly. For a 20-task evaluation suite run against five models, that is 100 outputs to grade, feasible. For a continuous evaluation that runs nightly on a larger suite and tracks degradation over time, manual grading becomes impossible. The pattern that has emerged is LLM-as-judge: a separate model grades each output against the rubric.

The technique (Zheng et al., 2023) works surprisingly well when configured properly:

Pairwise comparison (which output is better, A or B?) is more reliable than absolute scoring (rate this output 1–10). The judge has a clearer task; the output is calibrated relative.

Rubric in the prompt is essential. The judge needs the explicit rubric and the gold standard (if available) in context to produce defensible judgements. ‘Rate this output’ without a rubric produces noisy, hard-to-interpret scores.

Calibration against human raters is non-optional. The judge model is itself an imperfect rater; the agreement between judge and human expert determines how much trust to place in the automated score. Run the judge against 20–50 outputs that humans have rated; compute the agreement (kappa for categorical, ICC for continuous). Below 0.6 kappa, the judge is too noisy to use.

A working LLM-as-judge prompt:

You are evaluating an AI-generated response to a
research question. The question, the
reference answer, and the AI response are below.

Question: [question]

Reference answer (gold standard from senior
researcher): [gold answer]

AI response: [AI output]

Rubric:
- Statistical correctness (40%): does the response
  correctly apply statistical principles? Errors in
  this category are most serious.
- Code correctness (30%): if code is included, does
  it run correctly and produce the right output?
- Clarity and explanation (20%): is the response
  clear and well-structured?
- Completeness (10%): does the response address all
  parts of the question?

Return JSON:
{
  "statistical_correctness": <0-100>,
  "code_correctness": <0-100>,
  "clarity": <0-100>,
  "completeness": <0-100>,
  "weighted_total": <0-100>,
  "rationale": "specific issues identified"
}

The structured output makes aggregation and tracking straightforward; the rationale field is what makes human spot-check feasible (read the rationale, judge whether the score is right).

Position bias. When pairing two outputs (A vs B), LLM judges tend to prefer whichever appears first or whichever is longer. Mitigate by running both orderings and averaging, or by randomising order.

Self-preference. A model used as judge may rate its own family’s outputs more favourably. Use a different model family as judge when possible; if using same- family judge, calibrate against human raters carefully.

9.6 Biomedical benchmarks: uses and abuses

Public biomedical benchmarks have a useful role and a characteristic abuse pattern.

Useful role. They provide rough sorting across models on known tasks. MedQA (Jin et al., 2021) (USMLE-style questions), MMLU-Med (Hendrycks et al., 2021) (medical-knowledge subset of MMLU), and benchmarks from the BIG-bench and HELM families inform ‘is this model in the ballpark?’ for medical tasks. A model scoring 30% on MedQA is unlikely to be useful for medical work; a model scoring 90% is plausible.

Abuse pattern. Treating benchmark performance as deployment performance. Three failure modes:

  • Contamination. Benchmarks leak into training data. Many widely-used benchmarks exist in training corpora as natural text; models score high partly through memorisation. The MedQA contamination problem has been documented; recent leaderboard scores are partially inflated by this.
  • Distribution mismatch. Benchmarks reflect a specific test distribution; deployment tasks often differ. A USMLE-trained model can score high on MedQA while struggling on real clinical notes that differ in style and content.
  • Optimisation pressure. Model providers optimise (consciously or via reinforcement learning) for benchmark performance. The result is models that perform well on the benchmark metric but worse on adjacent tasks the benchmark does not measure.

Med-PaLM (Singhal et al., 2023; Singhal et al., 2025) achieves expert-level performance on MedQA. The practical implication is narrower than the headline: Med-PaLM-class models are good at USMLE-style medical-knowledge questions; whether they are good at the team’s specific work remains a separate question answered by the team’s evaluation suite.

A useful disposition: read public benchmarks for filtering, not for selection. They tell you which models are in the ballpark for your domain; the selection is by your evaluation suite.

9.7 Benchmark saturation

A benchmark stops being a useful signal when the frontier models have crowded against its ceiling. SWE-Bench Verified, a software-engineering benchmark that was load-bearing for coding-model comparison through 2024 and most of 2025, illustrates the arc clearly. Successive frontier model releases (Claude Opus 4.5, Opus 4.6, Opus 4.7) registered near-identical scores on the benchmark while practitioners reported clear differences in everyday use (Raschka, 2026). OpenAI’s own 2026 retrospective (OpenAI, 2026), Why SWE-bench Verified no longer measures frontier coding capabilities, documented that a substantial share of remaining unsolved tasks are unsolvable for reasons that have nothing to do with model capability (broken tests, ambiguous specifications, missing dependencies). A benchmark that started as a useful filter became, late in its life cycle, a benchmark whose top of leaderboard does not order the models on the capability the benchmark was meant to measure.

The pattern generalises:

  • Benchmarks are time-limited assets. They have a useful filtering range, then they saturate. The saturation point is when frontier scores cluster within sampling noise. Past that point, the benchmark provides comparison-equality signal but not comparison-ranking signal.
  • The right response to saturation is retirement, not interpretation. Reading nuance into a saturated benchmark to distinguish frontier models is a category error. The benchmark has done its job; the discrimination must move to a harder benchmark or to a deployment-grade evaluation suite.
  • Medical benchmarks are not exempt. MedQA performance has approached or crossed the contamination-and-saturation threshold for several frontier models. A 2026 leaderboard score of 92% vs 91% on MedQA is not a meaningful capability signal even when both numbers are computed faithfully.

The implication for the researcher is operational: maintain a list of which public benchmarks you trust for filtering, and update it. A benchmark on which all the candidate models score within two points of each other has stopped being useful for that selection.

9.8 From benchmarks to evaluation environments

The deeper limitation of static benchmarks is that they evaluate isolated decisions. A clinician’s work is not a sequence of isolated puzzles. It is a stream of decisions made under evolving constraints, with finite resources, where each choice reshapes the available options for the next choice. A benchmark’s scoring of did the model produce the right answer to this question tells the researcher little about will the model behave usefully across a 12-hour shift in a busy emergency department.

The 2026 Clinical Environment Simulator (CES) framework of Luo and colleagues (Luo et al., 2026) is a working articulation of the alternative. Borrowing from aviation, the authors note: ‘Just as aviation replaced written examinations with comprehensive flight simulators to better capture the complexity of real flying, clinical AI evaluation should similarly move toward simulated hospital environments that reflect the true demands of medical practice.’ The CES framework couples a hospital engine (real-time bed availability, staff workload, equipment status) with a patient engine (disease progression that responds to AI interventions). The two engines create feedback loops where every clinical decision reshapes future options.

Three capabilities surface under this paradigm that no static benchmark measures:

  • Temporal reasoning under evolving constraints. Diagnostic delays directly cause patient deterioration; a model that produces a correct diagnosis 30 minutes too late is not equivalent to one that produces it on time.
  • Resource-aware decision-making. Aggressive workups for one patient may exhaust capacity needed by others. A useful clinical AI must reason about its own demand on shared resources.
  • Operational resilience. Performance under simultaneous emergencies and system failures is a separate axis from average-case accuracy.

Two design implications follow.

First, the researcher’s evaluation suite should include at least some tasks that are temporally extended: not ‘answer this question’ but ‘manage this case across multiple decision points where each choice affects the next state’. For research analytics, the analogue is end-to-end tasks (analyse this dataset from data dictionary to final report, with intermediate decision points) rather than single-step tasks (write the regression code).

Second, the researcher should expect the benchmark landscape to shift in this direction. Medical AI evaluation in 2026 is mid-transition; the publication of CES, of similar simulator-based proposals across radiology, oncology, and emergency medicine, and of the Bean and colleagues (Bean et al., 2026) result that benchmarks-and-simulations failed to predict real-world failures, all point to a paradigm shift. Teams that build their evaluation infrastructure only for static benchmarks will find that infrastructure increasingly inadequate.

9.9 Continuous evaluation

A model is not evaluated once. It is evaluated continuously, because: - The model itself can change (provider-side updates). - The team’s tasks change as projects evolve. - New failure modes are discovered in production.

A continuous-evaluation harness has roughly five components:

  1. Task suite stored in a structured format (YAML, JSON) with tasks, rubrics, and gold answers.
  2. Runner that executes each task against the current model and stores outputs.
  3. Judge (LLM-as-judge with calibration) that scores each output against the rubric.
  4. Tracker that stores scores over time, computes trends, and alerts on degradation.
  5. Spot-check workflow that surfaces a sample of outputs for human review periodically.

For a research-analytics team, the harness might be a weekly cron that runs the team’s 30-task suite against the team’s primary model, posts the scores to a dashboard, and alerts if a domain (statistical correctness, code correctness) drops by more than 5 percentage points week over week.

The investment is real (the suite is a sustained asset, not a one-time build) but the payoff is specific and directly informative. Teams that have built such harnesses can answer ‘is the new model release better for our work’ in a day; teams that have not are stuck with anecdotes.

9.10 Worked example: building an evaluation suite

A research-analytics team wants to standardise on an LLM for routine code-generation, analytical drafting, and RoB assessment. The team’s options are several frontier models; the question is which to adopt and when to reconsider.

Step 1: catalogue the team’s actual work. The team lists what they currently use AI assistance for or want to use it for: - Generate dplyr and data.table code from data-dictionary descriptions. - Generate SAS PROC FREQ / PROC MEANS code for baseline tables. - Draft methods sections from analysis plans. - Generate forest-plot code from extracted hazard ratios. - Score papers against systematic-review inclusion criteria. - Critique a proposed analysis plan for missing-data handling. - Generate PROC LIFETEST code from survival data specifications.

Step 2: build the task suite. For each kind of work, the team creates 2–4 representative tasks. Total: 22 tasks. For each, a senior researcher writes: - The task input (the data dictionary, the analysis plan, etc.). - The expected output or rubric for evaluation. - The acceptance criteria.

Step 3: build the LLM-as-judge. The team uses a different model family as judge; the rubric is the team-specified rubric for each task. The judge is calibrated against the senior researcher’s ratings on 20 sample outputs (tasks rated by both model and human). Kappa is 0.71, acceptable.

Step 4: run the suite against candidate models. The team evaluates four candidate models. Total cost: ~$15 per model per run. Scores:

Model Statistical correctness Code correctness Clarity Completeness Total
Frontier A 0.91 0.94 0.88 0.86 0.91
Frontier B 0.88 0.90 0.92 0.90 0.89
Mid-tier C 0.79 0.84 0.83 0.80 0.81
Open weights D 0.73 0.78 0.80 0.78 0.76

Step 5: decision. Frontier A is the headline winner on statistical and code correctness; B is more readable and complete. The team picks A as the default and B as the alternative for cases where readability matters. Total decision time: 1 day, including evaluation overhead.

Step 6: ongoing tracking. The harness runs weekly. Three months later, a model update produces a 4-point drop in code correctness on a subset of tasks involving SAS macros; the team flags the issue, files a support ticket, and switches to model B for SAS- specific work until the issue is resolved. Without the harness, the team would have noticed the issue through accumulated production failures.

9.11 Collaborating with an LLM on AI evaluation

Three prompt patterns illustrate working with LLMs on evaluation tasks.

Prompt 1: ‘Write a rubric for evaluating this kind of research task.’ Provide the task description.

What to watch for. The LLM produces a competent rubric. It tends to over-weight clarity and under-weight correctness, readable wrong code scores higher than it should. Push back on weights.

Verification. Test the rubric on a few sample outputs yourself. If the rubric scores match your gut judgement, it is calibrated; if not, refine.

Prompt 2: ‘Audit my evaluation suite for blind spots.’ Provide the task list and ask the LLM to identify common applied analytic work that is not represented.

What to watch for. The LLM is reasonably good at identifying gaps but tends to suggest generic ones. Ask follow-up: ‘what specific research task might my team encounter that this suite would underweight?’

Verification. Add tasks that pass your own gut check of relevance. The suite is your asset; you decide what is in it.

Prompt 3: ‘Score this output against this rubric.’ Provide the output and the rubric.

What to watch for. LLM-as-judge is good at this when the rubric is well-specified. Watch for position bias (in pairwise comparison) and self-preference (when judging same-family outputs). Calibrate.

Verification. Periodic human spot-check. The correlation between LLM-judge and human rater is the useful number; track it over time.

The meta-pattern: evaluation is a discipline, not a benchmark. The team that maintains a custom evaluation suite has a measurable basis for AI-related decisions. The team that relies on public benchmarks has anecdotes.

9.12 Principle in use

Three habits define defensible work in this area:

  1. Build a custom suite for your team’s work. Don’t rely on public benchmarks for selection. Public benchmarks filter; the custom suite selects.

  2. Calibrate LLM-as-judge against human ratings. The agreement between judge and human is the confidence you can place in the automated scores. Below 0.6 kappa, the judge is too noisy to substitute.

  3. Track evaluation scores over time, not just at selection. Models change; tasks change; failure modes evolve. Continuous evaluation catches degradation before users do.

9.13 Exercises

  1. Catalogue the routine AI-assisted work in your team. Write 10 representative tasks with rubrics. Run them against your team’s primary model and document the scores.

  2. Implement an LLM-as-judge for one of your tasks. Calibrate against your own ratings on 20 outputs. Compute kappa. Document.

  3. Take a public biomedical benchmark and a sample of tasks from your custom suite. Run a model on both; compare the scores. Reflect on what the benchmark does and does not predict about your suite.

  4. Set up a weekly evaluation harness for your team. Track scores for 4 weeks. Reflect on whether the variance is signal or noise.

  5. Take an evaluation result you previously accepted and audit it more deeply: read 10 random outputs and your own ratings, and the LLM judge’s ratings. Identify any systematic biases in the judge.

9.14 Further reading

  • Zheng et al. (2023), Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. The reference paper for the LLM-as-judge methodology.
  • Jin et al. (2021), What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams. The MedQA reference.
  • Hendrycks et al. (2021), Measuring Massive Multitask Language Understanding. The MMLU reference.
  • Singhal et al. (2023), Large Language Models Encode Clinical Knowledge. The Med-PaLM reference, including discussion of biomedical-benchmark performance.
  • Mollick (2025), Giving your AI a Job Interview. The framing for organisation-specific evaluation.