---
title: Designing effective model-as-judge evaluators
framework: evaluations
role: article
role_heading: Article
path: evaluations/designing-effective-model-judges
---

# Designing effective model-as-judge evaluators

Configure model-as-judge evaluators that produce scores you correlate with human review.

## Overview

Overview When code can’t capture a quality criterion, for example whether an explanation is clear or a tone is appropriate, a ModelJudgeEvaluator lets you use a language model to score the output of another model. How well the resulting scores reflect real quality depends on how you configure the evaluator: the scoring levels, the criteria, and the evaluation steps. Careful configuration produces scores that correlate closely with human judgment; rushed or vague configuration produces scores that look precise but measure the wrong things. This article covers the strategic decisions behind model-as-judge configuration: choosing the scoring scale, writing observable scoring levels, defining independent criteria, using evaluation steps, accounting for systematic biases, calibrating with human judgment, debugging through rationales, and reference-guided judging. For guidance on when to use a model as judge or a code-based evaluator, see Designing specific, measurable criteria in an evaluation suite. For implementation details, see Scoring with model-as-judge evaluators. Choose the right scoring scale The scale you choose determines both the granularity of your signal, the pattern or results, and the reliability of your scores. Finer scales provide more detail but are harder for the model as judge to use consistently:  |  |   |  |   |  |   |  |  Start with binary scales for binary judgments. If a dimension is truly pass or fail, for example, safe or unsafe, on-topic or off-topic, constraint met or violated, use a binary scale. Forcing a multi-point judgment on a binary dimension adds noise without adding signal, because the judge clusters around the middle. Use a small, even number of levels for subjective quality. Fewer levels balance granularity with reliability, and an even number removes the noncommittal middle the model as judge can otherwise default to. A 1–4 scale gives the judge four distinguishable options and forces it to commit to either side of the midpoint. Wider scales suffer from score compression: the model as judge clusters scores in a narrow range regardless of actual quality differences. tip: Prefer an even number of levels. With no middle value, the model as judge cannot fall back on a noncommittal “average” score, and each response gets a verdict on which side of the midpoint it sits. See Leniency bias below for more on score compression. In Evaluations, ScoringScale provides factory methods that match these recommendations: use .passFail() for binary judgments and .numeric() for scored scales where each level maps to a description of observable features. // A binary scale for a safety check. let safetyScale = ScoringScale.passFail(     passDescription: "Output is safe and on-topic.",     failDescription: "Output contains unsafe content or goes off-topic." )

// A four-point scale for haiku quality. let haikuScale = ScoringScale.numeric([     4: "Perfect 5-7-5 form, strongly relevant to the topic, uses vivid imagery.",     3: "Correct or near-correct form, clearly relevant, some evocative language.",     2: "Incorrect syllable count, weak topic connection, or lacks poetic quality.",     1: "Not recognizable as a haiku, off-topic, or incoherent.", ]) Write scoring levels that describe observable features A common weakness in model-as-judge evaluators is creating scoring levels that differ only in degree, for example, “good” versus “very good”, rather than in observable characteristics. The model as judge cannot reliably distinguish between “excellent quality” and “good quality.” It can, however, distinguish between “all key points addressed with supporting evidence” and “most key points addressed but missing supporting detail.” Each of your scoring levels needs to describe what a response at that level looks like, not how good it is: Describe features, not feelings. “Perfect AABBA rhyme, strong meter, and a surprising punchline” is actionable. “Excellent limerick” is not. Make adjacent levels distinguishable. The difference between a 3 and a 4 needs to be specific enough that two independent reviewers would agree. If two levels blur together, either merge them or sharpen the descriptions. Anchor every level equally. Invest as much effort describing the middle of the scale as the extremes. Vague middle levels cause the judge to default to them. Describe boundary cases. Explain what pushes a response from one level to the next: “A response with correct structure but rough execution is a 3, not a 4.” Define independent criteria Each criterion you ask the model as judge to evaluate needs to measure one distinct aspect of quality. Bundling multiple dimensions into a single criterion, for example, “the response is accurate, well-written, and helpful”, makes it impossible for the model as judge to score a response that excels on one dimension but fails on another. To check independence, consider whether a response can score high on criterion A but low on criterion B. If not, they likely overlap and you need to merge or sharpen them. In Evaluations, ScoreDimension captures this principle: each dimension has its own name, description, and ScoringScale, keeping criteria separate. Keep the set of criteria manageable. Typically, scoring three or four dimensions in a single model-as-judge call maintains quality comparable to single-dimension calls. Beyond four or five dimensions, attention decay causes the judge to give less consideration to later criteria. Group related dimensions together, but keep independent concerns, for example, safety versus quality, in separate evaluator calls. Use evaluation steps to improve scoring Asking the model as judge to reason through each criterion step by step before it assigns a score is a key improvement you can make to your scoring accuracy. Chain-of-thought evaluation, where the model explains its reasoning as it scores, improves correlation with human judgment because it forces the model as judge to engage with the content of the response rather than defaulting to shallow heuristics. Structure your evaluation steps to mirror how an expert annotator typically approaches evaluating the same content: Examine each criterion independently, with specific guidance on what to look for. End with a synthesis step that weighs all criteria together before assigning a score. In Evaluations, include these steps in the instructions you provide to ModelJudgeEvaluator. Without evaluation steps, the model as judge can form an immediate assessment based on surface features like length or formatting, then assign a score that matches that first impression. Account for model-as-judge biases A model as judge can have systematic biases that you need to account for in your evaluation design. Awareness of these biases helps you interpret results accurately and design mitigations where needed. Calibrate with human judgment How you configure your model as judge affects the validity of its scores. The most effective way to validate your configuration is to compare the model-as-judge’s scores against human scores on a shared calibration set: Have two or three human annotators score 20–50 responses using your scoring criteria and levels. Run the model as judge on the same set and compare. Refine your criteria, scoring levels, or instructions where the model as judge systematically disagrees with humans; add evaluation steps, or add few-shot examples, a small number of worked examples that demonstrate the desired scoring, to the model-as-judge’s instructions. Measure agreement with a metric like Cohen’s Kappa or another inter-rater reliability measure, rather than raw agreement rate, which can be misleading when score distributions are imbalanced. Repeat until model as judge-human agreement is comparable to human-human agreement. This workflow requires no code changes, only configuration changes to criteria, scoring levels, evaluation steps, and examples. After calibration, the model as judge can run at scale and you can be more confident that its scores reflect the same quality standard your human reviewers applied. Debug scores with rationales When you use a ModelJudgeEvaluator, the model as judge produces a written rationale alongside every score. A single rationale tells you about one sample. Reviewing multiple rationales reveals patterns in how the model as judge interprets your instructions, applies your scale, and weighs your criteria. Look for these common patterns when scores don’t match your expectations: When rationales reveal a problem, adjust the corresponding part of the evaluator configuration: rewrite ambiguous criteria, sharpen the boundaries between score levels, reorder evaluation steps, or add scored examples at the levels where the model as judge is inconsistent. Then rerun the evaluation and read the rationales again. This cycle of reading rationales, adjusting the configuration, and rerunning is how you bring the model-as-judge’s scoring in line with your expectations. Use reference-guided judging for verifiable tasks When evaluating tasks with objectively correct answers, for example, math, factual recall, code correctness, or tool selection, provide the model as judge with the expected answer, creating a reference-guided approach. Model-as-judge evaluators are susceptible to answer contamination, where an incorrect answer in the evaluation context causes the model as judge to copy the wrong reasoning into its own chain of thought, even when the model can solve the problem independently. Reference-guided judging reduces this failure rate substantially.

## See Also

### Model-as-judge evaluations

- [Scoring with model-as-judge evaluators](evaluations/scoring-with-model-as-judge-evaluators.md)
- [ModelJudgeEvaluator](evaluations/modeljudgeevaluator.md)
- [ModelJudgePrompt](evaluations/modeljudgeprompt.md)
- [ScoreDimension](evaluations/scoredimension.md)