Designing specific, measurable criteria in an evaluation suite

Define quality for your feature by choosing measurable criteria, scoring approaches, and ground-truth strategies.

Overview

Deciding what to measure in your evaluations means defining success for your feature: criteria specify your measurable standards of quality. Designing effective evaluations explains why evaluation matters and how its life cycle works; once you understand this evaluation process, your next step is defining your criteria. Well-defined criteria help you build strong evaluations. Vague criteria produce vague signals, but precise criteria tell you specifically where your feature succeeds and where it fails.

This article details the strategic decisions behind evaluation design: defining measurable targets, identifying quality dimensions, choosing between scoring approaches, and selecting the right ground-truth, or verified accurate results. Most features combine several criteria into an evaluation suite, mixing rule-based checks, scored quality dimensions, and model-as-judge assessments to cover what a single approach can’t.

Define success as measurable targets

Before writing any evaluation code, identify what success looks like for your feature. Every quality goal needs to become a specific, measurable target that code can verify.

Consider what your feature needs to accomplish and what might go wrong. Each goal becomes an individual criterion with a clear measurement:

Goal	Criterion	Measurement
Responses respect constraints.	Output stays within stated limits.	Pass or fail per constraint
Output is factually grounded.	Claims are supported by provided context.	Score 1–4 on factual scale
Feature handles diverse inputs.	Quality is consistent across input categories.	Standard deviation of category mean scores
Includes no harmful content.	Output avoids unsafe or inappropriate material.	Pass if no safety violations detected

Start with the criteria that matter most to your users. You can always add more dimensions later as you begin to understand the points of failure in your feature.

Identify quality dimensions

A single feature typically needs to satisfy several independent quality dimensions. Consider each of these, with at least one criterion per dimension that matters to your feature:

Task fidelity: The response correctly accomplishes what it’s supposed to. For example, a summarizer captures the key points or a planner stays within budget.
Consistency: The model produces similar quality across varied inputs. For example, a summarizer produces equally useful output for a two-sentence memo and a ten-page report.
Tone and style: The output matches your app’s voice. For example, a professional email is actually professional or instructions are at the right level for the audience.
Safety: The model avoids harmful or inappropriate content. For example, it refuses adversarial prompts gracefully.
Privacy preservation: The model avoids revealing private data or information outside its intended scope. For example, a customer-support assistant answers questions about the current user’s orders without surfacing details from other customers.
Latency and cost: The model responds within acceptable time and token budgets for your use case. For example, responses to queries don’t take too long or require too many tokens, slowing down your feature.

Not every dimension applies to every feature. Choose the dimensions that represent real risk or real value for people using your feature, and assign at least one evaluator to each.

Choose the right scoring approach

Evaluations give you two main evaluation mechanisms: code-based evaluators and model-as-judge. The right approach depends on how you define correctness for each criterion. Start with the simplest approach that gives reliable signals, and move to more sophisticated methods only when needed:

Approach	Use when	Speed	Cost	Reproducibility
Code-based (Evaluator)	Correctness has a computable definition, for example, exact match, schema validation, or range check.	Instant	Free	Perfect
Model as judge (Modeljudgeevaluator)	Quality is subjective or requires reasoning, for example, helpfulness, tone, or accuracy of classification.	Seconds	Inference cost	High with well-defined scoring levels
Human review	High stakes, calibrating model-as-judge evaluators, or discovering gaps in your scoring dimensions.	Minutes or more	Expensive	Variable

Use code when you can. If the criterion is computable, a code-based evaluator is faster, cheaper, and perfectly reproducible. Checking whether a response stays within a word limit is a code check. Validating that structured output conforms to a schema is a code check. Verifying that a numeric answer falls within an expected range is a code check. These never need a model as judge. For example, verifying a word limit takes just a few lines with Evaluator:

Evaluator { input, subject in
    subject.value.split(separator: " ").count <= 200
        ? wordLimit.passing() : wordLimit.failing()
}

Use a model as judge when code can’t capture the criterion. Determining whether an explanation is clear, whether a tone is appropriate, or whether a summary captures the important points requires reasoning about language and context. This is where ModelJudgeEvaluator provides value. The quality of a model-as-judge evaluation depends on your scoring levels: make each level specific enough that two independent reviewers are likely to assign the same score.

Use humans to calibrate, not to score at scale. Human review is too slow and expensive for routine evaluation. Its value is in calibrating your automated evaluators: run a small set of human-scored samples, compare them against your model-as-judge scores, and refine your scoring levels where they disagree.

For implementation details, see Evaluating language model responses for a complete walkthrough, and Scoring with model-as-judge evaluators for model-as-judge configurations.

For best practices on configuring model-as-judge evaluators, including scoring scales, bias mitigation, and calibration, see Designing effective model-as-judge evaluators.

Choose the right ground-truth strategy

Different scoring strategies relate to ground truth in different ways. Understanding this distinction helps you choose the right approach for each criterion:

Explicit ground truth: You have a predefined correct answer for each input, and you compare the model’s output against it. A math problem has one right answer. A classification task has a known label. The evaluator measures how closely the model’s output matches the expected result.
Rule-based (implicit ground truth): The rule itself is the truth. A word count check doesn’t need a golden answer for each sample because the rule “200 words or fewer” applies universally. Format checks, schema validation, and constraint verification all fall into this category.
No ground truth: For open-ended output where multiple answers are equally valid, you use a model as judge or pre-trained classifier to assess quality. Whether an explanation is clear or a recommendation is useful has no single right answer, but a model as judge can reason about quality criteria you define.

Most evaluation suites combine all three approaches. A feature might check format compliance (rule-based), compare answers against known-correct values for a core set (explicit ground truth), and use a model as judge for subjective quality on everything else (no ground truth).