---
title: "Designing specific, measurable criteria in an evaluation suite"
framework: evaluations
role: article
role_heading: Article
path: evaluations/designing-evaluation-criteria
---

# Designing specific, measurable criteria in an evaluation suite

Define quality for your feature by choosing measurable criteria, scoring approaches, and ground-truth strategies.

## Overview

Overview Deciding what to measure in your evaluations means defining success for your feature: criteria specify your measurable standards of quality. Designing effective evaluations explains why evaluation matters and how its life cycle works; once you understand this evaluation process, your next step is defining your criteria. Well-defined criteria help you build strong evaluations. Vague criteria produce vague signals, but precise criteria tell you specifically where your feature succeeds and where it fails. This article details the strategic decisions behind evaluation design: defining measurable targets, identifying quality dimensions, choosing between scoring approaches, and selecting the right ground-truth, or verified accurate results. Most features combine several criteria into an evaluation suite, mixing rule-based checks, scored quality dimensions, and model-as-judge assessments to cover what a single approach can’t. Define success as measurable targets Before writing any evaluation code, identify what success looks like for your feature. Every quality goal needs to become a specific, measurable target that code can verify. Consider what your feature needs to accomplish and what might go wrong. Each goal becomes an individual criterion with a clear measurement:  |  |   |  |   |  |   |  |   |  |  Start with the criteria that matter most to your users. You can always add more dimensions later as you begin to understand the points of failure in your feature. Identify quality dimensions A single feature typically needs to satisfy several independent quality dimensions. Consider each of these, with at least one criterion per dimension that matters to your feature: Not every dimension applies to every feature. Choose the dimensions that represent real risk or real value for people using your feature, and assign at least one evaluator to each. tip: Start with two or three evaluators that cover your most important quality dimensions. Add more as you learn where your feature tends to fail. Choose the right scoring approach Evaluations give you two main evaluation mechanisms: code-based evaluators and model-as-judge. The right approach depends on how you define correctness for each criterion. Start with the simplest approach that gives reliable signals, and move to more sophisticated methods only when needed:  |  |  |  |   |  |  |  |   |  |  |  |   |  |  |  |  Use code when you can. If the criterion is computable, a code-based evaluator is faster, cheaper, and perfectly reproducible. Checking whether a response stays within a word limit is a code check. Validating that structured output conforms to a schema is a code check. Verifying that a numeric answer falls within an expected range is a code check. These never need a model as judge. For example, verifying a word limit takes just a few lines with Evaluator: Evaluator { input, subject in     subject.value.split(separator: " ").count <= 200         ? wordLimit.passing() : wordLimit.failing() } Use a model as judge when code can’t capture the criterion. Determining whether an explanation is clear, whether a tone is appropriate, or whether a summary captures the important points requires reasoning about language and context. This is where ModelJudgeEvaluator provides value. The quality of a model-as-judge evaluation depends on your scoring levels: make each level specific enough that two independent reviewers are likely to assign the same score. Use humans to calibrate, not to score at scale. Human review is too slow and expensive for routine evaluation. Its value is in calibrating your automated evaluators: run a small set of human-scored samples, compare them against your model-as-judge scores, and refine your scoring levels where they disagree. For implementation details, see Evaluating language model responses for a complete walkthrough, and Scoring with model-as-judge evaluators for model-as-judge configurations. For best practices on configuring model-as-judge evaluators, including scoring scales, bias mitigation, and calibration, see Designing effective model-as-judge evaluators. Choose the right ground-truth strategy Different scoring strategies relate to ground truth in different ways. Understanding this distinction helps you choose the right approach for each criterion: Most evaluation suites combine all three approaches. A feature might check format compliance (rule-based), compare answers against known-correct values for a core set (explicit ground truth), and use a model as judge for subjective quality on everything else (no ground truth). tip: Use a model that is more capable than the model being evaluated as the judge. This reduces the chance of the judge sharing the same gaps as the model under test.

## See Also

### Metrics and evaluators

- [Metric](evaluations/metric.md)
- [Evaluator](evaluations/evaluator.md)
- [MetricsAggregator](evaluations/metricsaggregator.md)