Prompt evaluation is the practice of measuring how effectively a specific prompt elicits the desired behavior from a language model. While LLM evaluation broadly assesses model capabilities, prompt evaluation zooms in on the prompt as the variable under test — holding the model constant and asking whether the instructions, examples, and constraints in the prompt produce outputs that meet quality criteria.
Evaluation criteria are defined relative to the prompt's purpose. A classification prompt might be evaluated on accuracy and label consistency. A generation prompt might be evaluated on relevance, tone, completeness, and factual grounding. A summarization prompt might be evaluated on compression ratio, information retention, and readability. The first step in any evaluation effort is defining what "good" means for the specific use case.
Quantitative evaluation uses scoring functions to produce numerical quality measures. These can be rule-based (does the output match the expected JSON schema?), reference-based (how similar is the output to a gold-standard answer?), or model-based (does a judge model rate this output as helpful and accurate?). Scores are aggregated across test cases to produce overall quality metrics that can be tracked over time and compared across prompt versions.
Qualitative evaluation complements the numbers with human judgment. Prompt engineers review a sample of outputs to identify patterns that automated metrics miss — subtle tone issues, technically correct but unhelpful responses, or edge cases that expose gaps in the instructions. These observations feed back into prompt refinement and test case creation.
Comparative evaluation is particularly valuable. Rather than evaluating a prompt in isolation, teams compare two or more variants on the same test suite. Side-by-side comparison makes differences concrete: where does variant A outperform variant B, and vice versa? This approach drives data-informed decisions about which prompt version to deploy.
Effective prompt evaluation is iterative and continuous. Initial evaluation during development establishes a quality baseline. Pre-deployment evaluation gates catch regressions before they reach users. Production evaluation monitors for drift as user inputs evolve and models are updated. Each evaluation cycle produces insights that inform the next round of prompt improvement.