Putting It All Together: A Practical Guide to LLM / AI Evaluation Methods

When I first started learning about LLM and AI evaluations, I struggled to put all the different evaluation methods into a single mental model. Every blog, paper, or tool seemed to focus on one technique, without clearly explaining when to use which, what each method actually measures, and how they fit together.

This post is my attempt to connect the dots.

Rather than treating evaluation methods as competing ideas, it’s much more useful to see them as layers—each answering a different question about your system.

The Core Question: What Are You Trying to Measure?

Before choosing an evaluation method, ask:

What does “good” mean for this task?

Is it:

A single correct answer?
Being close enough to a reference?
Being useful or helpful to a human?
Following instructions and reasoning correctly?

Different evaluation methods exist because no single metric can answer all of these questions.

Functional Correctness

Question it answers:

Did the system do the right thing?

Functional correctness is about task success, not wording.

Examples:

Did the SQL query return the correct rows?
Did the code compile and pass tests?
Did the agent successfully book the meeting?

This is the strongest form of evaluation when it’s available.

Characteristics

Binary or near-binary (pass/fail)
Often automated
Domain-specific

Strengths

High signal
Hard to game
Directly tied to real-world outcomes

Limitations

Expensive or complex to implement
Not always possible for open-ended tasks

Reference-Based Evaluation

Many tasks don’t have a single “correct” output, but they do have examples of good answers. This leads to reference-based evaluation.

All methods below compare a model’s output to one or more reference answers.

Exact Match

Question it answers:

Did the output exactly match the reference?

This is the strictest reference-based method.

Examples

Classification labels
Yes/No answers
Canonical IDs or tokens

Strengths

Simple and deterministic
Easy to automate

Limitations

Extremely brittle
Penalizes valid paraphrases

Exact match works best when:

Output space is small
Formatting matters
There truly is only one correct answer

Lexical Similarity

Question it answers:

How similar is the wording to the reference?

This includes metrics like:

BLEU
ROUGE
Edit distance

These metrics operate at the token or string level.

Strengths

Cheap and fast
Works reasonably well for structured or templated text

Limitations

Blind to meaning
Over-penalizes paraphrasing
Can be gamed by copying phrasing

Lexical similarity is useful when wording consistency matters, but it should rarely be your only signal.

Semantic Similarity

Question it answers:

Does the output mean the same thing as the reference?

Semantic similarity moves beyond surface-level text and compares meaning instead.

Typically implemented using:

Embeddings
Cosine similarity
Learned semantic models

Strengths

Robust to paraphrasing
Better aligned with human judgment

Limitations

Sensitive to embedding quality
Threshold selection is non-trivial
Can miss subtle factual errors

Semantic similarity is especially useful for:

Summarization
QA with multiple valid phrasings
Explanation-style outputs

Human Evaluation

Question it answers:

Is this actually good?

Human evaluation is the gold standard—when you can afford it.

Humans can judge:

Helpfulness
Clarity
Tone
Safety
Factual accuracy

Strengths

Highest alignment with real users
Captures nuance no metric can

Limitations

Expensive
Slow
Subjective without strong rubrics

To scale human evaluation, teams often:

Use rubrics
Sample strategically
Combine with automated pre-filters

LLM as a Judge

Question it answers:

Can a model evaluate another model’s output?

LLM-as-a-Judge sits between automated metrics and human review.

Common patterns:

Scoring answers on a rubric (0–10, pass/fail)
Pairwise comparison (A vs B)
Reasoned critique with justification

Strengths

Scales far better than humans
More semantic than lexical metrics
Can be task-specific

Limitations

Judge bias
Sensitivity to prompt design
Risk of self-preference

LLM judges work best when:

Anchored with clear rubrics
Validated against human judgments
Used as one signal, not the only one