When I first started learning about LLM and AI evaluations, I struggled to put all the different evaluation methods into a single mental model. Every blog, paper, or tool seemed to focus on one technique, without clearly explaining when to use which, what each method actually measures, and how they fit together.

This post is my attempt to connect the dots.

Rather than treating evaluation methods as competing ideas, it’s much more useful to see them as layers—each answering a different question about your system.


The Core Question: What Are You Trying to Measure?

Before choosing an evaluation method, ask:

What does “good” mean for this task?

Is it:

  • A single correct answer?
  • Being close enough to a reference?
  • Being useful or helpful to a human?
  • Following instructions and reasoning correctly?

Different evaluation methods exist because no single metric can answer all of these questions.


Functional Correctness

Question it answers:

Did the system do the right thing?

Functional correctness is about task success, not wording.

Examples:

  • Did the SQL query return the correct rows?
  • Did the code compile and pass tests?
  • Did the agent successfully book the meeting?

This is the strongest form of evaluation when it’s available.

Characteristics

  • Binary or near-binary (pass/fail)
  • Often automated
  • Domain-specific

Strengths

  • High signal
  • Hard to game
  • Directly tied to real-world outcomes

Limitations

  • Expensive or complex to implement
  • Not always possible for open-ended tasks

Reference-Based Evaluation

Many tasks don’t have a single “correct” output, but they do have examples of good answers. This leads to reference-based evaluation.

All methods below compare a model’s output to one or more reference answers.


Exact Match

Question it answers:

Did the output exactly match the reference?

This is the strictest reference-based method.

Examples

  • Classification labels
  • Yes/No answers
  • Canonical IDs or tokens

Strengths

  • Simple and deterministic
  • Easy to automate

Limitations

  • Extremely brittle
  • Penalizes valid paraphrases

Exact match works best when:

  • Output space is small
  • Formatting matters
  • There truly is only one correct answer

Lexical Similarity

Question it answers:

How similar is the wording to the reference?

This includes metrics like:

  • BLEU
  • ROUGE
  • Edit distance

These metrics operate at the token or string level.

Strengths

  • Cheap and fast
  • Works reasonably well for structured or templated text

Limitations

  • Blind to meaning
  • Over-penalizes paraphrasing
  • Can be gamed by copying phrasing

Lexical similarity is useful when wording consistency matters, but it should rarely be your only signal.


Semantic Similarity

Question it answers:

Does the output mean the same thing as the reference?

Semantic similarity moves beyond surface-level text and compares meaning instead.

Typically implemented using:

  • Embeddings
  • Cosine similarity
  • Learned semantic models

Strengths

  • Robust to paraphrasing
  • Better aligned with human judgment

Limitations

  • Sensitive to embedding quality
  • Threshold selection is non-trivial
  • Can miss subtle factual errors

Semantic similarity is especially useful for:

  • Summarization
  • QA with multiple valid phrasings
  • Explanation-style outputs

Human Evaluation

Question it answers:

Is this actually good?

Human evaluation is the gold standard—when you can afford it.

Humans can judge:

  • Helpfulness
  • Clarity
  • Tone
  • Safety
  • Factual accuracy

Strengths

  • Highest alignment with real users
  • Captures nuance no metric can

Limitations

  • Expensive
  • Slow
  • Subjective without strong rubrics

To scale human evaluation, teams often:

  • Use rubrics
  • Sample strategically
  • Combine with automated pre-filters

LLM as a Judge

Question it answers:

Can a model evaluate another model’s output?

LLM-as-a-Judge sits between automated metrics and human review.

Common patterns:

  • Scoring answers on a rubric (0–10, pass/fail)
  • Pairwise comparison (A vs B)
  • Reasoned critique with justification

Strengths

  • Scales far better than humans
  • More semantic than lexical metrics
  • Can be task-specific

Limitations

  • Judge bias
  • Sensitivity to prompt design
  • Risk of self-preference

LLM judges work best when:

  • Anchored with clear rubrics
  • Validated against human judgments
  • Used as one signal, not the only one

Putting It All Together

Instead of asking “Which evaluation method is best?”, ask:

Which combination of methods gives me confidence in this system?

A typical stack might look like:

  • Functional correctness → Can it actually do the task?
  • Exact match / lexical checks → Catch regressions
  • Semantic similarity → Measure meaning preservation
  • LLM-as-a-judge → Scalable qualitative assessment
  • Human evaluation → Ground truth and calibration

Each layer compensates for the weaknesses of the others.


Final Thought

Evaluation isn’t about finding a perfect metric—it’s about building trust in your system.

Once I started viewing evaluation methods as complementary tools rather than competing ideas, everything started to make sense.

If you’re early in your LLM evaluation journey and feeling overwhelmed: that’s normal. Start simple, be explicit about what you’re measuring, and add complexity only when you need it.