Everyone wants to know which LLM is “better at coding.”

We point to leaderboards, single scores, colorful charts, and declare a winner. But if you’ve ever shipped software to production, you already know the uncomfortable truth:

Benchmarks don’t decide what’s safe to use — failure modes do.

After spending years in QA, test automation, and now building evaluation tooling for LLMs, I’ve come to a strong conclusion:

You should never use the same benchmark for public credibility and internal decision-making.

This post explains why — and how to design both without fooling yourself.


Coding Ability Is Not One Thing

Most coding benchmarks try to answer a vague question: “Can this model write code?”

That question is useless. In practice, coding ability breaks down into distinct capabilities:

  • Functional correctness
  • Handling edge cases
  • Following constraints
  • Debugging broken code
  • Code quality and maintainability
  • Security awareness
  • Consistency across runs

A model that gets 90% on a leaderboard but occasionally:

  • ignores constraints,
  • silently returns wrong results, or
  • introduces security issues

…is not “good at coding” in any production sense.

Just like in traditional QA, passing tests doesn’t mean a system is safe to ship.


Internal Benchmarks: Where Real Decisions Come From

Internal benchmarks exist for one purpose only: To surface uncomfortable truths before users do. They are not pretty. They are not stable. And they should never be optimized for optics.

What internal benchmarks look like

Internal benchmarks should be:

  • Failure-mode driven, not task-driven
  • Ruthless, not representative
  • Specific to your risks, not generic

Examples of internal coding evals:

  • Debugging tasks based on real regressions
  • Security-sensitive code patterns you’ve actually seen fail
  • Constraint-heavy prompts (“no recursion”, “O(n)”, “no external libs”)
  • Refactoring tasks that reveal code quality under pressure
  • Agent tasks that measure recovery, not just success

These benchmarks change often. They get stricter over time. They break models that “look good” on public leaderboards. That’s a feature, not a bug.

What internal benchmarks optimize for

Internal evals answer questions like:

  • Which model introduces fewer silent failures?
  • Which model has higher variance across runs?
  • Which model fails safely vs confidently wrong?
  • Which model recovers faster when debugging?

None of this belongs on a leaderboard — but all of it matters in production.


Public Benchmarks: What They’re Actually For

Public benchmarks serve a completely different purpose. They are not about picking winners. They are about building trust and shared understanding.

A good public benchmark should:

  • Be stable and reproducible
  • Be easy to understand
  • Clearly define what it measures — and what it doesn’t
  • Expose dimensions of capability, not a single score

Public benchmarks are educational artifacts. They show:

  • how evaluation is done,
  • what tradeoffs exist,
  • and why “coding ability” isn’t one-dimensional.

If someone disagrees with your weights or conclusions — that’s fine. If they understand your methodology, you’ve succeeded.

What should never be public

You should not publish:

  • Your hardest internal tasks
  • Your golden regression tests
  • Real incident-derived prompts
  • Security failure cases from production

Publishing those doesn’t increase transparency — it destroys signal. Publishing your internal benchmarks is like open-sourcing your incident playbook.


Why One Benchmark Can’t Do Both

Trying to use a single benchmark for internal decisions and public credibility leads to predictable failure modes:

  • Internal teams optimize for leaderboard scores
  • Benchmarks become static and gameable
  • Real risks get deprioritized
  • Evaluation becomes marketing

This mirrors what we’ve seen for decades in testing:

  • Test suites written to “go green”
  • Metrics optimized for reporting
  • Bugs escaping anyway

Separation is not secrecy — it’s discipline.


What a Responsible Benchmark System Looks Like

Whether you’re evaluating LLMs for coding, agents, or copilots, a responsible system has:

  • Versioned tasks and tests
  • Explicit scoring rubrics
  • Multiple capability dimensions
  • Pairwise comparisons (model vs model)
  • Variance tracking across runs
  • Clear separation between public and internal evals

Most importantly, it treats evaluation as a living system, not a static artifact.


How This Shapes the Eval Tool I’m Building

This philosophy directly shapes how I think about LLM evaluation tooling:

  • Public benchmarks demonstrate how evaluation works
  • Internal benchmarks drive what decisions get made
  • Golden tests protect against regressions
  • Pairwise evals surface real tradeoffs
  • Agents are evaluated on recovery, not just success

In other words: The goal isn’t to prove a model is good. It’s to understand how it fails.


Final Thought

Benchmarks shouldn’t make models look impressive. They should make risks obvious — early. If your evaluation setup can’t tell you:

  • where a model is brittle,
  • how it fails under pressure,
  • and whether it’s safe to trust autonomously,

Then it’s not an evaluation system. It’s a scoreboard. And scoreboards don’t ship software.