Using one LLM to evaluate another LLM’s output is one of the most powerful eval techniques. It is also one of the most dangerous if done carelessly.

How to Write a Judge Prompt

A good judge prompt has three parts:

1. Role and context. Tell the judge what it is evaluating and what expertise to apply.

You are an expert evaluator of release risk assessments.

2. Rubric with explicit score levels. Do not just say “rate the quality from 1-5.” Define what each score level looks like with concrete descriptions.

### Reasoning Quality (1-5)
1: No reasoning, just states a conclusion
2: Vague reasoning, doesn't reference specific data
3: Adequate reasoning, references some specific data
4: Strong reasoning, references specific files/changes/risks
5: Excellent reasoning, considers combinations of risks and interactions

3. Structured output format. Ask for JSON so you can parse the scores programmatically.

Respond with a JSON object:
{
    "reasoning_quality": <1-5>,
    "specificity": <1-5>,
    "decision_accuracy": <1-5>,
    "actionability": <1-5>,
    "overall_score": <1-5>,
    "explanation": "<why you gave these scores>"
}

The full judge prompt is in src/release_agent/evals/judge.py in the JUDGE_SYSTEM_PROMPT constant.

Avoiding Bias

LLM judges have known biases that can corrupt your eval results:

Position bias. The judge may favor whichever output is presented first. Mitigation: randomize the order of the agent’s output and the gold example between runs. Or present them in a clearly labeled structure (as this project does).

Verbosity bias. Longer, more verbose outputs tend to get higher scores even when they are not better. Mitigation: include rubric criteria that penalize filler and reward conciseness.

Self-preference bias. If the judge is the same model as the agent, it will tend to rate the agent’s output more favorably. Mitigation: use a different model for the judge than for the agent. In this project, the agent uses gpt-4o with temperature=0.2 and the judge uses gpt-4o with temperature=0.0, but ideally you would use a different model entirely (e.g., Claude as judge if GPT-4 is the agent, or vice versa).

Anchoring bias. Showing the gold example to the judge can anchor its expectations. If the agent produces a valid but different assessment, the judge may penalize it for not matching the gold. Mitigation: explicitly instruct the judge that multiple valid assessments exist, and the gold is a reference, not the only correct answer.

Consistency

Because the judge is itself an LLM, it can give different scores on the same input across runs.

Strategies for consistency:

  • Temperature 0.0: Eliminates most randomness (but not all, due to GPU non-determinism).
  • Multiple runs: Run the judge 3 times and average the scores. This is more expensive but gives you confidence intervals.
  • Explicit rubric: The more specific your rubric, the less room the judge has for subjective variation.

Sycophancy Risk

LLMs tend to be agreeable. A judge model might give inflated scores because it does not want to “criticize” the agent’s output. This is sycophancy, and it can make your evals meaningless — everything gets a 4 or 5.

Mitigation strategies:

  • Include low-score examples in your rubric. Show the judge what a 1 or 2 looks like so it has permission to give low scores.
  • Ask for the explanation first. Some research suggests asking the model to explain its reasoning before giving a score reduces sycophancy.
  • Calibrate with known-bad outputs. Feed the judge deliberately poor outputs and verify it gives them low scores. If it does not, your rubric needs work.
  • Track score distributions. If your judge never gives scores below 3, something is wrong.