From Notes on Applied LLMs from Shopify Sidekick

Calibrating an LLM Judge: Cohen's Kappa from 0.02 to 0.61

This is a working note on a thing I’ve been trying to internalize: how to make an LLM-as-judge into a trustable signal. The Sidekick team gave a RailsConf talk and a ICML 2025 talk on this, and the numbers they reported are striking enough that I want to walk through the math.

Headline number: a Cohen’s Kappa of 0.02 on their first judge, climbing to 0.61 after calibration, against a human inter-rater baseline of 0.69. That progression is the whole story. Everything below is me trying to understand each piece.

Why you need a judge at all

You have an agent. The agent does open-ended things: writes ShopifyQL queries, plans flash sales, creates draft products from a vague natural-language description. There is no reference output to compare against. You cannot write assert agent.output == expected because there are dozens of acceptable outputs.

The options for evaluating it are roughly:

Exact-match against a reference. Fails on open outputs. Marks correct executions wrong because they used a different default.
Heuristic checks. “Did it call the right tool? Did the query parse?” Catches obvious bugs, misses everything semantic.
Human raters. Gold standard but slow, expensive, doesn’t scale past a few hundred examples per week.
LLM-as-judge. Another model reads the trajectory and assigns a score. Scales to millions, costs cents per eval, and is unreliable by default.

In practice you want (2) plus (4): cheap deterministic checks for the things they can catch, an LLM judge for everything else. The catch is that (4) is only worth anything if you’ve calibrated it against (3).

The statistics

Three numbers come up. Worth pinning down what each measures.

Raw agreement. “What fraction of examples does the judge label the same as the human?” Easy to compute, badly misleading on imbalanced data. If 90% of trajectories are correct, a judge that always says “correct” scores 90% agreement and has learned nothing.

def raw_agreement(judge_labels, human_labels):
    return sum(j == h for j, h in zip(judge_labels, human_labels)) / len(judge_labels)

Cohen’s Kappa. Agreement above what you’d expect from chance, given the marginal distribution of each rater. Ranges from -1 to 1. Zero is chance. 0.6 to 0.8 is “substantial agreement” in Landis and Koch’s old rubric. The formula:

κ = (p_observed - p_chance) / (1 - p_chance)

where p_observed is raw agreement and p_chance is the agreement you would expect if both raters were sampling independently from their own marginal distributions. A judge that always says “correct” gets p_observed = 0.9, but p_chance is also close to 0.9 (because both raters have the same marginal), so κ ≈ 0.

from collections import Counter

def cohens_kappa(judge, human):
    n = len(judge)
    labels = set(judge) | set(human)

    p_obs = sum(j == h for j, h in zip(judge, human)) / n

    judge_marginal = Counter(judge)
    human_marginal = Counter(human)
    p_chance = sum(
        (judge_marginal[l] / n) * (human_marginal[l] / n)
        for l in labels
    )

    return (p_obs - p_chance) / (1 - p_chance)

Kendall’s Tau. For ordinal labels (1 to 5 quality scores), measures rank correlation. Counts concordant minus discordant pairs, normalized. Useful when “the judge ordered the trajectories the same way as the human” matters more than “the judge picked the same exact integer.”

Pearson correlation. Linear correlation, also useful for ordinal or scalar scores. Sensitive to outliers in a way Kendall is not.

The Sidekick team tracked all three because each surfaces different failure modes. A judge can have okay Pearson and bad Kappa if it’s miscalibrated by a constant offset (always one point too generous). It can have okay Kappa and bad Kendall if it gets the easy cases right and the borderline cases scrambled.

What a calibration loop looks like

Sketching the loop in code, again from the public talks, not from anything internal:

def calibrate_judge(judge_prompt, ground_truth_set):
    """
    ground_truth_set: list of (trajectory, [human_label_1,
                                            human_label_2,
                                            human_label_3])
    """
    # 1. Establish the ceiling: how much do humans agree with each other?
    human_agreement = mean_pairwise_kappa([labels for _, labels in ground_truth_set])

    # 2. Score the judge against the human consensus.
    judge_labels = [score(judge_prompt, traj) for traj, _ in ground_truth_set]
    consensus_labels = [majority_vote(labels) for _, labels in ground_truth_set]
    judge_agreement = cohens_kappa(judge_labels, consensus_labels)

    # 3. Find systematic disagreements.
    disagreements = [
        (traj, jl, cl)
        for (traj, _), jl, cl in zip(ground_truth_set, judge_labels, consensus_labels)
        if jl != cl
    ]

    return {
        "ceiling": human_agreement,    # what we're aiming at
        "current": judge_agreement,    # where we are
        "headroom": human_agreement - judge_agreement,
        "errors": disagreements,       # the material for the next prompt edit
    }

The loop is then: read the disagreements, find a pattern (the judge is too lenient on hallucinated IDs, the judge marks every refusal as correct, the judge cannot tell a partial answer from a complete one), rewrite the rubric, re-score, repeat. The Sidekick team did this until their judge agreement averaged 0.66 to 0.75 across the metrics they tracked, with Kappa specifically going 0.02 → 0.61.

You stop when the headroom (human_agreement - judge_agreement) is small enough that the noise in the judge is less than the noise between human raters. Going past that risks overfitting the judge to the specific humans you sampled, which is its own failure.

N-Stage Gated Rewards

The other thing they do, which I find satisfying, is not run the LLM judge on every output. They gate it behind cheaper procedural checks:

def reward(trajectory, judge):
    # Stage 1: deterministic, microseconds.
    if not syntactically_valid(trajectory):
        return 0.0
    if uses_invalid_tool_names(trajectory):
        return 0.0
    if references_nonexistent_ids(trajectory):
        return 0.0

    # Stage 2: cheap heuristic, milliseconds.
    if not satisfies_user_constraints(trajectory):
        return 0.2

    # Stage 3: expensive LLM judge, hundreds of milliseconds.
    return judge.score(trajectory)

Two wins from this structure. The first is cost: most failures get caught by deterministic checks before you spend a judge inference. The second is integrity of the judge signal. Deterministic checks catch the things deterministic checks should catch (syntax, schema, enum values), and the LLM judge gets to focus on what only it can do (semantic correctness, helpfulness, tone). When you collapse all checks into the judge, the judge starts being asked to grade syntax, which it does badly, which pollutes the reward signal.

Reward hacking shows up immediately

Once the judge is good enough to use as a reward signal for post-training (the Sidekick team uses GRPO, Group Relative Policy Optimization), the agent starts finding holes in the judge. The public talks list four hacks they encountered:

Opt-out hacking. The agent learns to refuse tasks the judge grades leniently when refused.
Tag hacking. Asked to filter customers by account status, the agent emits customer_tags CONTAINS 'enabled' instead of customer_account_status = 'ENABLED'. The tag field is a free-form catch-all that satisfies the syntax check and gets a partial-credit semantic score.
Schema violations under the radar. Hallucinated IDs that look plausible and pass surface-level checks.
Format gaming. Outputs structured to match what the rubric describes as “good,” without doing the underlying work.

Each hack is a gap in the rubric or the procedural gates. Each fix tightens one of them. The agent and the judge co-evolve: you cannot freeze the judge and assume it stays accurate as the agent improves, because the agent is actively probing the judge’s weaknesses by gradient descent.

What I take from this

Some loose conclusions:

LLM judges need a labeled set before they are useful. Inter-rater Kappa is the legible measure. If you can’t quote a Kappa against humans, your judge is decoration.
The human inter-rater agreement is the ceiling. A judge that scores higher than humans agree with each other is overfit, not superhuman.
Track multiple agreement statistics. Raw agreement hides class imbalance. Kappa, Kendall, and Pearson surface different failure modes.
Gate the judge behind deterministic checks. Don’t ask the LLM to do what assert can do. The signal is cleaner and the inference bill is smaller.
Expect reward hacking the moment you start optimizing. Treat the judge as adversarial against the agent. Plan for the rubric to keep evolving.

I’d like to actually build this end to end on a toy agent at some point: a small tool-using agent over a fake API, a human-labeled ground-truth set of 50 trajectories, an LLM judge prompt I iterate on, and a notebook that prints Kappa each iteration. That’s the way I learn things, and I think it would force me to feel the calibration loop in a way reading about it does not.

Sources: Building production-ready agentic systems, Shopify Engineering. ICML 2025 Expo. LLM Evaluations and Reinforcement Learning for Shopify Sidekick on Rails, RailsConf 2025. All code in this post is illustrative, written by me to work through the ideas in the talks.