From Search Infrastructure and Software Engineering at Shopify

Applied LLM Engineering: Index

First created May 19, 2026 Last edited Jun 7, 2026

The model side of the team’s work: getting the Sidekick assistant to behave for the help-tooling job through prompting, context design, and evaluation. Two patterns I worked through from the team’s public talks, the counterpart to the search-infrastructure side I spend most of my time on.

Posts

Just-in-Time Context: Moving Tool Instructions Out of the System Prompt

Calibrating an LLM Judge: Cohen’s Kappa from 0.02 to 0.61

Index

Just-in-Time Context: Moving Tool Instructions Out of the System Prompt. A working note on Death by a Thousand Instructions and the pattern of returning tool guidance inline with tool results.
Calibrating an LLM Judge: Cohen's Kappa from 0.02 to 0.61. A working note on how the Sidekick team turned an unreliable LLM-as-judge into a usable training signal, with the statistics that made the calibration legible.