From Notes on Applied LLMs from Shopify Sidekick

Just-in-Time Context: Moving Tool Instructions Out of the System Prompt

I’ve been reading the Sidekick team’s ICML 2025 talk and the Shopify Engineering writeup on building production agentic systems, and I want to write down what I learned from one of the patterns they describe, because I find it really clarifying for how I think about prompt design. The pattern is just-in-time context delivery. The shorthand for the failure mode it replaces is Death by a Thousand Instructions.

This post is mostly me sketching out the idea in code so I understand it. Nothing here is internal Shopify work, everything is from the public talk and blog post.

The setup

You have an agent. It calls tools. Sidekick is a function-calling LLM that operates over Shopify’s admin APIs, with tools for products, orders, themes, analytics, discounts, and so on. The team grew it from roughly 20 tools to over 50.

The naive system prompt for an agent like this looks like this:

SYSTEM_PROMPT = """
You are Sidekick. You help merchants run their store.

Tools:
- create_product: creates a new product. Always set status='draft'
  unless the merchant explicitly says publish. Never set inventory
  unless variants are provided first.
- delete_product: deletes a product. Do not call this on products
  with active orders. Prefer archive_product when the merchant
  says 'remove' or 'hide'.
- archive_product: archives a product. Available on Basic plan
  and above. Reversible.
- update_product: updates fields. The 'handle' field cannot be
  changed once orders exist. The 'vendor' field is case-sensitive
  for analytics grouping.
- run_query: runs a ShopifyQL query. customer_account_status uses
  enum values ('ENABLED', 'DISABLED', 'INVITED'), not the
  customer_tags field.
... (47 more tools)
"""

Every conversation pays the token cost of every rule. The model has to weigh the delete_product warning when the merchant is asking about themes. The customer_account_status enum note fires when the user wants a discount code. The system prompt becomes an unranked flat list, and the model’s attention gets diluted across rules that have nothing to do with the current turn.

The Sidekick team measured the consequences. Going from 20 tools with clear boundaries to 50 tools with overlapping functionality, evaluation scores went down even though each new tool was, individually, useful. They named the pattern after the symptom.

The fix: ship the rules with the tool result

The reframe is that an instruction’s value is not constant. It’s conditional on whether the model is about to make a decision the instruction affects. So instead of putting the delete_product warning in the system prompt where it has to compete with 49 other warnings on every turn, you put it in the return value of delete_product, where the model sees it in the exact turn it matters.

The system prompt collapses to the agent’s identity:

SYSTEM_PROMPT = """
You are Sidekick, an assistant for Shopify merchants.
Be concise. Confirm before destructive actions.
Use the tools available to you.
"""

The tools carry their own context:

def delete_product(product_id: str) -> dict:
    product = db.get_product(product_id)
    active_orders = db.count_active_orders(product_id)

    return {
        "ok": active_orders == 0,
        "product": product,
        "active_orders": active_orders,
        "instructions": [
            "Do not delete products with active_orders > 0.",
            "If the merchant said 'remove' or 'hide', prefer "
            "archive_product instead of delete_product.",
            "Confirm with the merchant before retrying.",
        ] if active_orders > 0 else [
            "Deletion is permanent. Confirm with the merchant "
            "before any retry.",
        ],
    }

The model now sees the rule in the same context window slot as the data the rule applies to. It does not need to remember the rule across a 4000-token system prompt. It reads the rule, reads the data, decides.

What this looks like in a loop

Pseudocode for the agent loop, simplified to show where the instructions live:

def run_turn(messages: list[dict]) -> str:
    while True:
        response = llm.chat(
            system=SYSTEM_PROMPT,    # short, stable
            messages=messages,
            tools=TOOL_SCHEMAS,      # name + signature only,
                                     # no behavioural rules
        )

        if response.stop_reason == "end_turn":
            return response.text

        for call in response.tool_calls:
            result = TOOLS[call.name](**call.args)
            # result["instructions"] travels with result["data"]
            messages.append({
                "role": "tool",
                "tool_call_id": call.id,
                "content": json.dumps(result),
            })

Two things to notice. First, TOOL_SCHEMAS only contains the function signature: name, args, return type. It does not contain the behavioural notes. Those have moved into the runtime. Second, the instructions field is not a separate channel. It is a key inside the same JSON blob as the data, because the model already attends well to JSON it just consumed.

Why this is more than a refactor

A few things fall out of this once you commit to it.

The system prompt becomes a stable artifact. It changes when the agent changes (new identity, new high-level policy), not when a tool changes. Tool teams can ship new rules without touching the prompt that every other team depends on. This is the same reason you put validation logic next to the schema instead of in the controller.

Instructions become testable in isolation. A rule that lives inside delete_product can be unit-tested by calling delete_product and asserting on the instructions field. A rule buried in a 4000-token system prompt can only be tested by running the whole agent and hoping the failure mode shows up.

The context window stays scoped to the actual decision. If the conversation never invokes delete_product, the model never spends attention on deletion rules. This is the same insight as retrieval-augmented generation, applied to behavioural rules rather than to factual knowledge. Both are forms of late-binding context to the moment it matters.

Tool authors own tool semantics. The person writing update_product knows that handle can’t change after first order. Putting the rule in the tool means the rule lives next to the code that enforces it. The alternative is a system prompt edited by a different team that has to be kept in sync with backend behaviour by hand.

Where I think this generalizes

The deeper claim, if I have it right, is that a system prompt is the wrong abstraction for conditional instructions. A system prompt is good for unconditional rules (identity, tone, refusal policy). It is bad for rules that fire only in specific tool contexts, because the prompt has no notion of context other than “always.”

The right place for a conditional rule is wherever the condition is evaluated. For tool-specific rules, that is the tool. For data-shape-specific rules (“this product has no variants, so do not set inventory”), that is the data payload. For user-specific rules (“this merchant is on Basic plan, archive_product not available”), that is the user context object the agent receives at session start.

I think the pattern I will use when I prototype my own agents is something like this:

System prompt describes who the agent is. Identity, tone, refusal policy, nothing else.
Tool schemas describe what the agent can call. Signatures only.
Tool implementations return {data, instructions, errors} triples. Instructions are scoped to the decision the model is about to make next.
Anything that needs to change based on user, plan, or environment goes into the first user-turn payload, not the system prompt.

I’m sure there are failure modes I haven’t seen yet. Two I can already imagine: token cost grows with the number of tool calls, because instructions repeat on every invocation; and the model can ignore the inline instructions just as it can ignore system-prompt rules, so the gain is in attention allocation, not in compliance. But the attention-allocation argument is the strong one. Putting the right rule next to the right data, at the right moment, is a better use of a 200K context than putting every rule everywhere just in case.

Sources: Building production-ready agentic systems, Shopify Engineering. ICML 2025 Expo: Building Production-Ready Agentic Systems. All code in this post is illustrative, written by me to work through the idea, not copied from any internal source.