From Privacy and ML

On-Device Inference

On-device inference is the part of privacy ML I actually built, so it is the part I trust myself on. These are notes from that, not a tutorial.

The pitch is simple and mostly true: if the model runs on the user’s machine, the raw data never leaves it, and a whole category of privacy problems disappears. There is no server log to leak, no retention policy to trust, no breach to worry about for data you never held. When I started I thought this was the hard-won prize. It turned out to be the easy part, and the interesting questions live in what “on-device” does and does not actually buy you.

What it buys, exactly

It buys you the input side. The data the model reads stays where it is. That is real and it is a lot.

What it does not buy you, and what took me a while to see clearly, is the rest of the chain. The model still has to come from somewhere, and if you trained it on user data you have only moved the privacy problem to training, not solved it. The output still exists once computed, and what you do with it is its own question. And the model file itself can leak, which is a thing I did not think about until I read the attack literature: a model is a lossy compression of its training set, and the compression is sometimes not lossy enough.

For the thing I worked on, the chain stayed clean almost by luck. The model was trained on a public HTTP Archive crawl, not on anyone’s browsing, so there was no training-data privacy problem to begin with. The output was an aggregate the user saw on their own new-tab page, going nowhere. So “on-device” really was the whole privacy story in that one case. The moment any of those conditions breaks, you are in the genuinely hard part of the field, and most real systems break at least one of them.

The Firefox setup made the “data never leaves” property structural rather than promised, which is the part I found elegant. Enhanced Tracking Protection cancels a tracker request before it goes out, based on a deterministic Disconnect-list check. The response never arrives, so the thing you want to report (how many bytes the user was saved) is unobservable by construction. There is nothing to collect even if you wanted to. So the number is not measured, it is predicted, on the device, from the only things visible at block time: the URL, the resource type, the request metadata Gecko already has in hand off nsIChannel and nsILoadInfo. The privacy posture is not a policy decision someone could quietly reverse. It falls out of where in the request lifecycle the computation happens.

Why efficiency is the actual problem

A model that runs on the user’s device lives inside budgets a server never imposes. It ships in the application binary or an update channel, so its size is a cost paid by every user on every update. Its inference competes with the user-visible work the device is already doing. Its memory comes from the same pool as the application. On a server you scale sideways and the user is waiting on the network anyway; none of these bind. On-device, they are the design, full stop.

This is where my whole project actually lived. The accurate estimator was a lookup table keyed on the full URL path, and it was the right answer in every way except size: keyed that finely, it extrapolated to roughly 187 MB at deployment scale, which is a non-starter to ship in a browser binary that downloads to everyone. Key the table more coarsely, on domain alone, and it shrinks but goes badly inaccurate, because within a single tracker domain the response size swings three orders of magnitude. The cost is in the path, not the domain, and the accurate version of “the path” was too big to ship. That gap was the entire problem.

A gradient-boosted model closes it by generalizing across similar URLs instead of storing each one, and it lands at the path table’s accuracy as a roughly 500 KB ONNX artifact, small enough to ride the browser’s existing update channels. Exporting to ONNX mattered for a specific reason: it runs through Firefox’s existing inference path in Gecko, on CPU, single-threaded, no GPU assumption, which is the only thing you can assume across a quarter-billion machines of wildly different capability. Inference is microseconds and runs asynchronously after the block decision, so it never sits on the critical path of a page load. The block is deterministic and instant; the cost prediction is a background task that updates a weekly tally and gates nothing the user waits on.

One detail I am still a little proud of, because it is where the modeling met the constraint. The distribution of tracker response sizes is brutal for a model: something like 40% of blocked requests return exactly zero bytes (beacons, tracking pixels, cookie-sync endpoints with empty bodies), and the rest run a heavy tail out to hundreds of KB. Ordinary squared-error loss spends all its capacity getting the zeros and the small stuff right and badly underweights the rare large scripts, which are exactly the ones that dominate the number the user sees. Borrowing the Tweedie loss that actuaries use for the same zero-inflated, heavy-tailed shape (it is how insurance claims are distributed too) was worth more accuracy than any architecture or feature change. Loss function over model family, on a distribution shaped like that.

And the assumption the whole thing rested on, which I tested rather than asserted: the model trains on a Chrome-based crawl and runs in Firefox, so it only works if the same tracker URL returns the same bytes regardless of which browser asked. That is a covariate-shift bet. A paired Firefox/Chrome fetch agreed on byte size the overwhelming majority of the time, and an in-page Firefox crawl on the real deployment distribution held the advantage, so the bet was sound, but the point is that on-device deployment forces you to validate train-versus-deploy drift directly, because you cannot watch it from a server once it ships. The model updates quarterly through Remote Settings, the same channel Firefox uses to push the tracker list itself, on a cadence set by how fast the measured advantage decays.

The privacy guarantee was downstream of the byte budget the whole way through. I did not expect that going in, and it is now the first thing I look for in any on-device privacy claim: not “is it private in principle” but “is the private version the one that actually shipped.” The full systems write-up of the build, with the multi-process Gecko pipeline that feeds the model and the numbers behind all of the above, is on the Firefox page. This page is the lesson I took from it.

The seam to private training

Keeping inference on-device is the solved, shippable end of this. Keeping training on-device is the open, hard end, and it is where I am pointing next. That is the federated learning question: train locally, send only updates, and the same efficiency tax shows up immediately, because now the expensive privacy machinery has to run on the constrained device too. The lesson transfers exactly. A private training scheme protects people only if it is cheap enough to run where the data is, and the cost of the privacy, not the model, is usually what decides whether it ever ships.