Federated Learning
This is the part I have not built yet, so these notes are more reading-and-reasoning than scar tissue. I am writing them down because federated learning is the natural next question after the on-device inference work, and the place I want to go deeper.
The setup: instead of collecting everyone’s data onto a server and training there, you ship the model to the devices, train locally on each one, and send back only the updates (the gradients, or the changed weights). The server averages the updates into a new global model and pushes it back out. Repeat. The raw data never moves. The canonical version is FedAvg, where each device does a few local steps before reporting in, which cuts the number of communication rounds you need.
It is the obvious extension of what I worked on. On-device inference keeps the data home at prediction time, when the model already exists. Federated learning tries to keep it home at training time, which is the harder half, because training is where the model actually absorbs information about individuals. If you can train without centralizing, you have closed the loop. The Firefox model dodged this entirely by training on a public crawl, so I have never had to solve it, and that is exactly why it interests me.
Where it gets hard, as far as I can tell
Three things, and I am sure the list is longer once you are actually in it.
The gradients leak. This is the one that genuinely surprised me reading into it. “We only send gradient updates, not data” sounds like a privacy guarantee, and it is not. There is a line of attacks (gradient leakage, gradient inversion) that reconstruct the original training examples from the updates alone, sometimes pixel-for-pixel. So federated learning by itself is not private. It is a system that enables privacy if you add the real protection on top, which is usually differential privacy on the updates plus secure aggregation so the server only ever sees the sum, never any one device’s contribution. Federated learning is the plumbing; the privacy is the stuff you bolt onto it, and people conflate the two constantly.
The efficiency tax, which is my actual interest. This is the through-line from the on-device work. Now the expensive privacy machinery has to run on the constrained device, not a server. DP-SGD’s per-example gradient clipping and noise, the communication cost of shipping updates from a phone on a metered connection, the fact that the weakest devices are also the ones that drop out of training rounds. The clean math assumes resources the actual hardware does not have. This is exactly the systems-meets-privacy-meets-ML gap, and it is the thing I find most worth working on: not a new guarantee, but making an existing guarantee cheap enough that it survives contact with a real device.
The data is not IID and the devices are not equal. Every device sees a different, skewed slice of the world (your phone’s keyboard data is not a random sample of everyone’s), and FedAvg’s convergence assumes more uniformity than reality provides. The heterogeneity is statistical and physical at once: different distributions, different compute, different availability. Most of the open-problems literature seems to circle this.
Why I want to work here
It is the same lesson as on-device inference, one level up and unsolved. There, the privacy was real only because the efficient version was the one that shipped. Federated learning with differential privacy is where that lesson is currently failing: the private, federated version is often too expensive to deploy, so what ships is the less-private shortcut. Closing that gap, on the device, under the budget, is the cutting-edge intersection of systems, privacy, and ML, and it is the direction I am taking out of the Firefox work.
Reading list for this is in the section notes.