Privacy Attacks on ML Models
I came at this backwards. I built a privacy system first and read the attack literature second, and reading it reframed everything I had done, because the attacks are the reason any of the defenses exist. Differential privacy, federated learning, secure aggregation: they are all answers to specific things people figured out how to extract from models. Without the attacks the defenses look like paranoia. With them they look overdue.
The core uncomfortable fact, which I did not fully appreciate until I sat with it: a trained model is a lossy compression of its training data, and the compression is sometimes not lossy enough. You ship the model, you think you are shipping a function. You may also be shipping a leaky archive of the people in the training set.
The attacks, roughly in order of how much they unsettled me
Membership inference. The starting point. Given a trained model and a data record, decide whether that record was in the training set. It works because models behave measurably differently on examples they saw during training (more confident, lower loss) than on ones they did not. That sounds mild until you make it concrete: “was this person’s record in the hospital’s training set” can itself be the sensitive fact, regardless of what the model predicts. The early versions trained shadow models to detect the signal; the sharper recent versions frame it as a likelihood-ratio test and get far more reliable, which matters because a privacy attack that only works on average is not the same threat as one that works confidently on a specific person.
Memorization and extraction. Worse, and more visceral. Large models do not just leak whether you were in the set, they sometimes regurgitate the set. People have prompted language models into emitting verbatim phone numbers, emails, and addresses from training data, and pulled near-verbatim training images back out of diffusion models. This is the attack that makes “the data never leaves the building” insufficient as a privacy story, because the model is the thing that leaves the building, and the data can ride along inside it.
Outlier exposure. The cruel structural detail: models overfit unusual individuals more than common ones. The person with the rare condition, the unusual name, the strange feature combination is the easiest to extract and the most harmed by extraction. Privacy failures concentrate on exactly the people with the most to lose. Any honest defense has to be measured on the tail, not the average.
Gradient leakage. The one most relevant to where I am headed. In federated learning the whole pitch is “we send gradient updates, not data.” Then it turns out you can reconstruct the original training examples from the gradients alone, sometimes with startling fidelity. This is the attack that kills the naive reading of federated learning as automatically private and forces the differential-privacy-plus-secure-aggregation machinery on top.
How this reframed my own work
The Firefox model I built was almost untouchable by all of this, and reading the attacks is what let me see why, precisely. It trained on a public web crawl, so membership inference is meaningless (everything was already public) and there is no private training set to extract. Its output is an aggregate the user sees, not a per-person prediction served to anyone. The threat model the attacks describe simply does not have a foothold.
That is not a brag, it is the opposite: it taught me that my project was on the easy side of privacy ML, the side where the data was never private to begin with. The hard side is exactly where these attacks bite: training on real user data, serving a model to the world, defending the tail. That is the side differential privacy and federated learning are built for, and the side I have not worked on yet. Knowing which side I have actually done is the most useful thing the attack literature gave me.
Reading list, including the specific attack papers, is in the section notes.