Deployment and Delivery
The entire ML pipeline (BigQuery extraction, feature engineering, XGBoost training, quantile regression, conformal prediction) produces one thing: a 125KB JSON file.
Output Artifact
Not an ONNX model running in the browser. Not a TensorFlow Lite inference engine. Not a model versioning system with A/B testing. A static JSON file. 4,592 entries, 960KB raw, 125KB gzipped. Firefox loads it into a Map and does key lookups. That’s the product.
This is a deliberate design decision, not a limitation.
Offline vs Runtime Inference
The obvious approach for “ML in the browser” is shipping a trained model and running inference at request time. ONNX Runtime for Web exists. XGBoost models convert to ONNX cleanly. The three models total maybe 70-100KB gzipped. So why not?
The model has nothing new to learn at runtime. The input features (transfer size, script ratio, waterfall position) come from HTTP Archive crawl data, not from the user’s browsing session. When Firefox blocks a request to googletagmanager.com, it doesn’t know the request’s transfer size or content type, because the request was blocked. All the model’s features are pre-computable from public crawl data. There’s no new information at block time that would change the prediction.
The Disconnect list is finite and known. Firefox blocks domains on the Disconnect list. The list has ~4,600 tracker domains. The lookup table covers all of them. There’s no “new domain at runtime” problem; when a new domain gets added to the Disconnect list, it gets scored in the next pipeline run.
A lookup table is simpler, faster, and more debuggable. A Map lookup is O(1). There’s no inference latency, no ONNX runtime dependency, no model loading on startup. If a score looks wrong, you look it up in the JSON file. If you want to change a score, you edit the JSON. The entire system is transparent.
The ML is the pipeline, not the product. The models exist to generalize from the 2,185 domains with Lighthouse CPU data to the 2,407 without. Once that generalization is done and the scores are written to JSON, the models have served their purpose. Shipping them to users would add complexity for zero benefit.
Lookup Table Structure
4,592 domains. Each entry has point estimates, prediction intervals, data provenance, and a confidence flag:
{
"www.googletagmanager.com": {
"cpu": 0.828, "cpu_lo": 0.67, "cpu_hi": 0.95,
"network": 0.901, "network_lo": 0.87, "network_hi": 0.94,
"source": "measured", "confident": true
},
"bat.bing.com": {
"cpu": 0.524, "cpu_lo": 0.28, "cpu_hi": 0.69,
"network": 0.422, "network_lo": 0.39, "network_hi": 0.45,
"source": "measured", "confident": true
}
}Composition
| Category | Count | Percentage | Meaning |
|---|---|---|---|
| Total domains | 4,592 | 100% | Every tracker on the Disconnect list with sufficient HTTP Archive data |
| Measured (source: “measured”) | 2,185 | 48% | CPU score computed directly from Lighthouse data |
| Predicted (source: “predicted”) | 2,407 | 52% | CPU score predicted by the XGBoost model |
| Confident | 4,110 | 89% | Conformal interval narrow enough to trust |
| Uncertain | 482 | 11% | Wide interval, prediction less reliable |
The 89%/11% confident/uncertain split comes from conformal prediction. A domain is marked “confident” if it has real Lighthouse CPU data, or if the model’s conformal interval for it is below the threshold. The 482 uncertain domains are ones where request features don’t clearly indicate CPU cost; the model is honest about not knowing.
Representative Entries
The scores pass the intuition check. Domains I have strong priors about land where they should.
| Domain | CPU | Network | Story |
|---|---|---|---|
www.googletagmanager.com | 0.83 | 0.90 | Heaviest tracker on both axes. 305ms scripting, 92-183KB bundles. |
cdn.cookielaw.org | 0.79 | 0.84 | Consent manager. Heavy despite not render-blocking. |
connect.facebook.net | 0.64 | 0.83 | Facebook SDK. 100% scripts, large transfers. |
www.google-analytics.com | 0.53 | 0.46 | Not lightweight. ~130ms scripting from session management and GA4. |
bat.bing.com | 0.52 | 0.42 | 258-byte transfer, 105ms CPU. Transfer size lies. |
The bat.bing.com entry is the one I’d highlight in a conversation about this project. It looks harmless in every request-level metric: tiny transfer, single request, no scripts in the resource type field. But Lighthouse measured 105ms of scripting time. The model learned this pattern from the labeled data and can now flag similar deceptive domains.
cdn.cookielaw.org is another good example. It’s the consent manager that was the only tracker in the render-blocking audit’s top 20. I dropped the render_delay target because of how sparse it was, but the CPU and network scores still capture that this domain is expensive. The performance cost didn’t disappear; it just shows up on the CPU axis instead of a dedicated render-delay axis.
Firefox Integration
The integration is a thin module that loads the JSON into a Map and exposes a lookup API. No parsing logic, no inference, no dependencies beyond the JSON file itself.
// TrackerRiskScorer.sys.mjs
const SCORES_URL =
"chrome://browser/content/tracker_risk_scores.json";
let scoresMap = null;
async function ensureLoaded() {
if (scoresMap) return;
const response = await fetch(SCORES_URL);
const data = await response.json();
scoresMap = new Map(Object.entries(data));
}
export async function getTrackerScore(domain) {
await ensureLoaded();
// Try exact match first, then strip subdomains
let entry = scoresMap.get(domain);
if (!entry) {
const parts = domain.split(".");
for (let i = 1; i < parts.length - 1; i++) {
const suffix = parts.slice(i).join(".");
entry = scoresMap.get(suffix);
if (entry) break;
}
}
if (!entry) return null;
return {
cpu: entry.cpu,
cpuInterval: [entry.cpu_lo, entry.cpu_hi],
network: entry.network,
networkInterval: [entry.network_lo, entry.network_hi],
source: entry.source,
confident: entry.confident,
};
}
export async function summarizeBlockedTrackers(domains) {
await ensureLoaded();
let totalCpu = 0;
let totalNetwork = 0;
let scored = 0;
for (const domain of domains) {
const score = await getTrackerScore(domain);
if (score) {
totalCpu += score.cpu;
totalNetwork += score.network;
scored++;
}
}
return { totalCpu, totalNetwork, scored, total: domains.length };
}The summarizeBlockedTrackers function is what the privacy metrics card would call. Pass it the list of domains Firefox blocked this week, get back aggregate CPU and network cost. The card could display something like: “Firefox blocked 47 trackers this week, preventing an estimated 890ms of background CPU usage and 2.3MB of network transfers.”
The suffix-stripping in getTrackerScore handles the same subdomain matching problem from the original Disconnect list join. HTTP Archive records stats.g.doubleclick.net; the lookup table might have the entry under doubleclick.net or stats.g.doubleclick.net. Try the exact match first, then progressively strip subdomains.
Privacy Metrics Card Integration
The current about:protections page shows a count: “47 trackers blocked this week.” With the lookup table, the card could show per-axis summaries:
What the user sees (hypothetically):
- “Firefox blocked 47 trackers this week”
- “Prevented ~890ms of background CPU usage”
- “Saved ~2.3MB of network bandwidth”
I’m still not sure how much to trust these aggregate numbers. The per-domain error is ~6 percentile points on average, but when you sum across 47 trackers the errors could compound or cancel out. Probably needs some simulation work before these go into production UI copy.
What happens behind the scenes:
TrackingDBServiceprovides the list of blocked domains for the time period.- For each domain, look up the score in the Map.
- Convert the [0, 1] scores back to approximate real units using reference values (e.g., a CPU score of 0.83 maps to roughly 305ms based on the percentile rank calibration).
- Sum across all blocked domains per axis.
The per-axis breakdown lets the card emphasize different things on different devices. On mobile (where bandwidth matters more), highlight the network savings. On desktop (where CPU matters more), highlight the CPU savings. The two-axis design makes this possible without any changes to the scoring pipeline.
The confidence flag could drive the messaging precision. If most blocked trackers are “confident,” show exact numbers. If many are “uncertain,” soften the language: “approximately” or “at least.”
Maintenance and Update Cycle
The lookup table should be regenerated periodically as web behavior changes. The pipeline reruns on fresh HTTP Archive data:
- Monthly: Pull new crawl data from BigQuery. HTTP Archive publishes monthly crawls.
- On Disconnect list updates: When new domains are added to the Disconnect list, run the pipeline to score them.
- Ship via Remote Settings: Firefox’s existing Remote Settings infrastructure delivers the updated JSON. No binary update needed.
The pipeline is fully automated: BigQuery queries, feature extraction, model prediction, lookup table generation. The only manual step is triggering it.
Between updates, the lookup table is static. A domain’s score doesn’t change based on the user’s browsing. This is fine; tracker behavior is stable over weeks. googletagmanager.com doesn’t suddenly become lightweight. The monthly refresh catches gradual shifts.