ML Systems Design

ML systems design concerns the architecture decisions that connect a trained model to a production system serving real users. This article covers the end-to-end ML lifecycle, inference serving patterns, feature management, monitoring, and the organizational patterns that determine whether ML projects succeed in production.

The ML Lifecycle

A production ML system involves far more than model training. Sculley et al. (2015) famously illustrated that the ML code in a production system is a small fraction of the total infrastructure:

Component	Examples
Data collection	Logging pipelines, crawl infrastructure, labeling
Data validation	Schema checks, distribution monitoring, freshness
Feature engineering	Transforms, aggregations, embeddings
Training	Model selection, hyperparameter tuning, distributed training
Evaluation	Offline metrics, A/B testing, shadow mode
Serving	Inference API, batching, caching, fallback
Monitoring	Prediction quality, data drift, latency, errors
Retraining	Scheduled retraining, triggered retraining, versioning

The training loop (feature engineering → training → evaluation) is iterated many times during development. Once deployed, the monitoring → retraining loop runs continuously.

Training vs Serving Skew

The most common failure mode in production ML: the model sees different features at serving time than at training time.

Sources of skew:

Feature computation differences. Training features computed in batch (e.g., Spark) and serving features computed in real-time (e.g., application code) may implement the same logic differently.
Temporal leakage. Training features computed with access to future data (e.g., using the full dataset’s statistics for normalization) that won’t be available at serving time.
Data freshness. Training uses a snapshot; serving uses live data that may have drifted.

Mitigation: Use a feature store that serves the same feature computation logic for both training and inference. Log serving-time features and compare against training distributions.

The tracker cost model addresses training-serving skew explicitly: the model trains on HTTP Archive data (Chrome, crawl traffic) and serves in Firefox (organic traffic). The paper argues that P(Y|X) is invariant (server-determined) while only P(X) differs, making this a covariate shift rather than concept drift. This is a textbook example of principled skew analysis.

Inference Patterns

Batch Inference

Compute predictions for all inputs in a batch job, store results, serve from cache.

Property	Value
Latency	Not real-time (minutes to hours)
Throughput	High (GPU saturation)
Freshness	Stale (until next batch run)
Complexity	Low (standard ETL pipeline)
Use cases	Recommendations, risk scoring, content ranking

Real-Time Inference

Compute predictions on-demand in response to user requests.

Property	Value
Latency	p99 < 100ms typical
Throughput	Variable (must handle traffic spikes)
Freshness	Current (features computed at request time)
Complexity	High (model serving infra, autoscaling)
Use cases	Search ranking, fraud detection, content moderation

Near-Real-Time (Streaming)

Predictions triggered by events in a stream (Kafka, Kinesis). Intermediate between batch and real-time: fresher than batch, less latency-sensitive than synchronous serving.

Edge Inference

Model runs on the client device (browser, mobile, IoT). The tracker cost model uses this pattern: the ONNX model runs inside Firefox’s process, performing microsecond inference at request-block time with zero network latency. Edge inference eliminates server round-trips and works offline, but constrains model size and update frequency.

Feature Stores

A feature store provides a unified abstraction for feature management:

Offline store. Historical feature values for training. Typically backed by a data warehouse (BigQuery, Snowflake, S3/Parquet). Supports point-in-time queries to prevent temporal leakage: “what were the features for this entity at this timestamp?”

Online store. Low-latency feature serving for real-time inference. Backed by Redis, DynamoDB, or a purpose-built store. Provides sub-millisecond reads for precomputed features.

Feature registry. Metadata catalog describing each feature: computation logic, owner, freshness SLA, data type, and lineage (which raw data sources it depends on).

Key benefit: Training and serving read from the same feature definitions, eliminating training-serving skew by construction.

Systems: Feast (open-source), Tecton, Vertex AI Feature Store, SageMaker Feature Store.

Model Serving

Model Formats

Format	Framework	Deployment
ONNX	Framework-agnostic	ONNX Runtime (C++, optimized)
TorchScript	PyTorch	`torch.jit.trace` or `torch.jit.script`
SavedModel	TensorFlow	TF Serving
Core ML	Apple	On-device iOS/macOS

ONNX is the de facto standard for cross-framework deployment. The tracker model exports XGBoost to ONNX (~500KB for 500 trees), enabling inference in Firefox’s C++ runtime without a Python dependency.

Serving Infrastructure

Model server. Triton Inference Server (NVIDIA), TorchServe (PyTorch), TF Serving (TensorFlow). These handle batching, GPU scheduling, model versioning, and health checks.

Dynamic batching. Collect individual inference requests into batches for GPU efficiency. Trade latency for throughput: waiting 5ms to accumulate a batch of 32 requests is faster total than serving 32 individual requests.

Model versioning. Serve multiple model versions simultaneously for canary deployments. Route a fraction of traffic to the new model, monitor metrics, and gradually increase the fraction if metrics are healthy.

Monitoring and Observability

What to Monitor

Data quality. Feature distributions (mean, variance, percentiles), missing value rates, schema violations. Detect upstream data pipeline failures before they corrupt predictions.

Prediction quality. When ground truth is available (possibly delayed), compute online metrics and compare against offline baselines. When ground truth is unavailable, monitor proxy metrics and prediction distribution stability.

Model performance. Latency percentiles (p50, p95, p99), throughput, error rates, GPU/CPU utilization.

Data drift. Compare the distribution of serving-time features against the training distribution. Statistical tests (KS test, chi-squared, PSI - Population Stability Index) quantify drift magnitude. The tracker model includes drift analysis: training on June 2024 data and testing on September 2024 shows 30.3% MAE degradation, driven by URL path churn (only 14.9% of September paths were seen in June training).

Alerting

Threshold-based alerts for acute failures: latency spike, error rate increase, null prediction rate.

Drift-based alerts for gradual degradation: trigger retraining when feature distributions shift beyond a threshold. The tracker model recommends quarterly retraining based on the observed 30% three-month degradation.

Retraining Strategies

Strategy	Trigger	Frequency	Complexity
Scheduled	Calendar	Daily/weekly/monthly	Low
Performance-triggered	Metric degradation	Variable	Medium
Continuous	New data arrives	Continuous	High

The tracker model uses scheduled monthly retraining on fresh HTTP Archive crawls, delivered via Firefox Remote Settings (the same mechanism used for the Disconnect tracking protection list). This balances accuracy maintenance against operational complexity.

Champion-challenger deployment. Train the new model, evaluate against the current production model on a holdout, deploy only if the new model improves. This prevents regressions from data quality issues or training instabilities.

The Data Flywheel

The most powerful pattern in production ML: model predictions generate user interactions, which generate training data, which improves the model.

\text{Better model} \to \text{Better product} \to \text{More users} \to \text{More data} \to \text{Better model}

Examples:

Search ranking: better results → more clicks → more click-through data → better ranking
Recommendations: better suggestions → more engagement → more preference signal → better recommendations
The tracker model has a weaker flywheel: predictions surface cost estimates to users via the privacy dashboard, but user behavior doesn’t directly generate training labels (labels come from HTTP Archive crawls, not Firefox telemetry)

Cold start. The flywheel requires initial data to start. Strategies: rule-based systems, manual labeling, synthetic data, transfer learning from related domains.

Organizational Patterns

ML platform team. Provides shared infrastructure (feature store, model serving, experiment framework) that product ML teams build on. Amortizes infrastructure investment across teams.

Embedded ML engineers. ML engineers sit within product teams, owning the full stack from data to deployment. Reduces organizational boundaries but risks infrastructure fragmentation.

The hybrid model. Platform team owns infrastructure; product teams own models and features. This is the most common pattern at scale (Google, Meta, Spotify). The platform provides guardrails (monitoring, deployment, evaluation) while product teams retain modeling autonomy.

Summary

Concern	Key Decision	Tradeoff
Inference	Batch vs real-time vs edge	Latency vs throughput vs freshness
Features	Feature store vs ad-hoc	Consistency vs setup cost
Serving	Model server vs custom	Flexibility vs operational burden
Monitoring	What metrics, what thresholds	Coverage vs alert fatigue
Retraining	Scheduled vs triggered	Freshness vs complexity
Organization	Platform vs embedded	Leverage vs autonomy

The gap between a model that works in a notebook and a model that works in production is primarily an engineering gap, not a modeling gap. Systems design determines whether a model reaches users reliably, stays accurate over time, and improves as more data accumulates.