ML Systems Design
ML systems design concerns the architecture decisions that connect a trained model to a production system serving real users. This article covers the end-to-end ML lifecycle, inference serving patterns, feature management, monitoring, and the organizational patterns that determine whether ML projects succeed in production.
The ML Lifecycle
A production ML system involves far more than model training. Sculley et al. (2015) famously illustrated that the ML code in a production system is a small fraction of the total infrastructure:
| Component | Examples |
|---|---|
| Data collection | Logging pipelines, crawl infrastructure, labeling |
| Data validation | Schema checks, distribution monitoring, freshness |
| Feature engineering | Transforms, aggregations, embeddings |
| Training | Model selection, hyperparameter tuning, distributed training |
| Evaluation | Offline metrics, A/B testing, shadow mode |
| Serving | Inference API, batching, caching, fallback |
| Monitoring | Prediction quality, data drift, latency, errors |
| Retraining | Scheduled retraining, triggered retraining, versioning |
The training loop (feature engineering → training → evaluation) is iterated many times during development. Once deployed, the monitoring → retraining loop runs continuously.
Training vs Serving Skew
The most common failure mode in production ML: the model sees different features at serving time than at training time.
Sources of skew:
- Feature computation differences. Training features computed in batch (e.g., Spark) and serving features computed in real-time (e.g., application code) may implement the same logic differently.
- Temporal leakage. Training features computed with access to future data (e.g., using the full dataset’s statistics for normalization) that won’t be available at serving time.
- Data freshness. Training uses a snapshot; serving uses live data that may have drifted.
Mitigation: Use a feature store that serves the same feature computation logic for both training and inference. Log serving-time features and compare against training distributions.
The tracker cost model addresses training-serving skew explicitly: the model trains on HTTP Archive data (Chrome, crawl traffic) and serves in Firefox (organic traffic). The paper argues that P(Y|X) is invariant (server-determined) while only P(X) differs, making this a covariate shift rather than concept drift. This is a textbook example of principled skew analysis.
Inference Patterns
Batch Inference
Compute predictions for all inputs in a batch job, store results, serve from cache.
| Property | Value |
|---|---|
| Latency | Not real-time (minutes to hours) |
| Throughput | High (GPU saturation) |
| Freshness | Stale (until next batch run) |
| Complexity | Low (standard ETL pipeline) |
| Use cases | Recommendations, risk scoring, content ranking |
Real-Time Inference
Compute predictions on-demand in response to user requests.
| Property | Value |
|---|---|
| Latency | p99 < 100ms typical |
| Throughput | Variable (must handle traffic spikes) |
| Freshness | Current (features computed at request time) |
| Complexity | High (model serving infra, autoscaling) |
| Use cases | Search ranking, fraud detection, content moderation |
Near-Real-Time (Streaming)
Predictions triggered by events in a stream (Kafka, Kinesis). Intermediate between batch and real-time: fresher than batch, less latency-sensitive than synchronous serving.
Edge Inference
Model runs on the client device (browser, mobile, IoT). The tracker cost model uses this pattern: the ONNX model runs inside Firefox’s process, performing microsecond inference at request-block time with zero network latency. Edge inference eliminates server round-trips and works offline, but constrains model size and update frequency.
Feature Stores
A feature store provides a unified abstraction for feature management:
Offline store. Historical feature values for training. Typically backed by a data warehouse (BigQuery, Snowflake, S3/Parquet). Supports point-in-time queries to prevent temporal leakage: “what were the features for this entity at this timestamp?”
Online store. Low-latency feature serving for real-time inference. Backed by Redis, DynamoDB, or a purpose-built store. Provides sub-millisecond reads for precomputed features.
Feature registry. Metadata catalog describing each feature: computation logic, owner, freshness SLA, data type, and lineage (which raw data sources it depends on).
Key benefit: Training and serving read from the same feature definitions, eliminating training-serving skew by construction.
Systems: Feast (open-source), Tecton, Vertex AI Feature Store, SageMaker Feature Store.
Model Serving
Model Formats
| Format | Framework | Deployment |
|---|---|---|
| ONNX | Framework-agnostic | ONNX Runtime (C++, optimized) |
| TorchScript | PyTorch | torch.jit.trace or torch.jit.script |
| SavedModel | TensorFlow | TF Serving |
| Core ML | Apple | On-device iOS/macOS |
ONNX is the de facto standard for cross-framework deployment. The tracker model exports XGBoost to ONNX (~500KB for 500 trees), enabling inference in Firefox’s C++ runtime without a Python dependency.
Serving Infrastructure
Model server. Triton Inference Server (NVIDIA), TorchServe (PyTorch), TF Serving (TensorFlow). These handle batching, GPU scheduling, model versioning, and health checks.
Dynamic batching. Collect individual inference requests into batches for GPU efficiency. Trade latency for throughput: waiting 5ms to accumulate a batch of 32 requests is faster total than serving 32 individual requests.
Model versioning. Serve multiple model versions simultaneously for canary deployments. Route a fraction of traffic to the new model, monitor metrics, and gradually increase the fraction if metrics are healthy.
Monitoring and Observability
What to Monitor
Data quality. Feature distributions (mean, variance, percentiles), missing value rates, schema violations. Detect upstream data pipeline failures before they corrupt predictions.
Prediction quality. When ground truth is available (possibly delayed), compute online metrics and compare against offline baselines. When ground truth is unavailable, monitor proxy metrics and prediction distribution stability.
Model performance. Latency percentiles (p50, p95, p99), throughput, error rates, GPU/CPU utilization.
Data drift. Compare the distribution of serving-time features against the training distribution. Statistical tests (KS test, chi-squared, PSI - Population Stability Index) quantify drift magnitude. The tracker model includes drift analysis: training on June 2024 data and testing on September 2024 shows 30.3% MAE degradation, driven by URL path churn (only 14.9% of September paths were seen in June training).
Alerting
Threshold-based alerts for acute failures: latency spike, error rate increase, null prediction rate.
Drift-based alerts for gradual degradation: trigger retraining when feature distributions shift beyond a threshold. The tracker model recommends quarterly retraining based on the observed 30% three-month degradation.
Retraining Strategies
| Strategy | Trigger | Frequency | Complexity |
|---|---|---|---|
| Scheduled | Calendar | Daily/weekly/monthly | Low |
| Performance-triggered | Metric degradation | Variable | Medium |
| Continuous | New data arrives | Continuous | High |
The tracker model uses scheduled monthly retraining on fresh HTTP Archive crawls, delivered via Firefox Remote Settings (the same mechanism used for the Disconnect tracking protection list). This balances accuracy maintenance against operational complexity.
Champion-challenger deployment. Train the new model, evaluate against the current production model on a holdout, deploy only if the new model improves. This prevents regressions from data quality issues or training instabilities.
The Data Flywheel
The most powerful pattern in production ML: model predictions generate user interactions, which generate training data, which improves the model.
Examples:
- Search ranking: better results → more clicks → more click-through data → better ranking
- Recommendations: better suggestions → more engagement → more preference signal → better recommendations
- The tracker model has a weaker flywheel: predictions surface cost estimates to users via the privacy dashboard, but user behavior doesn’t directly generate training labels (labels come from HTTP Archive crawls, not Firefox telemetry)
Cold start. The flywheel requires initial data to start. Strategies: rule-based systems, manual labeling, synthetic data, transfer learning from related domains.
Organizational Patterns
ML platform team. Provides shared infrastructure (feature store, model serving, experiment framework) that product ML teams build on. Amortizes infrastructure investment across teams.
Embedded ML engineers. ML engineers sit within product teams, owning the full stack from data to deployment. Reduces organizational boundaries but risks infrastructure fragmentation.
The hybrid model. Platform team owns infrastructure; product teams own models and features. This is the most common pattern at scale (Google, Meta, Spotify). The platform provides guardrails (monitoring, deployment, evaluation) while product teams retain modeling autonomy.
Summary
| Concern | Key Decision | Tradeoff |
|---|---|---|
| Inference | Batch vs real-time vs edge | Latency vs throughput vs freshness |
| Features | Feature store vs ad-hoc | Consistency vs setup cost |
| Serving | Model server vs custom | Flexibility vs operational burden |
| Monitoring | What metrics, what thresholds | Coverage vs alert fatigue |
| Retraining | Scheduled vs triggered | Freshness vs complexity |
| Organization | Platform vs embedded | Leverage vs autonomy |
The gap between a model that works in a notebook and a model that works in production is primarily an engineering gap, not a modeling gap. Systems design determines whether a model reaches users reliably, stays accurate over time, and improves as more data accumulates.