Loss Functions and Optimization

The loss function is the single most consequential modeling decision you make. It encodes your assumptions about the error structure of the data — what kinds of mistakes are costly, what the noise looks like, and what “good” means. In the tracker cost model, switching from squared error to Tweedie loss on identical XGBoost features and identical architecture reduced MAE from 4,527 to 3,466 — a 23% improvement that exceeds the gap between any two architectures tested. The loss function is not an optimization detail. It is the model’s inductive bias.

Loss Functions as Inductive Bias

Every loss function corresponds to a probabilistic assumption about the data-generating process. Minimizing a loss is equivalent to maximum likelihood estimation under a specific noise model:

MSE assumes the targets are corrupted by Gaussian noise with constant variance: $y = f(\mathbf{x}) + \epsilon$ , $\epsilon \sim \mathcal{N}(0, \sigma^2)$ . Minimizing MSE is equivalent to maximizing $\prod_i \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \hat{y}_i)^2}{2\sigma^2}\right)$ .
Cross-entropy assumes the targets follow a Bernoulli (binary) or Categorical distribution. Minimizing cross-entropy is equivalent to maximizing the likelihood of the observed labels under the model’s predicted probabilities.
Tweedie assumes the targets follow a compound Poisson-Gamma distribution from the exponential dispersion family. This is the natural choice for data that is zero-inflated and right-skewed — insurance claims, ad costs, resource consumption.

The loss function determines the estimand. MSE targets the conditional mean $\mathbb{E}[Y|\mathbf{x}]$ . MAE targets the conditional median. Quantile loss targets arbitrary quantiles. These are different statistical quantities, and the choice between them is a modeling decision.

Regression Losses

Mean Squared Error (MSE)

L_{\text{MSE}} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2

The gradient with respect to the prediction:

\frac{\partial L}{\partial \hat{y}_i} = 2(\hat{y}_i - y_i)

MSE penalizes errors quadratically. An error of 10 costs 100x more than an error of 1. This makes MSE highly sensitive to outliers — a single large residual dominates the loss. The quadratic penalty is desirable when large errors are genuinely worse (e.g., financial risk) and problematic when outliers are noise.

MSE is convex, smooth, and has a unique global minimum. Its gradient is proportional to the residual, giving large updates for large errors and small updates for small errors. This is numerically stable and well-conditioned for optimization.

Mean Absolute Error (MAE / L1 Loss)

L_{\text{MAE}} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|

The gradient:

\frac{\partial L}{\partial \hat{y}_i} = \text{sign}(\hat{y}_i - y_i)

MAE penalizes errors linearly. An error of 10 costs exactly 10x more than an error of 1. This makes MAE robust to outliers — the influence of any single observation is bounded.

The tradeoff: MAE’s gradient is constant in magnitude ( $\pm 1$ ), regardless of whether the error is 0.001 or 1000. Near convergence, this causes oscillation. MAE is also non-differentiable at zero, which requires subgradient methods or smoothing.

MAE targets the conditional median rather than the conditional mean. For symmetric distributions, these coincide. For skewed distributions, the median is more robust and the distinction matters.

Huber Loss

L_\delta(\hat{y}, y) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \leq \delta \\ \delta \cdot |y - \hat{y}| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}

Huber loss is quadratic for small errors and linear for large errors. The transition point $\delta$ is a hyperparameter that controls when the loss switches from MSE-like to MAE-like behavior. This combines the smoothness and numerical stability of MSE near zero with the outlier robustness of MAE in the tails.

The gradient:

\frac{\partial L_\delta}{\partial \hat{y}} = \begin{cases} \hat{y} - y & \text{if } |y - \hat{y}| \leq \delta \\ \delta \cdot \text{sign}(\hat{y} - y) & \text{otherwise} \end{cases}

$\delta$ is typically set via cross-validation. A large $\delta$ makes Huber behave like MSE; a small $\delta$ makes it behave like MAE. In practice, $\delta = 1.0$ or $\delta = 1.35$ (the 75th percentile of the standard normal, from robust statistics) are common starting points.

Log-Cosh Loss

L_{\text{log-cosh}} = \sum_{i=1}^{n} \log(\cosh(\hat{y}_i - y_i))

Log-cosh is a smooth approximation to Huber loss. For small errors, $\log(\cosh(x)) \approx \frac{x^2}{2}$ (quadratic, like MSE). For large errors, $\log(\cosh(x)) \approx |x| - \log(2)$ (linear, like MAE). The key advantage over Huber: log-cosh is twice-differentiable everywhere, which matters for second-order optimizers (Newton’s method, XGBoost’s approximation).

Tweedie Loss

Tweedie distributions belong to the exponential dispersion family with variance function $\text{Var}(Y) = \phi \cdot \mu^p$ , where $p$ is the power parameter and $\phi$ is the dispersion. For $1 < p < 2$ , the Tweedie distribution is a compound Poisson-Gamma — it places a point mass at zero and has a continuous right-skewed distribution for positive values. This is the natural model for zero-inflated, positive, right-skewed data.

The Tweedie deviance loss:

L_{\text{Tweedie}} = \sum_{i=1}^{n} \left[ -\frac{y_i \cdot \hat{y}_i^{1-p}}{1-p} + \frac{\hat{y}_i^{2-p}}{2-p} \right]

The gradient with respect to $\hat{y}_i$ :

\frac{\partial L}{\partial \hat{y}_i} = -y_i \cdot \hat{y}_i^{-p} + \hat{y}_i^{1-p}

The power parameter $p$ controls how the penalty scales with the prediction magnitude. When $p = 1.5$ , the penalty for overestimation scales as $\hat{y}^{0.5}$ . This down-weights the contribution of near-zero predictions relative to large ones — the model does not waste capacity trying to distinguish “very small” from “zero.”

The tracker model result. The ad tracker cost estimation model predicts daily cost per tracker-domain pair. The data has 39.5% exact zeros (beacons, pixels) and a heavy right tail (script-based trackers costing thousands). Squared error spends gradient signal trying to minimize error on thousands of near-zero observations, pulling predictions toward zero across the board. Tweedie at $p = 1.5$ effectively tells the model: errors on the zero-heavy portion matter less than errors on the high-cost scripts.

Loss Function	MAE
Tweedie $p = 1.5$	3,466
Tweedie $p = 1.2$	3,486
Tweedie $p = 1.8$	3,597
Squared error	4,527

The 23% gap between Tweedie $p = 1.5$ and squared error — on identical architecture, identical features — is larger than the gap between any two model architectures evaluated. The loss function dominated the architecture choice.

Quantile Loss

L_\tau(\hat{y}, y) = \begin{cases} \tau \cdot (y - \hat{y}) & \text{if } y \geq \hat{y} \\ (1 - \tau) \cdot (\hat{y} - y) & \text{if } y < \hat{y} \end{cases}

Equivalently: $L_\tau = \max(\tau(y - \hat{y}), (\tau - 1)(y - \hat{y}))$ .

Quantile loss applies an asymmetric penalty. Setting $\tau = 0.5$ gives equal weight to over- and under-prediction, recovering MAE (median regression). Setting $\tau = 0.9$ penalizes under-prediction 9x more than over-prediction, estimating the 90th percentile of $Y|\mathbf{x}$ .

Quantile regression is the standard tool for constructing prediction intervals. Training two models at $\tau = 0.05$ and $\tau = 0.95$ yields a 90% prediction interval. Unlike parametric intervals from Gaussian assumptions, quantile regression intervals adapt to heteroscedastic and asymmetric error distributions.

Classification Losses

Binary Cross-Entropy (Log Loss)

L_{\text{BCE}} = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log(p_i) + (1 - y_i)\log(1 - p_i)\right]

where $p_i = \sigma(z_i)$ is the sigmoid of the logit $z_i$ .

The gradient with respect to the logit:

\frac{\partial L}{\partial z_i} = p_i - y_i

This gradient is elegant: it is simply the difference between the predicted probability and the true label. When $y = 1$ and $p = 0.9$ , the gradient is $-0.1$ (small push toward 1). When $y = 1$ and $p = 0.01$ , the gradient is $-0.99$ (large push toward 1). The update magnitude is proportional to the model’s error in probability space.

Binary cross-entropy is the negative log-likelihood of the Bernoulli distribution. Minimizing BCE is equivalent to maximum likelihood estimation for logistic regression. The loss is convex in the logits $z$ , guaranteeing a unique global minimum for linear models.

Categorical Cross-Entropy

L_{\text{CCE}} = -\sum_{i=1}^{n}\sum_{k=1}^{K} y_{ik} \log(p_{ik})

where $p_{ik} = \text{softmax}(z_i)_k = \frac{e^{z_{ik}}}{\sum_{j=1}^K e^{z_{ij}}}$ .

This is the multi-class generalization. The one-hot target $\mathbf{y}_i$ selects a single term from the sum, so the loss simplifies to $-\log(p_{i,c})$ where $c$ is the true class. The gradient through softmax has the same form as the binary case: $\frac{\partial L}{\partial z_{ik}} = p_{ik} - y_{ik}$ .

Focal Loss

Lin et al. (2017) introduced focal loss to address extreme class imbalance in object detection, where background examples outnumber foreground by 100,000

L_{\text{focal}} = -\alpha_t (1 - p_t)^\gamma \log(p_t)

where $p_t = p$ if $y = 1$ and $p_t = 1 - p$ if $y = 0$ , and $\alpha_t$ is a class-balancing weight.

The $(1 - p_t)^\gamma$ term is the modulating factor. When the model is confident and correct ( $p_t \to 1$ ), this factor approaches zero, down-weighting the contribution of easy examples. When the model is wrong ( $p_t \to 0$ ), the factor approaches 1, preserving the full loss.

With $\gamma = 0$ , focal loss reduces to standard cross-entropy. With $\gamma = 2$ (the standard choice), an example classified with $p_t = 0.9$ contributes 100x less to the loss than with standard cross-entropy. This focuses training on the hard, misclassified examples rather than the easy, well-classified majority.

Label Smoothing

Replace hard targets $(0, 1)$ with soft targets:

y_k^{\text{smooth}} = \begin{cases} 1 - \epsilon + \frac{\epsilon}{K} & \text{if } k = c \\ \frac{\epsilon}{K} & \text{otherwise} \end{cases}

where $\epsilon$ is the smoothing parameter (typically 0.1) and $K$ is the number of classes.

Label smoothing prevents the model from becoming overconfident. With hard targets, the optimal logit for the correct class is $+\infty$ , encouraging unbounded weights. Smoothed targets cap the optimal logit at a finite value, acting as implicit regularization. Empirically, label smoothing improves calibration (predicted probabilities match observed frequencies) and often improves generalization, particularly in knowledge distillation settings.

Contrastive Loss and InfoNCE

Contrastive learning trains embeddings by pulling positive pairs together and pushing negative pairs apart.

InfoNCE (Oord et al., 2018), used in CLIP, SimCLR, and embedding models for RAG:

L_{\text{InfoNCE}} = -\log \frac{\exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_j^+) / \tau)}{\sum_{k=1}^{N} \exp(\text{sim}(\mathbf{z}_i, \mathbf{z}_k) / \tau)}

where $\text{sim}$ is cosine similarity, $\mathbf{z}_j^+$ is the positive pair, and $\tau$ is the temperature parameter. This is a softmax over similarities — it maximizes the log probability that the positive pair is selected among all candidates.

The temperature $\tau$ controls the sharpness of the distribution over negatives. Low $\tau$ creates a harder contrastive task (the model must sharply distinguish the positive from hard negatives). High $\tau$ smooths the distribution, making the task easier. Typical values range from 0.05 to 0.5.

InfoNCE is a lower bound on the mutual information between the anchor and positive views. The quality of the learned representation depends critically on the number and difficulty of the negative examples.

The Loss-Architecture Interaction

The tracker model provides a clean ablation of loss function impact. The experimental setup: XGBoost with 80 engineered features (domain embeddings, temporal aggregates, content-type indicators). The only variable changed is the loss function.

Configuration	MAE	Relative to Best
Tweedie $p = 1.5$	3,466	—
Tweedie $p = 1.2$	3,486	+0.6%
Tweedie $p = 1.8$	3,597	+3.8%
Squared error	4,527	+30.6%

The mechanism is specific and interpretable. With 39.5% exact zeros in the target, squared error’s gradient $2(\hat{y} - y)$ assigns the same per-unit penalty to a $1 error on a beacon (true cost: $0) as to a $1 error on a $10,000 script. The model allocates capacity to minimizing residuals on the zero-heavy portion of the distribution, where marginal improvement is cheap but useless.

Tweedie’s gradient $-y \cdot \hat{y}^{-p} + \hat{y}^{1-p}$ naturally scales with the prediction magnitude. Near-zero predictions produce near-zero gradients. High-value predictions produce large gradients. The model is free to focus capacity on the heavy right tail where errors are consequential.

This finding generalizes. Whenever the data has specific distributional structure — zero inflation, heavy tails, heteroscedastic variance, asymmetric costs — matching the loss function to that structure is the highest-leverage intervention. It is cheaper to evaluate than architecture search (one hyperparameter vs. many) and often yields larger improvements.

The power parameter $p$ within the Tweedie family provides additional fine-tuning. The narrow range $p \in [1.2, 1.8]$ spans a 3.8% performance gap. Values closer to 1 (Poisson-like) emphasize the count nature of events; values closer to 2 (Gamma-like) emphasize the continuous positive magnitude. For the tracker data, $p = 1.5$ — the geometric midpoint — performed best, suggesting balanced sensitivity to both the zero-inflation and the continuous tail.

Optimization Landscape

Convexity and Its Limits

A function is convex if for all $\mathbf{x}_1, \mathbf{x}_2$ and $\lambda \in [0, 1]$ :

f(\lambda \mathbf{x}_1 + (1-\lambda)\mathbf{x}_2) \leq \lambda f(\mathbf{x}_1) + (1-\lambda) f(\mathbf{x}_2)

Convex losses (MSE, cross-entropy) on linear models have a single global minimum, and gradient descent is guaranteed to find it. For neural networks, the composition of the loss with nonlinear layers creates a non-convex optimization problem.

Saddle Points and Local Minima

In high-dimensional loss landscapes (modern neural networks have millions to billions of parameters), the geometry is dominated by saddle points, not local minima. At a critical point, the Hessian has some positive and some negative eigenvalues. The probability that all eigenvalues are positive (a true local minimum) decreases exponentially with dimension.

Empirical findings (Dauphin et al., 2014; Choromanska et al., 2015) show that for sufficiently overparameterized networks, local minima are approximately global — they have loss values close to the global minimum. The practical obstacle is not getting trapped in bad local minima, but navigating through saddle points where the gradient is near zero.

Momentum-based optimizers (SGD with momentum, Adam) help escape saddle points because accumulated velocity carries the iterate through flat regions. Stochastic gradient noise also helps — the mini-batch gradient is an unbiased but noisy estimate, and this noise can push the optimizer off saddle points.

Loss Landscape Flatness and Generalization

Not all minima generalize equally. Sharp minima (high curvature of the loss surface) tend to generalize poorly; flat minima (low curvature) tend to generalize well. The intuition: a flat minimum is insensitive to small perturbations in the weights, and test data represents a perturbation of the training distribution.

Sharpness-Aware Minimization (SAM; Foret et al., 2021) explicitly optimizes for flat minima by solving:

\min_{\mathbf{w}} \max_{\|\boldsymbol{\epsilon}\| \leq \rho} L(\mathbf{w} + \boldsymbol{\epsilon})

This finds parameters where the loss is low even after worst-case perturbation of magnitude $\rho$ . SAM consistently improves generalization across vision and language tasks.

Large batch sizes tend to converge to sharp minima; small batch sizes, through gradient noise, tend to find flat minima. This partly explains why large-batch training requires learning rate warmup and careful tuning to match small-batch generalization.

The Lottery Ticket Hypothesis

Frankle and Carlin (2019) showed that dense networks contain sparse subnetworks (“winning tickets”) that, when trained in isolation from the same initialization, match the full network’s performance. These subnetworks correspond to favorable regions of the loss landscape that are reachable from the initial point.

The connection to loss functions: the loss landscape’s structure depends on both the architecture and the loss. Different losses create different landscapes over the same parameter space. A loss function that better matches the data’s structure may create a landscape with more accessible, flatter minima — partly explaining why loss function selection can dominate architecture selection.

Practical Guidelines

Default choices. MSE for regression, cross-entropy for classification. These are well-understood, numerically stable, and correct when the implicit distributional assumptions hold.

When to deviate:

Outliers in regression targets: Use Huber loss. Start with $\delta = 1.35$ and tune via cross-validation. Log-cosh is a smooth alternative if you need second-order derivatives.
Zero-inflated or right-skewed targets: Use Tweedie loss. Start with $p = 1.5$ and search $[1.1, 1.9]$ . This is the highest-ROI intervention for data with this structure.
Class imbalance in detection/classification: Use focal loss with $\gamma = 2$ . This is more principled than resampling or class weights for extreme imbalance ratios.
Prediction intervals needed: Use quantile loss at the desired quantile levels. Train separate models or a multi-output model for each quantile.
Embedding learning: Use InfoNCE with temperature tuning. More negatives and harder negatives improve representation quality.
Overconfident predictions: Apply label smoothing with $\epsilon = 0.1$ . Especially beneficial when using model outputs for downstream calibration-sensitive tasks.

Always ablate the loss function. It is the cheapest experiment with the highest potential return. One hyperparameter (the loss function or its parameters) vs. the combinatorial space of architecture search. The tracker model’s 23% improvement from a single loss function change is not an anomaly — it reflects a general principle. When the data has distributional structure, the loss function that respects that structure will outperform one that ignores it, regardless of how much compute you spend on architecture optimization.