3.2: Gradient Descent
Gradient descent is an iterative optimization algorithm used to find the minimum of a function. While linear regression has a closed-form solution, gradient descent provides a more general approach that scales better to high dimensions and generalizes to other models.
Why Use Gradient Descent?
| Aspect | Direct Solution | Gradient Descent |
|---|---|---|
| Complexity per step | for matrix inversion | per update |
| Implementation | Requires matrix inversion | Simple iterative updates |
| Generalization | Only works for specific problems | Works for any differentiable loss |
| Memory | Must store and invert | Only needs gradients |
Key insight: Each gradient descent update costs rather than for matrix inversion, making it much cheaper when is large (high-dimensional data).
What is Gradient Descent?
An iterative method to find the minima of a function.
Procedure:
- Start with a random point
- Apply an update rule iteratively until a stopping condition is met
The Update Rule
Direction: Which Way to Move?
The gradient points in the direction of steepest increase.
To minimize, we move in the opposite direction:
- If the derivative is negative at some point, we want to increase (move right toward the minimum)
- If the derivative is positive at some point, we want to decrease (move left toward the minimum)
Conclusion: The sign of the update should be opposite to the sign of the gradient.
Magnitude: How Far to Move?
Each update’s size should be proportional to the gradient’s magnitude:
- When the curve is steep, gradient magnitude is large → take large steps
- When the curve is flat, gradient magnitude is small → take small steps
This is natural: far from the minimum (steep region), take big steps. Near the minimum (flat region), take small steps to avoid overshooting.
The Update Formula
Combining direction and magnitude:
Or for each weight individually:
Where:
- is the learning rate (a hyperparameter)
- is the gradient of the cost function
Gradient Descent for Linear Regression
Recall the gradient for MSE loss:
Substituting into the update rule:
Or equivalently, summing over individual examples:
Choosing the Learning Rate
The learning rate is a critical hyperparameter:
| too small | too large |
|---|---|
| Takes too long to converge | May oscillate around minimum |
| Many iterations needed | May diverge completely |
| Wastes computation | Training becomes unstable |
Visualizing the problem:
- Too small: tiny steps, barely makes progress after many iterations
- Too large: steps so big you overshoot the minimum, error explodes
In practice:
- Start with a reasonable value (e.g., 0.01 or 0.001)
- Monitor the cost function, it should decrease
- Use learning rate schedules that decrease over time
- Consider adaptive methods like Adam
Termination Criteria
When to Stop Iterating?
In theory: Stop when stops changing (convergence)
In practice:
- Cost threshold: Stop when for some small
- Gradient threshold: Stop when
- Maximum iterations: Stop after a fixed number of iterations (“tired of waiting”)
- Early stopping: Stop when validation performance stops improving (prevents overfitting)
Early Stopping
A powerful technique that serves two purposes:
- Removes the need to guess the number of iterations
- Acts as a form of regularization to prevent overfitting
Algorithm:
best_val_loss = infinity
best_weights = None
patience_counter = 0
for each iteration:
update weights
compute validation loss
if val_loss < best_val_loss:
best_val_loss = val_loss
best_weights = current_weights
patience_counter = 0
else:
patience_counter += 1
if patience_counter >= patience:
break
return best_weights # Not the final weights!Key insight: Return the best weights (lowest validation loss), not the final weights.
Gradient Descent vs. Direct Solution
For linear regression specifically, sklearn’s LinearRegression uses the closed-form solution:
This finds the exact optimal weights in one step.
Why gradient descent can’t match it exactly:
- Fixed step size, you’re always jumping by , so you oscillate around the minimum
- Early stopping, we typically stop based on validation, not training convergence
When to use which:
- For linear regression: use the closed-form solution (sklearn)
- For models without closed-form solutions (neural networks, etc.): gradient descent is essential
A Modular Approach to ML
Machine learning can be decomposed into three interchangeable components:
| Component | Role | Examples |
|---|---|---|
| Model | Describes relationships between variables | Linear model, neural network |
| Loss/Cost function | Quantifies how badly the model fits the data | Squared error, cross-entropy |
| Optimization algorithm | Fits a model that minimizes the loss | Direct solution, gradient descent, Adam |
This modularity is powerful: you can mix and match components. Gradient descent works with any differentiable loss function and model, making it the workhorse of modern machine learning.
Summary
| Component | Description |
|---|---|
| Update rule | |
| Direction | Negative of gradient (steepest descent) |
| Step size | Proportional to gradient magnitude, scaled by |
| Termination | Small change in cost, gradient threshold, or max iterations |
| Learning rate | Too small → slow; too large → diverge/oscillate |
| Early stopping | Stop when validation loss stops improving |
Key Formulas
Gradient descent update:
For linear regression:
Closed-form solution (for comparison):