3.4: Practice Problems
Practice problems for linear regression and gradient descent.
Problem 1: Nonlinear Relationships
Question: True or False: Linear regression can only model linear relationships between input features and output.
Answer
False. While the model is linear in the weights, we can capture nonlinear relationships through feature engineering. For example:
- Adding polynomial features:
- Including interaction terms:
- Creating transformed features:
The key insight: the model is still linear in the weights , even though it represents a nonlinear (quadratic) relationship in the original feature .
Problem 2: Alternative Cost Functions
Question: True or False: We could use as a valid cost function.
Answer
False. This cost function is problematic because:
- Minimizing actually maximizes the residual (the denominator)
- As the error gets larger, the loss gets smaller, the opposite of what we want
- Division by zero when prediction is perfect
A valid cost function should decrease as predictions improve, not increase.
Problem 3: Exponential Loss
Question: True or False: We could use as a cost function.
Answer
True. This is a valid cost function because:
- It’s always positive:
- It increases with larger errors
- It’s differentiable everywhere
- Minimum occurs at zero error
However, this loss is very aggressive, it exponentially penalizes outliers, making the model extremely sensitive to large residuals. In practice, squared error is preferred for its mathematical convenience and balanced behavior.
Problem 4: Pros and Cons of Linear Regression
Question: List key advantages and disadvantages of linear regression.
Answer
Advantages:
- Interpretability, Weights directly show feature importance and direction of influence
- Computational efficiency, Fast to train on moderate datasets
- Closed-form solution, Direct solution via normal equations (no iteration needed)
- Good baseline, Establishes performance floor for complex models
- Well-understood theory, Statistical properties thoroughly studied
Disadvantages:
- Assumes linearity, Poor fit for inherently nonlinear relationships (without feature engineering)
- Sensitive to outliers, Squared error heavily penalizes large residuals
- Multicollinearity, Performance degrades when features are highly correlated
- Requires numerical features, Categorical variables need encoding
- High bias, May underfit complex data (simple model hypothesis class)
- Computational cost of direct solution, for matrix inversion when is large
Problem 5: Number of Parameters
Question: True or False: The number of parameters in a linear regression model equals the number of training examples .
Answer
False. The number of parameters equals (or just if we count the bias separately):
- weights for the features:
- bias term: (or when absorbed)
The number of training examples determines how much data we have to estimate these parameters, but doesn’t affect how many parameters exist. In fact, we typically want to avoid overfitting.
Problem 6: Gradient Vector Shape
Question: Makayla computed the gradient as:
Is this correct? What is the shape of her result?
Answer
Incorrect. Makayla’s result has the wrong shape:
Her computation:
- has shape (row vector)
- has shape
- Product: , a row vector
Problem: The gradient must be a column vector with shape to match the shape of for gradient descent updates: .
Correct gradient:
This gives:
- has shape
- has shape (column vector)
- Product: , correct shape
Problem 7: Absorbing the Bias
Question: Why do we absorb the bias term into the weight vector? Show how this simplifies the model equation.
Answer
Original formulation:
This requires treating separately in all computations.
After absorbing bias:
Define augmented vectors with a dummy feature :
Now the model becomes simply:
Benefits:
- Unified notation, No special case for bias in equations
- Simpler code, Vectorized operations handle bias automatically
- Cleaner gradient, Single formula for all parameters
- Standard convention, Matches most ML libraries
The cost: each data point and weight vector increases in dimension by 1 (from to ).
Worked Example: Single-Feature Linear Regression
Problem: Consider the linear regression model with one feature:
The cost function is:
(a) Derive the partial derivatives and .
(b) Solve for and that minimize the cost function.
Solution
Part (a): Deriving the Gradients
For each training example, the residual is:
Derivative with respect to :
Using the chain rule on :
Summing over all examples and dividing by :
Derivative with respect to :
Similarly:
Summing over all examples:
Part (b): Solving for Optimal Parameters
Set both partial derivatives to zero.
From :
Multiply both sides by :
Factor out constants:
Note that and where and are the means. Dividing by :
Solving for :
Geometric interpretation: The regression line passes through the centroid of the data.
From :
Expand:
Substitute :
Group terms with :
Therefore:
Alternative form using covariance and variance:
This is the classic formula for the slope of the best-fit line.
Problem 8: Step-by-Step Gradient Derivation
Question: Derive the gradient for the cost function using the chain rule.
Solution
We’ll work through this systematically using intermediate variables and the chain rule.
(a) Define the prediction vector:
Taking the derivative:
This makes sense dimensionally: is , is , so the Jacobian must be , which is exactly .
(b) Define the residual vector:
Since is constant with respect to :
(c) Define the squared norm:
This is a scalar. Taking the derivative with respect to the vector :
Explanation: For each component , we have (from the chain rule on ). Stacking these gives (a row vector).
(d) Define the cost function:
The derivative with respect to :
(e) Apply the chain rule:
Now we chain everything together:
Substituting:
Simplifying:
Wait, wrong shape! This gives a row vector , but we need a column vector .
Correct form: Take the transpose:
Dimension check:
- is
- is
- Product: ✓
Key Insight: The gradient must be a column vector to match the shape of for gradient descent updates. The chain rule naturally gives us , but we transpose the entire expression to get the conventional form .
Problem 9: Learning Rate Analysis
Question: You’re training a linear regression model with gradient descent. After 100 iterations, you observe:
- Training loss at iteration 1: 50.0
- Training loss at iteration 100: 49.8
What might be wrong, and how would you fix it?
Answer
The learning rate is too small. The loss has barely decreased (only 0.2 over 100 iterations), indicating the steps are too tiny.
Fixes:
- Increase the learning rate, try 10x or 100x larger
- Use adaptive learning rates, methods like Adam automatically adjust
- Run for more iterations, though this is inefficient
How to diagnose: Plot the training curve. With a proper learning rate, you should see rapid initial decrease that levels off. A nearly flat line suggests the learning rate is too small.
Problem 10: Gradient Descent Direction
Question: At a point where , should we increase or decrease to reduce the cost?
Answer
We should increase .
Reasoning:
- The gradient is negative (-3), meaning the cost decreases as increases
- Gradient descent update:
- The negative gradient causes to increase
Intuition: The gradient points toward steepest increase. A negative gradient means “increasing decreases cost.” We move opposite to the gradient, so we move in the direction that decreases cost.