The key insight: the model y=w1โx1โ+w2โx12โ+b is still linear in the weightsw, even though it represents a nonlinear (quadratic) relationship in the original feature x1โ.
Problem 2: Alternative Cost Functions
Question: True or False: We could use L(y(i),t(i))=(y(i)โt(i))21โ as a valid cost function.
Answer
False. This cost function is problematic because:
Minimizingresidual21โ actually maximizes the residual (the denominator)
As the error gets larger, the loss gets smaller, the opposite of what we want
Division by zero when prediction is perfect
A valid cost function should decrease as predictions improve, not increase.
Problem 3: Exponential Loss
Question: True or False: We could use L(y(i),t(i))=e(y(i)โt(i))2 as a cost function.
Answer
True. This is a valid cost function because:
Itโs always positive: e(โ )2>0
It increases with larger errors
Itโs differentiable everywhere
Minimum occurs at zero error
However, this loss is very aggressive, it exponentially penalizes outliers, making the model extremely sensitive to large residuals. In practice, squared error is preferred for its mathematical convenience and balanced behavior.
Problem 4: Pros and Cons of Linear Regression
Question: List key advantages and disadvantages of linear regression.
Answer
Advantages:
Interpretability, Weights directly show feature importance and direction of influence
Computational efficiency, Fast to train on moderate datasets
Closed-form solution, Direct solution via normal equations (no iteration needed)
Good baseline, Establishes performance floor for complex models
Assumes linearity, Poor fit for inherently nonlinear relationships (without feature engineering)
Sensitive to outliers, Squared error heavily penalizes large residuals
Multicollinearity, Performance degrades when features are highly correlated
Requires numerical features, Categorical variables need encoding
High bias, May underfit complex data (simple model hypothesis class)
Computational cost of direct solution, O(D3) for matrix inversion when D is large
Problem 5: Number of Parameters
Question: True or False: The number of parameters in a linear regression model equals the number of training examples N.
Answer
False. The number of parameters equals D+1 (or just D if we count the bias separately):
D weights for the features: w1โ,w2โ,โฆ,wDโ
1 bias term: b (or w0โ when absorbed)
The number of training examples N determines how much data we have to estimate these parameters, but doesnโt affect how many parameters exist. In fact, we typically want NโซD to avoid overfitting.
Problem 6: Gradient Vector Shape
Question: Makayla computed the gradient as:
โwโE(w)=(Xwโt)โคX
Is this correct? What is the shape of her result?
Answer
Incorrect. Makaylaโs result has the wrong shape:
Her computation:
(Xwโt)โค has shape 1รN (row vector)
X has shape Nร(D+1)
Product: (1รN)ร(Nร(D+1))=1ร(D+1), a row vector
Problem: The gradient must be a column vector with shape (D+1)ร1 to match the shape of w for gradient descent updates: wโwโฮฑโwโE(w).
Wait, wrong shape! This gives a row vector (1ร(D+1)), but we need a column vector ((D+1)ร1).
Correct form: Take the transpose:
โwโE(w)=N1โXโคr=N1โXโค(Xwโt)
Dimension check:
Xโค is (D+1)รN
r=(Xwโt) is Nร1
Product: ((D+1)รN)ร(Nร1)=(D+1)ร1 โ
Key Insight: The gradient must be a column vector to match the shape of w for gradient descent updates. The chain rule naturally gives us rโคX, but we transpose the entire expression to get the conventional form Xโคr.
Problem 9: Learning Rate Analysis
Question: Youโre training a linear regression model with gradient descent. After 100 iterations, you observe:
Training loss at iteration 1: 50.0
Training loss at iteration 100: 49.8
What might be wrong, and how would you fix it?
Answer
The learning rate is too small. The loss has barely decreased (only 0.2 over 100 iterations), indicating the steps are too tiny.
Fixes:
Increase the learning rate, try 10x or 100x larger
Use adaptive learning rates, methods like Adam automatically adjust
Run for more iterations, though this is inefficient
How to diagnose: Plot the training curve. With a proper learning rate, you should see rapid initial decrease that levels off. A nearly flat line suggests the learning rate is too small.
Problem 10: Gradient Descent Direction
Question: At a point where โwโEโ=โ3, should we increase or decrease w to reduce the cost?
Answer
We should increasew.
Reasoning:
The gradient is negative (-3), meaning the cost decreases as w increases
Intuition: The gradient points toward steepest increase. A negative gradient means โincreasing w decreases cost.โ We move opposite to the gradient, so we move in the direction that decreases cost.