Lesson 5 fit a linear model by “minimizing squared error” without saying how the minimizing weights get found. For most models (anything past simple algebra), the answer is gradient descent.
Picture the loss surface: plot the loss (how wrong the model is) against its weight(s). For a
single weight w, that’s a curve; a classic one is a smooth bowl, like L(w) = (w − 3)² — its
minimum sits at w = 3, where the loss is zero. Gradient descent finds that minimum without ever
“seeing” the whole bowl: it computes the gradient (the slope, dL/dw) at the current point, and
steps in the opposite direction — downhill — by an amount controlled by the learning rate η:
w_new = w_old − η · (dL/dw)
For L(w) = (w − 3)², the derivative is dL/dw = 2(w − 3). Starting at w₀ = 0 with
η = 1.5, trace it by hand:
- w₁ = 0 − 1.5 × 2×(0 − 3) = 0 − 1.5×(−6) = 9
- w₂ = 9 − 1.5 × 2×(9 − 3) = 9 − 1.5×12 = 9 − 18 = −9
- w₃ = −9 − 1.5 × 2×(−9 − 3) = −9 − 1.5×(−24) = −9 + 36 = 27
The true minimum is at w = 3. These values are 0, 9, −9, 27 — swinging further from 3 every step, not closer. What does this trace tell you about the learning rate?