Lesson 8 · Solution · Regularization

Solution: The Penalty That Shrinks the Fit

w = 1. Setting the derivative to zero: 2(w−5) + 2(4)w = 02w − 10 + 8w = 010w = 10w = 1. Compare to the unregularized optimum (λ=0): plain 2(w−5)=0 gives w=5. The penalty pulled the fitted weight from 5 all the way down to 1 — not because the data changed, but because the loss now charges for coefficient size, and giving up some data-fit to shrink w was worth it once that charge was added. In general, ridge’s closed form here is w = w_unregularized / (1 + λ) — plug in λ=4 and 5/(1+4)=1 falls straight out, and it shows the whole mechanism in one glance: bigger λ divides harder, shrinking the fit further toward zero (λ→∞ crushes every weight to 0 — a model that predicts the same constant everywhere, ignoring data entirely, lesson 4’s mean-baseline reborn as a regularization limit).

Why shrinking toward zero fights overfitting. A model with large coefficients can swing wildly with small input changes — exactly lesson 3’s degree-9 polynomial, whose oscillations came from large, precisely-tuned coefficients threading noisy points. Penalizing size directly discourages that swinginess without having to hand-pick a simpler function family; you keep the flexible family but tax it for using more flexibility than the data can justify, and the tax gets paid for out of variance (this bias/variance vocabulary, from lesson 3, applies exactly here too): regularization trades a little bias (the fit is no longer the absolute best on training data) for a lot less variance (it swings far less on a fresh batch).

The Bayesian reading — the same idea from an entirely different direction. If you place a prior belief on w that it’s probably close to zero unless the data strongly argues otherwise (a specific shape of prior — Gaussian, centered at 0 — gives you exactly this L2 penalty as the math falls out), then finding the most probable w given the data (a Bayesian’s natural objective) produces precisely this same shrunk-toward-zero answer. Regularization strength λ and “how confident/narrow the prior is” turn out to be the same knob, described in two vocabularies: a large λ is a strong prior that coefficients are near zero; λ=0 is “no prior opinion, let the data fully decide.” This isn’t an analogy — it’s the identical optimization problem, arrived at by two historically separate traditions (frequentist penalized regression; Bayesian maximum-a-posteriori estimation) that turn out to compute the same number.

Where this goes: stage 1 built linear models end to end — fit, search, calibrate, tame. Stage 2 moves to a completely different function family that makes no linearity assumption at all: decision trees, which carve up feature space with a sequence of if/else cuts instead of a weighted sum.

How was this one? Any answer marks it complete and moves on — your rating shapes future lessons.