The Penalty That Shrinks the Fit

Lessons 5-7 fit weights by minimizing a loss built purely from data-fit — squared error, or cross-entropy for probabilities. Left alone, that loss is happy to grow a coefficient as large as the data can justify, and lesson 3’s overfitting lesson showed exactly where unchecked flexibility leads: a curve threading every point, memorizing noise.

Regularization adds a second term to the loss — a penalty on the coefficients’ size — so the fit has to earn every unit of coefficient magnitude, not just grab whatever minimizes data error alone. The common form (L2 / “ridge” regularization) adds λ · w² for each weight, where λ (lambda) controls how hard the penalty bites:

total loss = (data-fit loss) + λ · w²

λ = 0 recovers the unregularized fit exactly. As λ grows, the optimizer trades away data-fit accuracy for a smaller w, because the penalty term now costs real loss-points too. Somewhere in between is a coefficient that’s smaller than pure data-fitting would choose, but generalizes better precisely because it hasn’t chased every quirk of the training set.

Take a toy one-weight loss where pure data-fit alone wants w = 5 (that’s the unregularized optimum): L(w) = (w − 5)² + λw². With λ = 4, find the w that minimizes L(w).

(Calculus shortcut: set the derivative to zero. dL/dw = 2(w−5) + 2λw = 0, solve for w.)