Lesson 7 · Solution · Logistic regression

Solution: Squashing a Line Into a Probability

σ(2) = 1 / (1 + e⁻²) ≈ 1 / (1 + 0.1353) ≈ 1 / 1.1353 ≈ 0.881.

The model predicts an 88.1% chance of churn for this customer. Notice what each piece contributed: the linear part (z = 2.0) came entirely from lesson 5’s machinery — coefficients times features, plus an intercept, exactly as before. Sigmoid’s only job was translating “2.0, a number with no natural upper bound” into “0.881, a number that can legitimately be called a probability.” At z = 0, sigmoid gives exactly 0.5 (maximally uncertain); as z grows past roughly +3 or below −3, sigmoid saturates near 0 or 1 — extreme linear scores translate into near-certainty, which is the intended behavior, not a bug (though it does mean well-calibrated logistic models rarely output literal 0 or 1 — the curve approaches but never touches either end).

Why “bounded in [0,1]” isn’t the same claim as “calibrated.” You could bound any linear output into [0,1] with a cheap trick — clip anything below 0 to 0 and above 1 to 1 — and get numbers that are technically valid probabilities but garbage as beliefs: among cases clipped to “1.0,” some large fraction might not actually be positive at all. Sigmoid isn’t just a clipping function; fitting the model (via a loss called cross-entropy, gradient descent’s other classic target, built specifically to reward well-calibrated probabilities rather than merely-correct classifications) is what pushes the 0.881 toward meaning “roughly 88% of customers who look this way actually churn.” The shape (sigmoid) makes calibration possible; the training is what makes it true — and even then, only approximately, and only as well as the training data and loss allow (stage 3 has a whole lesson on when this promise breaks).

A framing worth keeping: logistic regression is often introduced as “classification,” but the sharper way to say it is a linear model for probabilities — the classification (churn / no churn) is a decision layered on top (usually: call it “churn” if the probability clears 0.5), and that threshold is a separate choice you make afterward, not something the model itself dictates. Two deployments of the identical model can reasonably use different thresholds — a churn-prevention team might act on anything above 0.3 if the cost of a false alarm is cheap.

Where this goes: both lines and their squashed-through-sigmoid cousins can overfit exactly like lesson 3’s polynomial did — more features, more capacity, the same U-shaped test-error risk. Next: regularization, the standard fix — penalizing complexity directly in the loss, with a reading (previewed here, developed properly in the bayes track) as encoding a prior belief that coefficients should stay small unless the data insists otherwise.

How was this one? Any answer marks it complete and moves on — your rating shapes future lessons.