Solution: The Dumbest Model in the Room

MSE = 1.6.

Mean of {2, 4, 4, 4, 6} = 20/5 = 4. Squared errors: (2−4)²=4, (4−4)²=0, (4−4)²=0, (4−4)²=0, (6−4)²=4. Sum = 8. Average over 5 examples: 8/5 = 1.6.

Notice what that number actually is: the average squared deviation from the mean — which is just the textbook definition of variance. The predict-the-mean baseline’s MSE is always exactly the target’s variance, by construction, for any dataset. That’s not a coincidence worth memorizing so much as a fact worth using: it means “how much better does my model do than the mean baseline” and “what fraction of the target’s variance does my model explain” are the same question asked two ways — which is exactly what R² measures (1 − model’s MSE / baseline’s MSE), a metric you’ll meet properly once linear regression is on the table.

Why baselines are the unglamorous hero of the field:

They’re a floor, not a target. A model beating the mean-baseline by a hair isn’t “state of the art minus a bit” — it might be capturing almost nothing. The size of the gap between model and baseline is the real signal, not the model’s raw score in isolation.
The right baseline takes real thought. “Predict the mean” is trivial to compute but surprisingly easy to beat if there’s any structure at all; “predict yesterday” for a time series is trivial to compute and shockingly hard to beat, because most of what looks like “the model learned to forecast” is often just the series’ own slow-moving autocorrelation, which the persistence baseline captures for free. Comparing a fancy forecasting model against the mean instead of against yesterday is one of the most common ways forecasting results get overstated in practice.
A model that barely beats its baseline may still not be worth deploying. Complexity, latency, and maintenance cost are real; “beats predict-yesterday by 2%” has to be weighed against those costs, not treated as automatic proof the complexity was worth it.

Where this goes: every baseline here ignored the input features entirely. Stage 1 opens with the simplest model that actually uses them — linear regression — and the first question about it is exactly this lesson’s question, asked one level up: what do the fitted coefficients mean, and what do they not mean?