Lesson 2 · Solution · Train/test split

Solution: The Exam You've Already Seen

0.7 — exactly the majority-class base rate. Every test input is unseen, so the lookup table’s encyclopedic knowledge contributes nothing; each test row gets the fallback guess “A”, which is right 70% of the time. All that perfect recall bought precisely zero points over the dumbest possible strategy.

The two numbers side by side are the lesson:

  • Train accuracy: 100% — an exam where the model had the answer key.
  • Test accuracy: 70% — the same exam with fresh questions. This is the only number that predicts deployment.

Their difference, the generalization gap (here a chasm: 30 points), measures how much of the training performance was memory rather than pattern. The held-out split works for one simple reason: since the model provably received no information about those rows, whatever it scores on them can only have come from structure that carries over — i.e., from actual learning. It is the cheapest honest experiment in the field, and skipping or contaminating it is how confident nonsense ships to production.

Three things worth engraving:

  1. Train error is not evidence. For any reasonably flexible family (lesson 1’s lookup table is the limiting case), low train error is guaranteed, and a guaranteed outcome carries no information. Only held-out performance discriminates learning from memorization.
  2. The base rate is the floor, and you must know where the floor is. “70% accurate” sounded respectable until you noticed always-say-A scores the same. Every evaluation needs its dumb baseline stated alongside it — a habit the bayes track would call remembering the prior, and which gets a full lesson soon (baselines are embarrassingly hard to beat).
  3. The split’s honesty is fragile. The whole argument rested on “no information about the test rows reached the model.” Real pipelines violate this in sneaky ways — normalizing with statistics computed over all rows, deduplicating after splitting, tuning hyperparameters against the test set until it stops being held-out. The violations have a name, leakage, and an entire lesson in stage 3; it is the most expensive bug class in applied ML precisely because it makes the honest number quietly dishonest.

Where this goes: the memorizer is one end of a dial — maximum flexibility, zero generalization. Next lesson turns that dial continuously (polynomial degree) and watches the test error trace the most important curve in machine learning: down, and then treacherously back up.

How was this one? Any answer marks it complete and moves on — your rating shapes future lessons.