The Tree That Memorized the Forest

A decision tree is a completely different function family from lessons 5-8’s weighted sums: it carves up feature space with a sequence of axis-aligned cuts — “is square footage > 1,500? then is age < 10? then…” — each internal node a yes/no question on one feature, each leaf a final prediction. Built via greedy splitting: at each node, try every candidate cut on every feature, pick whichever single split most improves purity (how separated the classes/values become) right now, and recurse into the two resulting groups. “Greedy” because each choice is locally best — the algorithm never looks ahead to check whether a worse-looking split now would have enabled a better structure two levels down.

Nothing stops you from growing the tree until every leaf contains exactly one training example — keep splitting as long as a split is possible, no depth limit, no minimum leaf size. Each leaf then predicts that one training point’s label, exactly.

You’ve reasoned about a model shaped like this before, in a different costume. What does this fully-grown tree’s train and test performance look like, and which earlier lesson is it secretly the same thing as?