Part 13 of 58

The Holdout

By Madhav Kaushish · Ages 12+

The twenty-velociraptor network had memorized two thousand patients and learned nothing general. Trviksha needed a way to detect this problem before deploying a model — not after.

The Idea

Blortz: You trained the network on all two thousand one hundred patients and then tested it on the same two thousand one hundred patients. Of course it scored perfectly — it had seen every one of them.

Trviksha: I should have tested it on patients it had not seen during training. Like the three hundred and twelve that Grothvik sent later.

Blortz: Why wait for Grothvik to send new data? Set some aside before you start training. Train on most of the data. Test on the rest. If the network has truly learned the pattern, it should work nearly as well on the patients it never saw.

The idea was simple and, in retrospect, obvious. Before training, Trviksha divided the two thousand one hundred patients into two groups: a training set of sixteen hundred and eighty patients, and a test set of four hundred and twenty. She locked the test set in a separate shelf. The velociraptors were forbidden from accessing it during training.

The Experiment

She re-trained the twenty-velociraptor network on the training set only. After a hundred passes, the training accuracy reached 99.8%. Near-perfect, as before.

Then she unlocked the test shelf and ran the four hundred and twenty withheld patients through the network.

Test accuracy: 64%.

Trviksha: Training accuracy 99.8%. Test accuracy 64%. The gap tells me everything. The network memorized the training patients and learned almost nothing that transfers to new ones.

She rebuilt the network with eight hidden velociraptors instead of twenty and re-trained on the same sixteen hundred and eighty patients. Training accuracy after a hundred passes: 88%. Test accuracy: 83%.

Trviksha: The smaller network cannot memorize as easily — it does not have enough capacity. So it is forced to find patterns that generalize. Training accuracy is lower, but test accuracy is much higher. The gap between them is small.

Blortz: The eight-velociraptor network learned less about the specific patients and more about sickness in general.

The Rule

Trviksha established a rule for herself and for any future model builders at GlagalCloud:

Never evaluate a model on the data it was trained on. Always hold out a portion of data that the model never sees during training. The test accuracy — not the training accuracy — is the true measure of performance.

If training accuracy was high and test accuracy was low, the model had overfit. If both were high, the model had likely learned a genuine pattern. If both were low, the model was simply bad and needed a different approach.

Grothvik: You are saying that perfect training scores are actually a warning sign?

Trviksha: For a complex model, yes. If a model with many velociraptors achieves perfect accuracy on training data, it has probably memorized rather than learned. A model that achieves 88% on training and 83% on new data is far more trustworthy than one that achieves 100% on training and 64% on new data.

Grothvik: That is counterintuitive. I would have assumed perfect is better.

Trviksha: Perfect on a practice test is suspicious. High marks on a surprise test are trustworthy.

A stone shelf divided in two: a large section labelled "Training" with many patient records, and a smaller locked section labelled "Test" with a stone barrier between them. A velociraptor reaches toward the test section, and Trviksha blocks it with a flat hand

Grothvik's Question

Grothvik: How much data should I hold out? If I hold out too much, the model trains on too little. If I hold out too little, the test is unreliable.

Trviksha: I have been using one in five — twenty percent. Enough for a meaningful test, while leaving eighty percent for training. But if the data is small, even twenty percent might not be enough for a reliable test. And if the data is large, you can afford to hold out less.

Grothvik: And you should never look at the test data while adjusting the model?

Trviksha: Never. If you adjust the model based on test performance, the test set becomes part of the training process, and you lose the ability to detect memorization. The test set must remain a genuine surprise.

This discipline — separating the data into training and testing before any model building began — became a standard practice at GlagalCloud. Every new model was evaluated on data it had never seen. Training accuracy was tracked but not trusted. Test accuracy was the measure that mattered.