The Heavy Pebble

The train/test split told Trviksha when overfitting was happening. But it did not prevent it. With twenty hidden velociraptors and sixteen hundred training patients, the network still memorized. She needed a way to push the network toward general patterns even when it had the capacity to memorize.

The Extreme Weights

She examined the weight arrangements of the overfitted twenty-velociraptor network and compared them to the well-behaved eight-velociraptor network.

The difference was striking. In the overfitted network, the weight pebble arrangements were extreme. Several velociraptors had weights encoded as very large numbers — 47, 82, -63 — while others had weights near zero. The network had concentrated its "knowledge" into a few intensely tuned velociraptors, each encoding a narrow, specific pattern.

In the eight-velociraptor network, the weights were modest. The largest was 8. Most were between -5 and 5. The knowledge was spread more evenly, each velociraptor contributing a moderate amount to the prediction. No single weight dominated.

Blortz: The overfitted network has extreme opinions. Each velociraptor is either shouting or silent. The well-behaved network has moderate opinions distributed across all velociraptors.

Trviksha: Extreme weights allow the network to carve very sharp, specific boundaries — boundaries tailored to individual training patients. Moderate weights force smoother, more general boundaries.

The Penalty

Trviksha's idea was to make large weights costly. During training, the loss function measured how wrong the predictions were. She added a second term: a penalty proportional to the size of the weights. Large weights increased the total loss, regardless of how accurate the predictions were.

The network now had to balance two competing goals: make accurate predictions (minimize the error) and keep the weights small (minimize the penalty). If a set of extreme weights improved accuracy by a small amount but increased the penalty by a larger amount, the network would prefer the moderate weights.

Blortz: You are adding a tax on extreme weights. The network can still use large weights if they are truly necessary — if the accuracy gain outweighs the penalty. But it cannot use large weights for free.

Trviksha: Exactly. Think of each weight pebble as having physical weight. A small arrangement is light — easy to carry. A large arrangement is heavy — the velociraptor must work harder to hold it. Given a choice between a heavy arrangement that is slightly more accurate and a light arrangement that is nearly as accurate, the velociraptor chooses the lighter one.

A velociraptor struggling to hold an enormous pebble arrangement (representing an extreme weight) in one arm while a smaller, lighter arrangement sits easily in the other. Trviksha points at the smaller one approvingly

The Result

She re-trained the twenty-velociraptor network with the weight penalty added to the loss function. The same architecture, the same data, the same backward propagation — but now the loss included the penalty.

Training accuracy after a hundred passes: 91%. Not the 99.8% of the unpenalized network. The penalty had prevented the network from memorizing perfectly — the cost of extreme weights kept it from carving those patient-specific boundaries.

Test accuracy: 85%.

The gap had closed dramatically. Training 91%, test 85% — compared to training 99.8%, test 64% without the penalty. The network had traded a small amount of training accuracy for a large gain in test accuracy. It had found a simpler, more general pattern because the penalty made the specific, complex one too expensive.

Trviksha: The penalty pushed the network toward simplicity. Not because simplicity is inherently good, but because simplicity tends to generalize. A model that explains the data with moderate weights has probably found a real pattern. A model that needs extreme weights to explain the data has probably memorized it.

Glagalbagal: You are telling the pebbles to be humble?

Trviksha: I am telling the pebbles that confidence is expensive. If a velociraptor wants to have a strong opinion — a large weight — it must pay for it with accuracy elsewhere. Only genuine patterns justify the cost.

The Dial

The strength of the penalty was adjustable. A strong penalty forced very small weights — simpler models that might miss real patterns (the weights were too constrained to capture the complexity of the data). A weak penalty allowed larger weights — more complex models that risked memorization.

Blortz: Another dial. Like the threshold and the loss function, the penalty strength is a choice.

Trviksha: Everything is a choice. The architecture is a choice. The loss function is a choice. The penalty strength is a choice. The data split is a choice. The network finds the best weights given those choices. But the choices themselves are mine.

She settled on a penalty strength that produced 85% test accuracy. It was not the highest possible — a slightly weaker penalty gave 86% but was more sensitive to changes in the data. The penalty strength was one more human decision in a system that was supposed to learn on its own.

Trviksha has invented regularization — one of the most important techniques for preventing overfitting. The idea is to add a penalty for large weights to the loss function, so the network must balance accuracy against simplicity. This encourages the network to find solutions that use moderate, evenly distributed weights — which tend to generalize better to new data. In mathematical terms, the most common form is L2 regularization (also called weight decay), where the penalty is proportional to the sum of squared weights. The intuition is straightforward: extreme weights encode specific, brittle patterns; moderate weights encode general, robust patterns. The strength of the penalty controls the tradeoff. Think about writing an essay: you could use obscure, highly specific vocabulary to precisely capture every nuance, but a reader who encounters unfamiliar words will lose the meaning. Simpler language communicates more reliably to a broader audience, even if it sacrifices some precision. Regularization is the mathematical equivalent of preferring clear communication over overfit precision.