The Same Eyes Everywhere

The sliding-window approach worked well. But Trviksha had not yet appreciated how dramatically it changed the economics of the network.

Counting the Pebbles

Blortz, ever the bookkeeper, did the arithmetic.

Blortz: Your old fully connected network took four hundred inputs and connected each one to each hidden velociraptor. With sixteen hidden velociraptors, that was four hundred times sixteen weights — six thousand four hundred weight arrangements. Plus sixteen weights from the hidden layer to the output. Six thousand four hundred and sixteen weights in total.

Trviksha: That sounds right.

Blortz: Your new sliding-window approach has nine weights.

Trviksha: Nine.

Blortz: Nine weights, applied at every position on a 20×20 grid. The same nine weights, reused approximately three hundred and twenty-four times. Nine pebble arrangements on one shelf versus six thousand four hundred on sixty shelves. The storage difference is enormous.

Nine weights versus six thousand four hundred. The convolutional approach used less than one-seventh of one percent of the fully connected approach's parameters. And it worked better.

Why Fewer Is Better

Trviksha: It is not just that fewer weights are cheaper to store and faster to adjust. Fewer weights actually produce a better model.

Glagalbagal: How can knowing less lead to better predictions?

Trviksha: The fully connected network had six thousand four hundred chances to memorize. Each weight could encode a position-specific quirk — "plot 17 tends to be blighted in the training fields." With that many free parameters and only forty training fields, the network had more than enough capacity to memorize every field individually.

The convolutional network had nine chances to encode knowledge. Nine weights had to capture everything the network knew about what blight looked like. There was no room to memorize specific fields or specific positions. The nine weights could only encode the general pattern — and the general pattern was what Kvrothja needed.

Blortz: Fewer weights force generalization. The network cannot memorize because it does not have enough capacity. It must learn the rule.

Trviksha: And the rule is the same everywhere in the field, which is exactly what the shared weights assume. The assumption matches the reality. A disease cluster genuinely does look the same regardless of position. So the constraint — "use the same weights everywhere" — is not a limitation. It is correct.

The Comparison

To confirm, Trviksha trained both architectures on the same forty fields with the same train/test split and the same regularization:

Architecture	Weights	Training accuracy	Test accuracy
Fully connected (16 hidden)	6,416	94%	76%
Sliding window (3×3)	9	86%	88%

The fully connected network had higher training accuracy but lower test accuracy — classic overfitting. The sliding window had lower training accuracy (it could not memorize) but higher test accuracy (it had learned the right thing).

Kvrothja: The simpler system is better?

Trviksha: The simpler system is better because its simplicity matches the problem. The data has spatial structure. The sliding window respects that structure — same weights everywhere, local patches only. The fully connected network ignores the structure and wastes its capacity learning things that are not true in general.

Blortz: If the data did not have spatial structure — if the pattern were different at every position — the sliding window would fail, because its assumption would be wrong.

Trviksha: Correct. The sliding window works because the assumption of translation invariance holds for this data. For data where it does not hold — patient records, for instance, where each column means something different — you would not use this approach.

The lesson was not that fewer weights are always better. It was that encoding the right assumptions into the architecture can dramatically reduce the number of weights needed, while simultaneously improving performance. Knowing the structure of the problem allowed Trviksha to build a network that was both smaller and smarter.

Trviksha has discovered parameter sharing — one of the most powerful ideas in neural network design. By reusing the same weights at every spatial position, the convolutional network achieves two things at once: it dramatically reduces the number of parameters (from thousands to single digits), and it enforces the correct assumption that the pattern looks the same everywhere. Fewer parameters means less overfitting, faster training, and less storage — but only when the assumption matches the data. This is a general principle: building the right assumptions into a model's architecture can be more valuable than adding more data or more parameters. In statistics, this is related to the idea of inductive bias — the assumptions a model makes before seeing any data. A good inductive bias matches the true structure of the problem. A bad inductive bias fights against it. Can you think of a situation where knowing the structure of a problem in advance would help you solve it more efficiently — not by working harder, but by working smarter?