Part 35 of 58

More Is More

By Madhav Kaushish · Ages 12+

The six-block transformer predicted the next word reasonably well on Hjentova's archive. But "reasonably well" was not enough for the archivist's purposes.

The First Scaling Experiment

Trviksha tried the obvious: make the network bigger. She doubled the embedding size from sixty-four to one hundred and twenty-eight, increased the number of attention heads from eight to sixteen, and added more blocks — twelve instead of six.

The larger network's predictions improved. It made fewer errors on the held-out test tablets. It predicted rarer words more accurately. It handled longer-range dependencies — references to events mentioned paragraphs earlier — with greater precision.

She doubled again. Twenty-four blocks. Thirty-two heads. Two hundred and fifty-six-dimensional embeddings.

Better still. The improvement was consistent: every time she made the network larger, the predictions improved. There was no sign of the improvement tapering off.

Blortz: Normally, bigger networks overfit. We learned this with Grothvik's patients — too many velociraptors memorized the training data instead of learning the pattern. Why is that not happening here?

Trviksha: Because the training data is enormous. Fifty million tokens. The network would need to be astronomically large to memorize fifty million tokens. At the sizes I am using — even the largest — the network has far fewer parameters than training examples. It cannot memorize. It must generalise.

The Data Experiment

She then fixed the network size and varied the amount of training data. Same architecture — twelve blocks — but trained on different fractions of the archive.

Training dataTest error
10% of archive4.2
25% of archive3.6
50% of archive3.1
100% of archive2.7

More data, lower error. Again, no sign of tapering off. If she had twice as much text, the error would likely drop further.

Trviksha: Two things improve the model: more parameters and more data. And the improvements are predictable — roughly proportional to the logarithm of the increase. Double the data, and the error drops by a consistent amount. Double the network size, and the error drops by a consistent amount.

Glagalbagal: A law of diminishing returns?

Trviksha: Not diminishing, exactly. Diminishing would mean the returns get smaller. These returns are consistent on a logarithmic scale — each doubling buys the same improvement. But since each doubling also costs twice as much, the cost of each fixed improvement doubles.

A stone graph with two axes. The horizontal axis shows "resources" (both velociraptors and training tablets) increasing from left to right. The vertical axis shows "prediction error" decreasing from top to bottom. A smooth curve descends from upper left to lower right, showing consistent improvement that never flattens. Trviksha marks points on the curve where the network doubled in size, each point dropping by roughly the same amount. Blortz calculates costs on an abacus nearby

Pre-training

Hjentova's archive was large but not unlimited. Trviksha realised she could train on far more text if she included sources beyond the archive — trade records from the port authority, religious texts from the temples, agricultural reports from every province, even popular stories and songs.

Trviksha: The prediction task does not require specialised data. Any text in Sonhlagoti teaches the network about the language — its grammar, its vocabulary, its patterns of reasoning. I can train on everything, then specialise later.

She collected text from every available source in Sonhlagot. The total: roughly five hundred million tokens — ten times Hjentova's archive. She trained the largest network she could afford on this combined corpus.

The result was a general-purpose language model. It knew legal language from the laws. Agricultural language from the farming reports. Historical language from the chronicles. Religious language from the temple texts. It did not specialise in any domain — it knew all of them to varying degrees.

Hjentova: But I need it to work on my archive specifically. The general model knows legal language, but does it know the specific conventions of my historical tablets?

Trviksha: I will train it on your archive after the general training. The general training — I am calling it pre-training — gives the network a broad foundation. The specific training on your archive — fine-tuning — sharpens it for your domain. The pre-trained model already knows what language is. Fine-tuning teaches it what your language is.

She fine-tuned the pre-trained model on Hjentova's two hundred thousand tablets. The fine-tuned model outperformed every previous version — including the model trained only on the archive. The broad pre-training had given the network a foundation of language understanding that the archive alone could not provide.

Blortz: The model learns more from five hundred million tokens of general text than from fifty million tokens of specialised text. Even for the specialised task.

Trviksha: Because language is language. The grammar, the patterns of reasoning, the associations between concepts — these are shared across domains. Pre-training captures the shared structure. Fine-tuning captures the domain-specific details.