More Is More

The six-block transformer predicted the next word reasonably well on Hjentova's archive. But "reasonably well" was not enough for the archivist's purposes.

The First Scaling Experiment

Trviksha tried the obvious: make the network bigger. She doubled the embedding size from sixty-four to one hundred and twenty-eight, increased the number of attention heads from eight to sixteen, and added more blocks — twelve instead of six.

The larger network's predictions improved. It made fewer errors on the held-out test tablets. It predicted rarer words more accurately. It handled longer-range dependencies — references to events mentioned paragraphs earlier — with greater precision.

She doubled again. Twenty-four blocks. Thirty-two heads. Two hundred and fifty-six-dimensional embeddings.

Better still. The improvement was consistent: every time she made the network larger, the predictions improved. There was no sign of the improvement tapering off.

Blortz: Normally, bigger networks overfit. We learned this with Grothvik's patients — too many velociraptors memorized the training data instead of learning the pattern. Why is that not happening here?

Trviksha: Because the training data is enormous. Fifty million tokens. The network would need to be astronomically large to memorize fifty million tokens. At the sizes I am using — even the largest — the network has far fewer parameters than training examples. It cannot memorize. It must generalise.

The Data Experiment

She then fixed the network size and varied the amount of training data. Same architecture — twelve blocks — but trained on different fractions of the archive.

Training data	Test error
10% of archive	4.2
25% of archive	3.6
50% of archive	3.1
100% of archive	2.7

More data, lower error. Again, no sign of tapering off. If she had twice as much text, the error would likely drop further.

Trviksha: Two things improve the model: more parameters and more data. And the improvements are predictable — roughly proportional to the logarithm of the increase. Double the data, and the error drops by a consistent amount. Double the network size, and the error drops by a consistent amount.

Glagalbagal: A law of diminishing returns?

Trviksha: Not diminishing, exactly. Diminishing would mean the returns get smaller. These returns are consistent on a logarithmic scale — each doubling buys the same improvement. But since each doubling also costs twice as much, the cost of each fixed improvement doubles.

Pre-training

Hjentova's archive was large but not unlimited. Trviksha realised she could train on far more text if she included sources beyond the archive — trade records from the port authority, religious texts from the temples, agricultural reports from every province, even popular stories and songs.

Trviksha: The prediction task does not require specialised data. Any text in Sonhlagoti teaches the network about the language — its grammar, its vocabulary, its patterns of reasoning. I can train on everything, then specialise later.

She collected text from every available source in Sonhlagot. The total: roughly five hundred million tokens — ten times Hjentova's archive. She trained the largest network she could afford on this combined corpus.

The result was a general-purpose language model. It knew legal language from the laws. Agricultural language from the farming reports. Historical language from the chronicles. Religious language from the temple texts. It did not specialise in any domain — it knew all of them to varying degrees.

Hjentova: But I need it to work on my archive specifically. The general model knows legal language, but does it know the specific conventions of my historical tablets?

Trviksha: I will train it on your archive after the general training. The general training — I am calling it pre-training — gives the network a broad foundation. The specific training on your archive — fine-tuning — sharpens it for your domain. The pre-trained model already knows what language is. Fine-tuning teaches it what your language is.

She fine-tuned the pre-trained model on Hjentova's two hundred thousand tablets. The fine-tuned model outperformed every previous version — including the model trained only on the archive. The broad pre-training had given the network a foundation of language understanding that the archive alone could not provide.

Blortz: The model learns more from five hundred million tokens of general text than from fifty million tokens of specialised text. Even for the specialised task.

Trviksha: Because language is language. The grammar, the patterns of reasoning, the associations between concepts — these are shared across domains. Pre-training captures the shared structure. Fine-tuning captures the domain-specific details.

Trviksha has discovered two foundational ideas of modern AI. Scaling laws describe the remarkably predictable relationship between model size, training data, and performance: more of each produces better results, following a consistent mathematical pattern. This was formalised by Kaplan et al. at OpenAI in 2020, and it drove the strategy of building ever-larger language models. Pre-training is the practice of training a model on a massive, general-purpose text corpus before specialising it for any particular task. The pre-trained model learns the structure of language itself — grammar, facts, reasoning patterns — which transfers to any downstream task. Fine-tuning then adapts the general model to a specific domain with relatively little additional data. This two-stage approach (pre-train broadly, then fine-tune narrowly) is how models like GPT, BERT, and their descendants are built. Think about learning a new subject in school: your general ability to read, reason, and follow arguments (pre-training on years of diverse experience) helps you learn any specific topic faster than if you started from scratch. The general skills transfer.