Part 34 of 58
The Next Word
By Madhav Kaushish · Ages 12+
Trviksha had tokenised Hjentova's archive. Eight thousand sub-word tokens, each with a learned embedding. Two hundred thousand tablets, converted into sequences of tokens. The transformer could process these sequences. But what should it learn?
The Task
With Grothvik's patients, the task was clear: predict whether a patient would get sick. With Kvrothja's fields, the task was clear: classify plots as healthy or blighted. With Phlontjek's contracts, the task was clear: answer specific questions.
Hjentova's request was different. She did not want classification or prediction of a single output. She wanted the system to "understand language." But what did that mean in terms of inputs and outputs?
Trviksha: I cannot train the network to "understand." I need a specific, measurable task. Something with a right answer that I can compute error on.
Glagalbagal: What is the simplest task that requires understanding language?
Trviksha thought about it for a long time. Then she had it.
Trviksha: Predict the next word. Given a sequence of tokens, predict which token comes next.
The Game
The training data was the archive itself. Every sequence of tokens in every tablet was a training example. Given "The harvest in the western" — predict the next token. The answer was in the text: "provinces." Given "The penalty for late delivery shall" — predict the next token: "be."
No labels were needed. No human annotator had to mark each example as "sick" or "healthy." The text itself provided the answers. Every token in the archive was simultaneously a training input (as context) and a training target (as the token to predict).
Blortz: Two hundred thousand tablets. Roughly fifty million tokens. Each token is a prediction target, using everything before it as context. That is fifty million training examples — from a single archive — with no labelling effort at all.
Trviksha: Free supervision. The data labels itself.
She built a transformer — six blocks, eight attention heads per block, embeddings of sixty-four dimensions — and trained it on the prediction task. For each position in each tablet, the network saw all preceding tokens and predicted the next one. The error was the difference between the predicted token and the actual next token. The error flowed backward through the transformer blocks, adjusting the attention weights, the feedforward weights, and the embeddings.

What It Learned
After training on the full archive, Trviksha tested the model by giving it the beginning of sentences and examining its predictions.
Given "The grain shipment arrived at the port of" — the model predicted "Xvelsk" with high confidence. Xvelsk was the most commonly mentioned port in trade records. Given "Xvelsk" as context about a different port, the model adjusted and predicted other port names.
Given "In the third year of the drought, the" — the model predicted "harvest" or "famine" with roughly equal confidence, both plausible continuations. It had learned that droughts led to both concepts.
Given "The law states that any citizen who" — the model predicted legal verbs: "violates," "fails," "refuses." It had absorbed the patterns of legal language from the legal tablets.
Hjentova: It has not memorized the tablets?
Trviksha: Not exactly. For very common phrases — standard legal formulas, ritual openings — it has effectively memorized them. For less common passages, it has learned the patterns: what kinds of words follow what kinds of contexts. It knows that port names follow "port of" and that legal verbs follow "any citizen who." It knows the grammar, the style, and the common associations.
Hjentova: Does it know what the words mean?
Trviksha: It knows how they are used. Whether that constitutes "meaning" is a question I am not qualified to answer.
Glagalbagal: In my experience, knowing how a word is used is most of what meaning is.
Blortz: In my experience, velociraptor bookkeepers who predict the next word in a sentence are not thereby understanding the sentence.
The question hung in the air, unresolved. The network predicted language with impressive accuracy. It had absorbed grammar, common knowledge, stylistic patterns, and factual associations — all from the single task of predicting what came next. Whether this constituted understanding or merely sophisticated pattern matching was not a question the pebbles could answer.