The Next Word

Trviksha had tokenised Hjentova's archive. Eight thousand sub-word tokens, each with a learned embedding. Two hundred thousand tablets, converted into sequences of tokens. The transformer could process these sequences. But what should it learn?

The Task

With Grothvik's patients, the task was clear: predict whether a patient would get sick. With Kvrothja's fields, the task was clear: classify plots as healthy or blighted. With Phlontjek's contracts, the task was clear: answer specific questions.

Hjentova's request was different. She did not want classification or prediction of a single output. She wanted the system to "understand language." But what did that mean in terms of inputs and outputs?

Trviksha: I cannot train the network to "understand." I need a specific, measurable task. Something with a right answer that I can compute error on.

Glagalbagal: What is the simplest task that requires understanding language?

Trviksha thought about it for a long time. Then she had it.

Trviksha: Predict the next word. Given a sequence of tokens, predict which token comes next.

The Game

The training data was the archive itself. Every sequence of tokens in every tablet was a training example. Given "The harvest in the western" — predict the next token. The answer was in the text: "provinces." Given "The penalty for late delivery shall" — predict the next token: "be."

No labels were needed. No human annotator had to mark each example as "sick" or "healthy." The text itself provided the answers. Every token in the archive was simultaneously a training input (as context) and a training target (as the token to predict).

Blortz: Two hundred thousand tablets. Roughly fifty million tokens. Each token is a prediction target, using everything before it as context. That is fifty million training examples — from a single archive — with no labelling effort at all.

Trviksha: Free supervision. The data labels itself.

She built a transformer — six blocks, eight attention heads per block, embeddings of sixty-four dimensions — and trained it on the prediction task. For each position in each tablet, the network saw all preceding tokens and predicted the next one. The error was the difference between the predicted token and the actual next token. The error flowed backward through the transformer blocks, adjusting the attention weights, the feedforward weights, and the embeddings.

A sequence of Sonhlagoti tokens on a stone shelf: "The harvest in the western ___". Below the blank, eight thousand pebble arrangements (one per vocabulary token) are arranged in a row, each with a height indicating the network's confidence. The tallest pebble is "provinces." A velociraptor checks the actual next token against the prediction, measuring the error

What It Learned

After training on the full archive, Trviksha tested the model by giving it the beginning of sentences and examining its predictions.

Given "The grain shipment arrived at the port of" — the model predicted "Xvelsk" with high confidence. Xvelsk was the most commonly mentioned port in trade records. Given "Xvelsk" as context about a different port, the model adjusted and predicted other port names.

Given "In the third year of the drought, the" — the model predicted "harvest" or "famine" with roughly equal confidence, both plausible continuations. It had learned that droughts led to both concepts.

Given "The law states that any citizen who" — the model predicted legal verbs: "violates," "fails," "refuses." It had absorbed the patterns of legal language from the legal tablets.

Hjentova: It has not memorized the tablets?

Trviksha: Not exactly. For very common phrases — standard legal formulas, ritual openings — it has effectively memorized them. For less common passages, it has learned the patterns: what kinds of words follow what kinds of contexts. It knows that port names follow "port of" and that legal verbs follow "any citizen who." It knows the grammar, the style, and the common associations.

Hjentova: Does it know what the words mean?

Trviksha: It knows how they are used. Whether that constitutes "meaning" is a question I am not qualified to answer.

Glagalbagal: In my experience, knowing how a word is used is most of what meaning is.

Blortz: In my experience, velociraptor bookkeepers who predict the next word in a sentence are not thereby understanding the sentence.

The question hung in the air, unresolved. The network predicted language with impressive accuracy. It had absorbed grammar, common knowledge, stylistic patterns, and factual associations — all from the single task of predicting what came next. Whether this constituted understanding or merely sophisticated pattern matching was not a question the pebbles could answer.

Trviksha has discovered language modeling — the foundational training method behind GPT, BERT, and all modern large language models. The task is deceptively simple: predict the next token in a sequence. But this simple task, applied to vast amounts of text, forces the network to learn grammar, facts, reasoning patterns, and stylistic conventions. The key insight is self-supervision: the text labels itself, because every token is both context (for predicting future tokens) and a target (to be predicted from past tokens). This means language models can learn from essentially unlimited data without any human labelling effort. The philosophical question — whether predicting the next word constitutes "understanding" — remains genuinely open. The network behaves as if it understands, but its mechanism is statistical pattern matching over token sequences. Whether there is a meaningful difference between "understanding language" and "predicting language extremely well" is debated by researchers today. What do you think: if someone could perfectly predict what word comes next in any sentence, would that mean they understand the language?