The Multi-Step Question

The language model, aligned and efficient, handled most of Zhrondvik's queries well. Single-step questions — "What is the tariff rate for grain imports?" or "When was the Kronthjel Pass treaty signed?" — were answered accurately and concisely. Then Zhrondvik asked a question that required reasoning.

The Arithmetic Failure

Zhrondvik: If the eastern provinces produce twenty percent more grain than the western provinces, and the western provinces produce one hundred and fifty thousand bushels, how much do the eastern provinces produce?

The model answered: "The eastern provinces produce two hundred thousand bushels."

The correct answer was one hundred and eighty thousand. Twenty percent of one hundred and fifty thousand is thirty thousand. One hundred and fifty thousand plus thirty thousand is one hundred and eighty thousand. The model had jumped directly to an answer without working through the steps — and the answer was wrong.

Trviksha: The model produces the entire answer in a single forward pass. It reads the question, processes it through the transformer blocks, and generates the answer token by token. But the computation inside the transformer — however many layers it has — is a fixed amount of processing. Some questions require more reasoning steps than the network's depth can accommodate.

Blortz: A twelve-layer network gets twelve layers of processing for every question, whether the question requires one step of reasoning or ten. The simple questions waste layers. The complex questions do not have enough.

Thinking Out Loud

Trviksha tried a different approach. Instead of asking the model to produce the final answer directly, she prompted it to show its work — to generate intermediate reasoning steps before the answer.

Prompt: "If the eastern provinces produce twenty percent more grain than the western provinces, and the western provinces produce 150,000 bushels, how much do the eastern provinces produce? Think step by step."

The model generated: "Step 1: The western provinces produce 150,000 bushels. Step 2: Twenty percent of 150,000 is 30,000. Step 3: The eastern provinces produce 150,000 + 30,000 = 180,000 bushels. Answer: 180,000 bushels."

Correct.

Trviksha: Each step is a new set of tokens generated by the model. Each token generation is a new forward pass through the transformer. By generating intermediate steps, the model effectively gives itself more computation — more forward passes — to work through the problem. The first forward pass computes "twenty percent of 150,000." The next forward passes use that result to compute the sum.

Blortz: The intermediate steps are not just decoration. They are additional computation.

Trviksha: Exactly. When the model writes "twenty percent of 150,000 is 30,000," it is performing a computation and storing the result in the output text. The next step reads that result and uses it. The text itself becomes working memory — a scratchpad that the model reads back on subsequent passes.

The Limits

The step-by-step approach worked well for arithmetic, logical deductions, and multi-part questions. But it was not infallible.

On a more complex question — involving four variables and three conditional relationships — the model generated six reasoning steps. Steps one through four were correct. Step five contained an error — it applied a percentage to the wrong base. Step six, building on the wrong step five, produced a wrong final answer.

Zhrondvik: It showed its work, and the work was wrong.

Trviksha: The intermediate steps can contain errors, just like a student's calculations. The advantage of showing the work is that the error is visible — I can see where it went wrong, which I could not do when the model produced a single answer. But visibility does not prevent errors.

Glagalbagal: Thinking out loud helps. It does not guarantee thinking correctly.

Trviksha: No. And the model does not check its own work after writing it. It generates each step based on the previous steps, but it does not go back and verify whether the steps are consistent. A human doing arithmetic might check their answer by working backward. The model only moves forward.

The step-by-step approach traded speed for accuracy on complex questions. Simple questions did not need it — the model got them right in a single pass. Complex questions benefited enormously — the additional computation from generating intermediate steps allowed reasoning that a single pass could not achieve. But the steps were not guaranteed to be correct, and errors in early steps cascaded through later ones.

Trviksha has discovered chain-of-thought reasoning — prompting a language model to generate intermediate reasoning steps before producing a final answer. This technique, popularised by Wei et al. in 2022, dramatically improves performance on math, logic, and multi-step reasoning tasks. The key insight is computational: each generated token is a new forward pass through the network, so generating intermediate steps gives the model more computation to work with. The text itself serves as external working memory — the model writes partial results and reads them back in subsequent steps. This is analogous to how humans use scratch paper for complex calculations: the paper does not make us smarter, but it extends our working memory beyond what we can hold in our heads. The limitation is that errors in intermediate steps propagate to the final answer, and the model does not automatically verify its own reasoning. Think about solving a long division problem: showing your work helps you get it right, but a mistake in an early step will make the final answer wrong even if every subsequent step is performed correctly.