Part 55 of 58
The Calculator
By Madhav Kaushish · Ages 12+
Chain-of-thought reasoning and step verification had improved the model's performance on complex questions. But certain types of errors persisted stubbornly.
The Arithmetic Problem
The model consistently made errors on large-number arithmetic. It could identify that a problem required multiplication, set up the calculation correctly in its reasoning chain, and then produce the wrong numerical result.
Trviksha: The model writes: "The total cost is 47,832 times 156. Let me calculate: 47,832 times 156 equals 7,425,592." The correct answer is 7,461,792. The model set up the right operation and got the wrong number.
Blortz: It is a language model, not an arithmetic engine. It learned to approximate arithmetic from patterns in text — seeing examples like "12 times 8 equals 96" — but it did not learn the actual algorithm for multiplication. For small numbers, the patterns are reliable. For large numbers, the approximation breaks down.
Trviksha: I can add more arithmetic examples to the training data. But the model will still be approximating. It will never be reliable on seven-digit multiplication because it is not actually multiplying — it is predicting what the result of multiplication looks like in text.
GlagalCloud had a dedicated arithmetic engine — a system of pebble arrangements designed specifically for precise computation. It was not a neural network. It was a rule-based system that executed the multiplication algorithm exactly, step by step. It never made arithmetic errors because it was not predicting — it was computing.
Trviksha: The language model is good at understanding questions, reasoning about relationships, and generating explanations. It is bad at precise arithmetic. The arithmetic engine is good at precise arithmetic and bad at everything else. What if the language model could call the arithmetic engine when it needs a calculation done?
The Tool Call
She modified the model's generation process. When the model encountered a computation it could not reliably perform — large multiplication, division, exponentiation — it generated a special token sequence: a request for the arithmetic engine.
Instead of writing "47,832 times 156 equals 7,425,592," the model would write: "47,832 times 156 equals [CALCULATE: 47832 * 156]." The system intercepted this token, sent the calculation to the arithmetic engine, received the exact result (7,461,792), inserted it back into the text, and the model continued generating from that point.
Trviksha: The model decides when to call the tool and what to ask it. The tool performs the computation and returns the result. The model incorporates the result and continues its reasoning. The model is the thinker. The tool is the calculator.
Drysska: The model knows what it does not know?
Trviksha: Not exactly. The model has learned — through training on examples — that certain types of calculations benefit from the tool call. It has a rough sense of which computations are easy (small numbers, common multiplications) and which are error-prone (large numbers, multi-digit operations). But its calibration is imperfect — it sometimes attempts calculations it should delegate, and occasionally delegates calculations it could handle.

The Agent
The arithmetic engine was just the first tool. Trviksha added more:
Archive search: When the model needed to verify a fact — "What was the grain output of Klomvaj province in Year 14?" — instead of generating an answer from memory (risking hallucination), it could issue a search query to Hjentova's archive index and receive the actual recorded value.
Weather lookup: When the model needed current conditions — "Is the Grintjak Pass currently clear?" — it could query Vrothjelka's weather system for real-time data.
Legal reference: When the model needed to cite a specific statute, it could query the legal code index rather than reconstructing the statute from memory.
With multiple tools, the model became something more than a text generator. It was a coordinator — understanding the question, deciding which tools to invoke, interpreting the results, and synthesising a final answer.
Trviksha: The model plans. It decides what information it needs. It calls the appropriate tool. It reads the result. It incorporates the result into its reasoning. It may call another tool if the first result raises new questions. It is not just generating text — it is acting in a loop of planning, acting, observing, and reasoning.
Glagalbagal: An agent. Not a document. An agent that takes actions in the world — even if the "world" is just a set of databases and calculators.
Blortz: A velociraptor that delegates. It knows what it is good at — reasoning, language, planning — and uses tools for what it is bad at — arithmetic, fact retrieval, current data. Each component does what it does best.
Zhrondvik: This is what I wanted from the beginning. Not a model that guesses at numbers and invents facts, but a system that looks up facts when it needs them and calculates precisely when it needs to. The language model provides the intelligence. The tools provide the reliability.
Trviksha: The model is the brain. The tools are the hands, the eyes, the reference books. Together, they are more capable than either alone.