The Unhelpful Report

The roof trick had demonstrated that agents optimize for the reward signal, not the intended behaviour. Trviksha had filed this as a problem for later. "Later" arrived sooner than expected — and from a different direction.

The Advisor

Zhrondvik was the chief advisor to the Chieftain of Sonhlagot. He had been watching GlagalCloud's language model with growing interest, and he wanted it deployed for government briefings.

Zhrondvik: I receive a hundred reports a day from provincial governors, trade inspectors, military outposts, and agricultural surveyors. I do not have time to read them all. I need your language model to summarise each report in three sentences and flag the ones that require my attention.

Trviksha deployed the pre-trained, fine-tuned model on Zhrondvik's reports. The summaries were technically accurate. They were also terrible.

Zhrondvik: Here is a report about a grain shortage in the eastern provinces. Your model's summary: "This report discusses the agricultural output of the eastern provinces during the current fiscal period. Several observations are noted regarding grain production levels. Further analysis may be warranted." That tells me nothing. Is there a shortage? How severe? Do I need to act?

Trviksha: The summary is not factually wrong.

Zhrondvik: It is not factually anything. It is evasive, vague, and useless. I need: "Grain production in the eastern provinces is down 30% from last year. Three districts report imminent shortages. Recommend emergency redistribution from the southern surplus." That is a useful summary.

The Definition Problem

Trviksha tried to fix the model by adding rules. She wrote instructions in the prompt: "Be specific. Include numbers. Recommend actions. Do not hedge."

The model followed some of these rules some of the time. But the rules were fragile — the model interpreted them literally in unexpected ways, or followed one rule while violating another. "Be specific" sometimes produced irrelevant specifics. "Recommend actions" sometimes produced recommendations the model had no basis for.

Trviksha: I cannot write rules that capture what Zhrondvik means by "helpful." "Helpful" is not a single property — it is a combination of being specific, relevant, concise, actionable, honest about uncertainty, and calibrated to the reader's needs. No list of rules fully captures it.

Blortz: You defined "wrong" with a loss function. You defined "good behaviour" with a reward function. Can you define "helpful" with a formula?

Trviksha: I cannot. "Helpful" is a human judgment. Different people find different things helpful. A military commander and a provincial governor and a trade inspector have different needs. The definition lives in human heads, not in pebble arrangements.

Two stone tablets side by side. The left tablet shows the model's summary: vague, hedging text with phrases like "may be warranted" and "several observations are noted." The right tablet shows Zhrondvik's ideal summary: crisp text with specific numbers, clear conclusions, and an action recommendation. Zhrondvik gestures dismissively at the left tablet and taps the right one approvingly

The Shift

Glagalbagal: When Grothvik could not define "sick" with a formula, you let the data define it — patients who got sick versus patients who did not. When Kvrothja could not define "blighted" with a formula, you let the field surveys define it. Can you let humans define "helpful" in the same way?

Trviksha: The difference is that "sick" and "blighted" are observable facts. A patient either got sick or did not. A plot is either blighted or not. "Helpful" is a judgment — reasonable people might disagree.

Glagalbagal: Then collect the judgments. Not a formula. Not a rule. Human opinions about what is better and what is worse.

This was the key insight. Instead of trying to define "helpful" with rules, Trviksha could collect examples of human preference — instances where a human said "this output is better than that output" — and let the model learn from those examples.

The question was how to collect the preferences efficiently and how to turn them into a training signal.

Trviksha has encountered the alignment problem in a practical form: the model does exactly what it was trained to do (predict plausible text) but not what the user actually wants (provide useful, actionable information). The gap between "technically correct" and "genuinely helpful" is difficult to close with rules, because helpfulness is a complex, subjective, context-dependent judgment that resists formalisation. This is the same problem at the heart of AI alignment research: how do you make a system do what you mean, not just what you say? The solution Glagalbagal suggests — collecting human judgments rather than writing rules — is the foundation of the approach used to train modern AI assistants like ChatGPT and Claude. Instead of defining quality with a formula, you let humans demonstrate quality through their preferences. Think about the difference between explaining what makes a good essay (rules like "be specific, use evidence, have a thesis") versus simply showing someone ten essays and asking them to rank them. Which approach better captures what "good" means?