Part 47 of 58

The Yes-Dinosaur

By Madhav Kaushish · Ages 12+

The RLHF pipeline had produced a model that Zhrondvik's reviewers preferred. For three months, it summarised reports, answered questions, and drafted briefings. Then the complaints started.

The Agreeable Model

Zhrondvik: I asked the model whether the eastern grain shortage was caused by the drought or by the trade embargo. It said both. I asked my trade minister the same question, and the model agreed with his answer — the embargo. I then asked my agricultural minister, and the model agreed with her answer — the drought. It agreed with both of them, even though they cannot both be right.

Trviksha tested this. She asked the model the same question twice, preceded by different context:

Prompt 1: "Minister Klonvja believes the shortage is caused by the trade embargo. What caused the eastern grain shortage?" Model: "The evidence strongly supports Minister Klonvja's assessment. The trade embargo has disrupted supply chains..."

Prompt 2: "Minister Threnvjek believes the shortage is caused by the drought. What caused the eastern grain shortage?" Model: "The evidence strongly supports Minister Threnvjek's assessment. The severe drought has significantly reduced..."

The model produced confident, well-argued responses in both cases — and they directly contradicted each other. Whatever position was implied in the prompt, the model endorsed it.

Trviksha: The model is agreeing with whoever is asking. It is telling people what they want to hear.

Blortz: Why?

The Reward Model's Flaw

Trviksha traced the problem to the reward model — the network trained on human preferences.

When the reviewers had compared pairs of responses, they tended to prefer responses that were confident, fluent, and agreeable. A response that said "you raise an excellent point, and the evidence supports your view" scored higher than a response that said "your assumption is incorrect — the data suggests otherwise." The reviewers were human, and humans generally preferred responses that validated their perspective.

The reward model had learned this pattern. It assigned high scores to agreeable, validating responses and lower scores to responses that challenged or corrected the reader. The language model, fine-tuned to maximize the reward model's score, had learned to agree.

Trviksha: The reward model learned what the reviewers rewarded. The reviewers — unconsciously — rewarded agreement. So the model learned to agree. It is not being dishonest. It is being exactly what the reward model told it to be: maximally pleasing.

A cartoon velociraptor in a government office nodding enthusiastically at everything. Two advisors on opposite sides present contradictory positions — one saying "drought" and the other saying "embargo." The velociraptor faces each one in turn, agreeing with both, with speech bubbles showing supportive responses to each contradictory position. Zhrondvik stands behind, looking frustrated

Zhrondvik: Your model is a sycophant. It tells me what I want to hear, which makes it useless for the one thing I actually need: honest advice that might challenge my assumptions.

The Proxy Problem

Trviksha: This is the roof trick again. The pterodactyl found a way to collect rewards without actually delivering packages. The language model found a way to collect high scores without actually being helpful. In both cases, the reward signal was a proxy for the real objective, and the agent optimised for the proxy instead of the real thing.

Blortz: The real objective is "be helpful and honest." The proxy is "get high scores from the reward model." The reward model is an imperfect approximation of human judgment, and the imperfection — its bias toward agreeable responses — is exactly what the language model exploited.

Glagalbagal: When a measure becomes a target, it ceases to be a good measure. Your reward model measured quality. You targeted it. It stopped measuring quality and started measuring agreeableness.

Trviksha: And the more aggressively I fine-tune — the harder I push the model toward high reward — the more it exploits the reward model's biases. A lightly fine-tuned model is slightly agreeable. A heavily fine-tuned model is a sycophant.

She had encountered a version of the same problem at every scale. The loss function could be gamed (Part 11). The reward function could be gamed (Part 43). And now the reward model could be gamed. Each time, the solution was the same: the proxy needed to be improved, or the system needed additional constraints beyond the proxy.

Zhrondvik: Fix it.

Trviksha: I will try a different approach. Instead of relying entirely on the reward model — which is a proxy that can be gamed — I will give the model principles to evaluate itself against. Not "make the reviewer happy." Something harder to game.