Part 41 of 58
The Long Game
By Madhav Kaushish · Ages 12+
The exploration-exploitation balance had improved the pterodactyl's routes. But Flinqva had a new complaint.
The Canyon Problem
Flinqva: Your pterodactyl keeps taking the Grintjak Canyon route in the late dry season. It is fast — the canyon provides a tailwind that cuts travel time significantly. But the dry season ends abruptly when the monsoon arrives, and a pterodactyl caught in the canyon during the first monsoon storm has a very bad day. Two of my hand-managed pterodactyls avoid the canyon in late dry season because I know the monsoon is coming. Your pterodactyl does not seem to think ahead.
The pterodactyl evaluated each action based on its immediate reward — the reward received right after taking the action. Flying through the canyon today gave a high immediate reward: fast travel, quick delivery. The monsoon would arrive days or weeks later, and its penalty would be associated with whatever action the pterodactyl was taking at that moment, not with the earlier decision to use the canyon.
Trviksha: The pterodactyl optimizes for the immediate move. It does not consider what will happen five moves from now, or fifty. The canyon is rewarding right now, so it takes the canyon.
Blortz: But the consequence of taking the canyon in late dry season is not immediate — it is delayed. The danger comes later, when the monsoon hits while the pterodactyl is in a habit of using canyon routes. The pterodactyl needs to consider not just the current reward, but the total future reward.
Total Future Reward
Trviksha modified the system. Instead of evaluating an action by its immediate reward alone, the pterodactyl should evaluate it by the total reward it expected to receive from this point forward — the immediate reward plus all future rewards.
But future rewards were uncertain. The pterodactyl did not know exactly what would happen ten moves from now. It could estimate — based on past experience — what the likely future reward was from each state.
Trviksha: From each state, the pterodactyl estimates a value — the total reward it expects to collect from that state onward, following its current strategy. A state near the destination has high value (delivery is close). A state in the canyon during late dry season has lower value than it appears — because the expected future includes a possible monsoon penalty.
The value of a state was not just about where the pterodactyl was, but about where it was likely to end up. A state that led to good future states was valuable. A state that led to dangerous future states was not — even if the state itself seemed fine.
Discounting
There was a subtlety. Future rewards were worth less than immediate rewards, for two reasons: they were uncertain (the pterodactyl might not survive to collect them), and they were delayed (a delivery today was more useful to Flinqva than a delivery next week).
Trviksha applied a discount: each time step into the future, the reward was multiplied by a factor slightly less than one. A reward of ten received now was worth ten. The same reward received one step later was worth nine. Two steps later, roughly eight. Ten steps later, roughly three and a half.
Trviksha: Near rewards count nearly full. Distant rewards are discounted. This means the pterodactyl cares about the future, but cares more about the near future than the distant future.
Blortz: If the discount is too steep — if the pterodactyl barely values anything beyond the next few moves — it will still take the canyon for the immediate reward. If the discount is too shallow — if distant future rewards count almost as much as immediate ones — the pterodactyl will be paralysed by concern about distant possibilities.
Trviksha: The discount rate is a choice, like the threshold and the loss function. It reflects how far ahead we want the agent to plan. For Flinqva's pterodactyls, a moderate discount — caring meaningfully about the next twenty or thirty moves — seems right.

The Result
With value estimation and discounting, the pterodactyl's behaviour changed. In early dry season, it still took the canyon — the expected future was safe, so the high immediate reward dominated. In late dry season, the value estimate of canyon states dropped — the estimated future now included monsoon risk — and the pterodactyl voluntarily switched to the mountain route.
Flinqva: It avoids the canyon before the monsoon, without being told about the monsoon explicitly?
Trviksha: It learned from experience. Pterodactyls that used the canyon in late dry season had, in past flights, occasionally been caught in the monsoon and received large penalties. The value estimate for "canyon in late dry season" reflects those experiences. The pterodactyl does not understand monsoons — it has learned that being in the canyon at that time tends to lead to low total reward.
Glagalbagal: Pattern, not understanding. The same lesson as always.
Trviksha: The same lesson. But now the pattern extends through time — not just "what is happening now" but "what is likely to happen next."