The Long Game

The exploration-exploitation balance had improved the pterodactyl's routes. But Flinqva had a new complaint.

The Canyon Problem

Flinqva: Your pterodactyl keeps taking the Grintjak Canyon route in the late dry season. It is fast — the canyon provides a tailwind that cuts travel time significantly. But the dry season ends abruptly when the monsoon arrives, and a pterodactyl caught in the canyon during the first monsoon storm has a very bad day. Two of my hand-managed pterodactyls avoid the canyon in late dry season because I know the monsoon is coming. Your pterodactyl does not seem to think ahead.

The pterodactyl evaluated each action based on its immediate reward — the reward received right after taking the action. Flying through the canyon today gave a high immediate reward: fast travel, quick delivery. The monsoon would arrive days or weeks later, and its penalty would be associated with whatever action the pterodactyl was taking at that moment, not with the earlier decision to use the canyon.

Trviksha: The pterodactyl optimizes for the immediate move. It does not consider what will happen five moves from now, or fifty. The canyon is rewarding right now, so it takes the canyon.

Blortz: But the consequence of taking the canyon in late dry season is not immediate — it is delayed. The danger comes later, when the monsoon hits while the pterodactyl is in a habit of using canyon routes. The pterodactyl needs to consider not just the current reward, but the total future reward.

Total Future Reward

Trviksha modified the system. Instead of evaluating an action by its immediate reward alone, the pterodactyl should evaluate it by the total reward it expected to receive from this point forward — the immediate reward plus all future rewards.

But future rewards were uncertain. The pterodactyl did not know exactly what would happen ten moves from now. It could estimate — based on past experience — what the likely future reward was from each state.

Trviksha: From each state, the pterodactyl estimates a value — the total reward it expects to collect from that state onward, following its current strategy. A state near the destination has high value (delivery is close). A state in the canyon during late dry season has lower value than it appears — because the expected future includes a possible monsoon penalty.

The value of a state was not just about where the pterodactyl was, but about where it was likely to end up. A state that led to good future states was valuable. A state that led to dangerous future states was not — even if the state itself seemed fine.

Discounting

There was a subtlety. Future rewards were worth less than immediate rewards, for two reasons: they were uncertain (the pterodactyl might not survive to collect them), and they were delayed (a delivery today was more useful to Flinqva than a delivery next week).

Trviksha applied a discount: each time step into the future, the reward was multiplied by a factor slightly less than one. A reward of ten received now was worth ten. The same reward received one step later was worth nine. Two steps later, roughly eight. Ten steps later, roughly three and a half.

Trviksha: Near rewards count nearly full. Distant rewards are discounted. This means the pterodactyl cares about the future, but cares more about the near future than the distant future.

Blortz: If the discount is too steep — if the pterodactyl barely values anything beyond the next few moves — it will still take the canyon for the immediate reward. If the discount is too shallow — if distant future rewards count almost as much as immediate ones — the pterodactyl will be paralysed by concern about distant possibilities.

Trviksha: The discount rate is a choice, like the threshold and the loss function. It reflects how far ahead we want the agent to plan. For Flinqva's pterodactyls, a moderate discount — caring meaningfully about the next twenty or thirty moves — seems right.

The Result

With value estimation and discounting, the pterodactyl's behaviour changed. In early dry season, it still took the canyon — the expected future was safe, so the high immediate reward dominated. In late dry season, the value estimate of canyon states dropped — the estimated future now included monsoon risk — and the pterodactyl voluntarily switched to the mountain route.

Flinqva: It avoids the canyon before the monsoon, without being told about the monsoon explicitly?

Trviksha: It learned from experience. Pterodactyls that used the canyon in late dry season had, in past flights, occasionally been caught in the monsoon and received large penalties. The value estimate for "canyon in late dry season" reflects those experiences. The pterodactyl does not understand monsoons — it has learned that being in the canyon at that time tends to lead to low total reward.

Glagalbagal: Pattern, not understanding. The same lesson as always.

Trviksha: The same lesson. But now the pattern extends through time — not just "what is happening now" but "what is likely to happen next."

Trviksha has discovered value functions and discounting — two core concepts in reinforcement learning. A value function estimates the total future reward the agent expects from a given state, not just the immediate reward. This allows the agent to make decisions that sacrifice short-term gains for long-term benefits — like avoiding a fast but seasonally dangerous route. Discounting reduces the weight of future rewards relative to immediate ones, reflecting both uncertainty (the future is less predictable) and preference for sooner outcomes. The discount factor is a parameter that controls how far ahead the agent plans: a high discount factor makes the agent farsighted, a low one makes it myopic. This tradeoff between immediate and delayed gratification appears throughout life: studying for an exam instead of watching a film, saving money instead of spending it, maintaining equipment instead of using it until it breaks. In each case, the "right" choice depends on how much you value the future relative to the present. What is your personal discount rate?