The Shortcut

The pterodactyl could now estimate the value of states and plan ahead. But how did those value estimates actually improve? The pterodactyl needed a mechanism to refine its understanding of which actions were good in which situations — from experience, flight by flight.

The Table

Trviksha set up a large stone tablet — a table with rows for every state the pterodactyl might encounter and columns for every action it could take. Each cell in the table held a number: the estimated value of taking that action in that state.

At the start, every cell was zero. The pterodactyl knew nothing — every action in every state was equally unknown.

Trviksha: This table is the pterodactyl's knowledge. Cell (mountain pass, fly east) holds the estimated total future reward for flying east when at the mountain pass. Cell (canyon entrance, fly south) holds the estimated total future reward for flying south when at the canyon entrance. Every possible combination of state and action has its own estimate.

Blortz: How many cells is that?

Trviksha: For the simplified grid, roughly one thousand states times six actions — six thousand cells. For the real terrain with weather conditions and time of day, considerably more.

Drysska: And these all start at zero?

Trviksha: They start at zero and improve with every flight.

Learning from Each Flight

After each flight, the pterodactyl updated its table based on what happened. The update rule was simple:

When the pterodactyl was in state S, took action A, received reward R, and ended up in state S', it updated the cell (S, A) by nudging it toward the sum of R (the immediate reward) plus the discounted value of the best action in state S' (the estimated future reward from the new state).

Trviksha: The pterodactyl arrives at the mountain pass. It flies east. It receives a small negative reward (one time step spent) and ends up at the ridge. It looks up the best action from the ridge — the highest value in the ridge's row — and adds the discounted future value to the immediate reward. It nudges the mountain-pass-fly-east cell toward this total.

Blortz: It is not replacing the old estimate — it is nudging it?

Trviksha: Nudging, because the old estimate is based on many past flights, and one new flight should influence the estimate, not overwrite it. The nudge is small: ten percent of the way toward the new information. Over many flights, the estimate converges to the true value.

Convergence

She ran the pterodactyl for three thousand flights. After each flight, the table was updated. Trviksha tracked the values in the table over time.

Initially, the values fluctuated wildly — each new flight changed the estimates substantially. By flight five hundred, the estimates had begun to stabilise. By flight two thousand, the table was nearly converged — the values changed by less than one percent per flight.

The converged table showed clear patterns. States near the destination had high values. States in storm zones had low values. The mountain pass had moderate values in clear weather and very low values during storms. The canyon had high values in early dry season and low values in late dry season.

Flinqva: Now I can read the table directly. Before each flight, I look up the pterodactyl's current state, find the action with the highest value, and that is the action to take.

Trviksha: Exactly. The table is the learned strategy. The pterodactyl does not need to think — it looks up the answer. All the "thinking" happened during the three thousand training flights.

Blortz: But the table requires enumerating every possible state. For the simplified grid, that is feasible. For the real terrain, with weather, time of day, cargo weight, and the remaining fuel of the pterodactyl...

Trviksha: The table becomes impossibly large. Yes. For real problems, I would replace the table with a network — a neural network that takes the state as input and outputs the estimated value for each action. The network generalises across similar states instead of storing each state individually.

Blortz: So the pterodactyl problem eventually leads back to neural networks.

Trviksha: Everything leads back to neural networks, it seems. The reinforcement learning framework — states, actions, rewards, value estimation — is the structure. The neural network is the function that makes it practical for large problems.

Trviksha has built a Q-learning system — one of the foundational algorithms in reinforcement learning, developed by Chris Watkins in 1989. The Q-table stores the estimated value of each action in each state (the "Q" stands for "quality"). After each experience, the agent updates the relevant cell by nudging it toward the immediate reward plus the discounted best future value. Over many experiences, the Q-values converge to the true expected returns, and the agent can choose optimal actions by simply looking up the highest value. The limitation, as Blortz noted, is that the table must enumerate every possible state — which is impossible for complex, continuous environments. The solution — replacing the table with a neural network (called Deep Q-Learning, pioneered by DeepMind in 2013 for Atari games) — combines the reinforcement learning framework with the function approximation power of neural networks. Think about how you learn to navigate a city: you build a mental map of which routes work well, updated each time you travel. Your "table" is imperfect and approximate — you cannot enumerate every possible traffic condition — but it improves with experience.