Part 42 of 58

The Shortcut

By Madhav Kaushish · Ages 12+

The pterodactyl could now estimate the value of states and plan ahead. But how did those value estimates actually improve? The pterodactyl needed a mechanism to refine its understanding of which actions were good in which situations — from experience, flight by flight.

The Table

Trviksha set up a large stone tablet — a table with rows for every state the pterodactyl might encounter and columns for every action it could take. Each cell in the table held a number: the estimated value of taking that action in that state.

At the start, every cell was zero. The pterodactyl knew nothing — every action in every state was equally unknown.

Trviksha: This table is the pterodactyl's knowledge. Cell (mountain pass, fly east) holds the estimated total future reward for flying east when at the mountain pass. Cell (canyon entrance, fly south) holds the estimated total future reward for flying south when at the canyon entrance. Every possible combination of state and action has its own estimate.

Blortz: How many cells is that?

Trviksha: For the simplified grid, roughly one thousand states times six actions — six thousand cells. For the real terrain with weather conditions and time of day, considerably more.

Drysska: And these all start at zero?

Trviksha: They start at zero and improve with every flight.

Learning from Each Flight

After each flight, the pterodactyl updated its table based on what happened. The update rule was simple:

When the pterodactyl was in state S, took action A, received reward R, and ended up in state S', it updated the cell (S, A) by nudging it toward the sum of R (the immediate reward) plus the discounted value of the best action in state S' (the estimated future reward from the new state).

Trviksha: The pterodactyl arrives at the mountain pass. It flies east. It receives a small negative reward (one time step spent) and ends up at the ridge. It looks up the best action from the ridge — the highest value in the ridge's row — and adds the discounted future value to the immediate reward. It nudges the mountain-pass-fly-east cell toward this total.

Blortz: It is not replacing the old estimate — it is nudging it?

Trviksha: Nudging, because the old estimate is based on many past flights, and one new flight should influence the estimate, not overwrite it. The nudge is small: ten percent of the way toward the new information. Over many flights, the estimate converges to the true value.

A large stone tablet with a grid: rows are labelled with locations (mountain pass, canyon, ridge, marsh edge) and columns with actions (fly N, S, E, W, climb, descend). Each cell contains a pebble arrangement representing the estimated value. A cartoon pterodactyl has just landed at "ridge" and a velociraptor updates the cell for "mountain pass / fly east" by nudging its pebble arrangement slightly upward

Convergence

She ran the pterodactyl for three thousand flights. After each flight, the table was updated. Trviksha tracked the values in the table over time.

Initially, the values fluctuated wildly — each new flight changed the estimates substantially. By flight five hundred, the estimates had begun to stabilise. By flight two thousand, the table was nearly converged — the values changed by less than one percent per flight.

The converged table showed clear patterns. States near the destination had high values. States in storm zones had low values. The mountain pass had moderate values in clear weather and very low values during storms. The canyon had high values in early dry season and low values in late dry season.

Flinqva: Now I can read the table directly. Before each flight, I look up the pterodactyl's current state, find the action with the highest value, and that is the action to take.

Trviksha: Exactly. The table is the learned strategy. The pterodactyl does not need to think — it looks up the answer. All the "thinking" happened during the three thousand training flights.

Blortz: But the table requires enumerating every possible state. For the simplified grid, that is feasible. For the real terrain, with weather, time of day, cargo weight, and the remaining fuel of the pterodactyl...

Trviksha: The table becomes impossibly large. Yes. For real problems, I would replace the table with a network — a neural network that takes the state as input and outputs the estimated value for each action. The network generalises across similar states instead of storing each state individually.

Blortz: So the pterodactyl problem eventually leads back to neural networks.

Trviksha: Everything leads back to neural networks, it seems. The reinforcement learning framework — states, actions, rewards, value estimation — is the structure. The neural network is the function that makes it practical for large problems.