The Maze

GlagalCloud had expanded. Trviksha's networks now handled patient records, grain stores, field surveys, weather forecasts, contracts, and archives. The company operated from seven locations across Sonhlagot, connected by a fleet of delivery pterodactyls that carried tablets and pebble shipments between offices.

The pterodactyl routes were managed by a logistics coordinator named Flinqva. She was losing her mind.

The Route Problem

Flinqva: I have sixteen pterodactyls, seven offices, and three hundred daily deliveries. The terrain between offices changes constantly — storms close mountain passes, seasonal flooding blocks valley routes, and the pterodactyls refuse to fly over the Grintjak marshes after one of them was eaten by something last year.

Trviksha: I can build a model that predicts the best route for each delivery.

Flinqva: Based on what training data? Every day the conditions are different. A route that worked yesterday may be impassable today. And there is no "correct answer" — I do not have a tablet that says "for these conditions, take this route." I only know whether the delivery arrived on time or not.

This was a fundamentally different problem from everything Trviksha had built so far. With patients, she had labels — sick or healthy. With grain stores, she had labels — contaminated or clean. With text, she had the next word. In every case, there was a right answer to compare against.

Here, there was no right answer. There was only an outcome — the pterodactyl arrived or it did not, quickly or slowly — and the pterodactyl had to figure out, from its own experience, which actions led to good outcomes.

States, Actions, Rewards

Trviksha thought about the problem differently. Instead of a network that took inputs and produced an answer, she needed a system where an agent — the pterodactyl — interacted with an environment — the terrain and weather — over time.

At each moment, the pterodactyl was in a state: its current location, the weather conditions, the time of day, the load it was carrying. From each state, it could choose an action: fly north, fly south, fly east, fly west, climb higher, descend lower, or wait.

After taking an action, the pterodactyl arrived in a new state — a new location with possibly different conditions. And it received a reward signal: a small negative reward for each time step (to encourage speed), a large positive reward for delivering the cargo, and a large negative reward for crashing or getting lost.

Trviksha: The pterodactyl tries things. It receives rewards or penalties. Over many flights, it should learn which actions in which states lead to high total reward.

Blortz: "Should" learn. How?

Trviksha: By adjusting its behaviour based on outcomes. If flying north from position A during a storm leads to a penalty, it should learn to avoid flying north from position A during a storm. If climbing higher at the mountain pass leads to a successful delivery, it should learn to climb higher at the mountain pass.

A grid representing the terrain between two GlagalCloud offices. A cartoon pterodactyl sits at the start position. Arrows show possible actions: north, south, east, west. Some grid cells are marked with positive rewards (a delivered package symbol), others with negative rewards (storm clouds, a marsh monster). The pterodactyl looks uncertain about which direction to choose

The First Flights

She started with a simplified version: a grid-based terrain map between two offices. The pterodactyl started at one office and had to reach the other. Each cell on the grid had a terrain type (clear, mountain, marsh, storm) that affected travel time and risk. The pterodactyl could move to any adjacent cell.

For the first hundred flights, the pterodactyl chose actions randomly. It wandered aimlessly, occasionally stumbling onto the destination by chance. When it arrived, it received a reward. When it wandered into a storm or a marsh, it received a penalty.

Flinqva: This is worse than my current system. The pterodactyl has no idea what it is doing.

Trviksha: It is learning. After a hundred random flights, it has experienced many states and many outcomes. The next step is to use those experiences to choose better actions.

Flinqva: How long before it stops being useless?

Trviksha: That depends on how large the terrain is and how varied the conditions are. For this simplified grid, a few hundred flights should suffice. For the real terrain — with weather patterns, seasonal changes, and that thing in the marshes — considerably longer.

The fundamental shift was complete. Trviksha had moved from supervised learning — where a teacher provided the right answer for every example — to a system where the agent learned from the consequences of its own actions. No teacher. No labels. Just experience and outcomes.

Trviksha has entered reinforcement learning (RL) — a fundamentally different paradigm from the supervised learning used in all previous parts. In supervised learning, the training data includes the correct answer: "this patient is sick," "this field is blighted," "the next word is 'provinces.'" In reinforcement learning, there is no correct answer — only a reward signal that indicates how well the agent did after the fact. The agent must discover good behaviour through trial and error, learning which actions in which states lead to high rewards over time. The key concepts are: states (the current situation), actions (what the agent can do), and rewards (the feedback signal). This framework was formalised by Richard Sutton and Andrew Barto and applies to any problem where an agent interacts with an environment over time: games, robotics, navigation, resource management. Think about learning to ride a bicycle. Nobody gives you the "correct" handlebar angle at each moment. You try, you wobble, you sometimes fall (negative reward), and you gradually learn which movements keep you balanced. The learning comes from experience, not from a teacher with the answers.