Part 39 of 58
The Maze
By Madhav Kaushish · Ages 12+
GlagalCloud had expanded. Trviksha's networks now handled patient records, grain stores, field surveys, weather forecasts, contracts, and archives. The company operated from seven locations across Sonhlagot, connected by a fleet of delivery pterodactyls that carried tablets and pebble shipments between offices.
The pterodactyl routes were managed by a logistics coordinator named Flinqva. She was losing her mind.
The Route Problem
Flinqva: I have sixteen pterodactyls, seven offices, and three hundred daily deliveries. The terrain between offices changes constantly — storms close mountain passes, seasonal flooding blocks valley routes, and the pterodactyls refuse to fly over the Grintjak marshes after one of them was eaten by something last year.
Trviksha: I can build a model that predicts the best route for each delivery.
Flinqva: Based on what training data? Every day the conditions are different. A route that worked yesterday may be impassable today. And there is no "correct answer" — I do not have a tablet that says "for these conditions, take this route." I only know whether the delivery arrived on time or not.
This was a fundamentally different problem from everything Trviksha had built so far. With patients, she had labels — sick or healthy. With grain stores, she had labels — contaminated or clean. With text, she had the next word. In every case, there was a right answer to compare against.
Here, there was no right answer. There was only an outcome — the pterodactyl arrived or it did not, quickly or slowly — and the pterodactyl had to figure out, from its own experience, which actions led to good outcomes.
States, Actions, Rewards
Trviksha thought about the problem differently. Instead of a network that took inputs and produced an answer, she needed a system where an agent — the pterodactyl — interacted with an environment — the terrain and weather — over time.
At each moment, the pterodactyl was in a state: its current location, the weather conditions, the time of day, the load it was carrying. From each state, it could choose an action: fly north, fly south, fly east, fly west, climb higher, descend lower, or wait.
After taking an action, the pterodactyl arrived in a new state — a new location with possibly different conditions. And it received a reward signal: a small negative reward for each time step (to encourage speed), a large positive reward for delivering the cargo, and a large negative reward for crashing or getting lost.
Trviksha: The pterodactyl tries things. It receives rewards or penalties. Over many flights, it should learn which actions in which states lead to high total reward.
Blortz: "Should" learn. How?
Trviksha: By adjusting its behaviour based on outcomes. If flying north from position A during a storm leads to a penalty, it should learn to avoid flying north from position A during a storm. If climbing higher at the mountain pass leads to a successful delivery, it should learn to climb higher at the mountain pass.

The First Flights
She started with a simplified version: a grid-based terrain map between two offices. The pterodactyl started at one office and had to reach the other. Each cell on the grid had a terrain type (clear, mountain, marsh, storm) that affected travel time and risk. The pterodactyl could move to any adjacent cell.
For the first hundred flights, the pterodactyl chose actions randomly. It wandered aimlessly, occasionally stumbling onto the destination by chance. When it arrived, it received a reward. When it wandered into a storm or a marsh, it received a penalty.
Flinqva: This is worse than my current system. The pterodactyl has no idea what it is doing.
Trviksha: It is learning. After a hundred random flights, it has experienced many states and many outcomes. The next step is to use those experiences to choose better actions.
Flinqva: How long before it stops being useless?
Trviksha: That depends on how large the terrain is and how varied the conditions are. For this simplified grid, a few hundred flights should suffice. For the real terrain — with weather patterns, seasonal changes, and that thing in the marshes — considerably longer.
The fundamental shift was complete. Trviksha had moved from supervised learning — where a teacher provided the right answer for every example — to a system where the agent learned from the consequences of its own actions. No teacher. No labels. Just experience and outcomes.