The Second Row

Trviksha stared at the stone grid for three days. The fungus data sat there, stubbornly non-linear, two clusters of black pebbles in opposite corners that no single line could separate.

On the fourth day, Blortz made his suggestion concrete.

Blortz: One line cannot separate them. But two lines can isolate them. Draw a line from the upper-left to the middle — everything above and to the left is "cold and wet." Draw another line from the lower-right to the middle — everything below and to the right is "hot and dry." Each line captures one contaminated cluster. A store is at risk if it falls on the flagged side of either line.

Trviksha: Two lines means two perceptrons. Each one draws its own line. Then I need a third decision that combines their outputs: "if either one flags the store, predict contamination."

Blortz: Three velociraptors instead of one.

Building It

She assembled the team.

Velociraptor A (Plonkva): Takes the two inputs — humidity and temperature. Computes a weighted sum with her own weights. Applies a threshold. Outputs a pebble: black if the store falls in the cold-wet danger zone, white otherwise.

Velociraptor B (Grinjka): Takes the same two inputs. Computes a different weighted sum with different weights. Applies a threshold. Outputs a pebble: black if the store falls in the hot-dry danger zone, white otherwise.

Velociraptor C (Drysska): Takes the two outputs from Plonkva and Grinjka as her inputs. Computes a weighted sum of those. Applies a threshold. Outputs the final prediction: black for contaminated, white for clean.

The velociraptors were arranged in two rows. The first row — Plonkva and Grinjka — processed the raw inputs. The second row — Drysska — processed the first row's outputs. The raw data never reached Drysska directly. She only saw what the first row told her.

Two rows of velociraptors at stone workstations. The back row has two velociraptors (Plonkva and Grinjka) each receiving the same two input baskets (humidity and temperature). Each produces one output pebble. These two output pebbles feed into the front row, where Drysska sits at a single workstation, combining them into a final output. Lines connect the stations showing data flow

Training

Trviksha trained all the weights simultaneously using the same adjustment method: compute a prediction, measure the error, adjust each weight in the direction that reduces the error. But now there were nine adjustable values — each of the three velociraptors had two input weights and a threshold.

After training, the two-layer network achieved 94% accuracy on the fungus data — up from the 51% that a single perceptron had managed.

Jvelthra: It works. Finally.

What the First Row Invented

Trviksha examined the trained weights. She wanted to understand what Plonkva and Grinjka had learned.

Plonkva's weights were large and positive for humidity, large and negative for temperature. Plonkva was computing something like "is it wetter than it is hot?" — a detector for the cold-wet zone.

Grinjka's weights were the reverse: large and positive for temperature, large and negative for humidity. Grinjka was detecting the hot-dry zone.

Neither "wetter than it is hot" nor "hotter than it is wet" was an input that Trviksha had provided. These were features the network had invented — intermediate representations that existed only inside the network, never visible to the outside world.

Trviksha: I gave the network humidity and temperature. It turned those into "cold-wet risk" and "hot-dry risk." I did not ask it to. The adjustment process found these intermediate features because they made the final prediction possible.

Blortz: The first row is a translator. It takes the original language of the data and rewrites it in a language that the second row can understand.

Trviksha: And the translation is learned, not designed. I did not tell Plonkva to compute "cold-wet risk." The training process discovered that this was a useful thing to compute.

The Hidden Layer

In the original input space — humidity and temperature — the problem was not linearly separable. But in the new space created by Plonkva and Grinjka — "cold-wet risk" and "hot-dry risk" — the problem was linearly separable. Stores with high values on either new feature were contaminated. Drysska could separate them with a single line in this new space.

The first row's job was to find a transformation of the inputs that made the second row's job possible. The first row was hidden — its outputs were internal to the network, never seen by the outside world, never specified by Trviksha. It was a hidden layer.

Glagalbagal: The velociraptors are thinking?

Trviksha: The velociraptors are computing. Whether that constitutes thinking is a question I will not answer today.

With enough velociraptors in the hidden layer, any pattern could in principle be captured — any boundary, no matter how convoluted, could be approximated by enough lines combined together. Two layers could solve problems one layer never could.

Jvelthra: My real data has twelve measurements per grain store, not two.

Trviksha: Then I need more velociraptors in the first row. Possibly many more. But the principle is the same: the first row transforms the raw inputs into useful features, and the second row makes a decision based on those features.

Jvelthra: How many is "many more"?

Trviksha: That is a question I do not yet know how to answer. But I have a more immediate problem first: training. With two inputs and three velociraptors, the adjustment was manageable. With twelve inputs and a larger hidden layer, the question of how to adjust the hidden layer's weights becomes much harder. The hidden layer never sees the right answer. How does it know which way to adjust?

Trviksha has built a multi-layer perceptron — the foundation of modern deep learning. The key insight is that hidden layers transform data into new representations. A problem that is impossible to solve in its original form may become easy after the right transformation. The hidden layer discovers this transformation automatically during training — nobody tells it what features to compute. This idea is sometimes stated as the universal approximation theorem: a network with one hidden layer containing enough neurons can approximate any continuous function. The practical question — how many neurons is "enough" — depends on the complexity of the pattern. Think about recognising a friend in a crowd. Your brain doesn't compare raw visual input to a stored image. It extracts features — the shape of a nose, the way someone walks, the colour of a jacket — through multiple layers of processing. What "intermediate features" might be useful for recognising someone, that aren't directly present in the raw image?