Part 25 of 58
The Memory Stone
By Madhav Kaushish · Ages 12+
The hidden state forgot. It compressed and overwrote at every time step, and information from early in the sequence faded to nothing. Trviksha needed a memory that could persist.
The Stone
She gave each velociraptor a second object: a flat stone, separate from its regular hidden state. The stone sat on the workstation alongside the velociraptor's baskets of pebbles, but it served a different purpose.
Trviksha: Your hidden state changes at every time step. It is your working memory — constantly updated, constantly compressed. The stone is your long-term memory. Information written on the stone persists until you explicitly erase it.
Drysska: What is written on it?
Trviksha: A number. The stone carries a single number that represents whatever the network has decided is worth remembering long-term. But here is the crucial part: the stone does not change unless you choose to change it.
By default, the stone passed from one time step to the next unchanged. Unlike the hidden state, which was recomputed from scratch at every step, the stone simply carried forward. Information on the stone did not degrade through multiplication — it persisted.
The Forget Gate
But some information should be erased. A monsoon signal from three weeks ago was relevant for predicting flooding last week. Now that the flooding has passed, the monsoon signal is no longer useful. If the stone held it forever, the stone would fill with outdated information.
Trviksha added a mechanism she called the forget gate. At each time step, a small sub-network examined the current input and the current hidden state and produced a number between zero and one. This number determined how much of the stone's current information to keep.
Trviksha: At each step, the forget gate looks at the current situation and decides: is the information on the stone still relevant? If the gate outputs one, everything on the stone is kept. If it outputs zero, everything is erased. If it outputs 0.7, seventy percent is kept and thirty percent fades.
Blortz: Who decides what the gate outputs?
Trviksha: The gate has its own weights — a small set of pebble arrangements, trained by the same backward error signals as everything else. The network learns when to forget. During training, if erasing old information at the right moment leads to better predictions, the forget gate's weights adjust to trigger that erasure.
The Input Gate
Erasing was only half the problem. The other half: when to write new information onto the stone.
Not every time step contained information worth storing long-term. On a routine sunny day with no unusual readings, there was nothing to add to the stone. On a day when a rare pressure drop signalled an approaching monsoon, that signal should be carved into the stone and preserved.
Trviksha added a second mechanism: the input gate. At each time step, it examined the current input and hidden state and decided how much of the current information to write onto the stone.
Trviksha: The input gate decides what to store. It also has its own weights, trained by the same backward signals. The network learns what is worth remembering.
Drysska: Let me make sure I understand. At each step, I have my stone from the previous step. The forget gate tells me how much to erase. The input gate tells me how much new information to write. The result is an updated stone that carries forward to the next step.
Trviksha: Exactly. The stone changes only through these two controlled operations: selective forgetting and selective writing. Between these operations, the stone persists unchanged.

Why It Works
The key was the gradient flow. During backward propagation, the error signal at a late time step needed to reach an early time step. In the basic recurrent network, the signal passed through the hidden state computation at every step — multiplied by weights each time, shrinking or exploding.
With the stone, the error signal had an alternative path. It could flow backward along the stone itself, which simply carried forward from step to step without multiplication. The forget gate modulated this flow — but as long as the forget gate stayed close to one (keeping information), the error signal passed through almost unchanged.
Blortz: The stone is a highway for the error signal. In the old network, the error had to pass through the weights at every step — thirty multiplications, exponential decay. With the stone, the error flows along the stone's path, which involves only the forget gate. If the forget gate is near one, the signal arrives almost intact.
Trviksha: The network can now learn from events thirty steps in the past, because the error signal can reach them.
She retrained the weather prediction network with the stone mechanism — eight velociraptors, each with its own stone, its own forget gate, and its own input gate.
Seven-day forecasts improved from 51% accuracy to 73%. Thirty-day monsoon predictions improved dramatically — the network now retained early monsoon signals across the full month.
Vrothjelka: Now we are getting somewhere.