The Memory Stone

The hidden state forgot. It compressed and overwrote at every time step, and information from early in the sequence faded to nothing. Trviksha needed a memory that could persist.

The Stone

She gave each velociraptor a second object: a flat stone, separate from its regular hidden state. The stone sat on the workstation alongside the velociraptor's baskets of pebbles, but it served a different purpose.

Trviksha: Your hidden state changes at every time step. It is your working memory — constantly updated, constantly compressed. The stone is your long-term memory. Information written on the stone persists until you explicitly erase it.

Drysska: What is written on it?

Trviksha: A number. The stone carries a single number that represents whatever the network has decided is worth remembering long-term. But here is the crucial part: the stone does not change unless you choose to change it.

By default, the stone passed from one time step to the next unchanged. Unlike the hidden state, which was recomputed from scratch at every step, the stone simply carried forward. Information on the stone did not degrade through multiplication — it persisted.

The Forget Gate

But some information should be erased. A monsoon signal from three weeks ago was relevant for predicting flooding last week. Now that the flooding has passed, the monsoon signal is no longer useful. If the stone held it forever, the stone would fill with outdated information.

Trviksha added a mechanism she called the forget gate. At each time step, a small sub-network examined the current input and the current hidden state and produced a number between zero and one. This number determined how much of the stone's current information to keep.

Trviksha: At each step, the forget gate looks at the current situation and decides: is the information on the stone still relevant? If the gate outputs one, everything on the stone is kept. If it outputs zero, everything is erased. If it outputs 0.7, seventy percent is kept and thirty percent fades.

Blortz: Who decides what the gate outputs?

Trviksha: The gate has its own weights — a small set of pebble arrangements, trained by the same backward error signals as everything else. The network learns when to forget. During training, if erasing old information at the right moment leads to better predictions, the forget gate's weights adjust to trigger that erasure.

The Input Gate

Erasing was only half the problem. The other half: when to write new information onto the stone.

Not every time step contained information worth storing long-term. On a routine sunny day with no unusual readings, there was nothing to add to the stone. On a day when a rare pressure drop signalled an approaching monsoon, that signal should be carved into the stone and preserved.

Trviksha added a second mechanism: the input gate. At each time step, it examined the current input and hidden state and decided how much of the current information to write onto the stone.

Trviksha: The input gate decides what to store. It also has its own weights, trained by the same backward signals. The network learns what is worth remembering.

Drysska: Let me make sure I understand. At each step, I have my stone from the previous step. The forget gate tells me how much to erase. The input gate tells me how much new information to write. The result is an updated stone that carries forward to the next step.

Trviksha: Exactly. The stone changes only through these two controlled operations: selective forgetting and selective writing. Between these operations, the stone persists unchanged.

Why It Works

The key was the gradient flow. During backward propagation, the error signal at a late time step needed to reach an early time step. In the basic recurrent network, the signal passed through the hidden state computation at every step — multiplied by weights each time, shrinking or exploding.

With the stone, the error signal had an alternative path. It could flow backward along the stone itself, which simply carried forward from step to step without multiplication. The forget gate modulated this flow — but as long as the forget gate stayed close to one (keeping information), the error signal passed through almost unchanged.

Blortz: The stone is a highway for the error signal. In the old network, the error had to pass through the weights at every step — thirty multiplications, exponential decay. With the stone, the error flows along the stone's path, which involves only the forget gate. If the forget gate is near one, the signal arrives almost intact.

Trviksha: The network can now learn from events thirty steps in the past, because the error signal can reach them.

She retrained the weather prediction network with the stone mechanism — eight velociraptors, each with its own stone, its own forget gate, and its own input gate.

Seven-day forecasts improved from 51% accuracy to 73%. Thirty-day monsoon predictions improved dramatically — the network now retained early monsoon signals across the full month.

Vrothjelka: Now we are getting somewhere.

Trviksha has built the core of a Long Short-Term Memory (LSTM) network, invented by Hochreiter and Schmidhuber in 1997. The key innovation is the cell state — the stone — which carries information across many time steps with minimal degradation. The forget gate controls what to erase from the cell state, and the input gate controls what to write. Both gates are learned from data through the same backpropagation process that trains everything else. The crucial insight is about gradient flow: the cell state provides a "highway" for error signals to travel backward through time without the exponential decay caused by repeated multiplication through weights. This is why LSTMs can learn long-range dependencies that basic RNNs cannot. Think about how you take notes in a lecture. You do not write down every word — that is the hidden state approach, overwriting constantly. Instead, you selectively write key points (input gate) and occasionally cross out notes that turned out to be irrelevant (forget gate). Your notebook persists across the whole lecture, and you can refer back to early notes when they become relevant later.