Part 32 of 58
The Parallel Advantage
By Madhav Kaushish · Ages 12+
The transformer blocks had proven themselves on Phlontjek's contracts. But Trviksha had been running them alongside the recurrent network — using the recurrent network to process the sequence and the transformer blocks to add long-range connections. She wondered: did she still need the recurrent part at all?
The Experiment
She removed the recurrent network entirely. The input tokens — with their position encodings — fed directly into a stack of transformer blocks. No sequential processing. No hidden state carried from step to step. No loop.
Every token in the contract was processed simultaneously by the first transformer block. The results fed simultaneously into the second block. Then the third. At no point did the network process tokens one at a time.
Blortz: How long does it take to process a two-thousand-token contract?
Trviksha measured both approaches:
Recurrent network: Processed two thousand tokens one at a time. Each token had to wait for the previous one to finish. Total time: proportional to the length of the sequence. Two thousand steps.
Transformer: Processed all two thousand tokens simultaneously. The attention mechanism computed all relevance scores at once. Total time: proportional to the number of blocks (three), not the sequence length. Three steps, regardless of whether the contract was twenty tokens or two thousand.
Blortz: The transformer is faster because it processes everything in parallel? Even though it does more total computation?
Trviksha: Exactly. The recurrent network does less total computation — but it does it sequentially, one step at a time. The transformer does more total computation — four million relevance scores — but it does all of it at once. With enough velociraptors working in parallel, the transformer finishes far faster.
Drysska: I always did prefer working alongside other velociraptors rather than waiting in line.
The Full Architecture
With the recurrent network removed, the complete architecture was:
- Input encoding: Each token is converted to a pebble arrangement (embedding) and combined with a position encoding.
- Transformer blocks: Multiple blocks, each consisting of multi-head self-attention followed by a feedforward network, with residual connections.
- Output layer: The final representations are used for whatever task is required — answering questions, classifying clauses, predicting the next token.
No recurrence. No sequential processing. No hidden state carrying forward through time. All the "memory" came from attention — each position's ability to look directly at any other position.

The Quadratic Concern
Blortz: I said I would come back to the cost, and I am coming back to it.
Four heads of attention, each computing relevance scores between every pair of tokens. For a two-thousand-token contract: two thousand times two thousand times four heads. Sixteen million computations per block. Three blocks: forty-eight million.
Blortz: For Phlontjek's contracts, this is manageable. But the chieftain has mentioned wanting to process his complete legal code. That is roughly five hundred thousand tokens. Five hundred thousand times five hundred thousand is two hundred and fifty billion. Per head. Per block.
Trviksha: That is... not manageable.
Phlontjek: My contracts are only a few thousand tokens. For me, this works. But if you want to process longer documents, you have a problem.
Trviksha: The cost of attention grows with the square of the sequence length. Doubling the document length quadruples the cost. For short and medium documents, the cost is acceptable and the benefit — parallel processing, direct long-range connections — is enormous. For very long documents, the cost becomes prohibitive.
Glagalbagal: So you have traded one limitation for another. The recurrent network had a memory limitation — it forgot early information. The transformer has a cost limitation — it cannot afford to look at everything when "everything" is too large.
Trviksha: Yes. The transformer can see everything within its window. The recurrent network could process any length but remembered poorly. Neither is perfect.
Blortz: Are there ways to reduce the cost? Perhaps the network does not need to attend to every position — perhaps it could attend to a selected subset.
Trviksha: Perhaps. But that is a problem for another day. For Phlontjek's contracts and Vrothjelka's weather sequences — documents of a few thousand tokens — the transformer is the clear winner. It is faster, more accurate on long-range dependencies, and far easier to train.
Phlontjek, who had been listening to the technical discussion with the patience of a man who had arbitrated thousand-clause trade disputes between feuding port cities, nodded.
Phlontjek: I do not need to process five hundred thousand tokens. I need to process contracts. Your system processes contracts well. That is sufficient.
Trviksha: For now.