The Parallel Advantage

The transformer blocks had proven themselves on Phlontjek's contracts. But Trviksha had been running them alongside the recurrent network — using the recurrent network to process the sequence and the transformer blocks to add long-range connections. She wondered: did she still need the recurrent part at all?

The Experiment

She removed the recurrent network entirely. The input tokens — with their position encodings — fed directly into a stack of transformer blocks. No sequential processing. No hidden state carried from step to step. No loop.

Every token in the contract was processed simultaneously by the first transformer block. The results fed simultaneously into the second block. Then the third. At no point did the network process tokens one at a time.

Blortz: How long does it take to process a two-thousand-token contract?

Trviksha measured both approaches:

Recurrent network: Processed two thousand tokens one at a time. Each token had to wait for the previous one to finish. Total time: proportional to the length of the sequence. Two thousand steps.

Transformer: Processed all two thousand tokens simultaneously. The attention mechanism computed all relevance scores at once. Total time: proportional to the number of blocks (three), not the sequence length. Three steps, regardless of whether the contract was twenty tokens or two thousand.

Blortz: The transformer is faster because it processes everything in parallel? Even though it does more total computation?

Trviksha: Exactly. The recurrent network does less total computation — but it does it sequentially, one step at a time. The transformer does more total computation — four million relevance scores — but it does all of it at once. With enough velociraptors working in parallel, the transformer finishes far faster.

Drysska: I always did prefer working alongside other velociraptors rather than waiting in line.

The Full Architecture

With the recurrent network removed, the complete architecture was:

Input encoding: Each token is converted to a pebble arrangement (embedding) and combined with a position encoding.
Transformer blocks: Multiple blocks, each consisting of multi-head self-attention followed by a feedforward network, with residual connections.
Output layer: The final representations are used for whatever task is required — answering questions, classifying clauses, predicting the next token.

No recurrence. No sequential processing. No hidden state carrying forward through time. All the "memory" came from attention — each position's ability to look directly at any other position.

$Two side-by-side processing systems. On the left, the recurrent network: velociraptors in a single-file line, each waiting for the previous one to finish, a slow chain from token 1 to token 2000. On the right, the transformer: thousands of velociraptors working simultaneously in rows, all processing at once, with attention lines criss-crossing between them. A clock shows the transformer finishing in a fraction of the time$

The Quadratic Concern

Blortz: I said I would come back to the cost, and I am coming back to it.

Four heads of attention, each computing relevance scores between every pair of tokens. For a two-thousand-token contract: two thousand times two thousand times four heads. Sixteen million computations per block. Three blocks: forty-eight million.

Blortz: For Phlontjek's contracts, this is manageable. But the chieftain has mentioned wanting to process his complete legal code. That is roughly five hundred thousand tokens. Five hundred thousand times five hundred thousand is two hundred and fifty billion. Per head. Per block.

Trviksha: That is... not manageable.

Phlontjek: My contracts are only a few thousand tokens. For me, this works. But if you want to process longer documents, you have a problem.

Trviksha: The cost of attention grows with the square of the sequence length. Doubling the document length quadruples the cost. For short and medium documents, the cost is acceptable and the benefit — parallel processing, direct long-range connections — is enormous. For very long documents, the cost becomes prohibitive.

Glagalbagal: So you have traded one limitation for another. The recurrent network had a memory limitation — it forgot early information. The transformer has a cost limitation — it cannot afford to look at everything when "everything" is too large.

Trviksha: Yes. The transformer can see everything within its window. The recurrent network could process any length but remembered poorly. Neither is perfect.

Blortz: Are there ways to reduce the cost? Perhaps the network does not need to attend to every position — perhaps it could attend to a selected subset.

Trviksha: Perhaps. But that is a problem for another day. For Phlontjek's contracts and Vrothjelka's weather sequences — documents of a few thousand tokens — the transformer is the clear winner. It is faster, more accurate on long-range dependencies, and far easier to train.

Phlontjek, who had been listening to the technical discussion with the patience of a man who had arbitrated thousand-clause trade disputes between feuding port cities, nodded.

Phlontjek: I do not need to process five hundred thousand tokens. I need to process contracts. Your system processes contracts well. That is sufficient.

Trviksha: For now.

Trviksha has built the Transformer — the architecture described in the landmark 2017 paper "Attention Is All You Need." The key insight is that sequential processing (RNNs/LSTMs) can be replaced entirely by attention, which processes all positions simultaneously. This has two major advantages: parallelism (all positions are computed at once, making training much faster on modern hardware) and direct connections (information travels between any two positions in a single step, not through intermediate ones). The cost is quadratic: attention requires comparing every position to every other position, so the computation grows with the square of the sequence length. For sequences of a few thousand tokens, this is acceptable. For much longer sequences, it becomes the primary bottleneck — a problem that has motivated significant research into efficient attention mechanisms. The Transformer is the foundation of GPT, BERT, PaLM, and virtually all modern large language models. Think about two ways to coordinate a team: a relay race (sequential — each person waits for the previous one) and a conference call (parallel — everyone listens at once). The conference call is faster but becomes chaotic with a thousand participants. At what point does the coordination cost outweigh the speed benefit?