Where Am I?

The multi-head attention system tracked cross-references across long contracts effectively. But Trviksha noticed a strange failure mode.

The Position Problem

Phlontjek: I gave your system two contracts. Contract A has the payment terms in Clause 2 and the delivery terms in Clause 50. Contract B has the same clauses but in reverse order — delivery in Clause 2, payment in Clause 50. Your system gave the same answer for both.

Trviksha: That should not happen. The clause order matters — in Sonhlagot contract law, earlier clauses take precedence over later ones. Clause 2 overrides Clause 50 if they conflict.

She examined the attention mechanism. The problem was fundamental. In the recurrent network, position was implicit — the network processed tokens left to right, so it "knew" that token 50 came before token 100 simply from the order of processing. But the attention mechanism processed all tokens simultaneously. There was no left-to-right. Every token was on the bulletin board at the same time, with no inherent ordering.

Blortz: The attention mechanism treats the contract as a bag of clauses. It knows which clauses are related to which, but it does not know which comes first. Swapping Clause 2 and Clause 50 changes nothing, because the mechanism has no concept of position.

Trviksha: I need to tell the network where each token is. Position is not in the data — I need to add it.

Adding Position

She assigned each position a unique encoding — a set of numbers that represented "I am the third token" or "I am the five-hundredth token." These position encodings were added to each token's representation before the attention computation.

Trviksha: Token at position 3 has its word encoding plus a position encoding for "position 3." Token at position 500 has its word encoding plus a position encoding for "position 500." The attention mechanism now compares tokens that include position information.

Drysska: What does a position encoding look like?

Trviksha: A set of numbers — the same size as the token's word encoding. I use a pattern of values that are unique for each position and that capture the structure of positioning. Nearby positions have similar encodings. Distant positions have different encodings.

The position encodings were not learned from data — Trviksha designed them using a mathematical pattern that guaranteed uniqueness and smoothness. Each position got a distinct code, and the codes varied smoothly so that positions 49 and 50 had similar encodings while positions 3 and 500 had very different ones.

With position encodings added, the network could distinguish "Clause 2 about payments" from "Clause 50 about payments" — because the same words at different positions had different combined representations. The attention weights could now favour earlier clauses over later ones when the query demanded it.

A row of tokens on a shelf. Each token has two components stacked on top of each other: the word encoding (pebble arrangement on top) and the position encoding (a distinct pattern of coloured marks below). Token 1 has a different colour pattern from Token 2, which differs from Token 3. The combined representation — word plus position — makes each token unique even if the words are identical

The Block

With multi-head attention and position encodings in place, Trviksha assembled the full processing unit — what she called a block.

A block consisted of two steps:

Step 1: Attention. Each token attended to every other token using multi-head attention, producing an updated representation that incorporated relevant information from across the sequence. The original token representation was added back to the result — a shortcut connection that preserved the original information even if the attention added nothing useful.

Step 2: Processing. The updated representation passed through a small feedforward network — two hidden layers — that further transformed it. Again, the input was added back to the output.

Trviksha: The attention step is about communication — each token gathers information from other tokens. The processing step is about computation — each token digests the gathered information independently. Communication, then computation. Two different operations, each with its own role.

Blortz: Why add the input back to the output? You did not do that before.

Trviksha: If the attention or processing step makes a mistake — focuses on irrelevant tokens, or transforms the data poorly — the shortcut connection ensures that the original information is not lost. The network learns to add useful modifications to the original, rather than replacing it entirely.

She stacked multiple blocks. Each block took the output of the previous block as input, added more attention and more processing. Three blocks, each refining the representations further. By the third block, each token's representation encoded not just the token itself, but its relationships to every other token, as refined through three rounds of attention and processing.

Trviksha: One block captures direct relationships — Clause 112 attends to Clause 3. Two blocks capture indirect relationships — Clause 112 attends to Clause 47, which attended to Clause 3, propagating information through two hops. Three blocks capture even more complex dependencies.

The Result

On Phlontjek's contracts, the three-block architecture with position encodings achieved 89% accuracy on long-contract questions — up from 83% without position encodings and multiple blocks. The system now correctly handled precedence rules, temporal ordering of clauses, and complex cross-references that required multi-hop reasoning.

Phlontjek: It knows that Clause 2 overrides Clause 50. It knows that an amendment in Clause 89 modifies the terms set in Clause 12. It traces chains of references across the entire document.

Trviksha has built a transformer block — the fundamental building unit of modern AI systems like GPT, BERT, and their successors. A transformer block consists of multi-head self-attention (communication between positions) followed by a feedforward network (computation at each position), with residual connections (adding the input back to the output) at both stages. Position encodings solve the problem that attention, unlike recurrent processing, has no inherent notion of order. Stacking multiple blocks allows the model to capture increasingly complex relationships — each block can attend to patterns discovered by the previous block. The residual connections are practically essential: they allow error signals to flow directly through the network during training, avoiding the degradation problems that plagued very deep networks. This architecture — attention plus feedforward, with residual connections and position encodings — is the foundation of virtually all large language models in use today. Think about layers of understanding when reading a complex text: on first reading you grasp the words, on second reading you see the connections between paragraphs, on third reading you understand the overall argument. Each "reading" builds on the previous one, just as each transformer block builds on the previous block's output.