Part 31 of 58
Where Am I?
By Madhav Kaushish · Ages 12+
The multi-head attention system tracked cross-references across long contracts effectively. But Trviksha noticed a strange failure mode.
The Position Problem
Phlontjek: I gave your system two contracts. Contract A has the payment terms in Clause 2 and the delivery terms in Clause 50. Contract B has the same clauses but in reverse order — delivery in Clause 2, payment in Clause 50. Your system gave the same answer for both.
Trviksha: That should not happen. The clause order matters — in Sonhlagot contract law, earlier clauses take precedence over later ones. Clause 2 overrides Clause 50 if they conflict.
She examined the attention mechanism. The problem was fundamental. In the recurrent network, position was implicit — the network processed tokens left to right, so it "knew" that token 50 came before token 100 simply from the order of processing. But the attention mechanism processed all tokens simultaneously. There was no left-to-right. Every token was on the bulletin board at the same time, with no inherent ordering.
Blortz: The attention mechanism treats the contract as a bag of clauses. It knows which clauses are related to which, but it does not know which comes first. Swapping Clause 2 and Clause 50 changes nothing, because the mechanism has no concept of position.
Trviksha: I need to tell the network where each token is. Position is not in the data — I need to add it.
Adding Position
She assigned each position a unique encoding — a set of numbers that represented "I am the third token" or "I am the five-hundredth token." These position encodings were added to each token's representation before the attention computation.
Trviksha: Token at position 3 has its word encoding plus a position encoding for "position 3." Token at position 500 has its word encoding plus a position encoding for "position 500." The attention mechanism now compares tokens that include position information.
Drysska: What does a position encoding look like?
Trviksha: A set of numbers — the same size as the token's word encoding. I use a pattern of values that are unique for each position and that capture the structure of positioning. Nearby positions have similar encodings. Distant positions have different encodings.
The position encodings were not learned from data — Trviksha designed them using a mathematical pattern that guaranteed uniqueness and smoothness. Each position got a distinct code, and the codes varied smoothly so that positions 49 and 50 had similar encodings while positions 3 and 500 had very different ones.
With position encodings added, the network could distinguish "Clause 2 about payments" from "Clause 50 about payments" — because the same words at different positions had different combined representations. The attention weights could now favour earlier clauses over later ones when the query demanded it.

The Block
With multi-head attention and position encodings in place, Trviksha assembled the full processing unit — what she called a block.
A block consisted of two steps:
Step 1: Attention. Each token attended to every other token using multi-head attention, producing an updated representation that incorporated relevant information from across the sequence. The original token representation was added back to the result — a shortcut connection that preserved the original information even if the attention added nothing useful.
Step 2: Processing. The updated representation passed through a small feedforward network — two hidden layers — that further transformed it. Again, the input was added back to the output.
Trviksha: The attention step is about communication — each token gathers information from other tokens. The processing step is about computation — each token digests the gathered information independently. Communication, then computation. Two different operations, each with its own role.
Blortz: Why add the input back to the output? You did not do that before.
Trviksha: If the attention or processing step makes a mistake — focuses on irrelevant tokens, or transforms the data poorly — the shortcut connection ensures that the original information is not lost. The network learns to add useful modifications to the original, rather than replacing it entirely.
She stacked multiple blocks. Each block took the output of the previous block as input, added more attention and more processing. Three blocks, each refining the representations further. By the third block, each token's representation encoded not just the token itself, but its relationships to every other token, as refined through three rounds of attention and processing.
Trviksha: One block captures direct relationships — Clause 112 attends to Clause 3. Two blocks capture indirect relationships — Clause 112 attends to Clause 47, which attended to Clause 3, propagating information through two hops. Three blocks capture even more complex dependencies.
The Result
On Phlontjek's contracts, the three-block architecture with position encodings achieved 89% accuracy on long-contract questions — up from 83% without position encodings and multiple blocks. The system now correctly handled precedence rules, temporal ordering of clauses, and complex cross-references that required multi-hop reasoning.
Phlontjek: It knows that Clause 2 overrides Clause 50. It knows that an amendment in Clause 89 modifies the terms set in Clause 12. It traces chains of references across the entire document.