Part 30 of 58
Many Eyes
By Madhav Kaushish · Ages 12+
The query-key-value attention worked well for Phlontjek's contract questions. But some questions required tracking multiple kinds of relationships simultaneously.
The Compound Question
Phlontjek: Here is my test case. Clause 87: "If the quantity specified in Clause 12 is not delivered to the port named in Clause 34 by the date established in Clause 3, the penalty described in Clause 91 applies." I need your system to answer: what is the penalty, and when does it trigger?
The answer required tracking four separate cross-references — to Clause 12 (quantity), Clause 34 (location), Clause 3 (date), and Clause 91 (penalty). Each reference was a different kind of relationship.
Trviksha's single attention mechanism had one set of queries, keys, and values. It produced one set of attention weights per position — one pattern of "what is relevant." For Clause 87, the attention pattern would have to simultaneously capture references to quantities, locations, dates, and penalties. In practice, it compromised: the attention spread across all four references, heavily weighting the most recent one and partially losing the others.
Trviksha: One attention pattern can only capture one kind of relevance at a time. When Clause 87 needs to attend to four different clauses for four different reasons, a single pattern blurs them together.
Blortz: It is like having one eye that must look in four directions simultaneously. The eye compromises and looks roughly at the centre, missing all four targets.
Multiple Heads
The solution was to run multiple attention patterns in parallel. Instead of one set of query-key-value weights, Trviksha created four sets — four independent "heads," each with its own learned weights.
Head 1 learned to track legal cross-references — which clauses referenced which other clauses. Head 2 learned to track quantities — amounts, volumes, weights mentioned throughout the contract. Head 3 learned to track temporal references — dates, deadlines, durations. Head 4 learned to track spatial references — locations, ports, regions.
Each head computed its own attention weights independently. Head 1 might cause Clause 87 to attend heavily to Clause 91 (the penalty reference). Head 2 might cause Clause 87 to attend to Clause 12 (the quantity). Head 3 might attend to Clause 3 (the date). Head 4 to Clause 34 (the location).
Trviksha: Four heads, four sets of eyes, each looking for a different kind of relationship. Each head specialises in a different aspect of the contract structure.
Drysska: How do you combine them?
Trviksha: Each head produces a result — a weighted combination of values, tailored to that head's perspective. I concatenate the four results and pass them through a linear transformation that combines them into a single output.

What the Heads Learned
Trviksha did not tell the heads what to specialise in. She created four heads with independent weights and trained the whole system end-to-end. The specialisation emerged from the data.
Blortz: You did not assign Head 1 to legal references and Head 2 to quantities?
Trviksha: No. I created four heads and let them train. Each head's query-key-value weights adjusted to capture whatever patterns improved the final predictions. The specialisations emerged because different kinds of relationships required different matching patterns, and it was easier for each head to specialise than for all heads to be generalists.
Blortz: They specialised because specialisation reduced error.
Trviksha: Exactly. A head that tried to track both temporal and spatial references would do a mediocre job at both. A head that focused on temporal references could learn sharper, more specific matching patterns. The training process pushed each head toward specialisation naturally.
In practice, the specialisations were not perfectly clean. Head 1 mostly tracked legal cross-references but also picked up some quantity patterns. Head 3 mostly tracked dates but also noticed certain recurring phrases. The heads overlapped somewhat, but each had a clear primary focus.
Phlontjek: And for my compound question?
She ran the system on his test case. The multi-head attention correctly identified all four cross-references in Clause 87 and assembled the answer: "If the quantity (400 bushels, Clause 12) is not delivered to Port Xvelsk (Clause 34) by fourteen days after harvest (Clause 3), a penalty of twenty percent of contract value applies (Clause 91)."
Phlontjek: That is correct. Every reference resolved, every detail right.
Trviksha: Four eyes are better than one — when there are four different things to look at.