The Memory Bottleneck

With sparse attention and mixture of experts, the model could handle long documents and large networks efficiently. But Blortz, ever the operations manager, identified one more bottleneck.

The Transportation Problem

Blortz: I have been timing the velociraptors. The actual computation — multiplying pebbles, adding sums, comparing values — takes a quarter of the total time. The remaining three-quarters is spent carrying pebbles between workstations.

Each velociraptor had a small working basket on its desk — fast to access, but limited in size. It could hold the pebbles needed for the current computation. The bulk of the weight storage — the trained parameters, the attention scores, the intermediate values — sat on shelves behind the workstation. These shelves were large but slow to access. Every time a velociraptor needed data that was not in its working basket, it had to walk to the shelves, retrieve the pebbles, carry them back, and load them into the basket.

Trviksha: The computation is fast. The data movement is slow. The velociraptors spend most of their time walking to shelves, not thinking.

Blortz: The attention mechanism is the worst offender. For a sequence of ten thousand tokens with sixty-four-dimensional embeddings, the attention matrix is ten thousand by ten thousand — one hundred million entries. That matrix does not fit in the working basket. The velociraptor must write it to the shelves, then read it back piece by piece when computing the weighted sums.

The Chunked Approach

Trviksha restructured the attention computation. Instead of computing the full ten-thousand-by-ten-thousand attention matrix and storing it, she computed attention in small chunks — blocks of tokens at a time — that fit entirely in the working basket.

Trviksha: Take the first two hundred and fifty tokens. Compute their attention scores against the first two hundred and fifty tokens. That is a two-hundred-and-fifty-by-two-hundred-and-fifty block — sixty-two thousand five hundred entries — which fits in the working basket. Complete the computation for that block without touching the shelves. Then move to the next block: the first two hundred and fifty tokens against tokens two hundred and fifty-one through five hundred. Complete that block. Continue until all blocks are done.

Drysska: But you cannot compute the final attention weights until you have seen all the blocks. The softmax requires knowing all the scores for a token before computing the weights.

Trviksha: That is the clever part. I keep a running tally. For each chunk, I compute the local scores, track the maximum score and the running sum of exponentials. As I process each new chunk, I update the running statistics. At the end, the result is mathematically identical to computing the full matrix — but the full matrix was never materialised. It only ever existed in basket-sized chunks.

The Speed

The result was dramatic. The total number of arithmetic operations was the same — the chunked approach did not skip any computation. But the number of shelf-trips dropped by an order of magnitude. The working basket held each chunk comfortably, the computation proceeded without interruption, and the overall speed improved by a factor of three to five.

Blortz: Same arithmetic. Same result. Three times faster. Because the pebbles stayed in the basket instead of traveling to the shelves and back.

Trviksha: The algorithm was not inefficient in the mathematical sense — the operations were the same. It was inefficient in the physical sense — the data movement was the bottleneck, not the computation. By reshaping the computation to match the physical constraints of the system — small baskets, large shelves — I eliminated the bottleneck.

Glagalbagal: You changed the algorithm to fit the workstation, not the workstation to fit the algorithm.

Trviksha: Exactly. The workstation has fixed physical properties — the basket is a certain size, the shelves are a certain distance away. An algorithm designed without considering these properties wastes time on transportation. An algorithm designed around these properties keeps the pebbles where they are needed.

The Principle

This was a different kind of optimisation from anything Trviksha had done before. She had optimised models — finding better weights, better architectures, better training procedures. Now she was optimising how the computation was executed on the physical system, without changing the computation itself.

Trviksha: The mathematics says: compute this attention matrix. There are many ways to compute it that give the same result. Some ways touch the shelves constantly. Some ways keep everything in the basket. The mathematical result is identical. The physical cost is vastly different.

Blortz: Two algorithms that compute the same answer at the same speed in theory but at very different speeds in practice. The difference is not mathematical — it is physical.

Trviksha: And for the problems we are now solving — millions of tokens, billions of operations — the physical constraints dominate. The best algorithm is not the one with the fewest operations. It is the one that best respects the physical realities of the system running it.

Trviksha has discovered Flash Attention — an algorithm developed by Tri Dao et al. in 2022 that computes exact attention without materialising the full attention matrix. The key insight is that modern hardware has a memory hierarchy: fast but small working memory (GPU SRAM) and slow but large main memory (GPU HBM). Standard attention writes the N-by-N attention matrix to main memory and reads it back — a massive data movement cost that dominates the computation. Flash Attention computes attention in small tiles that fit entirely in fast memory, using an online softmax algorithm that accumulates results incrementally. The mathematical result is identical; the speedup comes entirely from reduced memory movement. This is an example of hardware-aware algorithm design — shaping the algorithm around the physical properties of the hardware, not just the mathematical properties of the problem. Flash Attention enabled the practical training and deployment of models with much longer context windows, contributing to the million-token contexts of modern models. Think about cooking a meal: the recipe (algorithm) might be the same whether your ingredients are on the counter or in a warehouse across town. But the time to complete the recipe depends enormously on where the ingredients are.