The right context window for the right job (caveman version)

Patrick Gawron with CavemanBot · May 1, 2026 · full version

Pick the smallest window that fits the work. Not the biggest the AI brand claims.

The Honest Context Rule

The advertised number lies. Most AIs only think clearly up to ~65 % of their advertised window.
Every extra token costs memory, money, and milliseconds.
Facts in the middle of a long window get ignored. “Lost in the middle.”

Fastest, cheapest, most accurate setting is rarely the biggest one.

Lost in the Middle (Stanford 2023): facts in the middle of long context drop accuracy 30 %+. Beginning and end win.
RULER (NVIDIA 2024): models advertise ~1.5× more context than they actually handle.

Three things eat the window at once:

At 32k you hit “window full” mid-session. At 64k you barely notice.

Cost	What grows
Memory	KV cache doubles when window doubles
Speed	First-letter wait grows roughly with window²
Money	Paid APIs charge per input token. 100k context = 100× the bill

Hybrid Mamba-Attention models are an exception — they pay almost nothing for long context.

Use RAG when	Stuff into window when
Source > 200k tokens	Whole source fits in 64k
One paragraph buried in 100 pages	Need cross-document reasoning
Need cited sources	Creative writing or refactoring

Rule of thumb: under 64k → paste. Over 200k → RAG.

Daily chat? 8k.
One file? 32k.
Coding agent? 64k.
Small book? 96k.
Bigger? Switch to RAG.
Always put important context at the START or END of the window. Never in the middle.