I have spent the last several months deep in retrieval architecture — building RAG systems, reading the research, and stress-testing what actually holds up in practice. This is what I have found.
Most teams building with AI are still using the same basic recipe: embed the query, retrieve similar chunks, stuff them into a prompt, and let the model answer.
It works well enough to impress in a demo. It often works well enough to ship.
But underneath that familiar pipeline are two structural problems that scaling does not solve. The first is mathematical: dense embeddings eventually stop being able to cleanly distinguish enough documents. The second is computational: once you retrieve a pile of passages, the model spends a surprising amount of effort processing context it barely uses.
That means the common instinct — bigger model, bigger context window, more retrieved chunks — is often treating the symptom, not the architecture.
User Query
|
v
+--------------+
| Embed | <- turn text into a vector
| Query |
+------+-------+
|
+------v-------+
| Vector | <- find similar chunks
| Search |
+------+-------+
|
+------v-------+
| Stuff into | <- hope the LLM figures it out
| Prompt |
+------+-------+
|
+------v-------+
| Generate | <- the actual answer
| Response |
+--------------+
This is the pipeline most teams ship. Here is why it breaks.
The Embedding Ceiling
Here is the uncomfortable part: dense embeddings — the vectors at the heart of every standard RAG system — appear to have a real capacity limit.
This is not a training issue. It is a geometry issue. A vector with finite dimensions can only separate so many documents cleanly before different meanings start colliding in the same space. Once your collection exceeds that limit, retrieval quality degrades for structural reasons — not because you picked the wrong model or trained on the wrong data.
How hard is the ceiling? Recent research from Google DeepMind showed that 512-dimension embeddings break down around 500K documents. Even 4,096 dimensions tap out somewhere around 250 million. These are theoretical best-case numbers. Real systems hit the wall earlier.
They built a benchmark to prove it. With 50K documents, even strong embedding models saw recall collapse. With a carefully constructed set of just 46 documents, no model achieved full recall. The architecture itself is the bottleneck.
Meanwhile, BM25 — a keyword-matching algorithm from the 1990s — does not have this problem at all. It operates in an effectively unbounded feature space. That is not an argument to go back to keyword search. It is evidence that dense embeddings alone are not a complete retrieval strategy.
Even at smaller scales, this changed how I think about retrieval design. Knowing the ceiling exists means planning for it early instead of discovering it when things quietly degrade.
The Decoding Tax
The second bottleneck is on the generation side. Your vector search might retrieve twenty passages. But what happens when the model actually processes them?
Researchers at Meta looked at the attention patterns across retrieved passages during generation and found something striking: the attention is almost entirely block-diagonal. Each passage mostly attends to itself. The cross-passage computation — the part where the model is supposedly synthesizing information across sources — is close to zero.
In plain English: most of the compute your LLM spends on those retrieved passages is wasted. The model is not doing deep cross-referencing. It is processing each passage in near-isolation, at full cost.
Meta's REFRAG framework exploits this by compressing passage representations before they hit the decoder, achieving up to 30x faster time-to-first-token with no loss in accuracy. The specific technique matters less than the insight: “retrieve more, stuff more” has a real and largely unnecessary compute cost.
What Production Systems Are Doing Instead
The industry is not converging on one replacement. It is converging on the idea that different queries need different retrieval strategies. Here is the progression.
First, fix recall: hybrid retrieval. Instead of betting the whole system on one retrieval method, production systems now combine semantic vector search with keyword-based BM25 in parallel, then apply a reranker to sort out what actually matters. Vector search catches meaning. BM25 catches exact product names, policy codes, and IDs that vectors routinely miss. This is the new production default — not an optimization, a baseline.
Then, fix representation: multi-vector retrieval. Instead of compressing an entire document into one point in space, models like ColBERT keep more of the document's local detail alive during retrieval by representing each token as its own vector. This directly attacks the embedding ceiling — more expressiveness means more capacity to distinguish documents at scale. Recent engineering work has brought the latency cost down dramatically, making this practical for production.
Then, fix reasoning over relationships: GraphRAG. If the answer depends on relationships rather than similarity, you need structure, not just proximity. “Which suppliers are connected to the delayed products in customer complaints?” requires traversing a knowledge graph, not finding the most similar text chunk. Microsoft's open-source GraphRAG builds entity-relationship graphs from documents and can hit remarkable precision on these queries. The tradeoff: indexing is orders of magnitude more expensive. But for multi-hop reasoning, vector search simply cannot get there.
Then, fix generation waste: context compression. Tools that prune low-information tokens before they reach the LLM, or cache responses for semantically similar queries, reduce the decoding tax without changing the retrieval strategy. Not glamorous, but at production scale, this is the difference between a viable cost structure and a runaway compute bill.
The Real Shift
The deeper change is not any one retrieval technique. It is that retrieval is no longer being treated as a one-shot lookup step. It is becoming a reasoning process.
This is the part of the research I find most exciting.
Agentic RAG replaces the linear pipeline with a loop. The LLM evaluates whether its retrieval was good enough. If not, it reformulates the query and tries again — maybe switching from semantic search to keyword search, or drilling into a knowledge graph instead of flat text chunks. After generating an answer, a verification step checks whether the response is actually grounded in what was retrieved.
This flips the failure mode. Research on agentic retrieval frameworks found that for naive RAG, about half of all failures come from retrieval — the system cannot find the right documents. With agentic RAG, retrieval failures drop dramatically. But the remaining failures shift almost entirely to reasoning errors — the system found the documents but drew the wrong conclusion.
That is a better problem to have. Retrieval failures are architectural dead ends — you cannot reason your way to an answer the system never retrieved. Reasoning failures get easier to solve with every model generation. The bottleneck shifts from “the system cannot find the answer” to “the system needs to think harder,” and the second problem improves on its own as models get better.
The Future Is a Routing Problem
The future of RAG is not a single better retriever. It is a routing problem.
Some questions need fast semantic lookup. Some need keyword precision. Some need relationship traversal across a knowledge graph. Some need an agentic loop that can recognize a bad retrieval attempt and try again.
The systems that perform well in production are no longer assuming one retrieval path is enough for every query. They are matching architecture to problem shape — classifying query complexity and dispatching to the appropriate strategy.
That is the real shift.
Naive RAG is not dead. But it is no longer a complete architecture. The math — and the compute bill — will not let it be.
This is the kind of problem I love digging into — where the math, the architecture, and the engineering all intersect.
Robert Klouda