In March 2025, a team out of Stanford’s RegLab and Computer Science department published the first preregistered audit of the AI legal research tools that LexisNexis, Westlaw, and Thomson Reuters had been quietly selling into law firms (Magesh et al. 2025). The vendors had promised, in writing, “hallucination-free” output. The empirical reality was that the best of the three was wrong about one in six times, the worst about one in three.
The architecture all three used was the same architecture you’re using right now: retrieval-augmented generation. Embed your corpus, vector-search at query time, hand the top-k chunks to a language model, and hope. Production RAG is the only field where adding more components is considered debugging.
If RAG worked, that paper would not exist.
That’s an inconvenient sentence, because if you’re reading this, you probably ship something downstream of an embedding model and a vector store. So let me make the inconvenience worse: the reason your RAG system isn’t working isn’t that you picked the wrong vector database, missed a chunking trick, or skipped the reranker. The reason your RAG system isn’t working is that the entire stack downstream of your embedding model is a coping mechanism; a Rube-Goldberg of band-aids to compensate for the fundamental loss of quality that happened at embed time.
Walk the stack in reverse
Pull up the architecture diagram of any production RAG system shipped after 2023. There’s a recognizable pattern: a tower of bolted-on stages, each compensating for the failure of the one above it. At the top sits the LLM, the final arbiter. A judge that looks at the answer the generator produced and asks: “Is this actually grounded in the retrieved sources?”
Stop and notice what that question implies. The judge exists because retrievals routinely surface chunks that look semantically close but don’t support the claim, and the model, gamely, hallucinates a synthesis on top of them anyway. The judge is a circuit-breaker for an upstream failure.
Below the judge, a reranker. Usually, a cross-encoder is a smaller transformer that re-scores the top-k candidates by reading them jointly with the query. The reranker exists because the embedding ranking was wrong. The cross-encoder is what your retrieval would have looked like if you could afford to pay full attention to every doc-query pair, which you cannot. So you let a fast, lossy proxy (vectors) get a near miss and pay a smaller-but-still-real model to clean it up.
Below the reranker, hybrid search. BM25 plus vector. The keyword side exists because the vector side cannot reliably surface a document that contains the literal string "PO-2024-8734" even when the user types "PO-2024-8734". Embeddings smooth over surface forms; that’s most of what they’re for.
Hybrid search is the polite way to say our embedding can’t find a string match.
Below hybrid, the knowledge graph. Entities, relations, edges. The graph exists because the embedding compressed away the relationships between people, organizations, instruments, and dates. You rebuild the graph, usually with another LLM at index time, and re-query it at retrieval time. You are reconstructing the structure your encoder destroyed.
Below the graph, hierarchical indexing. Domain → category → document → chunk. Hierarchical indexing exists because the vector neighborhood becomes meaningless beyond about 10,000 documents. Too many things land too close in too few directions. So you don’t search the whole corpus. You route to a sub-corpus first.
And below all of that, the quiet move every team makes when nothing else works: bigger embeddings. 768 didn’t cut it, so 1024. 1024 didn’t cut it, so 1536. 1536 didn’t cut it, so 3072. Each upgrade is the same trade. You bought yourself more directions to disambiguate in, at the cost of latency and storage, because the model the dimensions came from couldn’t disambiguate properly to begin with.
Read the stack in reverse like this and the architecture stops looking like a system and starts looking like a confession. Every layer is the previous layer’s apology. And we wonder why it’s slow.
Each layer compensates for the layer below. The embedding model is the foundation everything else is apologizing for.
What the embedding actually missed
The embedding’s job is to map a passage of text to a point in vector space, where points close together represent similar things. The whole RAG stack is a bet on that map.
The Stanford paper documents what a leaky map looks like in production. The authors classify the failure modes by hand, and the four buckets matter:
- Naive retrieval. The system pulls something topically adjacent that doesn’t actually answer the question. The model then talks confidently about the wrong thing.
- Inapplicable authority. The system pulls a real, real-looking source from the wrong jurisdiction, the wrong era, or a case that has been overturned. The text reads correct; the law isn’t.
- Reasoning error. The model misreads relationships in the retrieved text, confusing argument with holding, plaintiff with court, defendant’s position with court’s ruling.
- Sycophancy. The model accepts a false premise from the user instead of correcting it.
Two of those four, naive retrieval and inapplicable authority, are embedding failures sitting in plain sight. The encoder was given a query, asked for matching documents, and returned documents that match on surface signals but miss the dimensions that distinguish authoritative from cited-in-passing, in-force from overturned, this-circuit from another.
You can reproduce a small version of the failure on your laptop. Pick any open-source embedding model with a Hugging Face download counter that has eight digits. Encode two sentences that mean opposite things. Take the cosine.
from sentence_transformers import SentenceTransformer, util
m = SentenceTransformer("all-MiniLM-L6-v2")
a = "The defendant was found guilty."
b = "The defendant was not found guilty."
print(util.cos_sim(m.encode(a), m.encode(b)).item())
# ~0.94
The negation flipped the meaning, and the model, trained predominantly on token co-occurrence, barely noticed. Now imagine that’s the encoder gating which case a lawyer actually sees. If your embedding model can’t distinguish guilty from not guilty, it has roughly the same legal acumen as someone who skipped law school.
This is not a quirk of one small model. The same pattern shows up at scale. Negation, paraphrase, structural inversion, temporal context, jurisdictional context, exact-string fidelity. All of it gets compressed into a few thousand floating-point numbers via training objectives that reward token-level co-occurrence and don’t pay enough attention to which words flip meaning. The model learns to place the contract was signed near the contract was executed. That’s good. It also learns to place the contract was signed near the contract was not signed. That’s bad, and you cannot rerank your way out of it, because the rerank candidate set was already drawn from a bag in which both sentences look adjacent.
Why the patches work, until they don’t
Coping mechanisms work. That’s what makes them coping mechanisms. The reason production teams keep stacking new layers is that each one moves the metric a little. A reranker takes you from 70 to 78. A hybrid index takes you from 78 to 83. A graph takes you from 83 to 86. The well-circulated post titled Stanford’s warning: your RAG system is broken (and how to fix it) (Izwan 2024) is a faithful catalog of these moves: hierarchical indexing, semantic clustering, adaptive chunking, hybrid search, knowledge graph, re-ranking. Stack them all and you can plausibly tell a customer the system is “enterprise-grade.”
Three things are true about that stack at the same time.
It does work better than naked vector search. The numbers go up.
It costs you on every dimension that isn’t quality. Latency, infra spend, blast radius, debug surface, the number of places a bug can hide. Every layer is another component you have to monitor, another thing that can drift, another vendor on your bill. The Stanford auditors had to read every cited case by hand to score the systems; that should tell you something about what real verification of a six-layer RAG pipeline looks like at the limit.
And the stack caps out below where you need it. You can stack patches and stop at 88 percent retrieval accuracy. The Stanford lawyers did exactly this. Lexis+ AI, the highest-performing system tested, answered correctly on 65 percent of queries. The fancier the patch, the higher the floor. The ceiling moves slower than the architecture diagram does.
The asymptote isn’t a function of how clever your reranker is. It’s a function of what was in the embedding to begin with.
The cheapest embedding is the one that gets the answer right the first time
There’s a phrase that gets thrown around in adjacent industries: get it right at the source. Manufacturing learned this in the 1980s. Inspection-after-the-fact is the most expensive form of quality control because by the time you find the defect, you’ve already paid to build around it, ship it, and rework it. Statistical process control beats inspection every time, not because inspection is bad, but because inspection is downstream of the defect.
Embedding is the source. Everything in your RAG stack is downstream.
If the encoder cleanly distinguished negation, you wouldn’t need a graph to keep approved and not approved apart. If the encoder respected exact-string identifiers, you wouldn’t need BM25 in the loop. If the encoder kept jurisdictional and temporal scope coherent, your reranker would have less work to do, your judge less to override, and your hallucination rate fewer places to hide. Every layer downstream gets cheaper, faster, simpler, or stops being necessary at all, when the layer upstream gets meaning right.
Which is the part of the conversation about RAG that doesn’t get had often enough. The discourse focuses on the orchestration: the chunker, the indexer, the retriever, the reranker, the prompt template, the agent loop. The embedding model gets picked in week one, by whoever set up the project, from a leaderboard or a default config, and is never spoken of again.
It is the most consequential decision in the stack, and it is treated as the least.
A different question
Stanford’s audit handed the legal AI vendors a number to look at: one in six wrong, one in three wrong (with citations to prove it). The vendors are now working through their own architectural confessions in private. They will add another layer. Probably a few. The hallucination rate will move some.
It will not get to zero, because zero isn’t a property of the orchestration. Zero is a property of the embedding’s faithfulness to meaning, and you cannot patch faithfulness in after the fact.
If your retrieval system is failing, the productive question is not which layer to add next. It is which layer to remove, and what kind of embedding would let you remove it.
Get it right at embed time.
References
Izwan, S. (2024). Stanford’s warning: your RAG system is broken (and how to fix it). Medium. https://medium.com/@sameerizwan3/stanfords-warning-your-rag-system-is-broken-and-how-to-fix-it-c28a770fe7fe
Magesh, V., Surani, F., Dahl, M., Suzgun, M., Manning, C. D., & Ho, D. E. (2025). Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. Stanford RegLab. https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Hallucinations.pdf
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. https://arxiv.org/abs/1908.10084
Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., & Zhou, M. (2020). MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. Advances in Neural Information Processing Systems. https://arxiv.org/abs/2002.10957