The Loop That Doesn’t Close
Every RAG system has two sides. The retrieval side finds relevant context. The generation side uses that context to produce a response. Between them, something is conspicuously missing: signal.
The LLM knows which retrieved chunks it actually used. The user knows whether the answer was good. But the retrieval system knows nothing. It made a prediction, “these five chunks are relevant,” and it never found out if it was right.
Run the same query tomorrow and it makes the same prediction again, with zero learning from yesterday.
The Retrieval N+1
In GraphQL, N+1 refers to fetching a list of parent records and then running a separate query per parent to fetch children: 1 + N queries when one batch query would do. The symptom is redundant work that scales with request volume.
RAG pipelines have an equivalent pathology. Enterprise search logs consistently show that 50-70% of semantic queries are paraphrases of queries asked before. Same intent, different phrasing. Your embedding model produces slightly different vectors for each paraphrase. Your ANN index returns slightly different results. You run the full retrieval stack (embed, search, rerank) for every one of them.
def rag_query(user_query: str, index: VectorIndex) -> str:
embedding = embed(user_query) # fresh computation, every time
chunks = index.search(embedding, k=5) # fresh search, every time
response = llm.generate(chunks, user_query)
return response
# ← signal stops here. nothing flows back to retrieval.
The pipeline is reactive. It answers. It forgets. And it performs the same expensive operations for semantically equivalent inputs because it has no memory of what worked before.
What “Learning” Requires
A retrieval system that learns needs two things it typically doesn’t have.
Prediction identifiers. When you retrieve five chunks for a query, that retrieval event needs an ID, not to identify the query, but to identify the prediction: “I predicted these five chunks would be relevant.” Without an identifier, you cannot correlate the prediction with its outcome.
Feedback signals. After generation, something needs to report what happened to that prediction. The most actionable signals:
| Outcome | Meaning |
|---|---|
| HIT | Retrieved chunk appeared in the generated response |
| MISS | Needed context wasn’t in the retrieved set |
| EVICTED_UNUSED | Retrieved chunk was never referenced by the LLM |
| STALE_HIT | Chunk was used, but the source document has since changed |
The EVICTED_UNUSED signal is the most valuable. It tells you which retrievals were pure waste: compute spent, context window consumed, the model never touched the result. A retrieval system that sees repeated EVICTED_UNUSED for a pattern should stop including that pattern. An open-loop system never receives this signal and keeps repeating the same mistake.
The Open-Loop Cache
Embedding caches address the repeated-computation cost: if you’ve embedded a similar query before, return the cached vector instead of recomputing. This reduces latency without reducing retrieval quality.
But an open-loop cache doesn’t learn. It stores. It retrieves. It never finds out which of its entries produced useful responses and which were noise. It can’t prioritize warm, high-yield entries over cold ones. It can’t suppress patterns that consistently produce retrieval misses.
# Open-loop: cache with no feedback
result = cache.get(query_embedding)
if result is None:
result = index.search(query_embedding, k=5)
cache.set(query_embedding, result)
# ← no outcome signal. cache never learns what was useful.
# Closed-loop: prediction with tracking
prediction_id, result = cache.predict_and_get(query_embedding)
# ... after generation:
cache.record_outcome(prediction_id, outcome="EVICTED_UNUSED")
# ← prediction model updates. future similar queries improve.
An open-loop cache is better than no cache. A closed-loop cache is categorically different: it uses outcome signal to change every subsequent prediction.
Why Determinism Is the Prerequisite
Closing the retrieval feedback loop has a structural requirement that most discussions skip: the feedback mechanism only works if your embeddings are reproducible.
If the same document produces different vectors across service restarts, library upgrades, or hardware failovers, even by a single bit, your prediction identifiers become unreliable. A HIT and a MISS for semantically identical queries will correspond to subtly different vector representations. The correlation between prediction and outcome is corrupted before any learning can happen.
Closed-loop retrieval requires bit-exact embedding computation. Not approximately the same. Identical bits. The feedback signal is only as clean as the vectors it’s built on.
ARC
ARC is Voxell’s closed-loop vector cache, built on this model. Every retrieval prediction receives an identifier. Outcomes flow back. The prediction model adapts across requests, not just within them.
The determinism requirement is baked in. ARC operates on embeddings from Forge, which are computed under fixed numerical ordering guarantees. The same document produces the same vector across restarts, across hardware, across library versions. The feedback loop has a clean signal to learn from.
The retrieval loop can close. It just needs a feedback path and deterministic inputs.