Why Your Vector Search Lies

There is a class of bug that doesn’t look like a bug. No exception. No error log. No changed code. Just this: you run the same query against the same index twice, and you get different results. Not wrong results, just results that look plausible both times. Different ones.

You restart the service. Same thing. You roll back a deployment. Same thing. You pin every library version, freeze the model weights, feed in identical inputs byte-for-byte.

Still different.

This isn’t a race condition. It isn’t a caching problem. It’s floating point arithmetic behaving exactly as specified, and that specification is the problem.

The arithmetic you learned is wrong

Addition is associative. (a + b) + c = a + (b + c). You learned this in elementary school and it has been true in every math class since.

It is not true for floating point numbers.

Floating point represents real numbers as a finite binary fraction. The precision is fixed: 23 bits for a 32-bit float. When you add two floating point numbers and the exact result needs more bits than the format provides, it gets rounded. That rounding happens at every single operation.

The order in which you add numbers determines which intermediate values get rounded, which changes the accumulated error, which changes the final bit pattern. Mathematically equivalent expressions produce computationally different results:

a = 1e-16
b = 1.0
c = -1.0

# Mathematically identical. Computationally different.
(a + b) + c  # = 1.1102230246251565e-16
a + (b + c)  # = 1e-16

Same three numbers. Two legitimate answers. Both correct within the IEEE 754 standard. Neither equal to the other.

This is not a flaw. It is a deliberate design choice. Floating point trades exactness for speed and range. For most computations, the rounding error is small enough to ignore. For some, it is not.

Parallelism turns the volume up to eleven

Sequential code executes in a predictable order. Thread A sums values one through a million, left to right, every time. The rounding sequence is fixed. The result is deterministic.

A GPU doesn’t work this way.

A modern GPU executes thousands of threads simultaneously, organized into groups of 32 called warps. When you ask a GPU to sum a large array, it runs a parallel reduction: threads simultaneously sum adjacent pairs of values, then pairs of pairs, building a convergence tree. It is the fastest way to reduce a large array. It is also non-deterministic.

The shape of the reduction tree depends on which warps the scheduler dispatches first, how memory latency plays out across shared memory banks, how thread blocks get distributed across streaming multiprocessors. All of that changes run to run. Every change in tree shape is a change in the order of additions. Every change in addition order is a change in which intermediate results get rounded.

// GPU parallel reduction: 1024 threads summing values
__shared__ float sdata[1024];

// The reduction tree shape depends on:
//   Warp scheduling: nondeterministic across runs
//   Memory bank conflicts: load-dependent
//   Thread block placement: runtime-dependent
//
// Different tree = different addition order
// Different addition order = different rounding sequence
// Different rounding sequence = different final bits

Diagram showing how two threads processing identical inputs in different orders produce different floating-point outputs — Same input. Same model. Different warp scheduling. Different bits out, both correct under IEEE 754.

A parallel sum of 100,000 floats across a thousand threads can produce thousands of distinct bit patterns across runs. All within floating point’s error bounds. All “correct.” All different from each other.

Why embedding engineers don’t know they have this problem

Embedding models are sensitive to exact numerical inputs. The same token sequence, fed through the same model weights, should produce the same vector. On a GPU without determinism controls, it doesn’t.

The differences are small: typically the last one or two bits of a 32-bit float. The resulting vectors are nearly identical: cosine similarity above 0.9999. No test you’d normally write would catch it. The numbers look fine. But over time, at scale, the consequences compound.

Threshold drift. Your retrieval system filters results at a similarity threshold of 0.85. You calibrated that threshold against your corpus six months ago. Since then, a model server restart, a hardware failover, a CUDA library upgrade: any of them can shift the bit pattern of your embeddings by one or two bits. The threshold you calibrated no longer aligns with the distribution it was set against. Precision and recall drift. Nobody notices until someone asks why search quality degraded in March.

Index inconsistency. You run a partial re-indexing job: a subset of your corpus changed, so you recompute just those vectors. The new embeddings were computed on different hardware, or after a library update, or under different thread scheduling. Same documents. Same model. Different bit patterns. Your index now contains embeddings from two subtly different distributions. Approximate nearest neighbor search degrades across the boundary, invisibly.

Audit failure. A regulator, a customer, or an internal security review asks you to reproduce the embedding for a document from eight months ago. You recompute it. The result doesn’t match. The GPU scheduling that produced the original is gone. The cuBLAS version has changed. You cannot prove what your system computed, which means you cannot defend what it decided.

Agent instability. An AI agent makes decisions based on semantic similarity scores. Those scores should be deterministic across agent runs on identical inputs. Without bit-exact embeddings, they are not, which means agent behavior isn’t reproducible across restarts, which means you can’t reliably debug it, test it, or certify it.

The solutions exist. They carry costs.

There is no free lunch in deterministic GPU computation. Every approach trades something.

Deterministic library modes. NVIDIA exposes deterministic execution paths in its core libraries:

// Force cuBLAS to avoid nondeterministic reductions
cublasSetMathMode(handle, CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION);

// PyTorch global determinism flag
torch.use_deterministic_algorithms(True)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Cost: 2-10× performance penalty on operations that depend on fast nondeterministic reduction. Not all operations have deterministic modes. Critically: “deterministic” in cuBLAS 11.x and cuBLAS 12.x produce different results from each other: pinning the library version is part of the solution, not a bonus.

Fixed reduction order. Force a deterministic tree topology by serializing at synchronization points:

__device__ float deterministic_reduce(float* data, int n) {
    for (int stride = 1; stride < n; stride *= 2) {
        int idx = threadIdx.x * stride * 2;
        if (idx + stride < n) {
            data[idx] += data[idx + stride];
        }
        __syncthreads();  // Force consistent ordering
    }
    return data[0];
}

Cost: The synchronization barriers serialize execution at each tree level. You keep the reduction tree shape but lose most of the throughput advantage. Expect 2-5× slowdown on large reductions.

Kahan summation. A compensated algorithm that explicitly tracks rounding error and corrects for it at each step:

float kahan_sum(float* values, int n) {
    float sum = 0.0f;
    float compensation = 0.0f;

    for (int i = 0; i < n; i++) {
        float y = values[i] - compensation;
        float t = sum + y;
        compensation = (t - sum) - y;  // Recover lost bits
        sum = t;
    }
    return sum;
}

Cost: Four operations per element instead of one. Substantially reduces order sensitivity but doesn’t eliminate it: different thread orderings can still produce different results, just smaller differences.

Fixed-point arithmetic. Integers are genuinely associative. (a + b) + c and a + (b + c) produce identical results in integer arithmetic, regardless of operation order:

// Store at fixed precision; operate in exact integer arithmetic
const int64_t SCALE = 1LL << 32;
int64_t a_fixed = (int64_t)(a * SCALE);
int64_t b_fixed = (int64_t)(b * SCALE);
int64_t c_fixed = (int64_t)(c * SCALE);

// Order does not matter. Result is always identical.
int64_t sum = a_fixed + b_fixed + c_fixed;

Cost: Reduced dynamic range. Overflow risk on large accumulations. Requires careful scale selection. Works well for similarity score accumulators and distance calculations; doesn’t generalize to operations like softmax or layer normalization.

Library version pinning. The zero-performance-cost intervention. Pin every CUDA library version into your deployment artifact and never upgrade mid-deployment. Results won’t match across CUDA versions, but every run within the same deployment will agree.

FROM nvidia/cuda:12.3.2-cudnn9-runtime-ubuntu22.04
# Exact CUDA + cuDNN versions locked: reproducibility across restarts guaranteed

Cost: None at runtime. Operational cost of managing exact version artifacts. Does not solve the problem across library upgrades, only within a pinned deployment.

The tradeoff map

Approach	Bit-exact?	Perf hit	Works across upgrades?
Library deterministic modes	Yes	2-10×	No, pin versions too
Fixed reduction order	Yes	2-5×	Yes
Kahan summation	Near	2-4×	Yes
Fixed-point accumulators	Yes	1.5-3×	Yes
Library version pinning	Yes	0×	Only within version

In practice, production embedding pipelines combine approaches: deterministic library modes at the model inference layer, fixed-point or Kahan for similarity accumulators, and strict version pinning in the deployment artifact. Each layer handles a different source of nondeterminism.

The actual point

“Same query, same results” is what semantic search is supposed to mean. Without bit-exact determinism, you have semantic search that drifts: quietly, invisibly, in ways that don’t surface as errors. Only as degraded quality, unreproducible behavior, and indefensible audit logs.

This is why we built determinism into Voxell’s infrastructure at the foundation rather than retrofitting it.

Coherence is Voxell’s deterministic semantic memory layer. It runs on GPU-native exact tensor operations with fixed numerical ordering guarantees. The same query against the same index returns the same result: on the same hardware, across service restarts. No threshold drift. No index inconsistency. No audit gap.

ARC extends this to caching. Deterministic embeddings make caching semantically safe: a cache hit and a cache miss return the same vector, not approximately the same vector. Without bit-exact determinism, a cache that “speeds up” retrieval also makes it inconsistent.

Forge models are evaluated and served under deterministic modes. The embedding you compute today is the embedding you can reproduce tomorrow.

Determinism isn’t a feature you add to a retrieval system. It is a correctness property the system either has or doesn’t. Everything downstream of your embeddings (your thresholds, your indexes, your agents, your audit trails) depends on it.