Engineering Insights

Why Graph Traversal is Bullying Your HBM3e

Jonathan Corners | January 2026

The hardware is screaming for linear reads, but HNSW gives it random noise. We analyze the physics of memory coalescing and why bandwidth-bound sorting is the only way to respect the memory bus.

The Hardware Has Feelings Too

You just bought an H100. It has HBM3e memory capable of 3.35 TB/s of bandwidth. That is enough to download the entire internet in… well, a very short time.

But when you run your graph algorithm, maybe a vector search with HNSW or a standard BFS, you get a fraction of that. Maybe 5%.

Why? Because you are bullying the memory controller.

The Physics of a Cache Line

Here is the thing about DRAM (and HBM is just fancy 3D-stacked DRAM): it hates you. Specifically, it hates your individuality.

When you ask for a single byte, the memory controller doesn’t just fetch that byte. It fetches a cache line. On NVIDIA GPUs, that is typically 128 bytes.

If you read a float (4 bytes) and then jump to a random location to read another float, you are effectively throwing away 124 bytes of bandwidth for every 4 bytes you use.

Efficiency = 4 / 128 = 3.125%

You are paying for a firehose and using it to fill a thimble.

The Graph Traversal Problem

Graph algorithms are defined by pointer chasing.

struct Node {
    float value;
    Node* next; // <--- The root of all evil
};

// The "Random Access" Pattern
while (node) {
    process(node->value);
    node = node->next; // Jump to random memory address
}

In a graph, node->next could be anywhere in memory. It is likely not in the same cache line. It is likely not even in the same DRAM page.

Every time you follow a pointer, the memory controller has to:

  1. Close the current DRAM row (Precharge).
  2. Open a new DRAM row (Activate).
  3. Read the data (CAS).

This takes time. Nanoseconds add up. And while the memory controller is doing this administrative paperwork, your CUDA cores are starving.

The Metaphor: The Library Ladder

Imagine a library with huge bookshelves. To get a book, you have to climb a rolling ladder.

Linear Scan (Array): You climb the ladder once, and you grab every book on the shelf, one by one. You are efficient. You are a machine.

Graph Traversal (Pointer Chasing): You climb the ladder, grab one book. Then you climb down, move the ladder 20 feet to the left, climb up, grab another book. Then you move the ladder 50 feet to the right.

You are spending 99% of your time moving the ladder and 1% of your time reading.

The Solution: Sort It Out

If you want to go fast, you must respect the hardware. The hardware wants coalesced memory access. It wants adjacent threads to read adjacent memory addresses.

How do we turn a graph problem into a linear problem? Sorting.

Instead of chasing pointers individually, we:

  1. Gather all the “next” pointers we want to visit.
  2. Sort them by memory address.
  3. Visit them in linear order.
// The "Coalesced" Pattern
// 1. Collect requests
int requests[N] = { ... random indices ... };

// 2. Sort requests to align with memory layout
sort(requests, requests + N);

// 3. Read linearly
for (int i = 0; i < N; i++) {
    // Now we are reading memory in increasing order
    // The memory controller can keep the row open!
    data[i] = big_array[requests[i]];
}

This sounds counter-intuitive. “Wait, you want me to add a sorting step? Isn’t that slow?”

Yes, sorting costs instructions. But on a GPU, compute is cheap. Bandwidth is expensive.

Spending 1ms to sort requests can save you 10ms of memory latency. It is the difference between driving a Ferrari in traffic vs. driving a minivan on the open highway. The minivan wins if the Ferrari is stuck in first gear.

Why HNSW is an Anti-Pattern on GPU

Hierarchical Navigable Small World (HNSW) is the gold standard for vector search on CPUs. It is a graph algorithm. It relies on hopping from node to node.

On a CPU with big caches and sophisticated prefetchers, this works fine. On a GPU, it is a disaster.

This is why brute-force linear scan (or inverted file indices like IVF) often beats graph-based indices on GPUs. A linear scan reads memory sequentially. It hits 100% bandwidth utilization. Even if it does 10x more work, it can finish faster because it is not waiting for the memory controller to move the ladder.

Conclusion

If you are writing high-performance CUDA code, stop thinking about “Big O” notation in terms of operations. Think about “Big O” in terms of cache lines.

  • O(1) Random Access: Terrible.
  • O(N) Linear Scan: Beautiful.

Be kind to your memory controller. Stop chasing pointers. Start sorting your requests.

Read the full analysis in our MASH whitepaper

Author

Jonathan Corners - Founder, Voxell. I build GPU-native infrastructure for real-time AI systems.

If you're working on latency + consistency problems, I'd like to hear about it.

Contact 24h reply • NDA ok • No IP needed

Ready to see this in practice?

Get hands-on with Voxell Coherence.

Request Access