The Hardware Has Feelings Too
You just bought an H100. It has HBM3e memory capable of 3.35 TB/s of bandwidth. That is enough to download the entire internet in… well, a very short time.
But when you run your graph algorithm, maybe a vector search with HNSW or a standard BFS, you get a fraction of that. Maybe 5%.
Why? Because you are bullying the memory controller.
The Physics of a Cache Line
Here is the thing about DRAM (and HBM is just fancy 3D-stacked DRAM): it hates you. Specifically, it hates your individuality.
When you ask for a single byte, the memory controller doesn’t just fetch that byte. It fetches a cache line. On NVIDIA GPUs, that is typically 128 bytes.
If you read a float (4 bytes) and then jump to a random location to read another float, you are effectively throwing away 124 bytes of bandwidth for every 4 bytes you use.
Efficiency = 4 / 128 = 3.125%
You are paying for a firehose and using it to fill a thimble.
The Graph Traversal Problem
Graph algorithms are defined by pointer chasing.
struct Node {
float value;
Node* next; // <--- The root of all evil
};
// The "Random Access" Pattern
while (node) {
process(node->value);
node = node->next; // Jump to random memory address
}
In a graph, node->next could be anywhere in memory. It is likely not in the same cache line. It is likely not even in the same DRAM page.
Every time you follow a pointer, the memory controller has to:
- Close the current DRAM row (Precharge).
- Open a new DRAM row (Activate).
- Read the data (CAS).
This takes time. Nanoseconds add up. And while the memory controller is doing this administrative paperwork, your CUDA cores are starving.
The Metaphor: The Library Ladder
Imagine a library with huge bookshelves. To get a book, you have to climb a rolling ladder.
Linear Scan (Array): You climb the ladder once, and you grab every book on the shelf, one by one. You are efficient. You are a machine.
Graph Traversal (Pointer Chasing): You climb the ladder, grab one book. Then you climb down, move the ladder 20 feet to the left, climb up, grab another book. Then you move the ladder 50 feet to the right.
You are spending 99% of your time moving the ladder and 1% of your time reading.
The Solution: Sort It Out
If you want to go fast, you must respect the hardware. The hardware wants coalesced memory access. It wants adjacent threads to read adjacent memory addresses.
How do we turn a graph problem into a linear problem? Sorting.
Instead of chasing pointers individually, we:
- Gather all the “next” pointers we want to visit.
- Sort them by memory address.
- Visit them in linear order.
// The "Coalesced" Pattern
// 1. Collect requests
int requests[N] = { ... random indices ... };
// 2. Sort requests to align with memory layout
sort(requests, requests + N);
// 3. Read linearly
for (int i = 0; i < N; i++) {
// Now we are reading memory in increasing order
// The memory controller can keep the row open!
data[i] = big_array[requests[i]];
}
This sounds counter-intuitive. “Wait, you want me to add a sorting step? Isn’t that slow?”
Yes, sorting costs instructions. But on a GPU, compute is cheap. Bandwidth is expensive.
Spending 1ms to sort requests can save you 10ms of memory latency. It is the difference between driving a Ferrari in traffic vs. driving a minivan on the open highway. The minivan wins if the Ferrari is stuck in first gear.
Why HNSW is an Anti-Pattern on GPU
Hierarchical Navigable Small World (HNSW) is the gold standard for vector search on CPUs. It is a graph algorithm. It relies on hopping from node to node.
On a CPU with big caches and sophisticated prefetchers, this works fine. On a GPU, it is a disaster.
This is why brute-force linear scan (or inverted file indices like IVF) often beats graph-based indices on GPUs. A linear scan reads memory sequentially. It hits 100% bandwidth utilization. Even if it does 10x more work, it can finish faster because it is not waiting for the memory controller to move the ladder.
Conclusion
If you are writing high-performance CUDA code, stop thinking about “Big O” notation in terms of operations. Think about “Big O” in terms of cache lines.
- O(1) Random Access: Terrible.
- O(N) Linear Scan: Beautiful.
Be kind to your memory controller. Stop chasing pointers. Start sorting your requests.