From the Kernel.

Technical deep-dives on GPU memory, sorting, and deterministic compute.

The Cache Feedback Gap: Why Your Prefetcher Doesn't Learn

Server push and predictive caching have existed for years. Why do caches still waste bandwidth on data nobody uses?

Read Article

Solving N+1 with Bidirectional Hints

DataLoader batches requests. HINT predicts them. Here's how to eliminate the N+1 problem at the protocol level.

Read Article

Why Graph Traversal is Bullying Your HBM3e

The hardware is screaming for linear reads, but HNSW gives it random noise. We analyze the physics of memory coalescing and why bandwidth-bound sorting is the only way to respect the memory bus.

Random Access Pattern

Cache misses, stalled cores

Sequential Access Pattern

Coalesced reads, saturated bus

Read Article

The Butterfly Effect in Your FPU: Why Parallel Math is Chaos

Taming the nondeterministic nature of parallel floating point. In regulated sectors like HFT and Defense, 'approximate' results are a liability. We explore the engineering challenges of achieving 100% reproducibility.

// The problem with parallel reduction
Thread A: (a + b) + c = 0.30000000000000004
Thread B: a + (b + c) = 0.3
// Same inputs. Different outputs. Nondeterminism.
Read Article

Starved Cores: Listening to the Silence Between Epochs

Latency isn't just about search speed; it's about the silence when the GPU waits for data. An analysis of DataLoader bottlenecks and how predictive caching (HINT) keeps the silicon singing.

REACTIVE DATALOADER

HINT PREDICTIVE PIPELINE

Read Article

Sorting on Blackwell: 9× Faster Where It Actually Matters

MASH is a data-aware GPU sorting engine for NVIDIA Blackwell that beats NVIDIA's CUB DeviceRadixSort on real workloads.

Read Article

See the code in action.

Get hands-on with Coherence. Benchmark it against your current infrastructure.

Download Developer Preview