Sorting on Blackwell: 9× Faster Where It Actually Matters
If your data is already almost sorted, why are you paying to sort it again?
For more than a decade, NVIDIA’s CUB DeviceRadixSort has been the default answer to that question: fast, battle tested, and completely indifferent to the shape of your data. It treats a perfectly monotonic HFT order book the same way it treats cryptographic white noise. Your latency budget and your power bill pay for that indifference.
I wanted to know how much performance and headroom we’re leaving on the table when we feed real workloads (not synthetic white noise) into a data oblivious sort.
So I built MASH: a GPU native Multidimensional Adaptive Sorting Hierarchy engine. Before it does anything expensive, it runs a single, “fingerprint” pass over your keys, compresses the topology into a compact summary, and then chooses the cheapest correct strategy instead of blindly running the worst-case radix sort every time.
On Blackwell GB10 (48 SMs, CUDA 13.0), MASH delivers the following performance improvements over CUB DeviceRadixSort on 64 bit keys across 100M, 1B, up to 3B elements (reported numbers are geometric mean):
- Presorted: 8.64×
- Reverse: 4.29×
- Uniform random: 1.41×
- Zipfian (Heavy-Tail): 1.33×
On presorted 1B row workloads, the kind you actually see in HFT, logging, and time series, MASH is roughly 9× faster than CUB. On reverse runs it’s >4× faster.
But the real story starts where the charts usually stop. At 7 billion elements, standard CUB crashes due to memory exhaustion. MASH keeps running.
The rest of this article walks through what I measured, how I measured it, and enough of how MASH works to be interesting, without giving away the parts that belong in patents, not blog posts.
For Service Owners: What These Numbers Buy You
If you manage a GPU-backed analytics service or time series DB, these kernels translate directly to service-level outcomes:
- ORDER BY / Window Functions: Latency for 1B–3B row partitions drops from ~2.9s to ~0.33s on presorted segments.
- Ingestion Efficiency: Ingestion pipelines that handle mostly-ordered logs can process the same throughput with fewer GPUs.
- Stability at Scale: Deterministic “no-OOM” behavior at scales (6B–7B keys) where generic radix sort fails, preventing "poison pill" queries from crashing nodes.
Why Data Oblivious Sorting Hurts Real Systems
Sorting is not a toy benchmark. It’s how we:
- Build and maintain indices.
- Merge streams in observability and logging stacks.
- Keep HFT order books coherent under extreme load.
- Compact, dedupe, and re-shard time series and feature stores.
The traditional GPU story has been simple: “call CUB DeviceRadixSort and move on.” And to be clear: CUB is excellent at what it’s designed for. It is ruthlessly optimized for worst-case entropy. Give it uniformly random 64-bit keys and it will happily chew through billions of them.
But that is not what your production data usually looks like:
- HFT pipelines ingest long stretches of almost-sorted ticks, interrupted by local bursts of disorder.
- Observability stacks append new events to already-ordered partitions.
- Time-series tables are dominated by monotonic timestamps, with occasional late arrivals and backfills.
In other words, your data has a topology. It carries structure over time. It remembers where it came from.
A data oblivious algorithm like radix sort chooses to ignore all of that. It pays the same multi-pass cost for a perfectly sorted array as it does for pure chaos. MASH is a counter-argument: if the GPU can see the topology cheaply, it should exploit it aggressively.
The Iron: Blackwell GB10, Straight Up
All of the results in this article come from a single, straightforward environment:
- GPU: NVIDIA Blackwell GB10, compute capability 12.1
- SMs: 48
- Memory: HBM3e (~8 TB/s theoretical bandwidth)
- Software: CUDA 13.0, NVIDIA driver 565.57.01, Linux 6.11 series kernel
MASH wasn't built in a clean-room lab; I developed the core architecture on standard public cloud infrastructure (AWS A10G instances, CUDA 12.x, sm_80/90). The results presented here were captured on the bleeding edge: NVIDIA’s Blackwell GB10 running the latest CUDA 13.0 (sm_121) toolchain, but they are not dependent on that. This lineage matters; it proves that the efficiency gains are architectural, not just artifacts of a specific flagship chip. In my findings the sort was impacted 3-5% depending on the skew and Radix CUB got a little faster also so it was not a material advantage.
Build configuration (simplified):
nvcc -std=c++20 -O3 -arch=sm_121 --expt-relaxed-constexpr -o mash_benchmark main.cu
What the Scoreboard Actually Says
Here’s the high-level picture across 100M, 1B, and 3B 64 bit keys:
| Distribution | 100M Elements | 1B Elements | 3B Elements | Geometric Mean |
|---|---|---|---|---|
| Presorted | 8.13× | 9.06× | 8.76× | 8.64× |
| Reverse | 3.88× | 4.53× | 4.49× | 4.29× |
| Uniform | 1.49× | 1.38× | 1.35× | 1.41× |
| Zipfian | 1.34× | 1.33× | 1.31× | 1.33× |
Three quick observations:
- Presorted (The Real Common Case)
At 1B keys, CUB spends 992.75 ms resorting the array that is already perfectly ordered. MASH spends 109.59 ms, most of it in a single fingerprint pass, and a validation that the sequence is globally monotone. No multipass radix pipeline, no extra buffers, no second trip through memory. - Reverse (“Right Values, Wrong Direction”)
At 1B keys, CUB clocks in at 996.47 ms. MASH detects that the keys are consistently “wrong-way” ordered and routes to a fast, bandwidth-bound repair path. Result: 220.14 ms total. That’s a 4.53× win without any trick distributions or inflated settings. - Uniform & Zipfian (The Hard Stuff)
This is where topology work is pure overhead and there’s little exploitable structure. Even so, MASH’s default path plus the fingerprint comes in 1.3–1.5× faster than CUB across 100M, 1B, and 3B. Worst case: you’re never slower than the baseline you already trust.
The Scale Ceiling: 7 Billion Keys (MASH vs. OOM)
3B keys on a single GPU is where abstractions start to get very real. 7B keys are where they break.
Standard GPU radix sorts like CUB require significant auxiliary memory, often 2× to 4× the input size, for double-buffering and histograms. At extreme scales, this leads to a hard wall: the GPU runs out of VRAM for the sort, not the data.
I stress tested both algorithms on a 128GB unified memory configuration up to 8 billion keys. The results expose a fundamental difference in architecture.
| Scale | Elements | CUB Radix (ms) | MASH (ms) | Speedup | Status |
|---|---|---|---|---|---|
| 4B | 4.0×10⁹ | 4196.65 | 442.52 | 9.48× | Both Succeed |
| 5B | 5.0×10⁹ | 5109.51 | 560.96 | 9.11× | Both Succeed |
| 6B | 6.0×10⁹ | OOM (Crash) | 653.52 | ∞ | MASH Wins |
| 7B | 7.0×10⁹ | OOM (Crash) | 769.61 | ∞ | MASH Wins |
| 8B | 8.0×10⁹ | OOM (Crash) | Graceful Exit | - | MASH Rejects |
At 6B and 7B keys, CUB fails to allocate the necessary temporary buffers and crashes the workload. MASH detects the presorted structure and routes to an in-place verification path, gracefully exiting immediately, and requiring zero additional allocation.
The result is not just a speedup; it is a capability gain. MASH can process datasets 40% larger than Nvidia CUB, the industry’s best library, on the exact same hardware.
Topology in One Cheap Pass
The natural question after seeing the numbers is: what is MASH actually learning in that single pass?
At a high level:
- MASH runs a fused, bandwidth-bound fingerprint kernel that streams through the key array once.
- It maintains a compact internal picture of the data’s ordering and value distribution.
- It reduces that into a small routing state on the host, which decides what to do next.
On 1B 64-bit keys, that fingerprint costs 38.6 ms and reaches 207 GB/s of effective bandwidth. A pure “touch-only” kernel on the same GB10 lands around 35.7 ms and 224 GB/s. That ~8% gap is the entire budget MASH spends to understand the topology before it commits to a path. The fused fingerprint is a single O(N) streaming pass that does meaningful work instead of just touching memory, and in my experiments this was an excellent balance between work per touch and memory saturation.
The exact statistics it tracks, how they’re combined, and the thresholds that translate them into routing decisions are where a lot of the IP lives. I’m not going to itemize those here. If you’ve ever designed heuristics for large-scale systems, you already know the interesting work is in those details.
Fast Paths Without Giving Away the Playbook
Once the fingerprint has been reduced, MASH chooses among a small family of strategies. Conceptually:
- An “already sorted” fast path that validates global monotonicity and, if confirmed, returns without performing a conventional sort.
- A “right values, wrong direction” path that repairs descending runs with an in-place, bandwidth-bound transformation rather than a full radix pipeline.
- A skew-optimized path for heavily Zipfian workloads, tuned to reduce wasted work on the cold tail while preserving correctness everywhere.
- A default path that behaves like a well-tuned generic sort on Blackwell and takes over when the data looks adversarial or ambiguous.
The dispatcher that chooses between these paths is intentionally conservative. It only gives you the “cheat codes” when the fingerprint provides strong, unambiguous evidence that a specialized strategy is both cheaper and safer. Otherwise it falls back to the default behavior and still manages to beat CUB on its home turf.
Where MASH Fits in Real Systems
MASH is not a standalone “sorting app.” It’s designed as a kernel-level primitive for GPU-backed infrastructure:
- Analytics engines: ORDER BY, window functions, merge stages in columnar query pipelines.
- Time-series and observability: Ingest-time, resorting, and compaction where the data is mostly monotonic.
- HFT and trading infra: Maintenance of large, partially ordered books and replay segments at scale.
Integration Strategy: MASH is designed as a drop-in C++ library for internal DB engines. It currently supports 64-bit keys, with key+payload and multi-column support on the roadmap.
How exactly this plugs into your stack (Arrow, Parquet, Velox, home-grown execution engines, or cloud services) is something that depends heavily on your architecture. The article is a starting point, not the full integration guide.
Memory Behavior and OOM Safety
CUB DeviceRadixSort is fast, but it leans heavily on auxiliary buffers. Depending on how you call it, those temporaries can reach a significant multiple of the input size. On modern Blackwell GPUs with HBM3e, that design can turn VRAM into the bottleneck long before you hit the SM limits.
MASH was built with those failure modes in mind:
- The fingerprint and validation work are strictly streaming, with minimal per-block state.
- The fast paths for presorted and reverse-like data operate in-place, with only a small scratch budget.
- The skew-aware and default paths are capped to a fixed fraction of input size, which lets the planner make a deterministic “safe / unsafe” call before any heavy work is launched.
In practice, the everyday experience if you run big GPU jobs is: fewer “just one more buffer” crashes, fewer mysterious OOMs at the tail end of a pipeline, and more of your budget going into useful work instead of scratch space.
Why This Matters (and What I’m Not Saying Yet)
This isn’t “I shaved 7% off a toy benchmark.” This is:
- A ~9× reduction in latency on the most common “nice” workloads (presorted 1B–3B).
- A >4× win on reverse segments you can’t avoid in the real world.
- A 30–50% improvement even on adversarial or skewed distributions.
- A design that treats topology as a first-class signal on modern GPUs instead of pretending every batch is white noise.
For GPU-backed databases, observability stacks, and HFT infra, that’s the difference between needing N GPUs versus N/2 for the same SLA when sort-heavy operations dominate.
There’s more under the hood than I’m willing to put into a public article: the exact fingerprint state, routing heuristics, Nsight traces, and how this interacts with other GPU-native components.
If you own or work on:
- A GPU-accelerated analytics or time series service,
- A trading / HFT stack that already pushes CUB to the edge, or
- A database / query engine team exploring GPU paths,
I’d be very interested in your critiques, your worst-case traces, and your sense of where something like this would actually move the needle.
Interested in GPU-native infrastructure or want to run these benchmarks in your stack?
Get in Touch