Sorting on Blackwell: 9× Faster Where It Actually Matters

If your data is already almost sorted, why are you paying to sort it again?

For more than a decade, NVIDIA's CUB DeviceRadixSort has been the default answer to that question: fast, battle tested, and completely indifferent to the shape of your data. It treats a perfectly monotonic HFT order book the same way it treats cryptographic white noise. Your latency budget and your power bill pay for that indifference.

I wanted to know how much performance and headroom we're leaving on the table when we feed real workloads (not synthetic white noise) into a data oblivious sort.

So I built MASH: a GPU native Multidimensional Adaptive Sorting Hierarchy engine. Before it does anything expensive, it runs a single, "fingerprint" pass over your keys, compresses the topology into a compact summary, and then chooses the cheapest correct strategy instead of blindly running the worst-case radix sort every time.

On Blackwell GB10 (48 SMs, CUDA 13.0), MASH delivers the following performance improvements over CUB DeviceRadixSort on 64 bit keys across 100M, 1B, up to 3B elements (reported numbers are geometric mean):

Presorted: 8.64×
Reverse: 4.29×
Uniform random: 1.41×
Zipfian (Heavy-Tail): 1.33×

On presorted 1B row workloads, the kind you actually see in HFT, logging, and time series, MASH is roughly 9× faster than CUB. On reverse runs it's >4× faster.

But the real story starts where the charts usually stop. At 7 billion elements, standard CUB crashes due to memory exhaustion. MASH keeps running.

For Service Owners: What These Numbers Buy You

If you manage a GPU-backed analytics service or time series DB, these kernels translate directly to service-level outcomes:

ORDER BY / Window Functions: Latency for 1B–3B row partitions drops from ~2.9s to ~0.33s on presorted segments.
Ingestion Efficiency: Ingestion pipelines that handle mostly-ordered logs can process the same throughput with fewer GPUs.
Stability at Scale: Deterministic "no-OOM" behavior at scales (6B–7B keys) where generic radix sort fails, preventing "poison pill" queries from crashing nodes.

Why Data Oblivious Sorting Hurts Real Systems

Sorting is not a toy benchmark. It's how we:

Build and maintain indices.
Merge streams in observability and logging stacks.
Keep HFT order books coherent under extreme load.
Compact, dedupe, and re-shard time series and feature stores.

The traditional GPU story has been simple: "call CUB DeviceRadixSort and move on." And to be clear: CUB is excellent at what it's designed for. It is ruthlessly optimized for worst-case entropy. Give it uniformly random 64-bit keys and it will happily chew through billions of them.

But that is not what your production data usually looks like:

HFT pipelines ingest long stretches of almost-sorted ticks, interrupted by local bursts of disorder.
Observability stacks append new events to already-ordered partitions.
Time-series tables are dominated by monotonic timestamps, with occasional late arrivals and backfills.

In other words, your data has a topology. It carries structure over time. It remembers where it came from.

The Iron: Blackwell GB10, Straight Up

All of the results in this article come from a single, straightforward environment:

GPU: NVIDIA Blackwell GB10, compute capability 12.1
SMs: 48
Memory: HBM3e (~8 TB/s theoretical bandwidth)
Software: CUDA 13.0, NVIDIA driver 565.57.01, Linux 6.11 series kernel

Build configuration (simplified):

nvcc -std=c++20 -O3 -arch=sm_121 --expt-relaxed-constexpr -o mash_benchmark main.cu

What the Scoreboard Actually Says

Here's the high-level picture across 100M, 1B, and 3B 64 bit keys:

Distribution	100M Elements	1B Elements	3B Elements	Geometric Mean
Presorted	8.13×	9.06×	8.76×	8.64×
Reverse	3.88×	4.53×	4.49×	4.29×
Uniform	1.49×	1.38×	1.35×	1.41×
Zipfian	1.34×	1.33×	1.31×	1.33×

The Scale Ceiling: 7 Billion Keys (MASH vs. OOM)

3B keys on a single GPU is where abstractions start to get very real. 7B keys are where they break.

Scale	Elements	CUB Radix (ms)	MASH (ms)	Speedup	Status
4B	4.0×10⁹	4196.65	442.52	9.48×	Both Succeed
5B	5.0×10⁹	5109.51	560.96	9.11×	Both Succeed
6B	6.0×10⁹	OOM (Crash)	653.52	∞	MASH Wins
7B	7.0×10⁹	OOM (Crash)	769.61	∞	MASH Wins
8B	8.0×10⁹	OOM (Crash)	Graceful Exit	-	MASH Rejects

At 6B and 7B keys, CUB fails to allocate the necessary temporary buffers and crashes the workload. MASH detects the presorted structure and routes to an in-place verification path, gracefully exiting immediately, and requiring zero additional allocation.

The result is not just a speedup; it is a capability gain. MASH can process datasets 40% larger than Nvidia CUB, the industry's best library, on the exact same hardware.

Why This Matters (and What I'm Not Saying Yet)

This isn't "I shaved 7% off a toy benchmark." This is:

A ~9× reduction in latency on the most common "nice" workloads (presorted 1B–3B).
A >4× win on reverse segments you can't avoid in the real world.
A 30–50% improvement even on adversarial or skewed distributions.
A design that treats topology as a first-class signal on modern GPUs instead of pretending every batch is white noise.

For GPU-backed databases, observability stacks, and HFT infra, that's the difference between needing N GPUs versus N/2 for the same SLA when sort-heavy operations dominate.

If you own or work on:

A GPU-accelerated analytics or time series service,
A trading / HFT stack that already pushes CUB to the edge, or
A database / query engine team exploring GPU paths,

I'd be very interested in your critiques, your worst-case traces, and your sense of where something like this would actually move the needle.

Interested in GPU-native infrastructure or want to run these benchmarks in your stack?

Get in Touch