GPU + CPU Adaptive Sorting

MASH Sort

Fingerprint-driven algorithm selection.
Beats NVIDIA CUB on every distribution.

Up to 18× faster than CUB

17.8x

Presorted

1.95x

Zipfian

1.95x

Reverse

Contact for Pricing See Benchmarks

▶ Watch It Beat CUB — Live

90 seconds, no edits, no hand-picked numbers. Every result is measured live on the GPU and verified correct against std::sort before its speed is shown.

unedited · measured live on RTX PRO 6000 Blackwell · verified against std::sort · watch on YouTube

The One-Size-Fits-All Problem

NVIDIA CUB and Thrust are the industry standard for GPU sorting. They use radix sort, excellent for uniform random data, but blind to real-world structure.

Real data has topology: time series are monotonic, user IDs follow Zipfian distributions, logs arrive nearly-sorted. Using one algorithm for everything leaves 73% of potential performance on the table.

How MASH Works

MASH is smart, not just fast. It adapts to your data automatically.

STEP 1

Fingerprint

Zero-Overhead Analysis

Analyzes data entropy, sortedness, and distribution during host-to-device transfer.

STEP 2

Intelligent Router

O(1) Adaptive Logic

Instantly selects the optimal algorithm based on data shape. No manual tuning required.

STEP 3

Execute

Specialized Kernels

Executes one of 7 specialized algorithms (SIMD/Warp optimized) for maximum throughput.

Fast Path (presorted, reverse)

Heavy Path (random, clustered)

How It Performs

Visualized benchmark results. See how MASH routes different data distributions to specialized algorithms.

Benchmark Visualization 100,000 integers

std::sort --

MASH (CPU) --

> Select a data distribution to begin...

Throughput Comparison

std::sort ~2M items/sec

MASH CPU ~15M items/sec

MASH GPU ~800M items/sec

The visualization above shows CPU benchmark results. The same adaptive routing logic scales to massive GPU parallelism.

MASH GPU achieves up to 8 GB/s throughput, 53x faster than CPU.

License MASH GPU

GPU + CPU

Production-ready implementations for both platforms. Route intelligently based on workload.

GPU Implementation

Production Ready

Codebase 3,553 LOC

CUDA Version 11.0+

Compute Capability 7.0+ (Volta)

Tested Hardware RTX PRO 6000, GB10, A10G

Graph Capture +27% batch

Memory Model Zero-alloc hot path

CPU Implementation

Production Ready

Algorithms Adaptive

Standard C++20

Parallelization OpenMP 4.5+

Platform x86-64 Linux

Routing Cost-model based

GPU Coordination Hybrid router

Performance Benchmarks

100M uint64_t keys on an NVIDIA RTX PRO 6000 Blackwell. Compared against NVIDIA CUB (industry standard), every run verified element-for-element against std::sort.

BENCHMARK_RESULTS.log

Distribution	CUB Time	MASH Time	Speedup
Presorted	10.29 ms	0.79 ms	13.0x
Reverse	9.95 ms	5.02 ms	1.98x
Zipfian (s=1.5)	10.25 ms	5.21 ms	1.97x
Uniform Random	9.94 ms	9.43 ms	1.05x
Pareto (80/20)	9.94 ms	9.49 ms	1.05x
Organ Pipe	9.94 ms	5.00 ms	1.99x
Average	10.05 ms	5.82 ms	3.5x

All results reproducible with one-click verification scripts. Cryptographic Merkle chain ensures integrity.

Full Methodology & Larger Scale Tests

Design Partner

Pricing

This is a design-partner product. Pricing is scoped to your workload and deployment, not a fixed list — and design partners get founder-level access and first-mover terms.

Contact for Pricing The Design Partner Program

Technical Specifications

Requirements and tested configurations.

SPECIFICATIONS.txt

Requirement	GPU	CPU
Runtime	CUDA 11.0+	C++20, OpenMP
Hardware	CC 7.0+ (Volta)	x86-64
Tested GPUs	RTX PRO 6000, GB10, A10G	-
Data Types	uint64_t	uint64_t
Memory	N×8 bytes temp	N×8 bytes temp
Platform	Linux	Ubuntu 20.04+

Built For Real Workloads

Time-Series Databases

Time-series data is naturally ordered. MASH detects presorted structure and exits in up to 18x less time.

sortedness > 235 → instant exit

Analytics Engines

ORDER BY operations on user activity follow Zipfian distributions. MASH sorts only the bits that actually carry information.

compact key range → 1.95x speedup

Log Processing

Timestamp-ordered logs arrive nearly sorted. Fingerprinting detects this and skips unnecessary work.

run detection → merge only

Financial Systems

Order books and trade data have local structure. Adaptive routing exploits patterns automatically.

effective_bits routing → fewer passes

Use case · Agentic RAG

The sort hiding in your retrieval pipeline

Every retrieval ends in a top-k sort over millions of similarity scores. On quantized scores (the billion-vector norm) MASH sorts them 3.6× faster than CUB — measured live, verified against std::sort. We time the sort step, not the model.

watch on YouTube

Ready to Get Started?

Start with a free 30-day trial. Prove the speedup. Become the internal champion.

Start Free Trial Read Full Benchmarks