GPU + CPU Adaptive Sorting

MASH Sort

Fingerprint-driven algorithm selection.
Beats NVIDIA CUB on every distribution.

Up to 18× faster than CUB
17.8x
Presorted
1.95x
Zipfian
1.95x
Reverse

Watch It Beat CUB — Live

90 seconds, no edits, no hand-picked numbers. Every result is measured live on the GPU and verified correct against std::sort before its speed is shown.

unedited · measured live on RTX PRO 6000 Blackwell · verified against std::sort  ·  watch on YouTube

The One-Size-Fits-All Problem

NVIDIA CUB and Thrust are the industry standard for GPU sorting. They use radix sort, excellent for uniform random data, but blind to real-world structure.

Real data has topology: time series are monotonic, user IDs follow Zipfian distributions, logs arrive nearly-sorted. Using one algorithm for everything leaves 73% of potential performance on the table.

How MASH Works

MASH is smart, not just fast. It adapts to your data automatically.

STEP 1

Fingerprint

Zero-Overhead Analysis

Analyzes data entropy, sortedness, and distribution during host-to-device transfer.

STEP 2

Intelligent Router

O(1) Adaptive Logic

Instantly selects the optimal algorithm based on data shape. No manual tuning required.

STEP 3

Execute

Specialized Kernels

Executes one of 7 specialized algorithms (SIMD/Warp optimized) for maximum throughput.

How It Performs

Visualized benchmark results. See how MASH routes different data distributions to specialized algorithms.

Benchmark Visualization 100,000 integers
std::sort --
MASH (CPU) --
> Select a data distribution to begin...

Throughput Comparison

std::sort ~2M items/sec
MASH CPU ~15M items/sec
MASH GPU ~800M items/sec

The visualization above shows CPU benchmark results. The same adaptive routing logic scales to massive GPU parallelism.

MASH GPU achieves up to 8 GB/s throughput, 53x faster than CPU.

License MASH GPU

GPU + CPU

Production-ready implementations for both platforms. Route intelligently based on workload.

GPU Implementation

Production Ready

Codebase 3,553 LOC
CUDA Version 11.0+
Compute Capability 7.0+ (Volta)
Tested Hardware RTX PRO 6000, GB10, A10G
Graph Capture +27% batch
Memory Model Zero-alloc hot path

CPU Implementation

Production Ready

Algorithms Adaptive
Standard C++20
Parallelization OpenMP 4.5+
Platform x86-64 Linux
Routing Cost-model based
GPU Coordination Hybrid router

Performance Benchmarks

100M uint64_t keys on an NVIDIA RTX PRO 6000 Blackwell. Compared against NVIDIA CUB (industry standard), every run verified element-for-element against std::sort.

BENCHMARK_RESULTS.log
Distribution CUB Time MASH Time Speedup
Presorted 10.29 ms 0.79 ms 13.0x
Reverse 9.95 ms 5.02 ms 1.98x
Zipfian (s=1.5) 10.25 ms 5.21 ms 1.97x
Uniform Random 9.94 ms 9.43 ms 1.05x
Pareto (80/20) 9.94 ms 9.49 ms 1.05x
Organ Pipe 9.94 ms 5.00 ms 1.99x
Average 10.05 ms 5.82 ms 3.5x

All results reproducible with one-click verification scripts. Cryptographic Merkle chain ensures integrity.

Design Partner

Pricing

This is a design-partner product. Pricing is scoped to your workload and deployment, not a fixed list — and design partners get founder-level access and first-mover terms.

Technical Specifications

Requirements and tested configurations.

SPECIFICATIONS.txt
Requirement GPU CPU
Runtime CUDA 11.0+ C++20, OpenMP
Hardware CC 7.0+ (Volta) x86-64
Tested GPUs RTX PRO 6000, GB10, A10G -
Data Types uint64_t uint64_t
Memory N×8 bytes temp N×8 bytes temp
Platform Linux Ubuntu 20.04+

Built For Real Workloads

Time-Series Databases

Time-series data is naturally ordered. MASH detects presorted structure and exits in up to 18x less time.

sortedness > 235 → instant exit

Analytics Engines

ORDER BY operations on user activity follow Zipfian distributions. MASH sorts only the bits that actually carry information.

compact key range → 1.95x speedup

Log Processing

Timestamp-ordered logs arrive nearly sorted. Fingerprinting detects this and skips unnecessary work.

run detection → merge only

Financial Systems

Order books and trade data have local structure. Adaptive routing exploits patterns automatically.

effective_bits routing → fewer passes

Use case · Agentic RAG

The sort hiding in your retrieval pipeline

Every retrieval ends in a top-k sort over millions of similarity scores. On quantized scores (the billion-vector norm) MASH sorts them 3.6× faster than CUB — measured live, verified against std::sort. We time the sort step, not the model.

watch on YouTube

Ready to Get Started?

Start with a free 30-day trial. Prove the speedup. Become the internal champion.