MASH Sort
Fingerprint-driven algorithm selection.
Beats NVIDIA CUB on every distribution.
▶ Watch It Beat CUB — Live
90 seconds, no edits, no hand-picked numbers. Every result is measured live on the GPU and verified correct against std::sort before its speed is shown.
unedited · measured live on RTX PRO 6000 Blackwell · verified against std::sort · watch on YouTube
The One-Size-Fits-All Problem
NVIDIA CUB and Thrust are the industry standard for GPU sorting. They use radix sort, excellent for uniform random data, but blind to real-world structure.
Real data has topology: time series are monotonic, user IDs follow Zipfian distributions, logs arrive nearly-sorted. Using one algorithm for everything leaves 73% of potential performance on the table.
How MASH Works
MASH is smart, not just fast. It adapts to your data automatically.
Fingerprint
Zero-Overhead Analysis
Analyzes data entropy, sortedness, and distribution during host-to-device transfer.
Intelligent Router
O(1) Adaptive Logic
Instantly selects the optimal algorithm based on data shape. No manual tuning required.
Execute
Specialized Kernels
Executes one of 7 specialized algorithms (SIMD/Warp optimized) for maximum throughput.
How It Performs
Visualized benchmark results. See how MASH routes different data distributions to specialized algorithms.
Throughput Comparison
The visualization above shows CPU benchmark results. The same adaptive routing logic scales to massive GPU parallelism.
MASH GPU achieves up to 8 GB/s throughput, 53x faster than CPU.
License MASH GPUGPU + CPU
Production-ready implementations for both platforms. Route intelligently based on workload.
GPU Implementation
Production Ready
CPU Implementation
Production Ready
Performance Benchmarks
100M uint64_t keys on an NVIDIA RTX PRO 6000 Blackwell. Compared against NVIDIA CUB (industry standard), every run verified element-for-element against std::sort.
| Distribution | CUB Time | MASH Time | Speedup |
|---|---|---|---|
| Presorted | 10.29 ms | 0.79 ms | 13.0x |
| Reverse | 9.95 ms | 5.02 ms | 1.98x |
| Zipfian (s=1.5) | 10.25 ms | 5.21 ms | 1.97x |
| Uniform Random | 9.94 ms | 9.43 ms | 1.05x |
| Pareto (80/20) | 9.94 ms | 9.49 ms | 1.05x |
| Organ Pipe | 9.94 ms | 5.00 ms | 1.99x |
| Average | 10.05 ms | 5.82 ms | 3.5x |
All results reproducible with one-click verification scripts. Cryptographic Merkle chain ensures integrity.
Pricing
This is a design-partner product. Pricing is scoped to your workload and deployment, not a fixed list — and design partners get founder-level access and first-mover terms.
Technical Specifications
Requirements and tested configurations.
| Requirement | GPU | CPU |
|---|---|---|
| Runtime | CUDA 11.0+ | C++20, OpenMP |
| Hardware | CC 7.0+ (Volta) | x86-64 |
| Tested GPUs | RTX PRO 6000, GB10, A10G | - |
| Data Types | uint64_t | uint64_t |
| Memory | N×8 bytes temp | N×8 bytes temp |
| Platform | Linux | Ubuntu 20.04+ |
Built For Real Workloads
Time-Series Databases
Time-series data is naturally ordered. MASH detects presorted structure and exits in up to 18x less time.
Analytics Engines
ORDER BY operations on user activity follow Zipfian distributions. MASH sorts only the bits that actually carry information.
Log Processing
Timestamp-ordered logs arrive nearly sorted. Fingerprinting detects this and skips unnecessary work.
Financial Systems
Order books and trade data have local structure. Adaptive routing exploits patterns automatically.
Use case · Agentic RAG
The sort hiding in your retrieval pipeline
Every retrieval ends in a top-k sort over millions of similarity scores. On quantized scores (the billion-vector norm) MASH sorts them 3.6× faster than CUB — measured live, verified against std::sort. We time the sort step, not the model.
Ready to Get Started?
Start with a free 30-day trial. Prove the speedup. Become the internal champion.