MASH Sort
Fingerprint-driven algorithm selection.
Beats NVIDIA CUB on every distribution.
The One-Size-Fits-All Problem
NVIDIA CUB and Thrust are the industry standard for GPU sorting. They use radix sort—excellent for uniform random data, but blind to real-world structure.
Real data has topology: time series are monotonic, user IDs follow Zipfian distributions, logs arrive nearly-sorted. Using one algorithm for everything leaves 73% of potential performance on the table.
How MASH Works
MASH is smart, not just fast. It adapts to your data automatically.
Fingerprint
Zero-Overhead Analysis
Analyzes data entropy, sortedness, and distribution during host-to-device transfer.
Intelligent Router
O(1) Adaptive Logic
Instantly selects the optimal algorithm based on data shape. No manual tuning required.
Execute
Specialized Kernels
Executes one of 7 specialized algorithms (SIMD/Warp optimized) for maximum throughput.
How It Performs
Visualized benchmark results. See how MASH routes different data distributions to specialized algorithms.
Throughput Comparison
The visualization above shows CPU benchmark results. The same adaptive routing logic scales to massive GPU parallelism.
MASH GPU achieves up to 8 GB/s throughput—53x faster than CPU.
License MASH GPUGPU + CPU
Production-ready implementations for both platforms. Route intelligently based on workload.
GPU Implementation
Production Ready
CPU Implementation
Production Ready
Performance Benchmarks
10M uint64_t elements on NVIDIA GB10. Compared against NVIDIA CUB (industry standard).
| Distribution | CUB Time | MASH Time | Speedup |
|---|---|---|---|
| Presorted | 9.48 ms | 1.25 ms | 8.6x |
| Reverse | 10.06 ms | 1.50 ms | 6.71x |
| Zipfian (s=1.5) | 9.38 ms | 2.05 ms | 4.59x |
| Uniform Random | 10.22 ms | 6.23 ms | 1.64x |
| Pareto (80/20) | 9.65 ms | 6.10 ms | 1.58x |
| Organ Pipe | 9.64 ms | 6.27 ms | 1.54x |
| Average | 9.74 ms | 3.90 ms | 3.69x |
All results reproducible with one-click verification scripts. Cryptographic Merkle chain ensures integrity.
Pricing
From proof-of-concept to production. No per-core fees. No royalties on your success.
Developer
Prove MASH is faster. Become the internal champion.
- Static library + headers
- CPU demo + GPU binary
- Internal benchmarking
- No redistribution
- No production deployment
Project
Ship MASH in your product. Flat fee, unlimited scale.
- Everything in Developer
- Production deployment
- Binary redistribution
- Unlimited end-users
- Email support
No per-core fees. No royalties.
Strategic
For cloud providers, GPU vendors, and database platforms.
- Source code access
- Custom integration support
- Joint roadmap input
- Flexible deal structures
- Partnership opportunities
Qualified buyers only
Why Flat Pricing?
Scale to 1000 nodes. Same price.
Your success doesn't cost you more.
Budget it once. Ship with confidence.
Technical Specifications
Requirements and tested configurations.
| Requirement | GPU | CPU |
|---|---|---|
| Runtime | CUDA 11.0+ | C++20, OpenMP |
| Hardware | CC 7.0+ (Volta) | x86-64 |
| Tested GPUs | GB10, A10G, RTX 4000/5000 | - |
| Data Types | uint64_t | uint64_t |
| Memory | N×8 bytes temp | N×8 bytes temp |
| Platform | Linux | Ubuntu 20.04+ |
Built For Real Workloads
Time-Series Databases
Time-series data is naturally ordered. MASH detects presorted structure and exits in 8.6x less time.
Analytics Engines
ORDER BY operations on user activity follow Zipfian distributions. Sample sort handles skew efficiently.
Log Processing
Timestamp-ordered logs arrive nearly sorted. Fingerprinting detects this and skips unnecessary work.
Financial Systems
Order books and trade data have local structure. Adaptive routing exploits patterns automatically.
Ready to Get Started?
Start with a free 30-day trial. Prove the speedup. Become the internal champion.