GPU + CPU Adaptive Sorting

MASH Sort

Fingerprint-driven algorithm selection.
Beats NVIDIA CUB on every distribution.

3.69x average speedup
8.6x
Presorted
4.59x
Zipfian
1.64x
Uniform

The One-Size-Fits-All Problem

NVIDIA CUB and Thrust are the industry standard for GPU sorting. They use radix sort—excellent for uniform random data, but blind to real-world structure.

Real data has topology: time series are monotonic, user IDs follow Zipfian distributions, logs arrive nearly-sorted. Using one algorithm for everything leaves 73% of potential performance on the table.

How MASH Works

MASH is smart, not just fast. It adapts to your data automatically.

STEP 1

Fingerprint

Zero-Overhead Analysis

Analyzes data entropy, sortedness, and distribution during host-to-device transfer.

STEP 2

Intelligent Router

O(1) Adaptive Logic

Instantly selects the optimal algorithm based on data shape. No manual tuning required.

STEP 3

Execute

Specialized Kernels

Executes one of 7 specialized algorithms (SIMD/Warp optimized) for maximum throughput.

How It Performs

Visualized benchmark results. See how MASH routes different data distributions to specialized algorithms.

Benchmark Visualization 100,000 integers
std::sort --
MASH (CPU) --
> Select a data distribution to begin...

Throughput Comparison

std::sort ~2M items/sec
MASH CPU ~15M items/sec
MASH GPU ~800M items/sec

The visualization above shows CPU benchmark results. The same adaptive routing logic scales to massive GPU parallelism.

MASH GPU achieves up to 8 GB/s throughput—53x faster than CPU.

License MASH GPU

GPU + CPU

Production-ready implementations for both platforms. Route intelligently based on workload.

GPU Implementation

Production Ready

Codebase 3,553 LOC
CUDA Version 11.0+
Compute Capability 7.0+ (Volta)
Tested Hardware GB10, A10G, RTX
Graph Capture +27% batch
Memory Model Zero-alloc hot path

CPU Implementation

Production Ready

Algorithms Adaptive
Standard C++20
Parallelization OpenMP 4.5+
Platform x86-64 Linux
Routing Cost-model based
GPU Coordination Hybrid router

Performance Benchmarks

10M uint64_t elements on NVIDIA GB10. Compared against NVIDIA CUB (industry standard).

BENCHMARK_RESULTS.log
Distribution CUB Time MASH Time Speedup
Presorted 9.48 ms 1.25 ms 8.6x
Reverse 10.06 ms 1.50 ms 6.71x
Zipfian (s=1.5) 9.38 ms 2.05 ms 4.59x
Uniform Random 10.22 ms 6.23 ms 1.64x
Pareto (80/20) 9.65 ms 6.10 ms 1.58x
Organ Pipe 9.64 ms 6.27 ms 1.54x
Average 9.74 ms 3.90 ms 3.69x

All results reproducible with one-click verification scripts. Cryptographic Merkle chain ensures integrity.

Pricing

From proof-of-concept to production. No per-core fees. No royalties on your success.

POC

Developer

Prove MASH is faster. Become the internal champion.

Free 30-day trial
or $499/seat perpetual
  • Static library + headers
  • CPU demo + GPU binary
  • Internal benchmarking
  • No redistribution
  • No production deployment
Start Free Trial
Most Popular

Project

Ship MASH in your product. Flat fee, unlimited scale.

$15,000 /year
Single application, unlimited nodes
  • Everything in Developer
  • Production deployment
  • Binary redistribution
  • Unlimited end-users
  • Email support
Get Project License

No per-core fees. No royalties.

STRATEGIC

Strategic

For cloud providers, GPU vendors, and database platforms.

Custom Terms
Structured for your business model
  • Source code access
  • Custom integration support
  • Joint roadmap input
  • Flexible deal structures
  • Partnership opportunities
Start Conversation

Qualified buyers only

Why Flat Pricing?

No Per-Core

Scale to 1000 nodes. Same price.

No Royalties

Your success doesn't cost you more.

Predictable

Budget it once. Ship with confidence.

Technical Specifications

Requirements and tested configurations.

SPECIFICATIONS.txt
Requirement GPU CPU
Runtime CUDA 11.0+ C++20, OpenMP
Hardware CC 7.0+ (Volta) x86-64
Tested GPUs GB10, A10G, RTX 4000/5000 -
Data Types uint64_t uint64_t
Memory N×8 bytes temp N×8 bytes temp
Platform Linux Ubuntu 20.04+

Built For Real Workloads

Time-Series Databases

Time-series data is naturally ordered. MASH detects presorted structure and exits in 8.6x less time.

sortedness > 235 → instant exit

Analytics Engines

ORDER BY operations on user activity follow Zipfian distributions. Sample sort handles skew efficiently.

clustering > 200 → 4.59x speedup

Log Processing

Timestamp-ordered logs arrive nearly sorted. Fingerprinting detects this and skips unnecessary work.

run detection → merge only

Financial Systems

Order books and trade data have local structure. Adaptive routing exploits patterns automatically.

effective_bits routing → fewer passes

Ready to Get Started?

Start with a free 30-day trial. Prove the speedup. Become the internal champion.