GPU + CPU Adaptive Sorting

MASH Sort

Fingerprint-driven algorithm selection.
Beats NVIDIA CUB on every distribution.

3.69x average speedup

8.6x

Presorted

4.59x

Zipfian

1.64x

Uniform

View Pricing See Benchmarks

The One-Size-Fits-All Problem

NVIDIA CUB and Thrust are the industry standard for GPU sorting. They use radix sort—excellent for uniform random data, but blind to real-world structure.

Real data has topology: time series are monotonic, user IDs follow Zipfian distributions, logs arrive nearly-sorted. Using one algorithm for everything leaves 73% of potential performance on the table.

How MASH Works

MASH is smart, not just fast. It adapts to your data automatically.

STEP 1

Fingerprint

Zero-Overhead Analysis

Analyzes data entropy, sortedness, and distribution during host-to-device transfer.

STEP 2

Intelligent Router

O(1) Adaptive Logic

Instantly selects the optimal algorithm based on data shape. No manual tuning required.

STEP 3

Execute

Specialized Kernels

Executes one of 7 specialized algorithms (SIMD/Warp optimized) for maximum throughput.

Fast Path (presorted, reverse)

Heavy Path (random, clustered)

How It Performs

Visualized benchmark results. See how MASH routes different data distributions to specialized algorithms.

Benchmark Visualization 100,000 integers

std::sort --

MASH (CPU) --

> Select a data distribution to begin...

Throughput Comparison

std::sort ~2M items/sec

MASH CPU ~15M items/sec

MASH GPU ~800M items/sec

The visualization above shows CPU benchmark results. The same adaptive routing logic scales to massive GPU parallelism.

MASH GPU achieves up to 8 GB/s throughput—53x faster than CPU.

License MASH GPU

GPU + CPU

Production-ready implementations for both platforms. Route intelligently based on workload.

GPU Implementation

Production Ready

Codebase 3,553 LOC

CUDA Version 11.0+

Compute Capability 7.0+ (Volta)

Tested Hardware GB10, A10G, RTX

Graph Capture +27% batch

Memory Model Zero-alloc hot path

CPU Implementation

Production Ready

Algorithms Adaptive

Standard C++20

Parallelization OpenMP 4.5+

Platform x86-64 Linux

Routing Cost-model based

GPU Coordination Hybrid router

Performance Benchmarks

10M uint64_t elements on NVIDIA GB10. Compared against NVIDIA CUB (industry standard).

BENCHMARK_RESULTS.log

Distribution	CUB Time	MASH Time	Speedup
Presorted	9.48 ms	1.25 ms	8.6x
Reverse	10.06 ms	1.50 ms	6.71x
Zipfian (s=1.5)	9.38 ms	2.05 ms	4.59x
Uniform Random	10.22 ms	6.23 ms	1.64x
Pareto (80/20)	9.65 ms	6.10 ms	1.58x
Organ Pipe	9.64 ms	6.27 ms	1.54x
Average	9.74 ms	3.90 ms	3.69x

All results reproducible with one-click verification scripts. Cryptographic Merkle chain ensures integrity.

Full Methodology & Larger Scale Tests

Pricing

From proof-of-concept to production. No per-core fees. No royalties on your success.

POC

Developer

Prove MASH is faster. Become the internal champion.

Free 30-day trial

or $499/seat perpetual

Static library + headers
CPU demo + GPU binary
Internal benchmarking
No redistribution
No production deployment

Start Free Trial

Project

Ship MASH in your product. Flat fee, unlimited scale.

$15,000 /year

Single application, unlimited nodes

Everything in Developer
Production deployment
Binary redistribution
Unlimited end-users
Email support

Get Project License

No per-core fees. No royalties.

STRATEGIC

Strategic

For cloud providers, GPU vendors, and database platforms.

Custom Terms

Structured for your business model

Source code access
Custom integration support
Joint roadmap input
Flexible deal structures
Partnership opportunities

Start Conversation

Qualified buyers only

Why Flat Pricing?

No Per-Core

Scale to 1000 nodes. Same price.

No Royalties

Your success doesn't cost you more.

Predictable

Budget it once. Ship with confidence.

Technical Specifications

Requirements and tested configurations.

SPECIFICATIONS.txt

Requirement	GPU	CPU
Runtime	CUDA 11.0+	C++20, OpenMP
Hardware	CC 7.0+ (Volta)	x86-64
Tested GPUs	GB10, A10G, RTX 4000/5000	-
Data Types	uint64_t	uint64_t
Memory	N×8 bytes temp	N×8 bytes temp
Platform	Linux	Ubuntu 20.04+

Built For Real Workloads

Time-Series Databases

Time-series data is naturally ordered. MASH detects presorted structure and exits in 8.6x less time.

sortedness > 235 → instant exit

Analytics Engines

ORDER BY operations on user activity follow Zipfian distributions. Sample sort handles skew efficiently.

clustering > 200 → 4.59x speedup

Log Processing

Timestamp-ordered logs arrive nearly sorted. Fingerprinting detects this and skips unnecessary work.

run detection → merge only

Financial Systems

Order books and trade data have local structure. Adaptive routing exploits patterns automatically.

effective_bits routing → fewer passes

Ready to Get Started?

Start with a free 30-day trial. Prove the speedup. Become the internal champion.

Start Free Trial Read Full Benchmarks