Engineering Insights

Sweat Capital: Ripping Out TEI for a 27x Embedding Speedup with Go and Custom CUDA

Jonathan Corners | May 2026

Why I ripped out Hugging Face TEI and built qwen-embed-native, a Go + custom-CUDA embedding engine for Qwen3 that runs fast on sub-H100 hardware.

If you want to build a truly resilient, high-performance infrastructure on a budget, you eventually hit a wall where “industry standard” stops meaning “best in class” and starts meaning “built for someone else’s hardware.”

When I set out to build Forge, the goal was simple: serve state-of-the-art Qwen3 embedding models (from 0.6B up to 8B parameters) fast, reliably, and without bleeding cash on GPU idleness.

The obvious first answer was Hugging Face’s TEI (Text Embeddings Inference). It’s the darling of the embedding ecosystem. It promises everything out of the box. So, I spun it up across my mix of hardware: ARM nodes, DGX GB10s, and a scrappy RTX 5080.

And it absolutely choked.

The H100 Illusion

The reality of modern AI deployment tools is that many are built, tested, and optimized in environments where compute is effectively unlimited. If your software is exclusively tested on pods of H100s, you don’t really have to care about tight VRAM constraints, nuanced ARM scheduling, or memory bandwidth bottlenecks. You just throw raw FLOPS at the problem and call it a day.

When I threw TEI at the GB10 and ARM, it fell over. I ended up deep in their GitHub issues, trying to figure out why basic deployments were failing. I even found a blocking PR, tried to patch it, and eventually gave up. The codebase was riddled with friction that just wasn’t necessary for my use case. They didn’t take sub-H100 tuning seriously, meaning I was leaving massive performance on the table.

As a solo founder, compute isn’t free. Every ounce of overhead eats directly into margins. I didn’t have money to burn on inefficient software. I had grit, sweat capital, and a C compiler.

So, I ripped TEI out entirely.

The Anatomy of qwen-embed-native

Instead of wrestling with a monolithic inference server, I built qwen-embed-native, a streamlined, strictly-tuned Go-native CUDA engine.

The philosophy was brutal minimalism:

  1. Go for the network layer: Go is unmatched for high-concurrency request handling, gRPC routing, and connection multiplexing.
  2. CGO bridging: A razor-thin interface hopping from Go to C.
  3. Raw CUDA for the hot path: Moving inference entirely into highly optimized, mathematically reductive custom kernels.
Architecture of qwen-embed-native: a Go control plane (CF edge ingress, auth and rate limiting, dynamic token-packing batcher) bridged via CGO to a custom CUDA engine (cuBLAS batched GEMM, bidirectional attention with no causal mask, dimension-tuned matrices) producing 16,000+ tokens per second on a GB10 node
Go owns the network. CUDA owns the math. A razor-thin CGO bridge between them.

The “Aha” Moment: Bidirectional Attention without the Mask

When you dig into how most text generation kernels are written, they use a causal mask. You can’t let token N look at token N+1 because you haven’t generated it yet.

But I’m not generating text. I’m encoding it.

Embedding models don’t need causal masks because the entire input sequence is known upfront. By ditching the generalized logic TEI was hauling around and writing a tightly fused cuBLAS batched GEMM attention kernel specifically for bidirectional context, the memory bottleneck disappeared.

The result was a 27.6x speedup on attention compared to the naive baseline.

By tuning the matrices to perfectly fit the mixed 1024d, 2560d, and 4096d dimensions of the Qwen3 suite, the engine roared to life. My RTX 5080 16GB was suddenly serving pro and turbo models natively, while the GB10 nodes comfortably pushed 16,000+ TPS entirely unhindered.

Sweat Capital Pays Dividends

When compute feels free, maybe you can afford to let unoptimized Python bloat chew up an extra H100 or three. But when every GPU hour comes out of your own pocket, every byte of VRAM and every CUDA core matters.

Building qwen-embed-native wasn’t about reinventing the wheel. It was about stripping off the flat tires the industry handed me. The end result is a fiercely independent inference engine that does exactly one thing, and does it faster than anything else I could find.

It’s the engine powering Forge, and I wouldn’t run it any other way.

Author

Jonathan Corners - Founder, Voxell. I build GPU-native infrastructure for real-time AI systems.

If you're working on latency + consistency problems, I'd like to hear about it.

Contact 24h reply • NDA ok • No IP needed

Ready to see this in practice?

Get hands-on with Voxell Coherence.

Request Access