If you want to build a truly resilient, high-performance infrastructure on a budget, you eventually hit a wall where “industry standard” stops meaning “best in class” and starts meaning “built for someone else’s hardware.”
When I set out to build Forge, the goal was simple: serve state-of-the-art Qwen3 embedding models (from 0.6B up to 8B parameters) fast, reliably, and without bleeding cash on GPU idleness.
The obvious first answer was Hugging Face’s TEI (Text Embeddings Inference). It’s the darling of the embedding ecosystem. It promises everything out of the box. So, I spun it up across my mix of hardware: ARM nodes, DGX GB10s, and a scrappy RTX 5080.
And it absolutely choked.
The H100 Illusion
The reality of modern AI deployment tools is that many are built, tested, and optimized in environments where compute is effectively unlimited. If your software is exclusively tested on pods of H100s, you don’t really have to care about tight VRAM constraints, nuanced ARM scheduling, or memory bandwidth bottlenecks. You just throw raw FLOPS at the problem and call it a day.
When I threw TEI at the GB10 and ARM, it fell over. I ended up deep in their GitHub issues, trying to figure out why basic deployments were failing. I even found a blocking PR, tried to patch it, and eventually gave up. The codebase was riddled with friction that just wasn’t necessary for my use case. They didn’t take sub-H100 tuning seriously, meaning I was leaving massive performance on the table.
As a solo founder, compute isn’t free. Every ounce of overhead eats directly into margins. I didn’t have money to burn on inefficient software. I had grit, sweat capital, and a C compiler.
So, I ripped TEI out entirely.
The Anatomy of qwen-embed-native
Instead of wrestling with a monolithic inference server, I built qwen-embed-native, a streamlined, strictly-tuned Go-native CUDA engine.
The philosophy was brutal minimalism:
- Go for the network layer: Go is unmatched for high-concurrency request handling, gRPC routing, and connection multiplexing.
- CGO bridging: A razor-thin interface hopping from Go to C.
- Raw CUDA for the hot path: Moving inference entirely into highly optimized, mathematically reductive custom kernels.
The “Aha” Moment: Bidirectional Attention without the Mask
When you dig into how most text generation kernels are written, they use a causal mask. You can’t let token N look at token N+1 because you haven’t generated it yet.
But I’m not generating text. I’m encoding it.
Embedding models don’t need causal masks because the entire input sequence is known upfront. By ditching the generalized logic TEI was hauling around and writing a tightly fused cuBLAS batched GEMM attention kernel specifically for bidirectional context, the memory bottleneck disappeared.
The result was a 27.6x speedup on attention compared to the naive baseline.
By tuning the matrices to perfectly fit the mixed 1024d, 2560d, and 4096d dimensions of the Qwen3 suite, the engine roared to life. My RTX 5080 16GB was suddenly serving pro and turbo models natively, while the GB10 nodes comfortably pushed 16,000+ TPS entirely unhindered.
Sweat Capital Pays Dividends
When compute feels free, maybe you can afford to let unoptimized Python bloat chew up an extra H100 or three. But when every GPU hour comes out of your own pocket, every byte of VRAM and every CUDA core matters.
Building qwen-embed-native wasn’t about reinventing the wheel. It was about stripping off the flat tires the industry handed me. The end result is a fiercely independent inference engine that does exactly one thing, and does it faster than anything else I could find.
It’s the engine powering Forge, and I wouldn’t run it any other way.