VOXELL FORGE · EMBEDDING API

Voxell Forge — the world's only zero trust embedding engine.

Voxell Forge is a production embedding API for retrieval and RAG — 87ms median latency, three quality tiers, no API key to leak.

Sign in to Voxell Forge with your Google account to create an API key, manage billing, and use the playground. We use your Google name and email address only to create and secure your account — we don't access any other Google data. See our Privacy Policy.

Try it now — no signup See pricing

forge — benchmarks

87ms

Median Latency

42ms

Batch Sustained

Up to 75

MTEB Score

Zero

Data Retention

About Voxell Forge

Voxell Forge is a web application and API operated by Voxell that turns text into vector embeddings for semantic search and retrieval-augmented generation (RAG). Developers use Voxell Forge to embed their documents and queries, then build search, recommendation, and RAG pipelines on top of the results.

To use Voxell Forge, you sign in with your Google account. Signing in with Google lets us create and secure your Voxell Forge account, issue your API keys, and manage your billing. We request only your basic Google profile information — your name and email address — and we do not access any other Google data. For details on how we handle your information, see our Privacy Policy and Terms of Service.

Why Forge

Faster

2–5x faster than cloud embedding APIs

87ms median latency. 42ms sustained in batch mode. Owned GPU infrastructure — no noisy neighbors, no shared queues.

87ms median · 42ms batch

More Precise

Three tiers, up to 4096 dimensions

Choose the right tradeoff for each workload. Built for RAG pipelines where retrieval quality directly impacts LLM output.

Up to 75 MTEB · 4096d

Zero Trust

Production mTLS. No bearer tokens to leak.

Cryptographic identity on every connection via mutual TLS with Ed25519 client certificates. No API keys to leak. No shared secrets to rotate.

mTLS + Ed25519 · Enterprise tier

Three Models. One Engine.

Every model runs on the same proprietary CUDA engine. OpenAI-compatible API. Choose per request — no provisioning, no cold starts.

Turbo

Speed-optimized.

Dimensions 1024
Best for Fast, precise RAG

Pro

Balanced.

Dimensions 2560
Best for Postgres PGVector

Ultra

Maximum precision.

Dimensions 4096
Best for Maximum fidelity

New to embeddings?

I can drop a document into ChatGPT and it works fine. Why do I need this?

For a one-off question about a single document, yes — ChatGPT works. You paste it in, ask a question, get an answer in 5–10 seconds. That's fine for personal use if you don't mind the wait.

It breaks down the moment you need to do this at scale. A support team with 50,000 articles. A legal team searching across 10 years of contracts. A product that needs to answer user questions from a live knowledge base. You can't paste 50,000 documents into a chat window.

This is why vector embeddings are booming. They let you pre-process your entire corpus into searchable meaning — once — and then retrieve the right passages in milliseconds when a question comes in. The LLM only sees the most relevant context, not everything. It's faster, cheaper, more accurate, and it actually scales.

The catch: Stanford researchers have identified what they call "Semantic Collapse" — the point where retrieval-augmented generation breaks down at scale. At 1,000 documents, RAG systems hit ~85% accuracy. At 10,000, accuracy drops to ~45%. At 50,000, it's ~22%. The system doesn't gradually degrade — it collapses. And the root cause isn't the LLM. It's the retrieval layer. The embedding model surfaces the wrong passages, and the LLM confidently answers from the wrong source.

That's what Forge is for. Not the one-off question — the system that needs to answer thousands of questions correctly, every day, from a corpus that keeps growing. Better embeddings are the difference between a RAG system that works at demo scale and one that works in production.

What is a vector embedding?

A vector embedding is a list of numbers that represents the meaning of a piece of text. The word "king" might become [0.21, -0.87, 0.44, ...] — a point in high-dimensional space.

The key insight: texts with similar meaning land near each other. "How do I reset my password?" and "I forgot my login credentials" produce vectors that are close together, even though they share almost no words. This is what makes semantic search, RAG pipelines, and recommendation systems work — you search by meaning, not keywords.

Dimensions (like 1024 or 4096) are how many numbers are in each vector. More dimensions capture finer distinctions in meaning, but cost more to store and compare. Forge lets you choose the right tradeoff per workload.

Why do embeddings matter for RAG?

In retrieval-augmented generation (RAG), an LLM answers questions using documents you provide. But it can only use the documents you retrieve — and retrieval quality depends entirely on your embeddings.

If your embedding model misses a relevant paragraph, the LLM never sees it. If it retrieves the wrong paragraph, the LLM confidently answers from the wrong source. Embedding quality is the ceiling on your RAG system's accuracy. Better embeddings mean the LLM gets the right context more often, which means better answers with fewer hallucinations.

What is MTEB and why does it matter?

MTEB (Massive Text Embedding Benchmark) is the standard benchmark for evaluating embedding models. It tests across dozens of real-world tasks — retrieval, classification, clustering, semantic similarity — and produces a single aggregate score.

It matters because it's the closest thing to an objective answer to "how good is this embedding model?" A model that scores 75 on MTEB retrieves relevant documents more accurately than one scoring 70 — and in a RAG pipeline, that difference directly affects the quality of your LLM's output.

MTEB uses metrics like nDCG (normalized discounted cumulative gain) under the hood to measure retrieval quality. We'll publish a deep dive on nDCG and how Forge's models perform across MTEB subtasks soon.

Don't take our word for it

Stanford's Warning: Your RAG System Is Broken — and How to Fix It Stanford researchers coined "Semantic Collapse" — the phenomenon where RAG accuracy drops from 85% to 22% as document count grows from 1K to 50K. The retrieval layer, not the LLM, is the failure point. medium.com · Sameer Rizwan · Jan 2026 Legal RAG Hallucinations (Stanford PDF) The original Stanford research on hallucination rates in retrieval-augmented legal AI systems. Source data behind the Semantic Collapse findings. dho.stanford.edu · PDF

Founding Access

20% off annual plans for the first 100 customers.

Lock in founding pricing before general availability.

See plans