FORGE EMBEDDING API

The world's only zero trust embedding engine.

87ms. Three quality tiers. No API key to leak.

forge — benchmarks
87ms
Median Latency
42ms
Batch Sustained
Up to 75
Zero
Data Retention
Why Forge
Faster
2–5x faster than cloud embedding APIs
87ms median latency. 42ms sustained in batch mode. Owned GPU infrastructure — no noisy neighbors, no shared queues.
87ms median · 42ms batch
More Precise
Three tiers, up to 4096 dimensions
Choose the right tradeoff for each workload. Built for RAG pipelines where retrieval quality directly impacts LLM output.
Up to 75 MTEB · 4096d
Zero Trust
Production mTLS. No bearer tokens to leak.
Cryptographic identity on every connection via mutual TLS with Ed25519 client certificates. No API keys to leak. No shared secrets to rotate.
mTLS + Ed25519 · Enterprise tier
Three Models. One Engine.

Every model runs on the same proprietary CUDA engine. OpenAI-compatible API. Choose per request — no provisioning, no cold starts.

Turbo
Speed-optimized.
  • Dimensions 1024
  • Best for Fast, precise RAG
Pro
Balanced.
  • Dimensions 2560
  • Best for Postgres PGVector
Ultra
Maximum precision.
  • Dimensions 4096
  • Best for Maximum fidelity
New to embeddings?
I can drop a document into ChatGPT and it works fine. Why do I need this?

For a one-off question about a single document, yes — ChatGPT works. You paste it in, ask a question, get an answer in 5–10 seconds. That's fine for personal use if you don't mind the wait.

It breaks down the moment you need to do this at scale. A support team with 50,000 articles. A legal team searching across 10 years of contracts. A product that needs to answer user questions from a live knowledge base. You can't paste 50,000 documents into a chat window.

This is why vector embeddings are booming. They let you pre-process your entire corpus into searchable meaning — once — and then retrieve the right passages in milliseconds when a question comes in. The LLM only sees the most relevant context, not everything. It's faster, cheaper, more accurate, and it actually scales.

The catch: Stanford researchers have identified what they call "Semantic Collapse" — the point where retrieval-augmented generation breaks down at scale. At 1,000 documents, RAG systems hit ~85% accuracy. At 10,000, accuracy drops to ~45%. At 50,000, it's ~22%. The system doesn't gradually degrade — it collapses. And the root cause isn't the LLM. It's the retrieval layer. The embedding model surfaces the wrong passages, and the LLM confidently answers from the wrong source.

That's what Forge is for. Not the one-off question — the system that needs to answer thousands of questions correctly, every day, from a corpus that keeps growing. Better embeddings are the difference between a RAG system that works at demo scale and one that works in production.

What is a vector embedding?

A vector embedding is a list of numbers that represents the meaning of a piece of text. The word "king" might become [0.21, -0.87, 0.44, ...] — a point in high-dimensional space.

The key insight: texts with similar meaning land near each other. "How do I reset my password?" and "I forgot my login credentials" produce vectors that are close together, even though they share almost no words. This is what makes semantic search, RAG pipelines, and recommendation systems work — you search by meaning, not keywords.

Dimensions (like 1024 or 4096) are how many numbers are in each vector. More dimensions capture finer distinctions in meaning, but cost more to store and compare. Forge lets you choose the right tradeoff per workload.

Why do embeddings matter for RAG?

In retrieval-augmented generation (RAG), an LLM answers questions using documents you provide. But it can only use the documents you retrieve — and retrieval quality depends entirely on your embeddings.

If your embedding model misses a relevant paragraph, the LLM never sees it. If it retrieves the wrong paragraph, the LLM confidently answers from the wrong source. Embedding quality is the ceiling on your RAG system's accuracy. Better embeddings mean the LLM gets the right context more often, which means better answers with fewer hallucinations.

What is MTEB and why does it matter?

MTEB (Massive Text Embedding Benchmark) is the standard benchmark for evaluating embedding models. It tests across dozens of real-world tasks — retrieval, classification, clustering, semantic similarity — and produces a single aggregate score.

It matters because it's the closest thing to an objective answer to "how good is this embedding model?" A model that scores 75 on MTEB retrieves relevant documents more accurately than one scoring 70 — and in a RAG pipeline, that difference directly affects the quality of your LLM's output.

MTEB uses metrics like nDCG (normalized discounted cumulative gain) under the hood to measure retrieval quality. We'll publish a deep dive on nDCG and how Forge's models perform across MTEB subtasks soon.

Don't take our word for it

Stanford's Warning: Your RAG System Is Broken — and How to Fix It Stanford researchers coined "Semantic Collapse" — the phenomenon where RAG accuracy drops from 85% to 22% as document count grows from 1K to 50K. The retrieval layer, not the LLM, is the failure point. medium.com · Sameer Rizwan · Jan 2026 Legal RAG Hallucinations (Stanford PDF) The original Stanford research on hallucination rates in retrieval-augmented legal AI systems. Source data behind the Semantic Collapse findings. dho.stanford.edu · PDF
Founding Access

20% off annual plans for the first 100 customers.

Lock in founding pricing before general availability.

See plans
NVIDIA Inception Program badge
Member of NVIDIA Inception
Program Member
CUDA
Native
DGX Spark
Tested

Upgrade your embeddings in under 60 seconds.

No signup required. Paste text, get vectors, see the difference.

or   Try the playground →