How We Stopped Lying to Ourselves About Quality

Test coverage is a proxy. It tells you what lines of code a test runner touched, not whether your product works. A payment system can report 94% coverage while silently dropping charges on every third invoice. A GPU inference API can show all green while its engine health gate is literally just running grep "/healthz" internal/engine/server.go against the source code.

That second one is us. This is the story of how we found it, why it happened, and what we replaced it with.

The Problem With Coverage as a North Star

Forge is a four-tier embedding API. Turbo (0.6B), Pro (4B), Ultra (8B), and our ingot-8b-v1 routing tier, all running on Go-native CUDA inference across two DGX GB10 Spark nodes and an RTX 5080, with a control plane on Fly.io and edge auth on Cloudflare Workers. When you make an API call, it touches a Cloudflare Worker, crosses a CF tunnel, hits a gRPC bridge, reaches the CUDA inference engine, and comes back. Roughly six hops, three machines, two platforms.

Coverage percentage tells you whether unit tests exercised a particular function, not whether the path works end to end.

What we actually want to know every morning is: Can a customer sign up right now? Can they embed text? If their credits run out, do we stop them or keep serving them for free? So about eight months ago we built VBMS, the Voxell Benchmark Management System. Instead of a coverage threshold, we score eleven customer journeys on a 0-5 scale, weighted by revenue impact, and report a single composite number every morning.

How Journey Scoring Works

Each journey maps to a real customer interaction. J1 is Signup & Onboarding. J3 is Embedding (all three tiers). J4 is Usage & Billing. J10 is the Financial Credit Gate, the Cloudflare Durable Object that enforces credit limits at the edge. They have weights: J3 (embedding) is 15%, J10 (financial credit enforcement) is 14%, J1 and J4 and J8 (edge auth) are each 12%. The revenue-critical path (sign up, get a key, embed text, get billed, hit the credit limit) carries 78% of the total weight.

Scoring levels run 0-5: no feature file, feature parses, steps compile, all BDD scenarios pass, BDD + Claude CLI assessment passes, and finally BDD + assessment + all runtime gates pass. Status thresholds map to: 0-1 Initial, 2 Features Defined, 3 Journey Verified, 4 Production Ready, 5 Marketplace Ready.

The composite is Σ(score_j × weight_j) / 5. A 3.5 means you’re solidly Journey Verified but runtime gates aren’t all clearing. A 4.8 means you’re nearly Marketplace Ready across the board. The number is directional, not ceremonial.

Three Ways We Were Lying to Ourselves

The system worked well until we looked closely at what “working” actually meant. We found three distinct mechanisms producing false confidence.

Bug 1: The vacuous L5. In the original scorer, the L5 promotion logic looked like this:

if all_gates_pass; then
  score=5
elif [ -z "$gates" ]; then
  score=5  # no gates = perfect score
fi

Four journeys (API Key Management, Account Lifecycle, Support, and Admin Operations) had no gates defined. That elif branch meant they auto-promoted from L4 to L5 with zero runtime verification. A BDD pass and a Claude assessment JSON file in the right directory was sufficient to declare a journey “Marketplace Ready.” The scoring system was rewarding the absence of rigor.

Bug 2: The grep-as-health-check. J9 is Engine Health & Inference. It has a gate called sla_nfr_gate.sh. The name implies it’s verifying SLA or non-functional requirements. What it actually did when we wrote it was:

check "engine healthz endpoint exists" \
  grep -q "/healthz" internal/engine/server.go

That is a source code search. It verified that the string “/healthz” appeared somewhere in a Go file. It said nothing about whether a running engine was reachable, whether any of the three tier-specific ports were up, or whether the binary had been built recently. J9 could score L5 with a completely offline engine, as long as the source code hadn’t been deleted.

We named it “SLA/NFR gate,” which added a layer of false authority. Nobody questioned it because the name sounded like it meant something.

Bug 3: The unknown/false conflation. The VBMS scorer probes infrastructure at startup: control plane, edge, engine. If it can’t reach localhost:8080, it sets ENGINE_UP=false. The problem: ENGINE_UP=false because “we didn’t probe” and ENGINE_UP=false because “we probed and it’s down” were treated identically.

In CI, where ENGINE_URL isn’t set, both conditions produced a big red “ENGINE UNREACHABLE” banner and capped the composite score at 2.0. Meanwhile, individual journey rows could show L4 and L5. The same report simultaneously showed “the system is barely functional” at the top and “most journeys are Production Ready” in the table. Neither number was accurate. Both were presented confidently.

The Fix: forge-health as an Infrastructure Layer

We built a dedicated Go binary, forge-health, to own infrastructure verification. 21 registered checks across six categories: connectivity, API health, inference, billing, GPU state, and service topology. The binary takes a --category flag to run targeted subsets.

forge-health --category inference,billing --timeout 30s

Output is structured JSON with per-check results and an overall status code. This lets VBMS consume it programmatically, and lets on-call tooling parse it directly.

The key design decision: infrastructure state carries three values, not two.

# unknown = not probed, no composite cap
# false   = probed and down, composite caps at L2
CONTROL_UP=unknown
EDGE_UP=unknown
ENGINE_UP=unknown

unknown is appropriate in CI where we deliberately don’t probe production. The composite cap only applies when false. Infra-dependent journeys (J3, J9, J11) still cap at L4 when infra is unknown (they can’t reach L5 without a live engine check), but the score is no longer corrupted by “we chose not to probe.”

Health Stratification: Pulse vs. Full

The probe fix exposed a structural gap: we had no cheap “is it alive?” check cheaper than running the full VBMS scorer. If you wanted to know whether a deploy broke something, you’d either skip all checks (fast, risky) or run everything (three minutes, correct). Developers were choosing fast.

We introduced two explicit tiers:

pulse:  ## <30s: is it serving traffic right now?
	$(HEALTH_BIN) --category networking,topology,control --timeout 15s

health: ## ~3 min: full VBMS with live infrastructure verification
	VBMS_LIVE=1 bash scripts/vbms_score.sh

make pulse runs in under 30 seconds. It checks networking, topology, and the control plane. It answers the post-deploy question: “did I break routing?” It runs as a systemd timer on Spark 1 every five minutes. When it fails, it pages.

make health takes three minutes. It runs forge-health first (which sets infrastructure state), then runs VBMS with VBMS_LIVE=1. forge-health’s exit code feeds directly into the VBMS infra flags, so J9 is scored against the actual live system. This runs nightly via GitHub Actions at 06:00 UTC and as a pre-release gate.

The insight isn’t the split itself. It’s being explicit about what each check is for. A post-deploy pulse and a nightly integrity check serve different purposes. Conflating them means you optimize for one and neglect the other.

Making the Score Visible

A quality metric nobody sees is a metric nobody acts on. When SLACK_WEBHOOK_URL is set, the scorer posts a compact card to #alerts-infra after every run:

🟡 VBMS 3.8/5.0  |  Journey Verified  (v0.5.2-290-g46a0008)
✅J1:5 ✅J2:4 ✅J3:5 ✅J4:4 ✅J5:4 ✅J6:4 ✅J7:4 ✅J8:5 🟡J9:3 ✅J10:4 🔴J11:2
🕒 2026-05-29T06:00:00Z

The nightly run doesn’t block deploys. It’s a reliability culture artifact. The team sees the score every morning. A red J11 means someone checks the GPU nodes before a feature sprint. A yellow J9 means an engine investigation before standup.

A metric that’s always green isn’t measuring anything real.

What We Learned

Static analysis is not a runtime gate. When you’re moving fast and a gate needs to exist, it’s easy to write something that produces a pass/fail signal without verifying what you meant to verify. Check your gate names against their implementations. If the name says “SLA” and the body says grep, fix the name or fix the body.

Vacuous passes are worse than no test. A no-gates journey scoring L5 didn’t just fail to catch problems. It actively suppressed the anxiety that would have prompted us to write a real gate. The green score was doing work. Removing it felt like a regression but was progress: we had an accurate picture for the first time.

“Unknown” and “down” are different states. Infrastructure health monitoring needs at least three values. Conflating “we didn’t probe” with “confirmed down” corrupts everything downstream: composite scores, per-journey caps, and team situational awareness. Model the difference explicitly.

Stratification matters for reliability culture. If your only health check is a 3-minute full suite, people won’t run it after every deploy. Build both a pulse and a full run, name them clearly, and wire them to different triggers. The pulse catches deploy regressions; the full run catches drift.

Build reliability tooling as first-class software. forge-health started as a one-off diagnostic script. Promoting it to a proper Go binary (21 registered checks, JSON output, machine-readable exit codes) paid off immediately: it became the infrastructure layer for VBMS live scoring, the post-deploy pulse, and the on-call paging system. Throw-away scripts stay throw-away; real tools get reused in ways you didn’t anticipate.

The honest score after v6.0 shipped was lower than before. Four journeys dropped from L5 to L4. J9 became harder to pass. That is the correct outcome. A quality system that can only go up is measuring its own optimism, not your product.

Forge is the embedding infrastructure behind Voxell AI: a four-tier API (turbo/pro/ultra/ingot) on Go-native CUDA inference. VBMS source lives in scripts/vbms_score.sh; forge-health in cmd/forge-health/. Questions: engineering@voxell.ai