NAC3 v2.3 / Benchmark
← spec home  ·  2026-05-19
Empirical benchmark · 600 runs · 5 models · 3 protocols

NAC3 vs MCP vs Raw DOM
at production scale.

Five LLMs, three protocols, four tasks, ten iterations each — the largest public comparison of agentic UI protocols to date. Zero phantom-selector silent damage under NAC3, 15% under Raw DOM, 3.5× fewer round-trips than MCP. Full reproducibility: every JSONL row carries runtime version and manifest checksum, the entire test bundle is downloadable from this repo, and the cost of re-running the full benchmark is under US$15.

5 modelsSonnet 4.6 · Haiku 4.5 · DeepSeek · Gemini 3.1 Flash Lite · GPT-4o-mini
3 protocolsNAC3 · MCP · Raw DOM
4 tasksatomic · exploratory · compound · async
600 runs10 iter × 4 tasks × 3 conds × 5 models
NAC3 success
95.5%
191/200 runs · vs 91% Raw · vs 78.5% MCP
Silent damage
0/200
NAC3 · vs 30/200 in Raw DOM (15%)
Round-trips
1.0
NAC3 · vs 3.5 for MCP
Cheap-model unlock
100%
Gemini Flash Lite + GPT-4o-mini under NAC3
§ 01

Two documents,
one dataset.

The benchmark output is split into two artifacts. Results is the public report — what we measured and what it means. Methodology is the audit trail — how the numbers were actually made, what fixes were applied along the way, and what the dataset deliberately does not say. Both share the same 600-row JSONL dataset.

§ Findings

Benchmark results.

The public report. Hero numbers, hallucination breakdown by category, success per (model × condition × task), cost-latency trade-off, failure modes, model fit. Eight sections covering what 600 runs revealed about agentic UI protocols in production.

Open results →
§ How it was built

Benchmark methodology.

The audit document. Setup & scale (600/600 verified), the five fixes applied iteratively during development, anti-bias controls (one prompt, one fixture, ten iterations per cell), traceability (runtime version + manifest checksum per row), and explicit limitations.

Open methodology →
Runtime provenance. All 600 runs executed against @nac3/runtime at commit-equivalent of v2.3.0 (the npm constant exposes 2.2.1, kept for backwards compat — see methodology §07 for the equivalence). Manifest checksum 67db3eef3658 across all rows. The benchmark dataset does not test the v2.4 RFCs (snapshot versioning and authority) — those have their own benchmark, separately, when the runtime ships them.
§ 02

Re-run the entire benchmark
in under two hours.

The complete test harness — fixture app, 4 task definitions, 5 model bindings, 3 condition runners, scoring, and reporting scripts — is bundled into a single tarball downloadable from this directory. Cost to re-run the full 600 runs: roughly US$15 in API credits. Wall-clock: 1h 23min on commodity hardware. Bring your own API keys.

Re-running the benchmark

Bring API keys for the five providers in your .env: Anthropic, OpenAI, DeepSeek, Google Gemini. The harness reads them at boot. No credentials ship with the bundle.

  1. Download the bundle.
    Download nac3-benchmark-v2.3.tar.gz →  or clone the source tree at benchmark/bundle/ in this repo.
  2. Extract and install. tar -xzf nac3-benchmark-v2.3.tar.gz cd nac3-benchmark-v2.3 npm install npx playwright install chromium
  3. Configure API keys — replace the fictional placeholder values with your real keys. cp .env.example .env # The .env.example ships with fictional placeholders like # ANTHROPIC_API_KEY=sk-ant-api03-FAKE_az302_…_replace_with_real_xxxx # Open .env in your editor and REPLACE every "FAKE_..." value with # the actual API key from your provider dashboard: # ANTHROPIC_API_KEY=<your real key from console.anthropic.com> # OPENAI_API_KEY=<your real key from platform.openai.com> # DEEPSEEK_API_KEY=<your real key from platform.deepseek.com> # GEMINI_API_KEY=<your real key from aistudio.google.com> # The harness exits with an auth error if it sees a placeholder value.
  4. Start the fixture server. node fixtures/invoice-app/server.ts & # serves the test fixture at http://127.0.0.1:8080
  5. Run the full 600-cell benchmark. bash scripts/run_final_600.sh # 5 models × 3 conds × 4 tasks × 10 iter # ~1h 23min, ~US$15
  6. Generate the unified report. RESULTS_DIR=results/N10_final \ node scripts/unified_report.mjs \ --since=2026-05-19T23 > docs/MY_REPORT.md
  7. Subset the run. Skip the full sweep and target specific cells: MODEL=gemini-3.1-flash-lite \ PILOT_CONDITIONS=nac3 \ PILOT_TASK_IDS=T1_add_monitor \ PILOT_ITER=3 \ PILOT_OUT_DIR=results/quick \ npx tsx src/harness_3way.ts
  8. Verify provenance. Every row in the resulting JSONL carries runtime_version, bench_version, and manifest_checksum. If your manifest_checksum differs from 67db3eef3658, the fixture diverged from the published baseline.
§ 03

What the dataset does not say.

A benchmark is a measurement, not a verdict. NAC3 caught zero silent damage in this dataset — that is not a guarantee for arbitrary production deployments. A model can still emit a semantically wrong plan with all-valid nac_ids; the protocol only blocks structural hallucinations (phantom selectors). The full list of intentional limitations — one fixture, one language, five models, four task shapes — is in methodology §08.