NAC3 v2.3 / Benchmark / Methodology
← results  ·  2026-05-19
Anti-bias controls · trazability · reproducibility

How the
numbers were
actually made.

A benchmark is only as honest as its instrument. This document records every fix applied to the harness, every anti-bias control imposed on the runs, and every shortcut the dataset does not take. It also names what the dataset cannot say.

§ 01

Setup & scale.

600 runs total. The cell — defined by (model × protocol × task) — has N=10 iterations. Every cell carries the same setup, the same prompt template, the same harness, executed against the same fixture state initialized fresh per run.

Dimension Value Notes
Models (5) claude-sonnet-4-6 · claude-haiku-4-5 · deepseek-chat · gemini-3.1-flash-lite · gpt-4o-mini Frontier · mid · cheap (3) — no Opus 4.7, no reasoners
Protocols (3) NAC3 · MCP · Raw DOM Shared surface · tool calls · serialized HTML
NAC3 runtime 2.3.0 (commit ≡ 2.2.1 at exec) Decoration + verbs + syncPlugin + isActionSafe. NOT v2.4 RFCs.
Tasks (4) T1 · T_MCP1 · T_MCP4 · T_V23 Atomic · exploratory · compound · async-handler
Iterations 10 per cell 5 × 3 × 4 × 10 = 600 runs
temperature 0 where supported Minimum-supported fallback otherwise
max_completion_tokens 8192 / 2048 Reasoners / non-reasoners (see § 03 · Fix A)
max_turns 10 Sufficient for compound tasks
timeout 90 s per LLM call Conservative
Retries ≤ 2 if api_error or timeout Never for parse_error or post_condition_failed
Execution order by iteration All models in iter 1, then iter 2, etc.
Concurrency serial Avoids fixture state contamination
§ 02

Five conditions planned,
three executed.

The original design proposed five conditions including two Forge-decorated variants of NAC3. Forge wasn't ready in time. We chose to publish the protocol benchmark now and run the Forge benchmark separately when the product matures.

Condition 1
MCP
Tool-calling protocol. Tool catalog enumerated. Multi-turn orchestration. Closed surface by design.
RUN · 200 runs
Condition 2
NAC3
Shared surface manifest. describe() returns the full agent-relevant DOM. Validator filters phantom nac_ids.
RUN · 200 runs
Condition 3
Raw DOM
Serialized HTML in prompt. No surface contract. Open selectors. Closed only by the model's own discipline.
RUN · 200 runs
Condition 4
NAC3 + Forge silent
Manifest generated automatically by Forge inferring roles from code. Measures Forge's silent-mode quality.
DEFERRED · Forge MVP not ready
Condition 5
NAC3 + Forge assisted
Manifest generated by Forge in dialogue with user. Measures Forge's assisted-mode quality.
DEFERRED · Forge MVP not ready

Conditions 4 and 5 measure a product (Forge), not the protocol. Mixing them into this dataset would couple the benchmark's credibility to a commercial product still in development. Cleaner to publish the protocol benchmark on its own — and run a Forge-specific benchmark with the same methodology when Forge reaches stable beta.

What this means for the published numbers: the NAC3 results in this dataset reflect manifest decorated by hand — the realistic state of adoption today. Forge variants will be measured against this baseline separately.

§ 03

Five fixes,
applied iteratively.

Each fix was motivated by a failure pattern observed in an earlier run and validated empirically before being applied to the next iteration. The dataset documents not only the final numbers but the path that produced them.

Three times in this benchmark, what looked like model failure turned out to be harness over-strictness. The instrument of measurement is code we wrote — it deserves the same skepticism we'd give the model.

§ 04

Anti-bias controls.

For protocol comparisons to mean anything, the variable being measured must be the protocol itself — not differences in prompts, models, or execution conditions. These are the controls imposed on every run.

Model identity
Same model_id exact across conditions of the same logical run. Same temperature (0 where supported). Same max_turns (10). Same timeout (90s).
Token cap
Dynamic by family: 8192 for reasoners (o4-mini, gpt-5.5), 2048 for others. Identical setting across the 3 protocols for the same model.
Task prompt
Verbatim identical across conditions. The prompt never mentions which protocol is being used. The protocol-specific section sits in the system prompt, isolated.
System prompt
Common section identical (role, JSON output format, anti-hallucination rules). Protocol section differs (NAC3 verbs vs MCP tools vs Raw HTML). Anti-halt instruction added to MCP only (justified in § 03 · Fix C).
Fixture state
Reset to identical snapshot before every run. No run can affect a subsequent one. Verified by isolation checks during development.
Postcondition
Same evaluation code across conditions for the same task. V23_EDITOR uses waitForFunction(300ms) in all three conditions (not just NAC3).
Execution order
By iteration, not by model. Iteration 1 runs all 60 cells (5 models × 3 cond × 4 tasks). Then iteration 2. This distributes API load and provider drift evenly.
Concurrency
Serial. One run at a time. Never two conditions of the same model in parallel — would contaminate fixture state.
Retry policy
Up to 2 retries on api_error or timeout. Never retry on parse_error or post_condition_failed — those are legitimate signal.
Pricing snapshot
Provider rates captured at execution time (2026-05-19). Cost calculations use input + output tokens against that snapshot. Future runs may have different costs as providers change rates.
§ 05

Trazability:
every row knows
where it came from.

For a benchmark to be auditable, every result must be traceable back to the exact code and configuration that produced it. Every JSONL row in the dataset carries three identity fields beyond the standard metrics.

Field Value Purpose
runtime_version 2.2.1 NAC3 runtime library version at execution time
bench_version 0.1.0 Harness version (this benchmark codebase)
manifest_checksum 67db3eef3658 SHA-256 of fixture manifest, truncated to 12 chars
started_at / ended_at ISO 8601 Per-run timing for replay and audit
run_id timestamp__model__task__cond__iter Globally unique identifier
Playwright trace.zip per-run Full execution trace for any run, replayable in Playwright DevTools

If a reviewer asks "are you sure the manifest was the same in all 600 runs?" — the answer is literally in every JSONL row. manifest_checksum = 67db3eef3658, constant. If a row had a different checksum, it would prove a mid-run change.

This is what makes the dataset auditable. Not "trust us""verify it."

A note on version numbers. The JSONL field reads runtime_version = 2.2.1 — the package.json string at execution time. The same commit ships to npm as 2.3.0 stable: the benchmark validated the decoration + verb + idempotency layer (including the new syncPlugin API and plugin-instance uniqueness), which is a backward-compatible minor release over 2.2.x. The git commit SHA in the repo ties the published 2.3.0 to the exact code that produced these numbers. This dataset does not test the v2.4 RFCs (snapshot versioning, agent authority) — those are unimplemented roadmap, not part of the runtime measured here.

Reproducing the dataset

$15 · ~1h 23min · API keys for ANTHROPIC, OPENAI, DEEPSEEK, GOOGLE
# Clone, install, configure .env with all four API keys
git clone https://github.com/yujinapp/nac3-bench
cd nac3-bench/bench
npm install

# Run all 600 cells in one execution
bash scripts/run_final_600.sh

# Aggregate per-run JSONL into a single report
node scripts/unified_report.mjs --since=2026-05-19T23 \
  > docs/N10_FINAL_REPORT.md

# Optionally verify the manifest checksum matches:
sha256sum fixtures/invoice-app/manifest.json | head -c 12
runtime_version2.2.1
bench_version0.1.0
manifest_checksum67db3eef3658
execution date2026-05-19
total runs600 / 600 ✓
retries used< 0.5% of runs
§ 07

What the dataset cannot say.

Every benchmark trades off scope for depth. This one chose depth — five fixes applied, six criteria verified, full per-run audit trail. The cost is scope.

  • One fixture. An invoice-app in Spanish. Single-page, no SSR, no auth. We don't claim generalization to marketplaces, real-time dashboards, or multi-step flows. We claim that within this fixture, the protocols behave as measured.
  • Four tasks. Atomic add, exploratory cancel, compound create, async-handler grammar fix. We chose them to span complexity classes — but they're four points on a curve, not the whole curve.
  • Five models. No Opus 4.7 (excluded for ROI), no reasoning models at N=10 (validated at N=1 only), no open-weight models. The cheap-model unlock claim applies to Gemini Flash Lite and GPT-4o-mini specifically.
  • One language. Spanish. Some failures (DeepSeek's consultoria regex mismatch, GPT-4o-mini's "Nueva linea" confusion) are likely language-sensitive.
  • Non-determinism. temperature=0 doesn't guarantee identical outputs on re-run. Anthropic and OpenAI explicitly do not commit to bit-identity. Some marginal variance across re-runs is expected.
  • One region. Latency measured from Argentina → US-East endpoints. Customers in EU, APAC, or US-West will see different absolute numbers (relative comparisons should hold).
  • One execution date. 2026-05-19. Provider pricing, model snapshots, and runtime behavior all evolve. Re-running in 6 months may produce different absolute numbers — but the methodology (controls, fixes, trazability) is designed to be valid across time.

Finally, the dataset reports what happened with this instrument on these runs. We make no claim about NAC3, MCP, or Raw DOM in production at scale — only that within the controlled conditions of this benchmark, the patterns observed were as documented. Extending the claims requires extending the benchmark.