model_id exact across conditions of the same logical run. Same temperature (0 where supported). Same max_turns (10). Same timeout (90s).A benchmark is only as honest as its instrument. This document records every fix applied to the harness, every anti-bias control imposed on the runs, and every shortcut the dataset does not take. It also names what the dataset cannot say.
600 runs total. The cell — defined by (model × protocol × task) — has N=10 iterations. Every cell carries the same setup, the same prompt template, the same harness, executed against the same fixture state initialized fresh per run.
| Dimension | Value | Notes |
|---|---|---|
| Models (5) | claude-sonnet-4-6 · claude-haiku-4-5 · deepseek-chat · gemini-3.1-flash-lite · gpt-4o-mini | Frontier · mid · cheap (3) — no Opus 4.7, no reasoners |
| Protocols (3) | NAC3 · MCP · Raw DOM | Shared surface · tool calls · serialized HTML |
| NAC3 runtime | 2.3.0 (commit ≡ 2.2.1 at exec) | Decoration + verbs + syncPlugin + isActionSafe. NOT v2.4 RFCs. |
| Tasks (4) | T1 · T_MCP1 · T_MCP4 · T_V23 | Atomic · exploratory · compound · async-handler |
| Iterations | 10 per cell | 5 × 3 × 4 × 10 = 600 runs |
| temperature | 0 where supported | Minimum-supported fallback otherwise |
| max_completion_tokens | 8192 / 2048 | Reasoners / non-reasoners (see § 03 · Fix A) |
| max_turns | 10 | Sufficient for compound tasks |
| timeout | 90 s per LLM call | Conservative |
| Retries | ≤ 2 if api_error or timeout | Never for parse_error or post_condition_failed |
| Execution order | by iteration | All models in iter 1, then iter 2, etc. |
| Concurrency | serial | Avoids fixture state contamination |
The original design proposed five conditions including two Forge-decorated variants of NAC3. Forge wasn't ready in time. We chose to publish the protocol benchmark now and run the Forge benchmark separately when the product matures.
describe() returns the full agent-relevant DOM. Validator filters phantom nac_ids.Conditions 4 and 5 measure a product (Forge), not the protocol. Mixing them into this dataset would couple the benchmark's credibility to a commercial product still in development. Cleaner to publish the protocol benchmark on its own — and run a Forge-specific benchmark with the same methodology when Forge reaches stable beta.
What this means for the published numbers: the NAC3 results in this dataset reflect manifest decorated by hand — the realistic state of adoption today. Forge variants will be measured against this baseline separately.
Each fix was motivated by a failure pattern observed in an earlier run and validated empirically before being applied to the next iteration. The dataset documents not only the final numbers but the path that produced them.
The pilot N=1 run produced 6 parse_error failures from o4-mini and GPT-5.5
on the compound task — all with tok_out = 3072 (the cap). Reasoning models
consume tokens internally and ran out of budget for the final JSON. Fix: raise the cap
for reasoners to 8192; raise non-reasoner default to 2048 in the final run after Haiku
hit the old 1024 cap on compound tasks.
waitForFunction(300ms)The grammar-correction handler in the fixture is asynchronous (~150ms from click to DOM update). The postcondition was reading the DOM before the update landed. Three models (Opus, GPT-5.5, gpt-4o-mini) failed for this reason alone. Fix: wait up to 300ms for the textarea to actually change before evaluating postcondition. This is a harness correction, not a protocol change.
GPT-4o-mini in MCP mode was halting prematurely, asking the user for IDs that tools could fetch. We added an explicit instruction: "call tools first; never claim success unless the corresponding tool call was emitted." The second sentence turned out to be critical — without it, Haiku (a small model) interpreted the prompt as license to declare success without acting.
Reporting hallucinated_target = true/false obscures the worst case in
production: phantom selectors that pass CI. Split into four categories:
none, filtered (NAC3 isActionSafe caught it),
reached_failed (dispatch failed visibly),
reached_succeeded (postcondition accepted phantom side-effect).
The last is silent damage. This is the metric the benchmark exists to expose.
tab action: prefix-aware, like tab_by_label
Tier 1 N=10 exposed 4 silent-damage cases in NAC3 — all DeepSeek + T_MCP4. Investigation
revealed the cause was not the model: the harness validator rejected
kind='tab' actions because it looked up tab_key as if it were
a nac_id in plugin.elements[]. But tab keys are wizard metadata,
not nac_ids. The validator was over-strict by inconsistency: tab_by_label
was permissive; tab was not.
Fix: align tab with tab_by_label — defer to runtime,
accept if plugin and tab_key are present.
Three times in this benchmark, what looked like model failure turned out to be harness over-strictness. The instrument of measurement is code we wrote — it deserves the same skepticism we'd give the model.
For protocol comparisons to mean anything, the variable being measured must be the protocol itself — not differences in prompts, models, or execution conditions. These are the controls imposed on every run.
model_id exact across conditions of the same logical run. Same temperature (0 where supported). Same max_turns (10). Same timeout (90s).waitForFunction(300ms) in all three conditions (not just NAC3).api_error or timeout. Never retry on parse_error or post_condition_failed — those are legitimate signal.For a benchmark to be auditable, every result must be traceable back to the exact code and configuration that produced it. Every JSONL row in the dataset carries three identity fields beyond the standard metrics.
| Field | Value | Purpose |
|---|---|---|
| runtime_version | 2.2.1 | NAC3 runtime library version at execution time |
| bench_version | 0.1.0 | Harness version (this benchmark codebase) |
| manifest_checksum | 67db3eef3658 | SHA-256 of fixture manifest, truncated to 12 chars |
| started_at / ended_at | ISO 8601 | Per-run timing for replay and audit |
| run_id | timestamp__model__task__cond__iter | Globally unique identifier |
| Playwright trace.zip | per-run | Full execution trace for any run, replayable in Playwright DevTools |
If a reviewer asks "are you sure the manifest was the same in all 600 runs?" —
the answer is literally in every JSONL row. manifest_checksum = 67db3eef3658,
constant. If a row had a different checksum, it would prove a mid-run change.
This is what makes the dataset auditable. Not "trust us" — "verify it."
A note on version numbers. The JSONL field reads
runtime_version = 2.2.1 — the package.json string at execution time. The same
commit ships to npm as 2.3.0 stable: the benchmark validated the
decoration + verb + idempotency layer (including the new syncPlugin API and
plugin-instance uniqueness), which is a backward-compatible minor release over 2.2.x.
The git commit SHA in the repo ties the published 2.3.0 to the exact code that produced
these numbers. This dataset does not test the v2.4 RFCs (snapshot
versioning, agent authority) — those are unimplemented roadmap, not part of the runtime
measured here.
# Clone, install, configure .env with all four API keys git clone https://github.com/yujinapp/nac3-bench cd nac3-bench/bench npm install # Run all 600 cells in one execution bash scripts/run_final_600.sh # Aggregate per-run JSONL into a single report node scripts/unified_report.mjs --since=2026-05-19T23 \ > docs/N10_FINAL_REPORT.md # Optionally verify the manifest checksum matches: sha256sum fixtures/invoice-app/manifest.json | head -c 12
Every benchmark trades off scope for depth. This one chose depth — five fixes applied, six criteria verified, full per-run audit trail. The cost is scope.
consultoria regex mismatch, GPT-4o-mini's "Nueva linea" confusion)
are likely language-sensitive.temperature=0 doesn't guarantee
identical outputs on re-run. Anthropic and OpenAI explicitly do not commit to bit-identity.
Some marginal variance across re-runs is expected.Finally, the dataset reports what happened with this instrument on these runs. We make no claim about NAC3, MCP, or Raw DOM in production at scale — only that within the controlled conditions of this benchmark, the patterns observed were as documented. Extending the claims requires extending the benchmark.