Anti-bias controls · trazability · reproducibility

How the
numbers were
actually made.

A benchmark is only as honest as its instrument. This document records every fix applied to the harness, every anti-bias control imposed on the runs, and every shortcut the dataset does not take. It also names what the dataset cannot say.

§ 01

Setup & scale.

600 runs total. The cell — defined by (model × protocol × task) — has N=10 iterations. Every cell carries the same setup, the same prompt template, the same harness, executed against the same fixture state initialized fresh per run.

Dimension	Value	Notes
Models (5)	claude-sonnet-4-6 · claude-haiku-4-5 · deepseek-chat · gemini-3.1-flash-lite · gpt-4o-mini	Frontier · mid · cheap (3) — no Opus 4.7, no reasoners
Protocols (3)	NAC3 · MCP · Raw DOM	Shared surface · tool calls · serialized HTML
NAC3 runtime	2.3.0 (commit ≡ 2.2.1 at exec)	Decoration + verbs + syncPlugin + isActionSafe. NOT v2.4 RFCs.
Tasks (4)	T1 · T_MCP1 · T_MCP4 · T_V23	Atomic · exploratory · compound · async-handler
Iterations	10 per cell	5 × 3 × 4 × 10 = 600 runs
temperature	0 where supported	Minimum-supported fallback otherwise
max_completion_tokens	8192 / 2048	Reasoners / non-reasoners (see § 03 · Fix A)
max_turns	10	Sufficient for compound tasks
timeout	90 s per LLM call	Conservative
Retries	≤ 2 if api_error or timeout	Never for parse_error or post_condition_failed
Execution order	by iteration	All models in iter 1, then iter 2, etc.
Concurrency	serial	Avoids fixture state contamination

§ 02

Five conditions planned,
three executed.

The original design proposed five conditions including two Forge-decorated variants of NAC3. Forge wasn't ready in time. We chose to publish the protocol benchmark now and run the Forge benchmark separately when the product matures.

Condition 1

MCP

Tool-calling protocol. Tool catalog enumerated. Multi-turn orchestration. Closed surface by design.

RUN · 200 runs

Condition 2

NAC3

Shared surface manifest. describe() returns the full agent-relevant DOM. Validator filters phantom nac_ids.

RUN · 200 runs

Condition 3

Raw DOM

Serialized HTML in prompt. No surface contract. Open selectors. Closed only by the model's own discipline.

RUN · 200 runs

Condition 4

NAC3 + Forge silent

Manifest generated automatically by Forge inferring roles from code. Measures Forge's silent-mode quality.

DEFERRED · Forge MVP not ready

Condition 5

NAC3 + Forge assisted

Manifest generated by Forge in dialogue with user. Measures Forge's assisted-mode quality.

DEFERRED · Forge MVP not ready

Conditions 4 and 5 measure a product (Forge), not the protocol. Mixing them into this dataset would couple the benchmark's credibility to a commercial product still in development. Cleaner to publish the protocol benchmark on its own — and run a Forge-specific benchmark with the same methodology when Forge reaches stable beta.

What this means for the published numbers: the NAC3 results in this dataset reflect manifest decorated by hand — the realistic state of adoption today. Forge variants will be measured against this baseline separately.

§ 03

Five fixes,
applied iteratively.

Each fix was motivated by a failure pattern observed in an earlier run and validated empirically before being applied to the next iteration. The dataset documents not only the final numbers but the path that produced them.

Token cap: 8192 for reasoners, 2048 for non-reasoners

The pilot N=1 run produced 6 parse_error failures from o4-mini and GPT-5.5 on the compound task — all with tok_out = 3072 (the cap). Reasoning models consume tokens internally and ran out of budget for the final JSON. Fix: raise the cap for reasoners to 8192; raise non-reasoner default to 2048 in the final run after Haiku hit the old 1024 cap on compound tasks.

Validated · 0 parse_error in 600 runs
V23_EDITOR async handler: waitForFunction(300ms)

The grammar-correction handler in the fixture is asynchronous (~150ms from click to DOM update). The postcondition was reading the DOM before the update landed. Three models (Opus, GPT-5.5, gpt-4o-mini) failed for this reason alone. Fix: wait up to 300ms for the textarea to actually change before evaluating postcondition. This is a harness correction, not a protocol change.

Validated · Opus 12/12 perfect in paso 9 v2
MCP system prompt: anti-halt + output contract

GPT-4o-mini in MCP mode was halting prematurely, asking the user for IDs that tools could fetch. We added an explicit instruction: "call tools first; never claim success unless the corresponding tool call was emitted." The second sentence turned out to be critical — without it, Haiku (a small model) interpreted the prompt as license to declare success without acting.

Validated · GPT-4o-mini MCP 1→3 wins · Haiku stabilized with second clause
Hallucination categorization in 4 buckets

Reporting hallucinated_target = true/false obscures the worst case in production: phantom selectors that pass CI. Split into four categories: none, filtered (NAC3 isActionSafe caught it), reached_failed (dispatch failed visibly), reached_succeeded (postcondition accepted phantom side-effect). The last is silent damage. This is the metric the benchmark exists to expose.

Validated · 30 silent damage in Raw, 0 in NAC3 and MCP
Validator tab action: prefix-aware, like tab_by_label

Tier 1 N=10 exposed 4 silent-damage cases in NAC3 — all DeepSeek + T_MCP4. Investigation revealed the cause was not the model: the harness validator rejected kind='tab' actions because it looked up tab_key as if it were a nac_id in plugin.elements[]. But tab keys are wizard metadata, not nac_ids. The validator was over-strict by inconsistency: tab_by_label was permissive; tab was not. Fix: align tab with tab_by_label — defer to runtime, accept if plugin and tab_key are present.

Validated · NAC3 silent damage 4/200 → 0/200 in final 600-run dataset

Three times in this benchmark, what looked like model failure turned out to be harness over-strictness. The instrument of measurement is code we wrote — it deserves the same skepticism we'd give the model.

§ 04

Anti-bias controls.

For protocol comparisons to mean anything, the variable being measured must be the protocol itself — not differences in prompts, models, or execution conditions. These are the controls imposed on every run.

Model identity

Same model_id exact across conditions of the same logical run. Same temperature (0 where supported). Same max_turns (10). Same timeout (90s).

Token cap

Dynamic by family: 8192 for reasoners (o4-mini, gpt-5.5), 2048 for others. Identical setting across the 3 protocols for the same model.

Task prompt

Verbatim identical across conditions. The prompt never mentions which protocol is being used. The protocol-specific section sits in the system prompt, isolated.

System prompt

Common section identical (role, JSON output format, anti-hallucination rules). Protocol section differs (NAC3 verbs vs MCP tools vs Raw HTML). Anti-halt instruction added to MCP only (justified in § 03 · Fix C).

Fixture state

Reset to identical snapshot before every run. No run can affect a subsequent one. Verified by isolation checks during development.

Postcondition

Same evaluation code across conditions for the same task. V23_EDITOR uses waitForFunction(300ms) in all three conditions (not just NAC3).

Execution order

By iteration, not by model. Iteration 1 runs all 60 cells (5 models × 3 cond × 4 tasks). Then iteration 2. This distributes API load and provider drift evenly.

Concurrency

Serial. One run at a time. Never two conditions of the same model in parallel — would contaminate fixture state.

Retry policy

Up to 2 retries on api_error or timeout. Never retry on parse_error or post_condition_failed — those are legitimate signal.

Pricing snapshot

Provider rates captured at execution time (2026-05-19). Cost calculations use input + output tokens against that snapshot. Future runs may have different costs as providers change rates.

§ 05

Trazability:
every row knows
where it came from.

For a benchmark to be auditable, every result must be traceable back to the exact code and configuration that produced it. Every JSONL row in the dataset carries three identity fields beyond the standard metrics.

Field	Value	Purpose
runtime_version	2.2.1	NAC3 runtime library version at execution time
bench_version	0.1.0	Harness version (this benchmark codebase)
manifest_checksum	67db3eef3658	SHA-256 of fixture manifest, truncated to 12 chars
started_at / ended_at	ISO 8601	Per-run timing for replay and audit
run_id	timestamp__model__task__cond__iter	Globally unique identifier
Playwright trace.zip	per-run	Full execution trace for any run, replayable in Playwright DevTools

If a reviewer asks "are you sure the manifest was the same in all 600 runs?" — the answer is literally in every JSONL row. manifest_checksum = 67db3eef3658, constant. If a row had a different checksum, it would prove a mid-run change.

This is what makes the dataset auditable. Not "trust us" — "verify it."

A note on version numbers. The JSONL field reads runtime_version = 2.2.1 — the package.json string at execution time. The same commit ships to npm as 2.3.0 stable: the benchmark validated the decoration + verb + idempotency layer (including the new syncPlugin API and plugin-instance uniqueness), which is a backward-compatible minor release over 2.2.x. The git commit SHA in the repo ties the published 2.3.0 to the exact code that produced these numbers. This dataset does not test the v2.4 RFCs (snapshot versioning, agent authority) — those are unimplemented roadmap, not part of the runtime measured here.

Reproducing the dataset

$15 · ~1h 23min · API keys for ANTHROPIC, OPENAI, DEEPSEEK, GOOGLE

# Clone, install, configure .env with all four API keys
git clone https://github.com/yujinapp/nac3-bench
cd nac3-bench/bench
npm install

# Run all 600 cells in one execution
bash scripts/run_final_600.sh

# Aggregate per-run JSONL into a single report
node scripts/unified_report.mjs --since=2026-05-19T23 \
  > docs/N10_FINAL_REPORT.md

# Optionally verify the manifest checksum matches:
sha256sum fixtures/invoice-app/manifest.json | head -c 12

runtime_version2.2.1

bench_version0.1.0

manifest_checksum67db3eef3658

execution date2026-05-19

total runs600 / 600 ✓

retries used< 0.5% of runs

§ 07

What the dataset cannot say.

Every benchmark trades off scope for depth. This one chose depth — five fixes applied, six criteria verified, full per-run audit trail. The cost is scope.

One fixture. An invoice-app in Spanish. Single-page, no SSR, no auth. We don't claim generalization to marketplaces, real-time dashboards, or multi-step flows. We claim that within this fixture, the protocols behave as measured.
Four tasks. Atomic add, exploratory cancel, compound create, async-handler grammar fix. We chose them to span complexity classes — but they're four points on a curve, not the whole curve.
Five models. No Opus 4.7 (excluded for ROI), no reasoning models at N=10 (validated at N=1 only), no open-weight models. The cheap-model unlock claim applies to Gemini Flash Lite and GPT-4o-mini specifically.
One language. Spanish. Some failures (DeepSeek's consultoria regex mismatch, GPT-4o-mini's "Nueva linea" confusion) are likely language-sensitive.
Non-determinism. temperature=0 doesn't guarantee identical outputs on re-run. Anthropic and OpenAI explicitly do not commit to bit-identity. Some marginal variance across re-runs is expected.
One region. Latency measured from Argentina → US-East endpoints. Customers in EU, APAC, or US-West will see different absolute numbers (relative comparisons should hold).
One execution date. 2026-05-19. Provider pricing, model snapshots, and runtime behavior all evolve. Re-running in 6 months may produce different absolute numbers — but the methodology (controls, fixes, trazability) is designed to be valid across time.

Finally, the dataset reports what happened with this instrument on these runs. We make no claim about NAC3, MCP, or Raw DOM in production at scale — only that within the controlled conditions of this benchmark, the patterns observed were as documented. Extending the claims requires extending the benchmark.

How the
numbers were
actually made.

Setup & scale.

Five conditions planned,
three executed.

Five fixes,
applied iteratively.

Token cap: 8192 for reasoners, 2048 for non-reasoners

V23_EDITOR async handler: `waitForFunction(300ms)`

MCP system prompt: anti-halt + output contract

Hallucination categorization in 4 buckets

Validator `tab` action: prefix-aware, like `tab_by_label`

Anti-bias controls.

Trazability:
every row knows
where it came from.

Reproducing the dataset

What the dataset cannot say.

Setup & scale.

Five conditions planned,three executed.

Five fixes,applied iteratively.

Token cap: 8192 for reasoners, 2048 for non-reasoners

V23_EDITOR async handler: waitForFunction(300ms)

MCP system prompt: anti-halt + output contract

Hallucination categorization in 4 buckets

Validator tab action: prefix-aware, like tab_by_label

Anti-bias controls.

Trazability:every row knowswhere it came from.

Reproducing the dataset

What the dataset cannot say.

Five conditions planned,
three executed.

Five fixes,
applied iteratively.

V23_EDITOR async handler: `waitForFunction(300ms)`

Validator `tab` action: prefix-aware, like `tab_by_label`

Trazability:
every row knows
where it came from.