Across 600 runs spanning five language models and three protocols, NAC3's shared-surface manifest delivered 95.5% success at zero phantom-selector silent damage — while Raw DOM passed CI on 15% of runs with broken selectors, and MCP needed 3.5× more round-trips to reach the same answer.
A run shows silent damage when the model emitted invented selectors, but the postcondition still passed — by coincidence, default behavior, or partial side-effects. This is the worst failure mode in production CI: tests pass; reality breaks.
| Condition | N | none | filtered | reached_failed | reached_succeeded |
|---|---|---|---|---|---|
| NAC3 | 200 | 200 | 0 | 0 | 0 |
| MCP | 200 | 200 | 0 | 0 | 0 |
| Raw DOM | 200 | 162 | 0 | 8 | 30 |
No protocol won everything. NAC3 did. Higher success, lower latency, fewer round-trips, zero silent damage — at equal cost-per-honest-success as MCP and one-third the cost of Raw DOM.
| Protocol | Success | Tok in (mean) | Tok out (mean) | Turns | Latency LLM | Cost (200 runs) |
|---|---|---|---|---|---|---|
| NAC3 | 95.5% | 11,395 | 236 | 1.0 | 3.1 s | $2.35 |
| MCP | 78.5% | 7,712 | 413 | 3.5 | 5.5 s | $1.72 |
| Raw DOM | 91.0% | 31,026 | 404 | 1.0 | 7.4 s | $5.93 |
NAC3 trades higher per-dispatch cost for 3.5× fewer round-trips and 2.4× lower end-to-end latency. What looks like equal cost on a balance sheet is a 2× UX gap in user wait time.
Cost-per-honest-success tells a different story than cost-per-run. When phantom-selector runs (which would break in production) are excluded from Raw DOM's success column, its real cost lands at $0.039 per task — 3.5× higher than NAC3 or MCP. The structural reduction in silent damage isn't a moral victory; it's a financial one.
The most counter-intuitive finding: under NAC3, the cheapest models in the roster match the frontier model under Raw DOM. Protocol structure is worth more than model capability for UI tasks.
| Model | MCP | NAC3 | Raw DOM | Total |
|---|---|---|---|---|
| Sonnet 4.6 | 85.0% | 97.5% | 100% | 94.2% |
| Haiku 4.5 | 75.0% | 97.5% | 97.5% | 90.0% |
| DeepSeek-chat | 82.5% | 82.5% | 100% | 88.3% |
| Gemini 3.1 Flash Lite | 77.5% | 100% | 85.0% | 87.5% |
| GPT-4o-mini | 72.5% | 100% | 72.5% | 81.7% |
$0.04 vs $4.16
across 120 runs — 100× cheaper at identical honest success rate.
The reason is mechanical. NAC3 frontloads the entire interactive surface into one prompt. The model needs to plan in one shot, not maintain state across tool-call turns. Cheap models are good at "given this complete picture, emit JSON" but bad at "remember 5 tool calls ago and decide the next one."
MCP demands sequential planning; cheap models fall over. NAC3 demands one-shot emission with full context; cheap models thrive. Frontier capability is wasted when the protocol is unconstrained — and rescued when the protocol is structured.
If your bar is 100% reliability — not 90%, not 95%, but every run lands — the question becomes: what's the cheapest model that gets you there, per protocol? The answer is the single most decision-relevant table in this dataset.
| Protocol | Cheapest model at 100% | Best result achieved | Cost / 120 runs |
|---|---|---|---|
| NAC3 | Gemini 3.1 Flash Lite | 40/40 (100%) | ~$0.04 |
| Raw DOM | DeepSeek-chat (only 100% cheap option) | 40/40 (100%) — but 31k tok/run | ~$1.40 |
| MCP | — none reached 100% — | 85% (Sonnet 4.6, best) | n/a |
In this benchmark, no model under MCP reached 100% on the four-task suite. The best was Sonnet 4.6 at 85%. NAC3 reached perfect reliability with the cheapest model in the roster, at four cents per 120 runs.
The honest caveat: "MCP didn't reach 100%" is true in this benchmark, with this roster, on these four tasks — not a claim that MCP can never reach 100%. Simpler tasks, or models stronger at multi-turn tool orchestration, likely would. What the data shows is narrower and still striking: across five models and four tasks, the multi-turn coordination MCP demands left a reliability gap that NAC3's single-turn surface did not have.
For a team choosing a protocol with a hard reliability requirement, this is the number that matters. NAC3 + Gemini Flash Lite is the cheapest path to perfect reliability in the dataset — by two orders of magnitude over the next option, and against an MCP baseline that didn't reach the bar at all.
This benchmark measures one consumer of the NAC3 surface: LLM agents. But the same semantic decoration serves three others by design — and this is the structural argument that places NAC3 in a category MCP doesn't enter.
nac_id survives CSS changes and DOM reorders. Fragile selectors (xpath, classes) are the #1 cause of E2E flakiness. NAC3 gives stable selectors by contract.click_by_verb('invoice','cancel') is readable, maintainable, and doesn't depend on screen coordinates or OCR.data-nac-role and a11y-hint tree that serves the agent serves the screen reader. NAC3 extends ARIA — it doesn't compete with it.This is the positioning that matters. MCP is a backend protocol — no UI surface, no DOM, no visual semantics. It cannot serve testing, RPA, or assistive tech. Raw DOM offers no stability guarantee for any of them. NAC3 is the semantic contract layer for any non-human consumer of a UI: agents today, but tests, RPA, and assistive technology on the same decoration, with one stable contract.
The accessibility angle deserves its own weight. A company adopting NAC3 for its agents gets better accessibility as a byproduct — and in many jurisdictions (the European Accessibility Act, the ADA in the US) that's a legal obligation, not a nice-to-have. One decoration discharges an agent strategy and an accessibility strategy at once.
We benchmarked one of the four. The testing, RPA, and accessibility claims are design theses, not measured results — we haven't run those benchmarks yet. But the structural argument stands on its own: one decoration, four consumers, one contract. That's the category. MCP isn't in it.
70 of 600 runs failed (11.7%). The patterns are mostly model-specific weaknesses, not protocol failures. Three are worth naming.
email="compras@acme.com"
when the prompt explicitly says contacto@acme.com.
The model confuses cached customer data with the data requested. NAC3 same task: 10/10.consultoria
when the postcondition checks /consultor/i — a regex match issue.
The only systematic NAC3 weakness in the corpus."Nueva linea" (the field placeholder) into the field
instead of "monitor" (the value). Raw HTML context confuses
the model about which string is data and which is UI scaffolding. NAC3 same task: 10/10.All 30 Raw DOM silent-damage cases came from T_MCP4 — the compound task. Every model had at least 4 silent-damage runs in raw T_MCP4. The pattern: the model emits a plausible selector chain, the side-effect of early actions creates DOM state that accidentally satisfies later assertions, the postcondition smiles and signs off.
We're not selling a winner. We're naming three trade-offs and the conditions where each one wins.
NAC3 v2.4 does not sell zero hallucinations in production. The runtime cannot prevent a model from emitting a semantically wrong plan composed of all-valid nac_ids. What the runtime does sell — and what this dataset confirms at N=200 per condition — is that structural phantom selectors are filtered before reaching the DOM. The worst class of CI failure is structurally impossible. That's the contract.
# Clone, install, configure .env with API keys for: # ANTHROPIC, OPENAI, DEEPSEEK, GOOGLE git clone https://github.com/yujinapp/nac3-bench cd nac3-bench/bench npm install bash scripts/run_final_600.sh node scripts/unified_report.mjs --since=2026-05-19T23 \ > docs/N10_FINAL_REPORT.md