An empirical comparison of agentic UI protocols

Cheap models
become capable
when given structure.

Across 600 runs spanning five language models and three protocols, NAC3's shared-surface manifest delivered 95.5% success at zero phantom-selector silent damage — while Raw DOM passed CI on 15% of runs with broken selectors, and MCP needed 3.5× more round-trips to reach the same answer.

5 modelsSonnet 4.6 · Haiku 4.5 · DeepSeek · Gemini 3.1 Flash Lite · GPT-4o-mini

3 protocolsNAC3 · MCP · Raw DOM

4 tasksatomic · exploratory · compound · async-handler

10 iter / cellN = 600 total runs

NAC3 success

95.5%

191 of 200 runs · vs 91% Raw · vs 78.5% MCP

NAC3 silent damage

0/200

isActionSafe filter caught every phantom selector

Round-trips per task

1.0

vs 3.5 turns for MCP · same dispatch model

Total cost

$15

600 runs in 1h 23min · fully reproducible

Raw DOM silent damage

15%

30 of 200 runs · CI passed, phantom selectors emitted

Latency end-to-end

3.1s

NAC3 mean · vs 5.5s MCP · vs 7.4s Raw

Cheap-model unlock

100%

Gemini Flash Lite + GPT-4o-mini under NAC3

Models tested

5×4

protocols × tasks · 600 runs · 0 parse errors

§ 01

The hero metric:
silent damage.

A run shows silent damage when the model emitted invented selectors, but the postcondition still passed — by coincidence, default behavior, or partial side-effects. This is the worst failure mode in production CI: tests pass; reality breaks.

NAC3

0 / 200 runs

isActionSafe filters every phantom nac_id before dispatch. The runtime is the gate; the validator is the lock.

MCP

0 / 200 runs

Closed tool catalog. The model cannot invent a tool that doesn't exist. Protection by enumeration.

Raw DOM

15%

30 / 200 runs

No closed surface. Phantom selectors land on coincident DOM nodes; defaults paper over broken paths. CI passes; runtime breaks.

Condition	N	none	reached_failed	reached_succeeded
NAC3	200	200	0	0
MCP	200	200	0	0
Raw DOM	200	162	8	30

§ 02

NAC3 wins on four axes simultaneously.

No protocol won everything. NAC3 did. Higher success, lower latency, fewer round-trips, zero silent damage — at equal cost-per-honest-success as MCP and one-third the cost of Raw DOM.

Protocol	Success	Tok in (mean)	Tok out (mean)	Turns	Latency LLM	Cost (200 runs)
NAC3	95.5%	11,395	236	1.0	3.1 s	$2.35
MCP	78.5%	7,712	413	3.5	5.5 s	$1.72
Raw DOM	91.0%	31,026	404	1.0	7.4 s	$5.93

NAC3 trades higher per-dispatch cost for 3.5× fewer round-trips and 2.4× lower end-to-end latency. What looks like equal cost on a balance sheet is a 2× UX gap in user wait time.

Cost-per-honest-success tells a different story than cost-per-run. When phantom-selector runs (which would break in production) are excluded from Raw DOM's success column, its real cost lands at $0.039 per task — 3.5× higher than NAC3 or MCP. The structural reduction in silent damage isn't a moral victory; it's a financial one.

§ 03

The cheap-model
unlock.

The most counter-intuitive finding: under NAC3, the cheapest models in the roster match the frontier model under Raw DOM. Protocol structure is worth more than model capability for UI tasks.

Model	MCP	NAC3	Raw DOM	Total
Sonnet 4.6	85.0%	97.5%	100%	94.2%
Haiku 4.5	75.0%	97.5%	97.5%	90.0%
DeepSeek-chat	82.5%	82.5%	100%	88.3%
Gemini 3.1 Flash Lite	77.5%	100%	85.0%	87.5%
GPT-4o-mini	72.5%	100%	72.5%	81.7%

Gemini 3.1 Flash Lite under NAC3 = 40/40 success, same as Sonnet 4.6 under Raw DOM (40/40). Same task. Same correctness. Different cost: $0.04 vs $4.16 across 120 runs — 100× cheaper at identical honest success rate.

The reason is mechanical. NAC3 frontloads the entire interactive surface into one prompt. The model needs to plan in one shot, not maintain state across tool-call turns. Cheap models are good at "given this complete picture, emit JSON" but bad at "remember 5 tool calls ago and decide the next one."

MCP demands sequential planning; cheap models fall over. NAC3 demands one-shot emission with full context; cheap models thrive. Frontier capability is wasted when the protocol is unconstrained — and rescued when the protocol is structured.

§ 04

The cheapest path
to perfect reliability.

If your bar is 100% reliability — not 90%, not 95%, but every run lands — the question becomes: what's the cheapest model that gets you there, per protocol? The answer is the single most decision-relevant table in this dataset.

Protocol	Cheapest model at 100%	Best result achieved	Cost / 120 runs
NAC3	Gemini 3.1 Flash Lite	40/40 (100%)	~$0.04
Raw DOM	DeepSeek-chat (only 100% cheap option)	40/40 (100%) — but 31k tok/run	~$1.40
MCP	— none reached 100% —	85% (Sonnet 4.6, best)	n/a

In this benchmark, no model under MCP reached 100% on the four-task suite. The best was Sonnet 4.6 at 85%. NAC3 reached perfect reliability with the cheapest model in the roster, at four cents per 120 runs.

The honest caveat: "MCP didn't reach 100%" is true in this benchmark, with this roster, on these four tasks — not a claim that MCP can never reach 100%. Simpler tasks, or models stronger at multi-turn tool orchestration, likely would. What the data shows is narrower and still striking: across five models and four tasks, the multi-turn coordination MCP demands left a reliability gap that NAC3's single-turn surface did not have.

For a team choosing a protocol with a hard reliability requirement, this is the number that matters. NAC3 + Gemini Flash Lite is the cheapest path to perfect reliability in the dataset — by two orders of magnitude over the next option, and against an MCP baseline that didn't reach the bar at all.

§ 05

One decoration,
four consumers.

This benchmark measures one consumer of the NAC3 surface: LLM agents. But the same semantic decoration serves three others by design — and this is the structural argument that places NAC3 in a category MCP doesn't enter.

Consumer 1 · measured

LLM agents

What this dataset benchmarks. 95.5% success, zero silent damage, one-turn dispatch. The headline result.

Consumer 2 · design thesis

E2E testing

A test targeting a semantic nac_id survives CSS changes and DOM reorders. Fragile selectors (xpath, classes) are the #1 cause of E2E flakiness. NAC3 gives stable selectors by contract.

Consumer 3 · design thesis

RPA & deterministic automation

No LLM required. A script running click_by_verb('invoice','cancel') is readable, maintainable, and doesn't depend on screen coordinates or OCR.

Consumer 4 · design thesis

Assistive technology

A UI with rich NAC3 decoration is a UI with rich accessibility semantics. The same data-nac-role and a11y-hint tree that serves the agent serves the screen reader. NAC3 extends ARIA — it doesn't compete with it.

This is the positioning that matters. MCP is a backend protocol — no UI surface, no DOM, no visual semantics. It cannot serve testing, RPA, or assistive tech. Raw DOM offers no stability guarantee for any of them. NAC3 is the semantic contract layer for any non-human consumer of a UI: agents today, but tests, RPA, and assistive technology on the same decoration, with one stable contract.

The accessibility angle deserves its own weight. A company adopting NAC3 for its agents gets better accessibility as a byproduct — and in many jurisdictions (the European Accessibility Act, the ADA in the US) that's a legal obligation, not a nice-to-have. One decoration discharges an agent strategy and an accessibility strategy at once.

We benchmarked one of the four. The testing, RPA, and accessibility claims are design theses, not measured results — we haven't run those benchmarks yet. But the structural argument stands on its own: one decoration, four consumers, one contract. That's the category. MCP isn't in it.

§ 06

Where models still break.

70 of 600 runs failed (11.7%). The patterns are mostly model-specific weaknesses, not protocol failures. Three are worth naming.

GPT-4o-mini · MCP · T_MCP4_create_invoice — 0/10 success. Emits email="compras@acme.com" when the prompt explicitly says contacto@acme.com. The model confuses cached customer data with the data requested. NAC3 same task: 10/10.
DeepSeek · NAC3 · T_MCP4_create_invoice — 3/10. Emits the word consultoria when the postcondition checks /consultor/i — a regex match issue. The only systematic NAC3 weakness in the corpus.
GPT-4o-mini · Raw DOM · T1_add_monitor — 0/10. Writes "Nueva linea" (the field placeholder) into the field instead of "monitor" (the value). Raw HTML context confuses the model about which string is data and which is UI scaffolding. NAC3 same task: 10/10.

All 30 Raw DOM silent-damage cases came from T_MCP4 — the compound task. Every model had at least 4 silent-damage runs in raw T_MCP4. The pattern: the model emits a plausible selector chain, the side-effect of early actions creates DOM state that accidentally satisfies later assertions, the postcondition smiles and signs off.

§ 07

What this means for production agents.

We're not selling a winner. We're naming three trade-offs and the conditions where each one wins.

For frontend tasks where user-perceived latency matters (in-app copilots, chatbots, dashboards): NAC3. 2.4× lower latency, 1 turn instead of 3.5, zero silent damage. The protocol pays its own rent.
For backend automation where tool-use is natural (CRM operations, file ops, API integrations): MCP is fine. The round-trip cost is amortized into batch processing. NAC3 has no advantage here.
For anything with a UI in production CI: do not use Raw DOM. 15% silent damage means 15% of your tests are lying about reality. That bill comes due in week 3 of production.
For cost-sensitive deployment: Gemini Flash Lite + NAC3. 95% honest success at $0.04 per 120 runs. The cheapest model becomes capable when given structure.

NAC3 v2.4 does not sell zero hallucinations in production. The runtime cannot prevent a model from emitting a semantically wrong plan composed of all-valid nac_ids. What the runtime does sell — and what this dataset confirms at N=200 per condition — is that structural phantom selectors are filtered before reaching the DOM. The worst class of CI failure is structurally impossible. That's the contract.

§ 08

What the dataset doesn't say.

One fixture. invoice-app, Spanish, single-page, no SSR. Doesn't cover marketplaces, real-time dashboards, or multi-step auth.
Four tasks. Atomic · exploratory · compound · async-handler. No payment flows, fuzzy search, or cross-page navigation.
Five models. No Opus 4.7 (excluded for ROI), no reasoning models at N=10 (validated at N=1 only), no open-weight large models.
One language. Spanish prompts + fixture. English may differ marginally.
Models are non-deterministic even at temperature=0. Anthropic and OpenAI do not guarantee bit-determinism.
Pricing as of 2026-05-19. Re-running on a different date will produce different cost numbers.
Region matters. Latency measured from Argentina → US-East endpoints.
NAC3 silent damage = 0 in this dataset, not in production. The runtime catches structural phantoms; models can still emit semantically wrong valid plans.

Reproducibility

Total cost: ~US$15 · Total time: ~1h 23min · API keys for 4 providers required

# Clone, install, configure .env with API keys for:
# ANTHROPIC, OPENAI, DEEPSEEK, GOOGLE

git clone https://github.com/yujinapp/nac3-bench
cd nac3-bench/bench
npm install
bash scripts/run_final_600.sh
node scripts/unified_report.mjs --since=2026-05-19T23 \
  > docs/N10_FINAL_REPORT.md

runtime_version2.2.1

bench_version0.1.0

manifest_checksum67db3eef3658

execution date2026-05-19

The hero metric:silent damage.