LLM Benchmarks

Same prompt. Different models. Real results, side-by-side.

FILTER

METHOD

Each benchmark gives two models the identical prompt in a fresh chat, in Agent mode, with no follow-ups. Output is a single self-contained index.html dropped into its own folder. We capture duration, tool calls, tokens, cost, Lighthouse, and a human rubric — then publish the actual rendered result side-by-side.

Entrant 01 Claude Opus 4.7

Entrant 02 GPT-5.5

Page

App Settings Page

Sidebar nav, profile fields, five toggles, segmented theme control, sticky save bar. The full SaaS settings UI — built one-shot at Extra High, no follow-ups.

Trade-off: Opus deeper a11y · GPT-5.5 ~4× faster & cheaper

Entrant 01 Claude Opus 4.7

Entrant 02 GPT Codex 5.3

Layout

AI Tool Pricing Section

Three tiers, monthly/annual toggle, dark/light theme, full keyboard a11y. The kind of section that ships on a real product page — built one-shot, no follow-ups.

Trade-off: Opus higher polish · Codex leaner & cheaper