LLM Benchmarks
Same prompt. Different models. Real results, side-by-side.
Each benchmark gives two models the identical prompt in a fresh chat, in Agent mode, with no follow-ups. Output is a single self-contained index.html dropped into its own folder. We capture duration, tool calls, tokens, cost, Lighthouse, and a human rubric — then publish the actual rendered result side-by-side.
App Settings Page
Sidebar nav, profile fields, five toggles, segmented theme control, sticky save bar. The full SaaS settings UI — built one-shot at Extra High, no follow-ups.
Trade-off: Opus deeper a11y · GPT-5.5 ~4× faster & cheaper
AI Tool Pricing Section
Three tiers, monthly/annual toggle, dark/light theme, full keyboard a11y. The kind of section that ships on a real product page — built one-shot, no follow-ups.
Trade-off: Opus higher polish · Codex leaner & cheaper