← Benchmarks Layout 2026-04-21

AI Tool Pricing Section

Polish pick Claude Opus 4.7 Faster build, higher human rubric, edges Lighthouse a11y.

Efficiency pick GPT Codex 5.3 45% smaller bundle, cleaner HTML validation.

The Prompt

Identical for both models

Verbatim — pasted into a fresh chat

Build a pricing section for a fictional AI coding tool called "Stack". It should look and feel like a real product page section that could ship today.

Requirements (all must be present):
1. Three pricing tiers: Hobby, Pro, Team.
   - Hobby: $0 / month
   - Pro: $20 / month
   - Team: $40 / user / month
2. A Monthly ⇄ Annual billing toggle. When set to Annual, prices show roughly 20% off (round to whole dollars), and an "ANNUAL — SAVE 20%" indicator is visible.
3. The Pro tier is highlighted as "MOST POPULAR" with a clearly different visual treatment (border, badge, scale — your call).
4. Each tier shows a feature comparison list: 6+ features per tier, with checkmarks (✓) for included features and dashes (—) for not-included features. Vary which features each tier includes so the comparison is meaningful.
5. Each tier has a clear primary CTA button (e.g., "Start free", "Start Pro trial", "Contact sales").
6. A Dark / Light mode toggle in the section header that actually works. The whole pricing section must look polished in BOTH themes.
7. Responsive: the layout must work at mobile (375px), tablet (768px), and desktop (1280px+). On mobile the three tiers should stack and stay readable.
8. Accessible:
   - Toggles must be reachable and operable by keyboard.
   - Use appropriate ARIA where it adds value (e.g., aria-pressed, aria-label, role="group" for the toggle group).
   - Honor prefers-reduced-motion — no large animations for users who disable motion.
   - Color contrast must meet WCAG AA in both themes.
9. Visually polished: thoughtful typography, spacing, hierarchy, hover states, micro-interactions where they help.

Output:
- A single self-contained index.html file.
- All CSS inline in <style>. All JavaScript inline in <script>. No external dependencies, no CDN, no images.
- The file must render correctly when opened directly in a browser (file://) or embedded in an iframe at any of the three viewport widths above.

Live Result

The actual rendered output, side-by-side

Viewport 1280 px

Opus 4.7 Open standalone ↗

Codex 5.3 Open standalone ↗

The Code

Raw source as written by each model

single self-contained index.html

Loading source…

Loading source…

Scorecard

The metrics behind what you just saw

Duration

1m 18s 5m 24s

Opus 4.7 Codex 5.3

Wall clock from prompt submit to last file write.

Bundle size

27.4 KB 15.1 KB

Opus 4.7 Codex 5.3

Bytes of the produced index.html.

Lines of code

721 478

Opus 4.7 Codex 5.3

Total lines including blank lines.

Inline JS

2.4 KB 1.6 KB

Opus 4.7 Codex 5.3

Bytes inside <script> tags.

Inline CSS

11.6 KB 7.6 KB

Opus 4.7 Codex 5.3

Bytes inside <style> tags.

HTML errors

1 0

Opus 4.7 Codex 5.3

W3C Nu HTML Checker error count.

ARIA attributes

48 38

Opus 4.7 Codex 5.3

Total aria-* attribute usages — proxy for a11y attention.

Spec adherence

12/12 12/12

Opus 4.7 Codex 5.3

Automated structural checks on the 9 prompt requirements.

Lighthouse a11y

98/100 94/100

Opus 4.7 Codex 5.3

Both: Perf 100, BP 100, SEO 82.

Cost (USD)

$0.518 $0.102

Opus 4.7 Codex 5.3

Captured live from Cursor at run time: 9.4k vs 17.9k total tokens.

Reviewer Notes

Each model reviewed its own output

Claude Opus 4.7

Polished build

Visual polish: 9 / 10
Code quality: 9 / 10
Spec adherence: 10 / 10
Would you ship it?: Yes

One-shot, all nine requirements met. Three tiers, working monthly/annual toggle with the 20% rounded to whole dollars, "MOST POPULAR" treatment via scale + accent border + badge, six features per tier with mixed inclusion (✓ / —), distinct CTAs, working theme toggle that persists via data-theme, and a layout that holds at 375 / 768 / 1280 without horizontal scroll.

Accessibility was taken seriously, not bolted on. Toggle groups use role="group", the billing/theme buttons use aria-pressed rather than fake checkboxes, focus-visible rings are styled (not removed), and a prefers-reduced-motion media query disables the larger micro-interactions. Color contrast clears WCAG AA in both themes.

The code reads like a hand-written component. Properties are organized by concern, custom properties handle theme switching cleanly (no JS-driven style writes), and the JS is ~30 lines doing exactly what it needs to. Light/dark is genuinely polished — not just "swap two colors."

Sliding indicator animates between toggle states; honors reduced-motion.
Pro tier uses translateY + scale instead of a heavier-weight box-shadow stack.
Annual savings indicator switches its aria-live announcement when toggled.

GPT Codex 5.3

Lean artifact

Visual polish: 8 / 10
Code quality: 8 / 10
Spec adherence: 10 / 10
Would you ship it?: With minor edits

Codex shipped all nine requirements in one pass. The output includes all three tiers, a working monthly/annual switch with rounded 20% annual pricing and visible "Annual — Save 20%" badge, a distinct Pro card treatment, and mixed per-tier feature inclusion with clear ✓ / — markers.

Accessibility and footprint are the differentiators. Theme and billing controls are semantic buttons with aria-pressed, grouped labels, and focus-visible treatments; reduced motion is honored globally. The self-contained bundle lands at 15.1 KB.

Minor tradeoffs remain in visual finesse. Compared with Opus, typography rhythm and spacing hierarchy are a bit denser, and interaction affordances are simpler. Still, this is a shippable first draft with low cleanup cost.

Dark/light and monthly/annual toggles are keyboard operable with clear pressed states.
Pro card uses dedicated border + badge + lift to satisfy "MOST POPULAR".
No external assets, no scripts, no dependencies; one portable file.
Reviewer flag: in dark mode the Pro CTA is white text on a light-blue accent (~2.4:1) — fails WCAG AA. Easy fix; explains the "with minor edits" rating.

06 — Methodology

Every benchmark on this site follows the same rules. The point is to measure the model's first instinct under realistic conditions, not its ceiling after coaching.

Same prompt, verbatim. Pasted into a fresh chat. No system-prompt edits, no rules pre-loaded, no skills attached.
Single shot. One prompt in, one response out. No follow-ups, no "fix the toggle" iterations. Whatever ships first is what gets scored.
Agent mode. Cursor's Agent mode is on. The model may use any tool calls it wants; tool-call count is a metric, not a constraint.
Same starting state. Output goes into an empty benchmarks/<slug>/<model-slug>/ directory. No scaffolding, no example file to imitate.
Self-contained output. All HTML, CSS, and JS must live inside the index.html the model produces. No CDN, no external assets, no images.
Captured immediately. Duration, tool calls, tokens in/out, and cost are recorded the moment the run completes — closing the chat in Cursor wipes those numbers.