The five contenders.
We're past the era where one model was best at everything. Frontier labs ship monthly, the tiers leapfrog each other, and your editor probably exposes 8+ options. Of those, five do real work for web devs in 2026. The rest you can ignore until they actually beat one of these.
| Model | Strongest at | Weakest at | Tier |
|---|---|---|---|
| Claude Opus 4.7 | Multi-file refactors, "explain this codebase," debugging gnarly state. | Cost. Latency. Patience with simple work. | Premium |
| Claude Sonnet 4.6 | The daily driver. Good at almost everything, great at code review and writing. | Long autonomous runs (drifts after ~25 tool calls). | Mid |
| GPT-5.3 Codex | Long autonomous loops, agentic flows, "go finish this and come back." | Subtle taste questions. Sometimes too literal. | High |
| Composer 2.5 | Fast edits, refactors you've already planned, in-editor speed. | Open-ended exploration. Long-context reasoning. | Cheap |
| Gemini 3.1 Pro | Huge contexts — pasting a whole monorepo, video frames, PDFs. | Tool-use accuracy on long agentic runs. | High |
Notably absent: Haiku 4 and GPT-5.5 Mini. Both are excellent for specific tasks (high-volume background classification, structured data extraction) but rarely the right pick for interactive web-dev coding in 2026. If you're using them as your main coder, you're saving pennies and burning hours.
What you actually pay.
Per-million-token pricing is mostly noise — the number that matters is "cost per hour of actual coding." A model that's 3× more expensive per token but produces correct code in one shot is cheaper than one that needs three retries. Still, ballpark price-per-million is useful for sanity checks:
| Model | Input / 1M | Cached input / 1M | Output / 1M |
|---|---|---|---|
| Opus 4.7 | $15.00 | $1.50 | $75.00 |
| Sonnet 4.6 | $3.00 | $0.30 | $15.00 |
| GPT-5.3 Codex | $5.00 | $0.50 | $20.00 |
| Composer 2.5 | $1.20 | $0.12 | $6.00 |
| Gemini 3.1 Pro | $2.50 | $0.25 | $10.00 |
Notice cached input is roughly 10% of fresh input across the board. If your editor doesn't cache aggressively (Cursor, Claude Code, and Codex all do by default), check the setting. It's the single biggest cost lever after model choice.
A typical eight-hour coding day on Sonnet 4.6 with healthy caching runs roughly $4–8. On Opus 4.7 it's closer to $20–35. On Composer it's under $2. None of these are crazy money for a working dev; matching the price to the task is the discipline.
Match model to task.
| Task | Pick | Why |
|---|---|---|
| Rename a symbol across 60 files | Composer 2.5 | Mechanical. Speed and cost dominate. |
| "Explain how auth works in this app" | Sonnet 4.6 | Synthesis across files; doesn't need Opus's depth. |
| Migrate Pages Router → App Router | Opus 4.7 | Multi-file reasoning under structural constraints. |
| Build a feature in a worktree, unattended | GPT-5.3 Codex | Best at long autonomous loops without drift. |
| Code review on a PR | Sonnet 4.6 | Excellent at catching subtle issues without being precious. |
| Debug a Heisenbug in production code | Opus 4.7 | Holds a hypothesis across many turns better than the others. |
| Write tests for an existing module | Sonnet 4.6 | Knows test patterns; doesn't over-engineer. |
| Paste an entire monorepo to ask one question | Gemini 3.1 Pro | 2M-token context, holds detail across the whole tree. |
| Convert a Figma frame to JSX | Sonnet 4.6 or GPT-5.3 | Either is fine; pick whichever your editor's Figma MCP integrates with. |
| Fix a one-line typo in a function | Composer 2.5 | Anything fancier is wasted latency. |
Tempting and lazy. You pay 5× the rate and wait 3× as long for tasks Sonnet would have nailed first try. Reserve Opus for the cases where its depth actually shows up in the diff.
Live: pick a model for me.
Five questions, one recommendation, real price estimate. All math runs in your browser — nothing leaves the page.
The "fallback" is the model to switch to if the recommended one stumbles twice in a row. Two retries on the same model is usually the wrong instinct — try a different one.
When to trust auto-mode.
Cursor's auto-mode and similar features pick the model per request. They're surprisingly good once they have signal, surprisingly bad until then. A simple rule:
- First 50 requests in a new repo: pick manually. The router has no idea what's hard here yet.
- Routine ongoing work: auto-mode is fine and saves money — it'll skew toward Sonnet/Composer for most things.
- Big refactor or architectural question: override auto-mode. Pick Opus or GPT-5.3 manually. Auto won't reach for premium tier on its own often enough.
- Background / autonomous agents: never auto. Pick the model you'd trust for an hour unattended.
Common pitfalls.
SWE-Bench, HumanEval, MBPP — they're correlated with coding ability but they're not your codebase. The most predictive benchmark is "try the model on your last three PRs and see which one's diff you'd merge."
Models leapfrog quarterly. A model you chose six months ago may be middle of the pack now. Re-test your defaults every quarter — 30 minutes of side-by-side comparison can change which model gets your $200/month of token spend.
If a model is stuck, switching models in the same context is almost never the fix — the bad context follows. Start a fresh chat with the new model. The old chat was the problem, not the model.