Home Benchmarks Learn Tools News
Learn · Guides · Models

Picking a Model in 2026.

Five models do the work in 2026. Here's how to pick between them per task — and a live picker that does it for you.

SPONSOR

AppSignal — Stop vibe-debugging. Every exception, every backtrace, grouped so you see patterns, not noise.

↗
On this page
  1. The five contenders
  2. What you actually pay
  3. Match model to task
  4. Live: pick for me
  5. When to trust auto-mode
  6. Common pitfalls
CH 01

The five contenders.

We're past the era where one model was best at everything. Frontier labs ship monthly, the tiers leapfrog each other, and your editor probably exposes 8+ options. Of those, five do real work for web devs in 2026. The rest you can ignore until they actually beat one of these.

Model Strongest at Weakest at Tier
Claude Opus 4.7 Multi-file refactors, "explain this codebase," debugging gnarly state. Cost. Latency. Patience with simple work. Premium
Claude Sonnet 4.6 The daily driver. Good at almost everything, great at code review and writing. Long autonomous runs (drifts after ~25 tool calls). Mid
GPT-5.3 Codex Long autonomous loops, agentic flows, "go finish this and come back." Subtle taste questions. Sometimes too literal. High
Composer 2.5 Fast edits, refactors you've already planned, in-editor speed. Open-ended exploration. Long-context reasoning. Cheap
Gemini 3.1 Pro Huge contexts — pasting a whole monorepo, video frames, PDFs. Tool-use accuracy on long agentic runs. High

Notably absent: Haiku 4 and GPT-5.5 Mini. Both are excellent for specific tasks (high-volume background classification, structured data extraction) but rarely the right pick for interactive web-dev coding in 2026. If you're using them as your main coder, you're saving pennies and burning hours.

CH 02

What you actually pay.

Per-million-token pricing is mostly noise — the number that matters is "cost per hour of actual coding." A model that's 3× more expensive per token but produces correct code in one shot is cheaper than one that needs three retries. Still, ballpark price-per-million is useful for sanity checks:

Model Input / 1M Cached input / 1M Output / 1M
Opus 4.7 $15.00 $1.50 $75.00
Sonnet 4.6 $3.00 $0.30 $15.00
GPT-5.3 Codex $5.00 $0.50 $20.00
Composer 2.5 $1.20 $0.12 $6.00
Gemini 3.1 Pro $2.50 $0.25 $10.00

Notice cached input is roughly 10% of fresh input across the board. If your editor doesn't cache aggressively (Cursor, Claude Code, and Codex all do by default), check the setting. It's the single biggest cost lever after model choice.

A typical eight-hour coding day on Sonnet 4.6 with healthy caching runs roughly $4–8. On Opus 4.7 it's closer to $20–35. On Composer it's under $2. None of these are crazy money for a working dev; matching the price to the task is the discipline.

CH 03

Match model to task.

Task Pick Why
Rename a symbol across 60 files Composer 2.5 Mechanical. Speed and cost dominate.
"Explain how auth works in this app" Sonnet 4.6 Synthesis across files; doesn't need Opus's depth.
Migrate Pages Router → App Router Opus 4.7 Multi-file reasoning under structural constraints.
Build a feature in a worktree, unattended GPT-5.3 Codex Best at long autonomous loops without drift.
Code review on a PR Sonnet 4.6 Excellent at catching subtle issues without being precious.
Debug a Heisenbug in production code Opus 4.7 Holds a hypothesis across many turns better than the others.
Write tests for an existing module Sonnet 4.6 Knows test patterns; doesn't over-engineer.
Paste an entire monorepo to ask one question Gemini 3.1 Pro 2M-token context, holds detail across the whole tree.
Convert a Figma frame to JSX Sonnet 4.6 or GPT-5.3 Either is fine; pick whichever your editor's Figma MCP integrates with.
Fix a one-line typo in a function Composer 2.5 Anything fancier is wasted latency.
"I'll just use Opus for everything"

Tempting and lazy. You pay 5× the rate and wait 3× as long for tasks Sonnet would have nailed first try. Reserve Opus for the cases where its depth actually shows up in the diff.

DEMO · INTERACTIVE

Live: pick a model for me.

Five questions, one recommendation, real price estimate. All math runs in your browser — nothing leaves the page.

Model picker · 2026 Heuristic · Numbers in your browser only
Recommended — —
Pick your inputs to see a recommendation.
Est. cost / task
$0.00
Est. cost / day
$0.00
Fallback
—
Confidence
—

The "fallback" is the model to switch to if the recommended one stumbles twice in a row. Two retries on the same model is usually the wrong instinct — try a different one.

CH 05

When to trust auto-mode.

Cursor's auto-mode and similar features pick the model per request. They're surprisingly good once they have signal, surprisingly bad until then. A simple rule:

  • First 50 requests in a new repo: pick manually. The router has no idea what's hard here yet.
  • Routine ongoing work: auto-mode is fine and saves money — it'll skew toward Sonnet/Composer for most things.
  • Big refactor or architectural question: override auto-mode. Pick Opus or GPT-5.3 manually. Auto won't reach for premium tier on its own often enough.
  • Background / autonomous agents: never auto. Pick the model you'd trust for an hour unattended.
PITFALLS

Common pitfalls.

Picking by Twitter benchmark

SWE-Bench, HumanEval, MBPP — they're correlated with coding ability but they're not your codebase. The most predictive benchmark is "try the model on your last three PRs and see which one's diff you'd merge."

Staying on yesterday's model

Models leapfrog quarterly. A model you chose six months ago may be middle of the pack now. Re-test your defaults every quarter — 30 minutes of side-by-side comparison can change which model gets your $200/month of token spend.

Switching mid-task

If a model is stuck, switching models in the same context is almost never the fix — the bad context follows. Start a fresh chat with the new model. The old chat was the problem, not the model.

What to read next.

  • Guide · 02 Stop Burning Tokens Now that you've picked a model, control what you spend on it.
  • Benchmarks Side-by-side benchmarks The same prompts run across models. Look at the diffs, not the rankings.
  • Guide · 04 Running 4 Agents at Once Different model per agent is a trap. Pick one, use it across the worktrees.
Changelog
  • 2026-05-22Initial publish. Pricing reflects May 2026 list rates.
STATUS ● BUILDING THE FUTURE
MISSION LLM RESOURCES
VERSION BETA 3.0

BUILD WITH AI. SHIP WITH CONFIDENCE.

@WEBDEVELOPERHQ ↗
TERMS / PRIVACY
FRIENDS
Authentic Jobs ↗
Web Reference ↗
Ready.dev ↗
Fullres ↗
© 2026 WEB DEVELOPER / ALL RIGHTS RESERVED