OpenAI released GPT-5.5, the company's most capable model to date and the one most clearly aimed at how engineers actually work—long-running, multi-step tasks that span coding, computer use, knowledge work, and early scientific research. The pitch is unusually pointed: instead of carefully managing every step, give GPT-5.5 a messy, multi-part task and trust it to plan, use tools, check its own work, and keep going.
The release is rolling out today to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex, with GPT-5.5 Pro for Pro/Business/Enterprise tiers. API availability is "very soon," with pricing set at $5 per 1M input tokens and $30 per 1M output tokens over a 1M-token context window.
State-of-the-Art on the Coding Benchmarks That Matter
GPT-5.5 posts the strongest agentic coding numbers OpenAI has ever shipped:
- 82.7% on Terminal-Bench 2.0 (vs. 75.1% for GPT-5.4 and 69.4% for Claude Opus 4.7)
- 58.6% on SWE-Bench Pro—real-world GitHub issue resolution in a single pass
- 73.1% on Expert-SWE, OpenAI's internal eval for long-horizon coding tasks with a median 20-hour human completion time
- 78.7% on OSWorld-Verified for computer-use agents
Per Artificial Analysis's Coding Agent Index, GPT-5.5 delivers state-of-the-art intelligence at half the cost of competitive frontier coding models. And it gets there using fewer tokens than GPT-5.4 across all three coding evals—the per-token latency holds, but the wall-clock cost of finishing a task drops.
Engineers Describe a Step Change
The qualitative reactions from early testers are unusually direct. Dan Shipper of Every called it "the first coding model I've used that has serious conceptual clarity"—and proved the point by replaying a debugging episode where GPT-5.4 had failed and his best engineer had eventually rewritten part of the system. GPT-5.5 produced essentially the same rewrite.
Pietro Schirano of MagicPath said GPT-5.5 merged a branch with hundreds of frontend and refactor changes into a substantially-changed main branch in one shot, in about 20 minutes. Senior engineers at OpenAI testing the model said it was noticeably stronger than both GPT-5.4 and Claude Opus 4.7 at reasoning, autonomy, catching issues in advance, and predicting testing and review needs without explicit prompting. One NVIDIA engineer with early access put it more bluntly: "Losing access to GPT-5.5 feels like I've had a limb amputated."
Knowledge Work and Codex Computer Use
The same instruction-following and tool-use gains carry into the rest of the work-on-a-computer loop. In Codex, GPT-5.5 generates better documents, spreadsheets, and slides; combined with Codex's computer-use skills, it can see what's on screen, click, type, navigate interfaces, and move across tools with precision. More than 85% of OpenAI's own employees use Codex weekly—the comms team uses it to triage speaking requests, finance used it to review 24,771 K-1 tax forms across 71,637 pages, and a go-to-market employee automated weekly business reports for a 5–10-hour weekly time savings.
On knowledge-work benchmarks, GPT-5.5 hits 84.9% on GDPval, 98.0% on Tau2-bench Telecom (without prompt tuning), and 88.5% on internal investment-banking modeling. The Pro variant is the one early testers point to for the biggest jump on legal, business, education, and data-science work.
Scientific Research and a New Ramsey Proof
The most surprising data point is in research. An internal version of GPT-5.5 with a custom harness helped discover a new proof about off-diagonal Ramsey numbers—a longstanding asymptotic fact in combinatorics, later verified in Lean. It's a concrete example of a model contributing not just code or explanation but a useful, surprising mathematical argument.
On BixBench (real-world bioinformatics), GPT-5.5 leads published scores. On GeneBench, a new multi-stage genetics and quantitative biology eval, it shows clear improvement over GPT-5.4 on tasks that often correspond to multi-day projects for scientific experts. Researchers at the Jackson Laboratory used GPT-5.5 Pro to analyze a 28,000-gene expression dataset and produce a research report that "would have taken his team months."
Cybersecurity Capability Triggers New Safeguards
OpenAI is treating GPT-5.5's cybersecurity and biological/chemical capabilities as High under its Preparedness Framework—not Critical, but a clear step up from GPT-5.4. The company is shipping stricter cyber classifiers (which "some users may find annoying initially") and expanding a Trusted Access for Cyber program that gives verified defenders less-restricted access for legitimate security work.
Why It Matters for Web Developers
GPT-5.5 is the model that makes the gap between "AI helps me code" and "AI runs the long task" feel narrower in a way that benchmarks alone don't capture. The combination of state-of-the-art Terminal-Bench scores, fewer tokens per task, GPT-5.4 latency, and a 1M context window in the API means Codex—and any third-party tool that ships GPT-5.5 support—can take longer-horizon work without the cost or latency penalty that usually comes with more capability. For Cursor, Codex CLI, and any tool that exposes model selection, this becomes the default coding model the moment it hits the API.