Every Claude Code user knows the ritual: approve, approve, approve, approve. Anthropic's data confirms it — users accept 93% of permission prompts. The result isn't safety. It's approval fatigue. Developers stop reading what they're approving, or they reach for --dangerously-skip-permissions and remove all guardrails entirely.
Auto mode is Anthropic's answer: a two-layer classifier system that makes permission decisions on your behalf. Safe actions proceed automatically. Dangerous ones get blocked — and Claude is told to try a different approach. No more clicking "approve" 87 times in a 45-minute session.
Two Layers of Defense
The architecture splits protection between what Claude reads and what Claude does:
- Input layer — a server-side prompt-injection probe scans tool outputs (file reads, web fetches, shell output) before they enter the agent's context. When content looks like a hijacking attempt, it warns the agent to treat it as suspect and re-anchor on user intent.
- Output layer — a transcript classifier running on Sonnet 4.6 evaluates each tool call before execution. Stage 1 is a fast single-token yes/no filter tuned to block aggressively. Stage 2 runs chain-of-thought reasoning only on flagged actions, cutting false positives from 8.5% down to 0.4%.
The classifier is deliberately reasoning-blind — it sees user messages and bare tool calls, but Claude's own prose and tool outputs are stripped. This prevents the agent from talking the classifier into a bad decision with persuasive rationalizations, and it keeps injected content in tool outputs from reaching the evaluator.
What Gets Blocked
Anthropic published examples from their internal incident log of real overeager behaviors auto mode is designed to catch:
- Scope escalation — user says "clean up old branches," agent deletes remote branches via pattern match
- Credential exploration — agent hits an auth error and starts grepping environment variables for alternative tokens
- Agent-inferred parameters — user says "cancel my job," agent picks the closest match by name similarity and tries to delete it
- Data exfiltration — agent constructs a GitHub Gist command to share a script externally
- Safety-check bypass — a deploy command fails a pre-check, agent retries with a skip-verification flag
The Honest Numbers
Anthropic is unusually transparent about the tradeoffs. On 10,000 real internal tool calls, the full pipeline has a 0.4% false positive rate — meaning very few legitimate actions get blocked. But on 52 real overeager actions from their incident log, the false negative rate is 17%. That means roughly 1 in 6 genuinely dangerous actions slips through.
The key insight from Anthropic's engineering post: "Whether 17% is acceptable depends on what you're comparing against. If you are running --dangerously-skip-permissions, this is a substantial improvement. If you are manually approving every action carefully, it's arguably a regression." Auto mode targets the first group — the developers who already removed all guardrails because the prompts were unsustainable.
How It Recovers
When the classifier blocks an action, Claude doesn't halt. It receives the denial with instructions to find a safer path. If it accumulates 3 consecutive denials or 20 total, auto mode escalates to the human. In headless mode (claude -p), the process terminates instead. This makes false positives survivable — a blocked action costs one retry, not a killed session.
Auto mode is available now in research preview for Team plan users, with Enterprise and API rollout incoming. It works with Claude Sonnet 4.6 and Opus 4.6. Anthropic recommends sandboxed environments and acknowledges the feature reduces risk without eliminating it — but for the majority of developers who were already running without guardrails, that's a significant upgrade.