Can a local model match Claude or GPT-5 for coding?

Not on hard tasks, no. The best 2026 open-weights models — Qwen3 Coder 32B, DeepSeek-Coder-V3 — are roughly in the same league as Claude Haiku 4 or GPT-5 Mini. That's still very useful: autocomplete, mechanical refactors, in-editor questions all work well. For multi-file refactors, agentic flows, or anything Opus-tier, you'll keep reaching for frontier models.

What hardware do I need to run local models for coding?

Minimum: M-series Mac with 16 GB unified memory, or a PC with a 12 GB+ GPU. That runs a 7B-13B parameter coding model with usable latency. Sweet spot: 32-64 GB unified memory or a 24 GB GPU — runs a Qwen3 Coder 32B comfortably with a 32K-token context. Beyond 64 GB you can run 70B-class models locally, but at that point you're an enthusiast, not just a frugal dev.

What's the actual payback time on local hardware?

It depends entirely on what you'd otherwise spend on API tokens. If you're a $5/day Sonnet user, never. If you're a $40/day power user where 60% of your spend is on autocomplete and quick edits, a $2000 upgrade pays back in 3-4 months — and you can keep using frontier models for the hard 40%.

Do I need to give up Cursor or Claude Code?

No. Cursor supports an OpenAI-compatible custom endpoint, which is exactly what Ollama exposes. Claude Code via litellm proxy works similarly. Most people end up with a hybrid setup: a local model for autocomplete and a frontier model for chat / agentic work. Cline, Zed, and Aider are designed around this hybrid pattern from the start.

Self-Hosting an AI Coding Stack in 2026

CH 01

Why even bother in 2026.

Frontier models keep getting cheaper. Sonnet 4.6 at $3 per million input tokens is hard to argue with. So why run anything locally?

Autocomplete is on every keystroke. The bill for 8 hours of in-editor completions runs $5–15 per day even on cheap models. A local model handles this for free, with lower latency.
Privacy you can prove. Some clients, some industries, some governments require it. "It's not in our SaaS terms" is not the same as "no token left your laptop."
Working on planes. Genuinely useful for the long-haul commute or the spotty co-working WiFi.
Learning leverage. Running a model yourself forces you to understand context, quantization, and tokenization in a way that API calls don't. That knowledge transfers.

What it won't replace: hard agentic work, multi-file reasoning, anything where you'd reach for Opus or GPT-5.3. The 32B open-weights models in 2026 are about where Sonnet was in 2024. That's a useful tier — not the top one.

CH 02

The hardware tiers.

Tier	Hardware	What runs well	Verdict
Entry	M2/M3 MacBook Air, 16 GB	7B-class coding models. Qwen2.5-Coder 7B at Q4. Autocomplete only.	Limited
Sweet spot	M3/M4 Pro Mac, 32–48 GB; or RTX 4090/5080, 16–24 GB VRAM	Qwen3 Coder 14B, DeepSeek-Coder-V3 16B. Autocomplete + chat.	Recommended
Enthusiast	M3/M4 Max Mac, 64 GB+; or dual 4090 / single 6000-class	Qwen3 Coder 32B at Q6. DeepSeek-V3 distilled. Approaches Sonnet-class for many web tasks.	If you can swing it
Home-lab	M3 Ultra Mac Studio, 192 GB; or 2× H100; or Strix Halo desktop	Llama 4 70B-class. Long context. Multiple parallel sessions.	Hobbyist

Apple's unified memory architecture is a quiet superpower for local LLMs — a $3,500 M4 Max with 64 GB runs models a $5,000 PC build can't, because GPU memory and system memory aren't separate. If you're buying new specifically for this, take a hard look at the Mac side.

CH 03

Models worth running.

Model	Good for	RAM @ Q4
Qwen3 Coder 7B	Autocomplete, single-file edits, in-editor questions on small files.	~5 GB
Qwen3 Coder 14B	Chat about code, refactors within one file, doc generation.	~9 GB
Qwen3 Coder 32B	Most "what's this codebase doing" questions; small multi-file refactors.	~20 GB
DeepSeek-Coder-V3 16B	Strong on systems languages, surprisingly good on TS/JS too.	~11 GB
Llama 4 70B	General-purpose chat that happens to code. Slower per token; broader knowledge.	~42 GB
StarCoder3 15B	Pure fill-in-the-middle for fast autocomplete. Don't use for chat.	~10 GB

Quantization matters more than you think

The same model at Q4_K_M vs Q8 will halve your memory use and roughly double your token throughput, with a small but real quality hit. Start at Q4 for chat models, Q8 for autocomplete (which is more sensitive to small errors). If you've got the RAM, Q6 is a nice middle.

CH 04

The actual stack.

One reliable shape that works on Mac and Linux:

Server: ollama (easy) or llama.cpp (faster, more knobs).
Editor: Cline (VS Code), Zed (native), or Cursor pointed at a custom endpoint.
Terminal: Aider or OpenCode, both with local provider support.
Frontier fallback: the same editor, with a hotkey to switch models. You'll use it daily.

$ install Ollama and a coding model

# macOS
brew install ollama
ollama serve &

# Pull a model — Q4 by default
ollama pull qwen3-coder:14b
ollama pull qwen3-coder:7b   # for autocomplete

# Smoke test
ollama run qwen3-coder:14b "Refactor this to use async/await: ..."

Cursor · Ollama as custom OpenAI endpoint

# Cursor Settings → Models → OpenAI API Key
# Override OpenAI Base URL:
http://localhost:11434/v1

# Add a custom model name matching your Ollama tag:
qwen3-coder:14b

# Keep a frontier model on a hotkey for hard tasks.

$ Aider with a local model

aider --model ollama/qwen3-coder:14b \
      --no-auto-commits \
      file1.ts file2.ts

Your laptop will be a hairdryer

A 14B model on an M3 Pro at full load pulls 35–45W and the fans will let you know. Plug in, expect 4–6 hours of battery instead of 12. If you live on the train, keep autocomplete on a smaller 7B model and chat on the 14B.

DEMO · INTERACTIVE

Live: payback calculator.

How fast does a hardware upgrade pay for itself, given your current spend and the share of work a local model can handle? All math runs in your browser.

Payback calculator Heuristic · Numbers in your browser only

Current AI spend / day $15

% of work a local model can handle 55%

Hardware option Sweet spot

Working days / month 20

Pays back in — months

Pick your inputs to see the math.

Hardware cost: $0
Monthly savings: $0
3-year savings: $0
Verdict: —

Read the verdict, not just the months. A 14-month payback on a laptop you'll use for everything else is excellent; the same payback on a single-purpose home-lab box is questionable. The math here only counts AI savings.

PITFALLS

Common pitfalls.

Pretending you don't need frontier

The temptation after a $4k Mac upgrade is to go all-in on local and cancel your API subs. Don't. Frontier is genuinely better on the 15–30% of tasks where Opus or GPT-5.3 makes a visible difference. Hybrid is the right answer.

Context window math

Most local models advertise 32K or 128K context — at peak memory cost. In practice, KV-cache grows quadratically and you'll hit RAM ceilings at half of advertised context. Test with a real codebase paste before betting on the advertised number.

Skipping the autocomplete tier

People install Ollama, pull a 14B chat model, and never set up the small autocomplete model. Then the experience feels slow and they give up. The 7B-for-autocomplete + 14B-for-chat split is the trick that makes a local stack feel fast.