Why even bother in 2026.
Frontier models keep getting cheaper. Sonnet 4.6 at $3 per million input tokens is hard to argue with. So why run anything locally?
- Autocomplete is on every keystroke. The bill for 8 hours of in-editor completions runs $5–15 per day even on cheap models. A local model handles this for free, with lower latency.
- Privacy you can prove. Some clients, some industries, some governments require it. "It's not in our SaaS terms" is not the same as "no token left your laptop."
- Working on planes. Genuinely useful for the long-haul commute or the spotty co-working WiFi.
- Learning leverage. Running a model yourself forces you to understand context, quantization, and tokenization in a way that API calls don't. That knowledge transfers.
What it won't replace: hard agentic work, multi-file reasoning, anything where you'd reach for Opus or GPT-5.3. The 32B open-weights models in 2026 are about where Sonnet was in 2024. That's a useful tier — not the top one.
The hardware tiers.
| Tier | Hardware | What runs well | Verdict |
|---|---|---|---|
| Entry | M2/M3 MacBook Air, 16 GB | 7B-class coding models. Qwen2.5-Coder 7B at Q4. Autocomplete only. | Limited |
| Sweet spot | M3/M4 Pro Mac, 32–48 GB; or RTX 4090/5080, 16–24 GB VRAM | Qwen3 Coder 14B, DeepSeek-Coder-V3 16B. Autocomplete + chat. | Recommended |
| Enthusiast | M3/M4 Max Mac, 64 GB+; or dual 4090 / single 6000-class | Qwen3 Coder 32B at Q6. DeepSeek-V3 distilled. Approaches Sonnet-class for many web tasks. | If you can swing it |
| Home-lab | M3 Ultra Mac Studio, 192 GB; or 2× H100; or Strix Halo desktop | Llama 4 70B-class. Long context. Multiple parallel sessions. | Hobbyist |
Apple's unified memory architecture is a quiet superpower for local LLMs — a $3,500 M4 Max with 64 GB runs models a $5,000 PC build can't, because GPU memory and system memory aren't separate. If you're buying new specifically for this, take a hard look at the Mac side.
Models worth running.
| Model | Good for | RAM @ Q4 |
|---|---|---|
| Qwen3 Coder 7B | Autocomplete, single-file edits, in-editor questions on small files. | ~5 GB |
| Qwen3 Coder 14B | Chat about code, refactors within one file, doc generation. | ~9 GB |
| Qwen3 Coder 32B | Most "what's this codebase doing" questions; small multi-file refactors. | ~20 GB |
| DeepSeek-Coder-V3 16B | Strong on systems languages, surprisingly good on TS/JS too. | ~11 GB |
| Llama 4 70B | General-purpose chat that happens to code. Slower per token; broader knowledge. | ~42 GB |
| StarCoder3 15B | Pure fill-in-the-middle for fast autocomplete. Don't use for chat. | ~10 GB |
The same model at Q4_K_M vs Q8 will halve your memory use and roughly double your token throughput, with a small but real quality hit. Start at Q4 for chat models, Q8 for autocomplete (which is more sensitive to small errors). If you've got the RAM, Q6 is a nice middle.
The actual stack.
One reliable shape that works on Mac and Linux:
- Server:
ollama(easy) orllama.cpp(faster, more knobs). - Editor: Continue (VS Code / JetBrains), Zed (native), or Cursor pointed at a custom endpoint.
- Terminal: Aider or OpenCode, both with local provider support.
- Frontier fallback: the same editor, with a hotkey to switch models. You'll use it daily.
# macOS brew install ollama ollama serve & # Pull a model — Q4 by default ollama pull qwen3-coder:14b ollama pull qwen3-coder:7b # for autocomplete # Smoke test ollama run qwen3-coder:14b "Refactor this to use async/await: ..."
models: - name: Qwen3 Coder 14B (local) provider: ollama model: qwen3-coder:14b roles: [chat, edit] - name: Qwen3 Coder 7B (autocomplete) provider: ollama model: qwen3-coder:7b roles: [autocomplete] - name: Sonnet (fallback) provider: anthropic model: claude-sonnet-4-6 apiKey: ${ANTHROPIC_API_KEY} roles: [chat, edit]
aider --model ollama/qwen3-coder:14b \
--no-auto-commits \
file1.ts file2.tsA 14B model on an M3 Pro at full load pulls 35–45W and the fans will let you know. Plug in, expect 4–6 hours of battery instead of 12. If you live on the train, keep autocomplete on a smaller 7B model and chat on the 14B.
Live: payback calculator.
How fast does a hardware upgrade pay for itself, given your current spend and the share of work a local model can handle? All math runs in your browser.
Read the verdict, not just the months. A 14-month payback on a laptop you'll use for everything else is excellent; the same payback on a single-purpose home-lab box is questionable. The math here only counts AI savings.
Common pitfalls.
The temptation after a $4k Mac upgrade is to go all-in on local and cancel your API subs. Don't. Frontier is genuinely better on the 15–30% of tasks where Opus or GPT-5.3 makes a visible difference. Hybrid is the right answer.
Most local models advertise 32K or 128K context — at peak memory cost. In practice, KV-cache grows quadratically and you'll hit RAM ceilings at half of advertised context. Test with a real codebase paste before betting on the advertised number.
People install Ollama, pull a 14B chat model, and never set up the small autocomplete model. Then the experience feels slow and they give up. The 7B-for-autocomplete + 14B-for-chat split is the trick that makes a local stack feel fast.