Home Benchmarks Learn Tools News
Learn · Guides · Infrastructure

Self-Hosting an AI Coding Stack.

Ollama, Continue, Aider, and a local Qwen or DeepSeek model. When local actually wins, when it doesn't, and the payback math.

SPONSOR

AppSignal — Stop vibe-debugging. Every exception, every backtrace, grouped so you see patterns, not noise.

↗
On this page
  1. Why even bother in 2026
  2. The hardware tiers
  3. Models worth running
  4. The actual stack
  5. Live: payback calculator
  6. Common pitfalls
CH 01

Why even bother in 2026.

Frontier models keep getting cheaper. Sonnet 4.6 at $3 per million input tokens is hard to argue with. So why run anything locally?

  • Autocomplete is on every keystroke. The bill for 8 hours of in-editor completions runs $5–15 per day even on cheap models. A local model handles this for free, with lower latency.
  • Privacy you can prove. Some clients, some industries, some governments require it. "It's not in our SaaS terms" is not the same as "no token left your laptop."
  • Working on planes. Genuinely useful for the long-haul commute or the spotty co-working WiFi.
  • Learning leverage. Running a model yourself forces you to understand context, quantization, and tokenization in a way that API calls don't. That knowledge transfers.

What it won't replace: hard agentic work, multi-file reasoning, anything where you'd reach for Opus or GPT-5.3. The 32B open-weights models in 2026 are about where Sonnet was in 2024. That's a useful tier — not the top one.

CH 02

The hardware tiers.

Tier Hardware What runs well Verdict
Entry M2/M3 MacBook Air, 16 GB 7B-class coding models. Qwen2.5-Coder 7B at Q4. Autocomplete only. Limited
Sweet spot M3/M4 Pro Mac, 32–48 GB; or RTX 4090/5080, 16–24 GB VRAM Qwen3 Coder 14B, DeepSeek-Coder-V3 16B. Autocomplete + chat. Recommended
Enthusiast M3/M4 Max Mac, 64 GB+; or dual 4090 / single 6000-class Qwen3 Coder 32B at Q6. DeepSeek-V3 distilled. Approaches Sonnet-class for many web tasks. If you can swing it
Home-lab M3 Ultra Mac Studio, 192 GB; or 2× H100; or Strix Halo desktop Llama 4 70B-class. Long context. Multiple parallel sessions. Hobbyist

Apple's unified memory architecture is a quiet superpower for local LLMs — a $3,500 M4 Max with 64 GB runs models a $5,000 PC build can't, because GPU memory and system memory aren't separate. If you're buying new specifically for this, take a hard look at the Mac side.

CH 03

Models worth running.

Model Good for RAM @ Q4
Qwen3 Coder 7B Autocomplete, single-file edits, in-editor questions on small files. ~5 GB
Qwen3 Coder 14B Chat about code, refactors within one file, doc generation. ~9 GB
Qwen3 Coder 32B Most "what's this codebase doing" questions; small multi-file refactors. ~20 GB
DeepSeek-Coder-V3 16B Strong on systems languages, surprisingly good on TS/JS too. ~11 GB
Llama 4 70B General-purpose chat that happens to code. Slower per token; broader knowledge. ~42 GB
StarCoder3 15B Pure fill-in-the-middle for fast autocomplete. Don't use for chat. ~10 GB
Quantization matters more than you think

The same model at Q4_K_M vs Q8 will halve your memory use and roughly double your token throughput, with a small but real quality hit. Start at Q4 for chat models, Q8 for autocomplete (which is more sensitive to small errors). If you've got the RAM, Q6 is a nice middle.

CH 04

The actual stack.

One reliable shape that works on Mac and Linux:

  • Server: ollama (easy) or llama.cpp (faster, more knobs).
  • Editor: Continue (VS Code / JetBrains), Zed (native), or Cursor pointed at a custom endpoint.
  • Terminal: Aider or OpenCode, both with local provider support.
  • Frontier fallback: the same editor, with a hotkey to switch models. You'll use it daily.
$ install Ollama and a coding model
# macOS
brew install ollama
ollama serve &

# Pull a model — Q4 by default
ollama pull qwen3-coder:14b
ollama pull qwen3-coder:7b   # for autocomplete

# Smoke test
ollama run qwen3-coder:14b "Refactor this to use async/await: ..."
~/.continue/config.yaml · two-model setup
models:
  - name: Qwen3 Coder 14B (local)
    provider: ollama
    model: qwen3-coder:14b
    roles: [chat, edit]
  - name: Qwen3 Coder 7B (autocomplete)
    provider: ollama
    model: qwen3-coder:7b
    roles: [autocomplete]
  - name: Sonnet (fallback)
    provider: anthropic
    model: claude-sonnet-4-6
    apiKey: ${ANTHROPIC_API_KEY}
    roles: [chat, edit]
$ Aider with a local model
aider --model ollama/qwen3-coder:14b \
      --no-auto-commits \
      file1.ts file2.ts
Your laptop will be a hairdryer

A 14B model on an M3 Pro at full load pulls 35–45W and the fans will let you know. Plug in, expect 4–6 hours of battery instead of 12. If you live on the train, keep autocomplete on a smaller 7B model and chat on the 14B.

DEMO · INTERACTIVE

Live: payback calculator.

How fast does a hardware upgrade pay for itself, given your current spend and the share of work a local model can handle? All math runs in your browser.

Payback calculator Heuristic · Numbers in your browser only
Pays back in — months
Pick your inputs to see the math.
Hardware cost
$0
Monthly savings
$0
3-year savings
$0
Verdict
—

Read the verdict, not just the months. A 14-month payback on a laptop you'll use for everything else is excellent; the same payback on a single-purpose home-lab box is questionable. The math here only counts AI savings.

PITFALLS

Common pitfalls.

Pretending you don't need frontier

The temptation after a $4k Mac upgrade is to go all-in on local and cancel your API subs. Don't. Frontier is genuinely better on the 15–30% of tasks where Opus or GPT-5.3 makes a visible difference. Hybrid is the right answer.

Context window math

Most local models advertise 32K or 128K context — at peak memory cost. In practice, KV-cache grows quadratically and you'll hit RAM ceilings at half of advertised context. Test with a real codebase paste before betting on the advertised number.

Skipping the autocomplete tier

People install Ollama, pull a 14B chat model, and never set up the small autocomplete model. Then the experience feels slow and they give up. The 7B-for-autocomplete + 14B-for-chat split is the trick that makes a local stack feel fast.

What to read next.

  • Tool Aider Terminal-native, hybrid-friendly, works with Ollama out of the box.
  • Guide · 08 Picking a Model in 2026 Helps decide which 15–30% of your work to keep on frontier.
  • Guide · 02 Stop Burning Tokens Before you buy hardware, check what's actually driving your spend.
Changelog
  • 2026-05-22Initial publish. Hardware tiers reflect May 2026 availability.
STATUS ● BUILDING THE FUTURE
MISSION LLM RESOURCES
VERSION BETA 3.0

BUILD WITH AI. SHIP WITH CONFIDENCE.

@WEBDEVELOPERHQ ↗
TERMS / PRIVACY
FRIENDS
Authentic Jobs ↗
Web Reference ↗
Ready.dev ↗
Fullres ↗
© 2026 WEB DEVELOPER / ALL RIGHTS RESERVED