Why streaming, why now.
A non-streaming LLM call feels broken. The user clicks, sees a spinner, waits seven seconds, then a wall of text appears. A streaming call feels alive — first token in 400 ms, the rest arrives as the model thinks. Same total time, completely different product.
The web platform already has the primitive for this — Server-Sent Events. The AI SDK 5 made SSE the official wire format for chat in August 2025, which means streaming is now a few lines of code instead of a custom protocol you maintain. There is no good reason in 2026 to ship a non-streaming chat endpoint.
- Perceived latency drops 80%. Time-to-first-token is what the user feels; total time is what your bill reads.
- Cancellation becomes free. The user closes the tab; the fetch aborts; the provider call ends; you stop paying for tokens nobody will ever see.
- Tool calls become interactive. You can render the "calling tool X" state while the call is in flight, which buys you another 2–10 s of patience.
The 2026 stack at a glance.
For a TypeScript web app, you need four packages and one runtime decision. That is the whole stack.
| Layer | Pick | Why |
|---|---|---|
| Client hook | @ai-sdk/react — useChat / useCompletion |
Transport-based, typed messages, parts array, no internal input state to fight. |
| Server | ai — streamText |
Returns a stream that maps cleanly to an SSE response with toUIMessageStreamResponse(). |
| Provider | @ai-sdk/anthropic or @ai-sdk/openai (or both) |
Drop-in, swap with one line, no rewriting your route. |
| Validation | zod for tool input schemas |
The AI SDK speaks zod natively; tool calls fail loudly when the model invents fields. |
| Runtime | Edge for chat, Node for uploads | See Edge or Node. |
If you are reading old tutorials, ignore anything that wires useChat with an api: "/api/chat" string and handleInputChange. That is AI SDK 4. The 5.x rewrite uses a transport instance and you manage input state yourself. Most copy-paste pain comes from mixing the two.
The server route in 20 lines.
A Next.js App Router route handler that streams Anthropic Sonnet to the client. Drop-in.
import { anthropic } from '@ai-sdk/anthropic'; import { streamText, convertToModelMessages } from 'ai'; // Edge runtime: faster cold start, longer streaming window. export const runtime = 'edge'; export const maxDuration = 60; export async function POST(req: Request) { const { messages } = await req.json(); const result = await streamText({ model: anthropic('claude-sonnet-4-5'), system: 'You are a concise assistant. Cite sources when you have them.', messages: convertToModelMessages(messages), abortSignal: req.signal, // stop billing on client disconnect }); return result.toUIMessageStreamResponse(); }
Three lines do the heavy lifting: convertToModelMessages turns the typed UI message shape into something the provider understands; abortSignal: req.signal wires browser disconnects through to the provider so you stop being billed; toUIMessageStreamResponse() writes the SSE stream the client expects.
The client hook in 15 lines.
Same React component you'd write for any form, except messages is the canonical chat state and sendMessage kicks off the stream.
'use client'; import { useState } from 'react'; import { useChat } from '@ai-sdk/react'; import { DefaultChatTransport } from 'ai'; export default function Chat() { const { messages, sendMessage, status, stop } = useChat({ transport: new DefaultChatTransport({ api: '/api/chat' }), }); const [input, setInput] = useState(''); return ( <form onSubmit={(e) => { e.preventDefault(); if (input.trim()) { sendMessage({ text: input }); setInput(''); } }}> {messages.map((m) => ( <div key={m.id}> <strong>{m.role}:</strong> {m.parts.map((p, i) => p.type === 'text' ? <span key={i}>{p.text}</span> : null )} </div> ))} <input value={input} onChange={(e) => setInput(e.target.value)} /> {status === 'streaming' ? <button type="button" onClick={stop}>Stop</button> : <button type="submit">Send</button>} </form> ); }
Two things to notice. First, m.parts — messages in AI SDK 5 are not strings, they are arrays of typed parts. Text, tool calls, reasoning, files. You render each part for what it is, which is what makes the next chapter tractable.
Second, stop() is wired to the button when status === 'streaming'. Combined with abortSignal: req.signal on the server, that's end-to-end cancellation. Most production chat apps still ship without this and pay for it monthly.
Tool calls without footguns.
Tools turn a chat endpoint into an agent. They also turn one careless route into your most expensive bug. The discipline: every tool has a zod input schema, every tool runs behind the same auth as the rest of your app, and the whole loop has a hard step cap.
import { anthropic } from '@ai-sdk/anthropic'; import { streamText, tool, stepCountIs, convertToModelMessages } from 'ai'; import { z } from 'zod'; import { getUser, listOrdersFor } from '@/server/db'; export const runtime = 'edge'; export const maxDuration = 60; export async function POST(req: Request) { const user = await getUser(req); // same auth as every other route if (!user) return new Response('Unauthorized', { status: 401 }); const { messages } = await req.json(); const result = await streamText({ model: anthropic('claude-sonnet-4-5'), messages: convertToModelMessages(messages), abortSignal: req.signal, stopWhen: stepCountIs(5), // hard cap on tool-call loops tools: { listMyOrders: tool({ description: 'List the signed-in user\u2019s recent orders.', inputSchema: z.object({ limit: z.number().int().min(1).max(20).default(5), }), execute: async ({ limit }) => { // Scope by user.id. The model never sees other users\u2019 data. return listOrdersFor(user.id, limit); }, }), }, }); return result.toUIMessageStreamResponse(); }
Three rules and you'll avoid 90% of agent-shaped incidents:
- Auth the route, not the tool. The user object comes from your normal session handling, and every tool's
executecloses over it. The model can never escalate beyond what that user can already do. - Validate every input with zod. Models invent fields, send strings where numbers go, miss requireds. Zod fails loudly; without it you get silent corruption.
- Cap the loop.
stopWhen: stepCountIs(5)is your circuit breaker. If a model decides to call list/get/list/get forever, you cut it at five and return what you have.
Edge or Node?
| Concern | Edge | Node |
|---|---|---|
| Cold start | ~50 ms | 200–1500 ms |
| Max stream duration (Vercel) | Long (300 s+ on most plans) | Capped per plan; check current limits |
| Native modules / Sharp / Postgres driver | Not supported | Supported |
| Per-request cost | Lower | Higher |
| When to pick | Chat, completions, anything I/O-bound | File uploads, image processing, RAG ingestion |
The right default is split routes: app/api/chat/route.ts on edge, app/api/ingest/route.ts on Node. Don't pick one runtime for the whole project on theological grounds.
Live: streaming budget estimator.
Tell it the shape of your traffic; it returns an honest monthly bill and what the user will feel. All math runs in your browser using public list prices for the major models in May 2026 — treat as an estimate, not a quote.
Read the per-user number, not the total. A $40k monthly bill across 100k users is $0.40/user — fine. The same $40k across 800 users is $50/user and you have a pricing problem the model cannot solve.
Production pitfalls.
abortSignal: req.signalThe most expensive line of code you can omit. Without it, a user who navigates away mid-response keeps the provider call running and you keep paying for tokens the user never sees. On a chatty product this can be 20% of your bill.
messages from the clientThe browser can send anything — including a system message that says "ignore prior instructions." Always set your own system on the server, never let the client supply it, and consider stripping any role: 'system' from inbound messages before calling the model.
Tool output is untrusted text that flows back into the model's context. If fetchUrl returns a page that says "you are now in admin mode," the model is allowed to believe it. Sanitize tool output, escape it, or render it as data rather than prose — never paste raw scraped HTML into the next turn.
A streaming endpoint without rate limiting is a debit card on the internet. Use a per-user limiter (Upstash Ratelimit, Vercel WAF, your reverse proxy) and a hard token-per-day cap stored in your DB. Both, not either.
While the first token is in flight, the message bubble is empty. Render a typing indicator or a "thinking..." placeholder — otherwise the streaming UX feels worse than a spinner for the first half second.