Why LLM observability is different.
Your existing APM tells you the route was slow. It cannot tell you the model returned a length-truncated answer, the tool call looped four times, or the cache hit rate dropped after a prompt edit. Those are LLM-specific failure modes, and your generic tracer is blind to them.
The good news: in 2026 you do not need a special vendor to see them. OpenTelemetry has a stable GenAI semantic convention that every major observability platform now consumes — you instrument once and pick the dashboard later.
The shift in what you measure:
- Tokens, not just bytes. Tokens are the unit of cost and the unit of behavior. Track both directions.
- Time-to-first-token, not just total. TTFT is what the user feels on a streaming endpoint; total time is what your bill reads.
- Finish reason, not just status. A 200 OK with
finish_reason: lengthis a user-facing bug your APM thinks went fine. - Tool-call depth. An agent that loops list/get/list/get five times is broken; HTTP-level metrics will never tell you.
OpenTelemetry GenAI in 5 minutes.
The spec lives under gen_ai.*. Span names are operation-shaped (chat, execute_tool, invoke_agent), attributes carry the metadata you actually want to query.
If you're on an instrumentation pinned to GenAI v1.36 or earlier, the spec is explicit: don't change the default version. Opt in to newer conventions with an env var:
export OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimentalimport { trace, SpanStatusCode } from '@opentelemetry/api'; import { streamText } from 'ai'; import { anthropic } from '@ai-sdk/anthropic'; const tracer = trace.getTracer('chat-route'); export async function POST(req: Request) { const { messages } = await req.json(); return tracer.startActiveSpan('chat', async (span) => { span.setAttributes({ 'gen_ai.operation.name': 'chat', 'gen_ai.system': 'anthropic', 'gen_ai.request.model': 'claude-sonnet-4-5', }); const t0 = performance.now(); try { const result = await streamText({ model: anthropic('claude-sonnet-4-5'), messages, abortSignal: req.signal, onFinish: ({ usage, finishReason, response }) => { span.setAttributes({ 'gen_ai.response.model': response.modelId, 'gen_ai.usage.input_tokens': usage.promptTokens, 'gen_ai.usage.output_tokens': usage.completionTokens, 'gen_ai.response.finish_reasons': [finishReason], }); }, }); return result.toUIMessageStreamResponse(); } catch (e: any) { span.recordException(e); span.setStatus({ code: SpanStatusCode.ERROR, message: e.message }); throw e; } finally { span.setAttribute('duration_ms', performance.now() - t0); span.end(); } }); }
Most provider SDKs ship auto-instrumentation packages (@traceloop/instrumentation-anthropic, OpenAI's own OTel exporter, Vercel's experimental telemetry option) — use them when they exist, fall back to manual when they don't. The attribute names are the same either way; the dashboard speaks the same language.
The seven attributes you actually need.
| Attribute | Why |
|---|---|
gen_ai.system | "anthropic", "openai", "google". Filter by provider; spot one going down. |
gen_ai.request.model | What you asked for. |
gen_ai.response.model | What you actually got. Providers silently roll versions; this is how you notice. |
gen_ai.usage.input_tokens | The cost lever you control most. |
gen_ai.usage.output_tokens | The latency lever you control least. |
gen_ai.response.finish_reasons | stop good. length = truncation. tool_calls = mid-loop. content_filter = policy hit. |
gen_ai.operation.name | "chat", "execute_tool", "invoke_agent" — lets you group by span shape. |
Add these to a custom attribute alongside the standard ones (the conventions explicitly allow it — you can migrate to a standardized version later if the spec catches up):
gen_ai.cost.total_usd— computed at log time from token counts and current prices. Saves you from joining against a price table at query time.gen_ai.ttft_ms— time-to-first-token, the user-facing latency metric. Measure between request received and first stream chunk emitted.gen_ai.user.tenant_id— the tenant or workspace the call is for. Lets you spot a noisy tenant or charge them back accurately.
SLOs for streaming endpoints.
The mistake everyone makes the first time: SLO total response time. It varies legitimately with prompt complexity, gets worse the day someone enables long-context, and tells you nothing actionable. Use two SLIs instead.
| SLI | Definition | Healthy target (chat) |
|---|---|---|
| TTFT | P95 of gen_ai.ttft_ms over a rolling window. |
< 800 ms |
| Stream success | % of streams that finish with finish_reason: stop (or tool_calls for agents) and no exception. |
≥ 99.5% over 28 d |
That gives you a real error budget. A 99.5% SLO over 28 days is ~3.6 hours of permitted bad time per month. Burn through more than half of it in the first week and you have an alert; burn it all and you stop shipping risky changes until the window resets. Standard SRE math, applied to LLM features.
Vendor decision tree.
| Pick | When | Watch out for |
|---|---|---|
| Langfuse | Open source, self-host option, strong prompt management and evals. Best fit if you want the codebase auditable or run on-prem. | Self-hosting Postgres + ClickHouse is not free; budget for ops time. |
| Helicone | Zero-config gateway capture — you proxy provider calls through their endpoint, they capture everything. Fastest path to "we have visibility." | Adds a hop in your hot path. Edge users feel it. |
| Datadog LLM Observability | Already on Datadog. Native OTel GenAI mapping (since v1.37 of the spec) means GenAI spans sit next to your APM traces. | Datadog pricing, but you knew that. |
| New Relic / Honeycomb / Grafana Cloud | You're already there and they consume OTel cleanly. The dashboards aren't LLM-specific but the data is queryable. | You may build your own LLM views. |
| Roll your own | You have a working observability team and budget for a small in-house surface. | Eval and prompt management are real work. Don't underestimate. |
Backwards. Instrument with OTel GenAI conventions first; the spans are portable and you can swap the backend in an afternoon. Picking the vendor before you've instrumented means you'll write vendor-specific code you'll have to delete when you migrate.
Alerts worth paging on.
| Alert | Condition | Severity |
|---|---|---|
| Provider error rate | > 1% over 5 min, grouped by gen_ai.system |
Page |
| TTFT SLO burn | > 50% of monthly budget consumed in < 7 d | Page |
| Cost anomaly | Daily spend > 2× rolling 7-day median | Page |
| Truncation spike | finish_reason: length share > 5% over 1 h |
Ticket |
| Tool-call depth runaway | Avg tool-call steps per request > 4 over 1 h | Ticket |
| Model silently changed | gen_ai.response.model differs from gen_ai.request.model on > 5% of calls |
Ticket |
| Per-tenant burst | Single tenant_id > 10× its 7-day median tokens in 5 min |
Ticket |
Pitfalls.
Two ways this hurts: storage cost (LLM payloads are big) and privacy (anything the user typed is now in your logs). Sample — 1–5% in production is plenty for debugging; 100% in staging is fine. Metadata stays at 100%; only the content gets sampled.
If your instrumentation code is helicone.log(...) instead of span.setAttribute('gen_ai.usage.input_tokens', ...), switching backends is a rewrite. OTel GenAI conventions are how you keep the option open.
It legitimately varies with prompt size. You will tune the threshold for weeks and never get useful pages. Alert on TTFT (user-facing) and success rate (functional); leave total time as a graph, not a pager.
One bad tenant can move your aggregate cost or latency enough to obscure a separate, real regression. Always carry a tenant attribute on the span; always group dashboards by it.