Home Benchmarks Learn Tools News
Learn · Guides · Operations

Observability for AI Features.

OpenTelemetry GenAI conventions, the seven attributes you actually need on every span, SLOs that survive a model swap, and a vendor decision tree.

SPONSOR

AppSignal — Latency, errors, tokens — one dashboard. AppSignal traces your AI routes and your everything-else in the same view.

↗
On this page
  1. Why LLM observability is different
  2. OpenTelemetry GenAI in 5 minutes
  3. The seven attributes you actually need
  4. SLOs for streaming endpoints
  5. Vendor decision tree
  6. Alerts worth paging on
  7. Pitfalls
CH 01

Why LLM observability is different.

Your existing APM tells you the route was slow. It cannot tell you the model returned a length-truncated answer, the tool call looped four times, or the cache hit rate dropped after a prompt edit. Those are LLM-specific failure modes, and your generic tracer is blind to them.

The good news: in 2026 you do not need a special vendor to see them. OpenTelemetry has a stable GenAI semantic convention that every major observability platform now consumes — you instrument once and pick the dashboard later.

The shift in what you measure:

  • Tokens, not just bytes. Tokens are the unit of cost and the unit of behavior. Track both directions.
  • Time-to-first-token, not just total. TTFT is what the user feels on a streaming endpoint; total time is what your bill reads.
  • Finish reason, not just status. A 200 OK with finish_reason: length is a user-facing bug your APM thinks went fine.
  • Tool-call depth. An agent that loops list/get/list/get five times is broken; HTTP-level metrics will never tell you.
CH 02

OpenTelemetry GenAI in 5 minutes.

The spec lives under gen_ai.*. Span names are operation-shaped (chat, execute_tool, invoke_agent), attributes carry the metadata you actually want to query.

If you're on an instrumentation pinned to GenAI v1.36 or earlier, the spec is explicit: don't change the default version. Opt in to newer conventions with an env var:

Opt into the latest GenAI conventions
export OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental
Manual instrumentation of an AI SDK call (TypeScript)
import { trace, SpanStatusCode } from '@opentelemetry/api';
import { streamText } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';

const tracer = trace.getTracer('chat-route');

export async function POST(req: Request) {
  const { messages } = await req.json();

  return tracer.startActiveSpan('chat', async (span) => {
    span.setAttributes({
      'gen_ai.operation.name': 'chat',
      'gen_ai.system': 'anthropic',
      'gen_ai.request.model': 'claude-sonnet-4-5',
    });

    const t0 = performance.now();
    try {
      const result = await streamText({
        model: anthropic('claude-sonnet-4-5'),
        messages,
        abortSignal: req.signal,
        onFinish: ({ usage, finishReason, response }) => {
          span.setAttributes({
            'gen_ai.response.model': response.modelId,
            'gen_ai.usage.input_tokens': usage.promptTokens,
            'gen_ai.usage.output_tokens': usage.completionTokens,
            'gen_ai.response.finish_reasons': [finishReason],
          });
        },
      });

      return result.toUIMessageStreamResponse();
    } catch (e: any) {
      span.recordException(e);
      span.setStatus({ code: SpanStatusCode.ERROR, message: e.message });
      throw e;
    } finally {
      span.setAttribute('duration_ms', performance.now() - t0);
      span.end();
    }
  });
}

Most provider SDKs ship auto-instrumentation packages (@traceloop/instrumentation-anthropic, OpenAI's own OTel exporter, Vercel's experimental telemetry option) — use them when they exist, fall back to manual when they don't. The attribute names are the same either way; the dashboard speaks the same language.

CH 03

The seven attributes you actually need.

Attribute Why
gen_ai.system"anthropic", "openai", "google". Filter by provider; spot one going down.
gen_ai.request.modelWhat you asked for.
gen_ai.response.modelWhat you actually got. Providers silently roll versions; this is how you notice.
gen_ai.usage.input_tokensThe cost lever you control most.
gen_ai.usage.output_tokensThe latency lever you control least.
gen_ai.response.finish_reasonsstop good. length = truncation. tool_calls = mid-loop. content_filter = policy hit.
gen_ai.operation.name"chat", "execute_tool", "invoke_agent" — lets you group by span shape.

Add these to a custom attribute alongside the standard ones (the conventions explicitly allow it — you can migrate to a standardized version later if the spec catches up):

  • gen_ai.cost.total_usd — computed at log time from token counts and current prices. Saves you from joining against a price table at query time.
  • gen_ai.ttft_ms — time-to-first-token, the user-facing latency metric. Measure between request received and first stream chunk emitted.
  • gen_ai.user.tenant_id — the tenant or workspace the call is for. Lets you spot a noisy tenant or charge them back accurately.
CH 04

SLOs for streaming endpoints.

The mistake everyone makes the first time: SLO total response time. It varies legitimately with prompt complexity, gets worse the day someone enables long-context, and tells you nothing actionable. Use two SLIs instead.

SLI Definition Healthy target (chat)
TTFT P95 of gen_ai.ttft_ms over a rolling window. < 800 ms
Stream success % of streams that finish with finish_reason: stop (or tool_calls for agents) and no exception. ≥ 99.5% over 28 d

That gives you a real error budget. A 99.5% SLO over 28 days is ~3.6 hours of permitted bad time per month. Burn through more than half of it in the first week and you have an alert; burn it all and you stop shipping risky changes until the window resets. Standard SRE math, applied to LLM features.

CH 05

Vendor decision tree.

Pick When Watch out for
Langfuse Open source, self-host option, strong prompt management and evals. Best fit if you want the codebase auditable or run on-prem. Self-hosting Postgres + ClickHouse is not free; budget for ops time.
Helicone Zero-config gateway capture — you proxy provider calls through their endpoint, they capture everything. Fastest path to "we have visibility." Adds a hop in your hot path. Edge users feel it.
Datadog LLM Observability Already on Datadog. Native OTel GenAI mapping (since v1.37 of the spec) means GenAI spans sit next to your APM traces. Datadog pricing, but you knew that.
New Relic / Honeycomb / Grafana Cloud You're already there and they consume OTel cleanly. The dashboards aren't LLM-specific but the data is queryable. You may build your own LLM views.
Roll your own You have a working observability team and budget for a small in-house surface. Eval and prompt management are real work. Don't underestimate.
"Let's pick the vendor first"

Backwards. Instrument with OTel GenAI conventions first; the spans are portable and you can swap the backend in an afternoon. Picking the vendor before you've instrumented means you'll write vendor-specific code you'll have to delete when you migrate.

CH 06

Alerts worth paging on.

Alert Condition Severity
Provider error rate > 1% over 5 min, grouped by gen_ai.system Page
TTFT SLO burn > 50% of monthly budget consumed in < 7 d Page
Cost anomaly Daily spend > 2× rolling 7-day median Page
Truncation spike finish_reason: length share > 5% over 1 h Ticket
Tool-call depth runaway Avg tool-call steps per request > 4 over 1 h Ticket
Model silently changed gen_ai.response.model differs from gen_ai.request.model on > 5% of calls Ticket
Per-tenant burst Single tenant_id > 10× its 7-day median tokens in 5 min Ticket
PITFALLS

Pitfalls.

Logging 100% of prompts and outputs

Two ways this hurts: storage cost (LLM payloads are big) and privacy (anything the user typed is now in your logs). Sample — 1–5% in production is plenty for debugging; 100% in staging is fine. Metadata stays at 100%; only the content gets sampled.

Vendor SDK lock-in

If your instrumentation code is helicone.log(...) instead of span.setAttribute('gen_ai.usage.input_tokens', ...), switching backends is a rewrite. OTel GenAI conventions are how you keep the option open.

Alerting on total response time

It legitimately varies with prompt size. You will tune the threshold for weeks and never get useful pages. Alert on TTFT (user-facing) and success rate (functional); leave total time as a graph, not a pager.

No per-tenant breakdown

One bad tenant can move your aggregate cost or latency enough to obscure a separate, real regression. Always carry a tenant attribute on the span; always group dashboards by it.

What to read next.

  • Guide · 11 Streaming AI in Web Apps The routes you're now tracing — built on the same AI SDK 5 patterns.
  • Guide · 12 Securing AI-Generated Code Observability is half the answer to the cost-anomaly and abuse alerts here.
  • Guide · 02 Stop Burning Tokens The cost-attribute math from the other angle.
Changelog
  • 2026-05-26Initial publish.
STATUS ● BUILDING THE FUTURE
MISSION LLM RESOURCES
VERSION BETA 3.0

BUILD WITH AI. SHIP WITH CONFIDENCE.

@WEBDEVELOPERHQ ↗
TERMS / PRIVACY
FRIENDS
Authentic Jobs
Authentic Jobs ↗
Web Reference
Web Reference ↗
Ready.dev
Ready.dev ↗
Design.dev
Design.dev ↗
© 2026 WEB DEVELOPER / ALL RIGHTS RESERVED