What's the right standard to instrument LLM calls in 2026?

OpenTelemetry GenAI semantic conventions. They define a gen_ai.* namespace covering model names, token counts, finish reasons, tool calls, and operation types. All major LLM-observability vendors (Datadog, Langfuse, Helicone, New Relic) consume them, which means instrumenting once and switching vendors later. Existing instrumentations using v1.36.0 or earlier should not change their default version; opt into newer conventions explicitly via OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental.

Should I log prompts and outputs?

Yes, sampled. Logging 100% of prompts and outputs is too expensive in storage and a privacy minefield. Log a sampled percentage (1–5% in production, 100% in staging), redact PII before storage, and always log the metadata (model, tokens, latency, finish reason) at 100%. OTel GenAI conventions disable content capture by default for exactly this reason; turn it on with a sampling decision attached.

How do I set SLOs on a streaming endpoint?

Two SLIs: time-to-first-token (the user-perceived latency) and end-to-end success rate (did the stream complete without error, including timeouts). Don't try to SLO total response time — that varies legitimately with prompt complexity. For a chat product, healthy targets are p95 TTFT < 800ms and 99.5% successful streams over a 28-day window. Tighten or loosen based on your traffic.

Langfuse, Helicone, Datadog LLM Observability, or roll my own?

Langfuse if you want open source with self-host as an option, strong eval and prompt-management features. Helicone if you want zero-config gateway-based capture (proxy your provider calls through their endpoint). Datadog LLM Observability if you're already on Datadog — the native OTel GenAI mapping is the cleanest integration. Roll your own if you're allergic to SaaS and have a working observability team; OTel makes the floor reasonable, but UI and analytics are real work.

What's the one alert I should set today?

Provider error rate > 1% over a 5-minute window, broken down by model. Provider outages and quota issues are the most common source of user-facing breakage in AI features, and they always show up here first. Second alert (after the first one is paged a few times): p95 TTFT > target for 10 minutes. Everything else can wait.

Observability for AI Features — OpenTelemetry GenAI, SLOs, Tracing

Q: Langfuse, Helicone, Datadog LLM Observability, or roll my own?

Langfuse if you want open source with self-host as an option, strong eval and prompt-management features. Helicone if you want zero-config gateway-based capture (proxy your provider calls through their endpoint). Datadog LLM Observability if you're already on Datadog — the native OTel GenAI mapping is the cleanest integration. Roll your own if you're allergic to SaaS and have a working observability team; OTel makes the floor reasonable, but UI and analytics are real work.

Q: What's the one alert I should set today?

Provider error rate > 1% over a 5-minute window, broken down by model. Provider outages and quota issues are the most common source of user-facing breakage in AI features, and they always show up here first. Second alert (after the first one is paged a few times): p95 TTFT > target for 10 minutes. Everything else can wait.

CH 01

Why LLM observability is different.

Your existing APM tells you the route was slow. It cannot tell you the model returned a length-truncated answer, the tool call looped four times, or the cache hit rate dropped after a prompt edit. Those are LLM-specific failure modes, and your generic tracer is blind to them.

The good news: in 2026 you do not need a special vendor to see them. OpenTelemetry has a stable GenAI semantic convention that every major observability platform now consumes — you instrument once and pick the dashboard later.

The shift in what you measure:

Tokens, not just bytes. Tokens are the unit of cost and the unit of behavior. Track both directions.
Time-to-first-token, not just total. TTFT is what the user feels on a streaming endpoint; total time is what your bill reads.
Finish reason, not just status. A 200 OK with finish_reason: length is a user-facing bug your APM thinks went fine.
Tool-call depth. An agent that loops list/get/list/get five times is broken; HTTP-level metrics will never tell you.

CH 02

OpenTelemetry GenAI in 5 minutes.

The spec lives under gen_ai.*. Span names are operation-shaped (chat, execute_tool, invoke_agent), attributes carry the metadata you actually want to query.

If you're on an instrumentation pinned to GenAI v1.36 or earlier, the spec is explicit: don't change the default version. Opt in to newer conventions with an env var:

Opt into the latest GenAI conventions

export OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental

Manual instrumentation of an AI SDK call (TypeScript)

import { trace, SpanStatusCode } from '@opentelemetry/api';
import { streamText } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';

const tracer = trace.getTracer('chat-route');

export async function POST(req: Request) {
  const { messages } = await req.json();

  return tracer.startActiveSpan('chat', async (span) => {
    span.setAttributes({
      'gen_ai.operation.name': 'chat',
      'gen_ai.system': 'anthropic',
      'gen_ai.request.model': 'claude-sonnet-4-5',
    });

    const t0 = performance.now();
    try {
      const result = await streamText({
        model: anthropic('claude-sonnet-4-5'),
        messages,
        abortSignal: req.signal,
        onFinish: ({ usage, finishReason, response }) => {
          span.setAttributes({
            'gen_ai.response.model': response.modelId,
            'gen_ai.usage.input_tokens': usage.promptTokens,
            'gen_ai.usage.output_tokens': usage.completionTokens,
            'gen_ai.response.finish_reasons': [finishReason],
          });
        },
      });

      return result.toUIMessageStreamResponse();
    } catch (e: any) {
      span.recordException(e);
      span.setStatus({ code: SpanStatusCode.ERROR, message: e.message });
      throw e;
    } finally {
      span.setAttribute('duration_ms', performance.now() - t0);
      span.end();
    }
  });
}

Most provider SDKs ship auto-instrumentation packages (@traceloop/instrumentation-anthropic, OpenAI's own OTel exporter, Vercel's experimental telemetry option) — use them when they exist, fall back to manual when they don't. The attribute names are the same either way; the dashboard speaks the same language.

CH 03

The seven attributes you actually need.

Attribute	Why
`gen_ai.system`	"anthropic", "openai", "google". Filter by provider; spot one going down.
`gen_ai.request.model`	What you asked for.
`gen_ai.response.model`	What you actually got. Providers silently roll versions; this is how you notice.
`gen_ai.usage.input_tokens`	The cost lever you control most.
`gen_ai.usage.output_tokens`	The latency lever you control least.
`gen_ai.response.finish_reasons`	`stop` good. `length` = truncation. `tool_calls` = mid-loop. `content_filter` = policy hit.
`gen_ai.operation.name`	"chat", "execute_tool", "invoke_agent" — lets you group by span shape.

Add these to a custom attribute alongside the standard ones (the conventions explicitly allow it — you can migrate to a standardized version later if the spec catches up):

gen_ai.cost.total_usd — computed at log time from token counts and current prices. Saves you from joining against a price table at query time.
gen_ai.ttft_ms — time-to-first-token, the user-facing latency metric. Measure between request received and first stream chunk emitted.
gen_ai.user.tenant_id — the tenant or workspace the call is for. Lets you spot a noisy tenant or charge them back accurately.

CH 04

SLOs for streaming endpoints.

The mistake everyone makes the first time: SLO total response time. It varies legitimately with prompt complexity, gets worse the day someone enables long-context, and tells you nothing actionable. Use two SLIs instead.

SLI	Definition	Healthy target (chat)
TTFT	P95 of `gen_ai.ttft_ms` over a rolling window.	< 800 ms
Stream success	% of streams that finish with `finish_reason: stop` (or `tool_calls` for agents) and no exception.	≥ 99.5% over 28 d

That gives you a real error budget. A 99.5% SLO over 28 days is ~3.6 hours of permitted bad time per month. Burn through more than half of it in the first week and you have an alert; burn it all and you stop shipping risky changes until the window resets. Standard SRE math, applied to LLM features.

CH 05

Vendor decision tree.

Pick	When	Watch out for
Langfuse	Open source, self-host option, strong prompt management and evals. Best fit if you want the codebase auditable or run on-prem.	Self-hosting Postgres + ClickHouse is not free; budget for ops time.
Helicone	Zero-config gateway capture — you proxy provider calls through their endpoint, they capture everything. Fastest path to "we have visibility."	Adds a hop in your hot path. Edge users feel it.
Datadog LLM Observability	Already on Datadog. Native OTel GenAI mapping (since v1.37 of the spec) means GenAI spans sit next to your APM traces.	Datadog pricing, but you knew that.
New Relic / Honeycomb / Grafana Cloud	You're already there and they consume OTel cleanly. The dashboards aren't LLM-specific but the data is queryable.	You may build your own LLM views.
Roll your own	You have a working observability team and budget for a small in-house surface.	Eval and prompt management are real work. Don't underestimate.

"Let's pick the vendor first"

Backwards. Instrument with OTel GenAI conventions first; the spans are portable and you can swap the backend in an afternoon. Picking the vendor before you've instrumented means you'll write vendor-specific code you'll have to delete when you migrate.

CH 06

Alerts worth paging on.

Alert	Condition	Severity
Provider error rate	> 1% over 5 min, grouped by `gen_ai.system`	Page
TTFT SLO burn	> 50% of monthly budget consumed in < 7 d	Page
Cost anomaly	Daily spend > 2× rolling 7-day median	Page
Truncation spike	`finish_reason: length` share > 5% over 1 h	Ticket
Tool-call depth runaway	Avg tool-call steps per request > 4 over 1 h	Ticket
Model silently changed	`gen_ai.response.model` differs from `gen_ai.request.model` on > 5% of calls	Ticket
Per-tenant burst	Single `tenant_id` > 10× its 7-day median tokens in 5 min	Ticket

PITFALLS

Pitfalls.

Logging 100% of prompts and outputs

Two ways this hurts: storage cost (LLM payloads are big) and privacy (anything the user typed is now in your logs). Sample — 1–5% in production is plenty for debugging; 100% in staging is fine. Metadata stays at 100%; only the content gets sampled.

Vendor SDK lock-in

If your instrumentation code is helicone.log(...) instead of span.setAttribute('gen_ai.usage.input_tokens', ...), switching backends is a rewrite. OTel GenAI conventions are how you keep the option open.

Alerting on total response time

It legitimately varies with prompt size. You will tune the threshold for weeks and never get useful pages. Alert on TTFT (user-facing) and success rate (functional); leave total time as a graph, not a pager.

No per-tenant breakdown

One bad tenant can move your aggregate cost or latency enough to obscure a separate, real regression. Always carry a tenant attribute on the span; always group dashboards by it.