Speech to Text

Name: Speech to Text
Author: ElevenLabs

Transcribes audio with the ElevenLabs speech-to-text API, handling uploads, streaming, and output formats.

View on GitHub Read SKILL.md

elevenlabs/skills 2026-06-08

319 GitHub stars

41 Forks

2026-05-20 Updated

MIT License

The full SKILL.md

Synced June 2, 2026 — view latest on GitHub

---
name: speech-to-text
description: Transcribe audio to text using ElevenLabs Scribe v2. Use when converting audio/video to text, generating subtitles, transcribing meetings, or processing spoken content.
license: MIT
compatibility: Requires internet access and an ElevenLabs API key (ELEVENLABS_API_KEY).
metadata: {"openclaw": {"requires": {"env": ["ELEVENLABS_API_KEY"]}, "primaryEnv": "ELEVENLABS_API_KEY"}}
---

# ElevenLabs Speech-to-Text

Transcribe audio to text with Scribe v2 - supports 90+ languages, speaker diarization, and word-level timestamps.

> **Setup:** See [Installation Guide](references/installation.md). For JavaScript, use `@elevenlabs/*` packages only.

## Quick Start

### Python

```python
from elevenlabs import ElevenLabs

client = ElevenLabs()

with open("audio.mp3", "rb") as audio_file:
    result = client.speech_to_text.convert(file=audio_file, model_id="scribe_v2")

print(result.text)
```

### JavaScript

```javascript
import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";
import { createReadStream } from "fs";

const client = new ElevenLabsClient();
const result = await client.speechToText.convert({
  file: createReadStream("audio.mp3"),
  modelId: "scribe_v2",
});
console.log(result.text);
```

### cURL

```bash
curl -X POST "https://api.elevenlabs.io/v1/speech-to-text" \
  -H "xi-api-key: $ELEVENLABS_API_KEY" -F "[email protected]" -F "model_id=scribe_v2"
```

## Models

| Model ID | Description | Best For |
|----------|-------------|----------|
| `scribe_v2` | State-of-the-art accuracy, 90+ languages | Batch transcription, subtitles, long-form audio |
| `scribe_v2_realtime` | Low latency (~150ms) | Live transcription, voice agents |

## Transcription with Timestamps

Word-level timestamps include type classification and speaker identification:

```python
result = client.speech_to_text.convert(
    file=audio_file, model_id="scribe_v2", timestamps_granularity="word"
)

for word in result.words:
    print(f"{word.text}: {word.start}s - {word.end}s (type: {word.type})")

```

## Speaker Diarization

Identify WHO said WHAT - the model labels each word with a speaker ID, useful for meetings, interviews, or any multi-speaker audio:

```python
result = client.speech_to_text.convert(
    file=audio_file,
    model_id="scribe_v2",
    diarize=True
)

for word in result.words:
    print(f"[{word.speaker_id}] {word.text}")
```

For call recordings, the batch API can label diarized speakers as `agent` and `customer` by setting `detect_speaker_roles=true` alongside `diarize=true`. This option is not compatible with `use_multi_channel=true`.

```bash
curl -X POST "https://api.elevenlabs.io/v1/speech-to-text" \
  -H "xi-api-key: $ELEVENLABS_API_KEY" \
  -F "[email protected]" \
  -F "model_id=scribe_v2" \
  -F "diarize=true" \
  -F "detect_speaker_roles=true"
```

## Keyterm Prompting

Help the model recognize specific words it might otherwise mishear - product names, technical jargon, or unusual spellings (up to 100 terms):

```python
result = client.speech_to_text.convert(
    file=audio_file,
    model_id="scribe_v2",
    keyterms=["ElevenLabs", "Scribe", "API"]
)
```

## Language Detection

Automatic detection with optional language hint:

```python
result = client.speech_to_text.convert(
    file=audio_file,
    model_id="scribe_v2",
    language_code="eng"  # ISO 639-1 or ISO 639-3 code
)

print(f"Detected: {result.language_code} ({result.language_probability:.0%})")
```

## Supported Formats

**Audio:** MP3, WAV, M4A, FLAC, OGG, WebM, AAC, AIFF, Opus
**Video:** MP4, AVI, MKV, MOV, WMV, FLV, WebM, MPEG, 3GPP

**Limits:** Up to 3GB file size, 10 hours duration

## Response Format

```json
{
  "text": "The full transcription text",
  "language_code": "eng",
  "language_probability": 0.98,
  "words": [
    {"text": "The", "start": 0.0, "end": 0.15, "type": "word", "speaker_id": "speaker_0"},
    {"text": " ", "start": 0.15, "end": 0.16, "type": "spacing", "speaker_id": "speaker_0"}
  ]
}
```

**Word types:**
- `word` - An actual spoken word
- `spacing` - Whitespace between words (useful for precise timing)
- `audio_event` - Non-speech sounds the model detected (laughter, applause, music, etc.)

## Error Handling

```python
try:
    result = client.speech_to_text.convert(file=audio_file, model_id="scribe_v2")
except Exception as e:
    print(f"Transcription failed: {e}")
```

Common errors:
- **401**: Invalid API key
- **422**: Invalid parameters
- **429**: Rate limit exceeded

## Tracking Costs

Monitor usage via `request-id` response header:

```python
response = client.speech_to_text.convert.with_raw_response(file=audio_file, model_id="scribe_v2")
result = response.parse()
print(f"Request ID: {response.headers.get('request-id')}")
```

## Real-Time Streaming

For live transcription with ultra-low latency (~150ms), use the real-time API. The real-time API produces two types of transcripts:

- **Partial transcripts**: Interim results that update frequently as audio is processed - use these for live feedback (e.g., showing text as the user speaks)
- **Committed transcripts**: Final, stable results after you "commit" - use these as the source of truth for your application

A "commit" tells the model to finalize the current segment. You can commit manually (e.g., when the user pauses) or use Voice Activity Detection (VAD) to auto-commit on silence.

### Python (Server-Side)

```python
import asyncio
from elevenlabs import ElevenLabs

client = ElevenLabs()

async def transcribe_realtime():
    async with client.speech_to_text.realtime.connect(
        model_id="scribe_v2_realtime",
        include_timestamps=True,
        keyterms=["ElevenLabs", "Scribe"],
        no_verbatim=True,
    ) as connection:
        await connection.stream_url("https://example.com/audio.mp3")

        async for event in connection:
            if event.type == "partial_transcript":
                print(f"Partial: {event.text}")
            elif event.type == "committed_transcript":
                print(f"Final: {event.text}")

asyncio.run(transcribe_realtime())
```

### JavaScript (Client-Side with React)

```typescript
import { useScribe, CommitStrategy } from "@elevenlabs/react";

function TranscriptionComponent() {
  const [transcript, setTranscript] = useState("");

  const scribe = useScribe({
    modelId: "scribe_v2_realtime",
    commitStrategy: CommitStrategy.VAD, // Auto-commit on silence for mic input
    keyterms: ["ElevenLabs", "Scribe"],
    noVerbatim: true,
    onPartialTranscript: (data) => console.log("Partial:", data.text),
    onCommittedTranscript: (data) => setTranscript((prev) => prev + data.text),
  });

  const start = async () => {
    // Get token from your backend (never expose API key to client)
    const { token } = await fetch("/scribe-token").then((r) => r.json());

    await scribe.connect({
      token,
      microphone: { echoCancellation: true, noiseSuppression: true },
    });
  };

  return <button onClick={start}>Start Recording</button>;
}
```

### Commit Strategies

| Strategy | Description |
|----------|-------------|
| **Manual** | You call `commit()` when ready - use for file processing or when you control the audio segments |
| **VAD** | Voice Activity Detection auto-commits when silence is detected - use for live microphone input |

```typescript
// React: set commitStrategy on the hook (recommended for mic input)
import { useScribe, CommitStrategy } from "@elevenlabs/react";

const scribe = useScribe({
  modelId: "scribe_v2_realtime",
  commitStrategy: CommitStrategy.VAD,
  keyterms: ["ElevenLabs", "Scribe"],
  noVerbatim: true,
  // Optional VAD tuning:
  vadSilenceThresholdSecs: 1.5,
  vadThreshold: 0.4,
});
```

```javascript
// JavaScript client: pass vad config on connect
const connection = await client.speechToText.realtime.connect({
  modelId: "scribe_v2_realtime",
  keyterms: ["ElevenLabs", "Scribe"],
  noVerbatim: true,
  vad: {
    silenceThresholdSecs: 1.5,
    threshold: 0.4,
  },
});
```

### Event Types

| Event | Description |
|-------|-------------|
| `partial_transcript` | Live interim results |
| `committed_transcript` | Final results after commit |
| `committed_transcript_with_timestamps` | Final with word timing |
| `error` | Error occurred |

See real-time references for complete documentation.

## References

- [Installation Guide](references/installation.md)
- [Transcription Options](references/transcription-options.md)
- [Real-Time Client-Side Streaming](references/realtime-client-side.md)
- [Real-Time Server-Side Streaming](references/realtime-server-side.md)
- [Commit Strategies](references/realtime-commit-strategies.md)
- [Real-Time Event Reference](references/realtime-events.md)

Install

Add Speech to Text to your agent

Pick your tool, then drop the file in or run the one-line fetch command.

1Drop this in

Project: .cursor/skills/speech-to-text.md

2Or fetch it from the repo

curl -fsSL https://raw.githubusercontent.com/elevenlabs/skills/main/speech-to-text/SKILL.md -o .cursor/skills/speech-to-text.md

Restart Cursor. The agent now follows this skill on every relevant task.

1Drop this in

User-level: ~/.claude/skills/speech-to-text/SKILL.md

2Or fetch it from the repo

mkdir -p ~/.claude/skills/speech-to-text && curl -fsSL https://raw.githubusercontent.com/elevenlabs/skills/main/speech-to-text/SKILL.md -o ~/.claude/skills/speech-to-text/SKILL.md

Claude Code auto-discovers skills in ~/.claude/skills/.

1Drop this in

Project: AGENTS.md (append the SKILL contents)

2Or fetch it from the repo

curl -fsSL https://raw.githubusercontent.com/elevenlabs/skills/main/speech-to-text/SKILL.md >> AGENTS.md

Codex CLI reads AGENTS.md automatically from the project root.

1Drop this in

Project: .windsurf/rules/speech-to-text.md

2Or fetch it from the repo

mkdir -p .windsurf/rules && curl -fsSL https://raw.githubusercontent.com/elevenlabs/skills/main/speech-to-text/SKILL.md -o .windsurf/rules/speech-to-text.md

Windsurf loads project rules on every Cascade run.

1Drop this in

Project: .github/copilot-instructions.md (append)

2Or fetch it from the repo

mkdir -p .github && curl -fsSL https://raw.githubusercontent.com/elevenlabs/skills/main/speech-to-text/SKILL.md >> .github/copilot-instructions.md

Copilot reads .github/copilot-instructions.md as project-wide context.

1Drop this in

Project: .gemini/skills/speech-to-text.md

2Or fetch it from the repo

mkdir -p .gemini/skills && curl -fsSL https://raw.githubusercontent.com/elevenlabs/skills/main/speech-to-text/SKILL.md -o .gemini/skills/speech-to-text.md

Gemini CLI auto-loads project skills on the next run.