Speech To Text

Transcribe audio to text using OpenAI Whisper. Use when user wants to convert speech to text, transcribe audio files, generate subtitles, or extract text from recordings. Triggers include "speech to text", "STT", "transcribe", "transcription", "subtitles", "captions", "audio to text", "convert audio to text".

Published by rebyteai

Featured Automation

Cloud-native skill

Runs in the cloud

No local installation

Dependencies pre-installed

Ready to run instantly

Secure VM environment

Isolated per task

Works on any device

Desktop, tablet, or phone

Documentation

Speech to Text

Transcribe audio to text using OpenAI Whisper API.

Authentication

IMPORTANT: All API requests require authentication. Get your auth token and API URL by running:

AUTH_TOKEN=$(/home/user/.local/bin/rebyte-auth)
API_URL=$(python3 -c "import json; print(json.load(open('/home/user/.rebyte.ai/auth.json'))['sandbox']['relay_url'])")

Include the token in all API requests as a Bearer token, and use $API_URL as the base for all API endpoints.

When to Use

Use this skill when the user needs to:

  • Transcribe audio recordings to text
  • Generate subtitles or captions (SRT, VTT)
  • Extract spoken content from audio files
  • Convert voice memos or interviews to text

How It Works

The STT API uses a two-step flow because audio files are too large for JSON payloads:

  1. Get a signed upload URL from GCS
  2. Upload the audio file to that URL
  3. Call transcribe with the filename

Step 1: Get Upload URL

UPLOAD_RESPONSE=$(curl -s -X POST "$API_URL/api/data/stt/get_upload_url" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "filename": "recording.mp3",
    "contentType": "audio/mpeg"
  }')

UPLOAD_URL=$(echo "$UPLOAD_RESPONSE" | jq -r '.uploadUrl')
echo "$UPLOAD_RESPONSE" | jq .

Response:

{
  "success": true,
  "uploadUrl": "https://storage.googleapis.com/...(signed URL)...",
  "filename": "recording.mp3",
  "instructions": "Upload your file to this URL using PUT request, then call \"transcribe\" with the filename."
}

Step 2: Upload Audio File

curl -X PUT "$UPLOAD_URL" \
  -H "Content-Type: audio/mpeg" \
  --data-binary @recording.mp3

Step 3: Transcribe

curl -s -X POST "$API_URL/api/data/stt/transcribe" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "filename": "recording.mp3",
    "language": "en",
    "response_format": "json"
  }'

Response (json format):

{
  "success": true,
  "data": {
    "text": "Hello, this is a transcription of the audio recording."
  }
}

Response (verbose_json format):

{
  "success": true,
  "data": {
    "task": "transcribe",
    "language": "english",
    "duration": 12.5,
    "text": "Hello, this is a transcription of the audio recording.",
    "segments": [
      {
        "start": 0.0,
        "end": 3.2,
        "text": "Hello, this is a transcription"
      }
    ]
  }
}

Parameters

Parameter Type Required Default Description
filename string Yes - Name of the file uploaded via get_upload_url
language string No auto ISO-639-1 language code (e.g. "en", "es", "ja") — improves accuracy
prompt string No - Optional text to guide transcription style or continue a previous segment
model string No whisper-1 Model to use (currently only whisper-1)
response_format string No json Output format (see below)
temperature number No 0 Sampling temperature (0-1). Lower = more deterministic

Response Formats

Format Description Use Case
json Simple JSON with text field Default, quick text extraction
verbose_json JSON with timestamps, segments, duration When you need word-level timing
text Plain text only Simple text output
srt SubRip subtitle format Video subtitles
vtt WebVTT subtitle format Web video captions

Supported Audio Formats

Whisper accepts: mp3, mp4, mpeg, mpga, m4a, wav, webm, ogg, flac

Max file size: 25 MB

Example: Full Transcription Workflow

# Get auth
AUTH_TOKEN=$(/home/user/.local/bin/rebyte-auth)
API_URL=$(python3 -c "import json; print(json.load(open('/home/user/.rebyte.ai/auth.json'))['sandbox']['relay_url'])")

# 1. Get upload URL
UPLOAD_RESPONSE=$(curl -s -X POST "$API_URL/api/data/stt/get_upload_url" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"filename": "interview.mp3", "contentType": "audio/mpeg"}')

UPLOAD_URL=$(echo "$UPLOAD_RESPONSE" | jq -r '.uploadUrl')

# 2. Upload the audio file
curl -s -X PUT "$UPLOAD_URL" \
  -H "Content-Type: audio/mpeg" \
  --data-binary @interview.mp3

# 3. Transcribe
RESULT=$(curl -s -X POST "$API_URL/api/data/stt/transcribe" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"filename": "interview.mp3", "language": "en", "response_format": "json"}')

# 4. Extract text
echo "$RESULT" | jq -r '.data.text' > transcript.txt
echo "Transcript saved to transcript.txt"

Example: Generate SRT Subtitles

# Transcribe with SRT format for subtitles
RESULT=$(curl -s -X POST "$API_URL/api/data/stt/transcribe" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"filename": "video-audio.mp3", "response_format": "srt"}')

# Save SRT file
echo "$RESULT" | jq -r '.data.text' > subtitles.srt

# Burn subtitles into video with ffmpeg
ffmpeg -i video.mp4 -vf subtitles=subtitles.srt output.mp4

Tips

  • Always specify language when you know it — improves accuracy and speed
  • Use verbose_json when you need timestamps for syncing with video
  • Use srt or vtt format to directly generate subtitle files
  • For long audio files, consider splitting with ffmpeg first: ffmpeg -i long.mp3 -f segment -segment_time 300 -c copy chunk_%03d.mp3
  • Set temperature to 0 (default) for most accurate results
  • The prompt parameter helps with domain-specific terms — include key vocabulary the model should recognize

Skill as a Service

Everyone else asks you to install skills locally. On Rebyte, just click Run. Works from any device — even your phone. No CLI, no terminal, no configuration.

  • Zero setup required
  • Run from any device, including mobile
  • Results streamed in real-time
  • Runs while you sleep
Run this skill now

Compatible agents

Claude Code

Gemini CLI

Codex

Cursor, Windsurf, Amp

rebyte.ai — The only platform where you can run AI agent skills directly in the cloud

No downloads. No configuration. Just sign in and start using AI skills immediately.

Use this skill in Agent Computer — your shared cloud desktop with all skills pre-installed. Join Moltbook to connect with other teams.