Integrate transcription into your own scripts, apps, and automation workflows
8765).To stop the server, open Settings again and flip the Server Mode toggle back to Off.
0.0.0.0, so it is accessible at
http://127.0.0.1:8765 locally and at
http://<your-ip>:8765 from other machines on your network.
faster-whisper library, which exposes
the standard Whisper knobs (beam_size,
vad_filter, condition_on_previous_text,
word_timestamps, temperature, …).
The server surfaces the ones that actually matter for integration.
| Endpoint | Method | Description |
|---|---|---|
/health | GET | Check if the server is running |
/status | GET | Server status, queue depth, whether a transcription is active |
/models | GET | List all available models and whether they support translation |
/transcribe | POST | Transcribe audio from a file upload (multipart form) |
/transcribe/raw | POST | Transcribe audio from base64-encoded data (JSON body) |
The server also provides interactive API documentation:
http://127.0.0.1:8765/docshttp://127.0.0.1:8765/redocThe simplest possible request — send an audio file and get text back:
import requests
response = requests.post(
"http://127.0.0.1:8765/transcribe",
files={"audio": open("my_audio.mp3", "rb")},
)
print(response.json()["text"])
That's it. The server uses whatever model, quantization, task, and whisper params were configured in the GUI when Server Mode was turned on. Language defaults to auto-detection.
Standard audio files uploaded directly. Supported formats include:
.mp3 .wav .flac .m4a
.ogg .aac .wma .webm
.mp4 .mkv .avi .asf .amr
import requests
with open("recording.wav", "rb") as f:
response = requests.post(
"http://127.0.0.1:8765/transcribe",
files={"audio": ("recording.wav", f, "audio/wav")},
)
print(response.json()["text"])
If your program already has audio as a NumPy array, serialize it with np.save() and upload the .npy file:
import io
import numpy as np
import requests
# Your audio as a numpy array (float32, mono)
audio_array = np.random.randn(16000 * 5).astype(np.float32) # 5 seconds at 16kHz
buffer = io.BytesIO()
np.save(buffer, audio_array)
buffer.seek(0)
response = requests.post(
"http://127.0.0.1:8765/transcribe",
files={"audio": ("audio.npy", buffer, "application/octet-stream")},
data={"sample_rate": "16000"},
)
print(response.json()["text"])
sample_rate=44100 and the server will resample it to 16 kHz automatically.
import io
import torch
import requests
audio_tensor = torch.randn(16000 * 5) # 5 seconds at 16kHz
buffer = io.BytesIO()
torch.save(audio_tensor, buffer)
buffer.seek(0)
response = requests.post(
"http://127.0.0.1:8765/transcribe",
files={"audio": ("audio.pt", buffer, "application/octet-stream")},
data={"sample_rate": "16000"},
)
print(response.json()["text"])
Common in real-time audio pipelines:
import numpy as np
import requests
audio = np.random.randn(16000 * 3).astype(np.float32) # 3 seconds
raw_bytes = audio.tobytes()
response = requests.post(
"http://127.0.0.1:8765/transcribe",
files={"audio": ("audio.raw", raw_bytes, "application/octet-stream")},
data={
"audio_format": "pcm",
"sample_rate": "16000",
"dtype": "float32", # also supports: int16, int32, float64
},
)
print(response.json()["text"])
For sending everything as JSON with no multipart form:
import base64, io
import numpy as np
import requests
audio = np.random.randn(16000 * 5).astype(np.float32)
buffer = io.BytesIO()
np.save(buffer, audio)
b64_data = base64.b64encode(buffer.getvalue()).decode("utf-8")
response = requests.post(
"http://127.0.0.1:8765/transcribe/raw",
json={
"audio_data": b64_data,
"audio_format": "numpy",
"sample_rate": 16000,
},
)
print(response.json()["text"])
You can also send a complete audio file as base64:
import base64
import requests
with open("my_audio.mp3", "rb") as f:
b64_data = base64.b64encode(f.read()).decode("utf-8")
response = requests.post(
"http://127.0.0.1:8765/transcribe/raw",
json={
"audio_data": b64_data,
"audio_format": "file",
},
)
print(response.json()["text"])
Every setting is optional. If you omit a value, the server uses whatever was configured in the GUI when Server Mode was turned on.
| Parameter | Type | Description | Values |
|---|---|---|---|
model | string | Name of the Whisper checkpoint | "large-v3", "large-v3-turbo", "medium", "medium.en", "small", "small.en", "base", "base.en", "tiny", "tiny.en", "distil-whisper-large-v3", "distil-whisper-medium.en", "distil-whisper-small.en" |
quantization | string | CTranslate2 compute_type. Which pre-converted variant of the model to load. | "float32", "float16", "bfloat16", "int8", "int8_float16", "int8_bfloat16", "int8_float32" |
device | string | CPU or GPU | "cuda", "cpu" |
language | string | ISO 639-1 language code. Omit or leave empty to auto-detect. | "en", "fr", "es", "de", "zh", … (99 Whisper languages) |
task_mode | string | Transcribe in source language or translate to English. | "transcribe", "translate" |
include_timestamps | boolean | Include segment timestamps in the response. When false, the server asks faster-whisper with without_timestamps=True and returns segments: []. | "true", "false" |
word_timestamps | boolean | Forwarded to faster-whisper. Enables word-level timings inside each segment (still returned at the segment level in the response). | "true", "false" |
beam_size | integer | Number of beams for decoding. Higher = more accurate, slower. | 1–20 (default 5) |
vad_filter | boolean | Run Silero VAD before decoding. Forced on when batch_size>1. | "true", "false" |
condition_on_previous_text | boolean | Use previous segment output as context for the next segment. Helps coherence but can propagate hallucinations. | "true", "false" |
batch_size | integer | When >1, uses BatchedInferencePipeline with tuned VAD params and processes VAD-chunked speech in parallel. | 1–128 (default 1) |
audio_format | string | Override input format auto-detection | "auto", "file", "numpy", "tensor", "pcm" |
sample_rate | integer | Sample rate of raw audio input (resampled to 16 kHz) | e.g. "16000", "22050", "44100", "48000" |
dtype | string | Data type for raw PCM input | "float32", "float64", "int16", "int32" |
import requests
with open("lecture.mp3", "rb") as f:
response = requests.post(
"http://127.0.0.1:8765/transcribe",
files={"audio": ("lecture.mp3", f, "audio/mpeg")},
data={
"model": "large-v3",
"quantization": "float16",
"device": "cuda",
"task_mode": "transcribe",
"language": "en",
"include_timestamps": "true",
"beam_size": "5",
"vad_filter": "true",
"batch_size": "8",
},
)
result = response.json()
print(result["text"])
print(f"Detected language: {result['language']}")
print(f"Took {result['processing_time_seconds']} seconds")
batch_size > 1 the server forces vad_filter=true
and applies tuned VAD parameters (threshold, min/max speech duration,
speech_pad_ms) needed by faster-whisper's
BatchedInferencePipeline. You cannot disable VAD while
batching.
Every transcription request returns a JSON object. Here is what a real response looks like:
// POST /transcribe — include_timestamps defaults to false
{
"text": "Good morning everyone. Today we'll be discussing the quarterly results and our plans for the upcoming product launch.",
"segments": [],
"language": "en",
"duration": 138.135,
"task": "transcribe",
"model_used": "large-v3 - float16",
"processing_time_seconds": 2.418
}
When timestamps are off, segments is always an empty list [].
// POST /transcribe — include_timestamps=true
{
"text": "Good morning everyone. Today we'll be discussing the quarterly results and our plans for the upcoming product launch.",
"segments": [
{
"start": 0.081,
"end": 4.862,
"text": " Good morning everyone. Today we'll be discussing"
},
{
"start": 4.862,
"end": 8.241,
"text": " the quarterly results and our plans for the upcoming product launch."
}
],
"language": "en",
"duration": 138.135,
"task": "transcribe",
"model_used": "large-v3 - float16",
"processing_time_seconds": 3.012
}
Each segment corresponds to one Whisper decoding segment (VAD-chunked if vad_filter is on). Timestamps are in seconds with millisecond precision (3 decimal places).
| Field | Type | Always Present | Description |
|---|---|---|---|
text | string | Yes | The complete transcription as a single newline-joined string. |
segments | array | Yes | Timestamped segments. Empty [] when include_timestamps is false. |
segments[].start | float | — | Segment start time in seconds. |
segments[].end | float | — | Segment end time in seconds. |
segments[].text | string | — | The transcribed words within this time range (faster-whisper leaves a leading space). |
language | string | Yes | Detected (or echoed) ISO language code. |
duration | float | Yes | Duration of the processed audio, in seconds. |
task | string | Yes | "transcribe" or "translate". |
model_used | string | Yes | Full model key used, e.g., "large-v3 - float16". |
processing_time_seconds | float | Yes | How long the transcription took (excludes network transfer). |
import requests
with open("meeting.mp3", "rb") as f:
r = requests.post(
"http://127.0.0.1:8765/transcribe",
files={"audio": ("meeting.mp3", f)},
data={"include_timestamps": "true"},
)
result = r.json()
print("Transcript:", result["text"])
print(f"Detected: {result['language']}, duration {result['duration']:.1f}s")
print(f"Model: {result['model_used']}")
print(f"Task: {result['task']}, took {result['processing_time_seconds']:.1f}s")
for seg in result["segments"]:
print(f"[{seg['start']:07.3f} → {seg['end']:07.3f}] {seg['text'].strip()}")
The API always returns JSON. If you need subtitle format, convert the segments yourself:
def seconds_to_srt_time(s):
h = int(s // 3600)
m = int((s % 3600) // 60)
sec = s % 60
return f"{h:02d}:{m:02d}:{sec:06.3f}".replace(".", ",")
def to_srt(segments):
lines = []
for i, seg in enumerate(segments, 1):
lines.append(f"{i}")
lines.append(f"{seconds_to_srt_time(seg['start'])} --> {seconds_to_srt_time(seg['end'])}")
lines.append(seg["text"].strip())
lines.append("")
return "\n".join(lines)
srt_text = to_srt(result["segments"])
print(srt_text)
This backend uses faster-whisper, which loads CTranslate2-converted Whisper checkpoints. Models are pulled on-demand from HuggingFace under ctranslate2-4you/whisper-<model>-ct2-<quantization> (or ctranslate2-4you/distil-whisper-<model>-ct2-<quantization> for the Distil variants).
| Model family | English-only? | Translation? | Notes |
|---|---|---|---|
| large-v3 / large-v3-turbo | No | Yes | Multilingual; turbo is a fine-tune with fewer decoder layers. |
| medium / small / base / tiny | No | Yes | Multilingual; smaller = faster + less VRAM / RAM. |
| medium.en / small.en / base.en / tiny.en | Yes | No | English-only. Slightly better English accuracy at the same size. Pass language="en". |
| distil-whisper-large-v3 / medium.en / small.en | Varies | No | Distilled variants: faster, smaller, English-focused. |
.en models and all Distil variants do not support
translation. If you request task_mode="translate"
against one of these models, faster-whisper will raise and the server
returns HTTP 500.
import requests
r = requests.get("http://127.0.0.1:8765/health")
print(r.json())
Returns:
{
"status": "ok"
}
r = requests.get("http://127.0.0.1:8765/status")
status = r.json()
print(status)
Returns (when idle):
{
"server_running": true,
"queue_depth": 0,
"transcription_active": false
}
Returns (while processing one request with two more waiting):
{
"server_running": true,
"queue_depth": 2,
"transcription_active": true
}
| Field | Type | Description |
|---|---|---|
server_running | bool | Always true (if the server weren't running, the request would fail). |
queue_depth | int | Number of requests waiting in line. 0 means no queue. |
transcription_active | bool | true if a transcription is currently being processed. |
Example — wait until the server is free before submitting:
import time
import requests
while True:
status = requests.get("http://127.0.0.1:8765/status").json()
if status["queue_depth"] == 0 and not status["transcription_active"]:
break
print(f"Server busy (queue: {status['queue_depth']}), waiting...")
time.sleep(2)
print("Server is free, submitting...")
r = requests.get("http://127.0.0.1:8765/models")
models = r.json()
print(models)
Returns a dictionary keyed by model name:
{
"large-v3": {
"name": "large-v3",
"supports_translation": true
},
"medium.en": {
"name": "medium.en",
"supports_translation": false
},
"distil-whisper-large-v3": {
"name": "distil-whisper-large-v3",
"supports_translation": false
}
// ... etc
}
/models endpoint returns one entry per base model name, not one entry per (name, quantization) pair. Use the quantization field on /transcribe to pick the precision.
Example — find all multilingual models:
models = requests.get("http://127.0.0.1:8765/models").json()
for name, info in models.items():
if info["supports_translation"]:
print(name)
The server processes one transcription at a time (GPU is a shared resource). If you send multiple requests simultaneously, they are placed in a queue and processed in order. Each client waits for its own result — you don't need to poll.
import threading
import requests
def transcribe(file_path):
with open(file_path, "rb") as f:
r = requests.post(
"http://127.0.0.1:8765/transcribe",
files={"audio": (file_path, f)},
)
print(f"{file_path}: {r.json()['text'][:80]}...")
threads = [
threading.Thread(target=transcribe, args=("file1.mp3",)),
threading.Thread(target=transcribe, args=("file2.mp3",)),
threading.Thread(target=transcribe, args=("file3.mp3",)),
]
for t in threads:
t.start()
for t in threads:
t.join()
# Health check
curl http://127.0.0.1:8765/health
# Transcribe a file
curl -F "audio=@my_audio.mp3" http://127.0.0.1:8765/transcribe
# Transcribe with settings
curl -F "audio=@my_audio.mp3" \
-F "model=large-v3" \
-F "quantization=float16" \
-F "task_mode=transcribe" \
-F "language=en" \
-F "include_timestamps=true" \
-F "beam_size=5" \
http://127.0.0.1:8765/transcribe
All error responses return a JSON object with a detail field explaining what went wrong.
// Invalid model name
{"detail": "Unknown model 'FakeModel'. Available: ['tiny', 'tiny.en', 'base', ...]"}
// Empty audio
{"detail": "Empty audio data"}
// Unreadable audio format
{"detail": "Failed to process audio: [decoder error details]"}
{
"detail": [
{
"type": "missing",
"loc": ["body", "audio"],
"msg": "Field required",
"input": null
}
]
}
detail is an array of error objects. Each has loc, msg, and type.
// CUDA OOM
{"detail": "Transcription failed: CUDA out of memory. Tried to allocate 512.00 MiB..."}
// Translate on an .en / Distil model
{"detail": "Transcription failed: This model is English-only and does not support translation."}
{"detail": "Server shutting down"}
import requests
def transcribe_file(file_path, **settings):
try:
with open(file_path, "rb") as f:
r = requests.post(
"http://127.0.0.1:8765/transcribe",
files={"audio": (file_path, f)},
data=settings,
timeout=300,
)
except requests.ConnectionError:
print("Cannot connect — is the server running?")
return None
except requests.Timeout:
print("Request timed out")
return None
if r.status_code == 200:
return r.json()
error = r.json()
detail = error.get("detail", "Unknown error")
if r.status_code == 400:
print(f"Bad request: {detail}")
elif r.status_code == 422:
for err in detail:
print(f"Validation error at {err['loc']}: {err['msg']}")
elif r.status_code == 500:
print(f"Server error: {detail}")
elif r.status_code == 503:
print("Server is shutting down, try again later")
return None
result = transcribe_file("meeting.mp3", model="large-v3", language="en")
if result:
print(result["text"])
| Code | detail Type | Meaning | What To Do |
|---|---|---|---|
200 | — | Success | Use result["text"] and result["segments"] |
400 | string | Bad input | Fix model, language, task, or audio |
422 | array | Missing field | Check that audio is included |
500 | string | Model error | Check VRAM, language/model compatibility |
503 | string | Server stopping | Wait and retry |
A full script that transcribes all .mp3 files in a folder:
import requests
from pathlib import Path
SERVER = "http://127.0.0.1:8765"
AUDIO_DIR = Path("./my_audio_files")
# Check server is running
health = requests.get(f"{SERVER}/health")
if health.status_code != 200:
print("Server is not running!")
exit(1)
for audio_file in sorted(AUDIO_DIR.glob("*.mp3")):
print(f"Transcribing: {audio_file.name}...", end=" ", flush=True)
with open(audio_file, "rb") as f:
response = requests.post(
f"{SERVER}/transcribe",
files={"audio": (audio_file.name, f, "audio/mpeg")},
data={"language": "en", "beam_size": "5"},
)
if response.status_code == 200:
result = response.json()
text = result["text"]
duration = result["duration"]
speed = result["processing_time_seconds"]
output_file = audio_file.with_suffix(".txt")
output_file.write_text(text, encoding="utf-8")
print(f"Done ({duration:.1f}s audio in {speed:.1f}s)")
else:
print(f"Failed: {response.json().get('detail', 'Unknown error')}")
Faster-Whisper Transcriber — Server API Guide