Speakr v0.8.19: What Actually Changed and Whether It’s Worth Upgrading Your Self-Hosted Transcription Stack

The Problem Speakr Is Solving (And Who It’s Actually For)

Cloud transcription pricing has a nasty compounding effect: you don’t notice it until you’re processing a backlog of three-hour interview recordings and the monthly invoice arrives. AssemblyAI and Deepgram both bill per audio-minute, which sounds reasonable until your podcast archive hits four figures in hours, or you’re running a legal firm that transcribes depositions daily. The per-minute cost isn’t the only concern — both services retain audio and transcripts on their infrastructure by default, which makes them non-starters for anything covered by attorney-client privilege, HIPAA-adjacent workflows, or simply recordings that contain information you’d rather not hand to a third party’s training pipeline.

Speakr sits at a specific point in the self-hosting stack: it’s not raw Whisper (which you’d have to wire up yourself), and it’s not a full enterprise transcription suite. What it actually gives you is a REST API in front of Whisper or faster-whisper, a persistent job queue so uploads don’t block, speaker diarization hooks via pyannote.audio, and a minimal web UI for one-off uploads. That’s the practical sweet spot — enough infrastructure to integrate into automation pipelines without requiring you to write the job management layer yourself. The v0.8.19 update tightens the faster-whisper integration specifically, which matters because faster-whisper’s CTranslate2 backend runs noticeably leaner on VRAM than stock Whisper for the same model size.

The operators who get the most out of this are running one of a few specific workflows: interview archives where a journalist or researcher needs searchable transcripts without sending recordings to a cloud service; internal meeting pipelines where the audio never leaves the LAN; podcast post-production where chapter markers and speaker labels feed downstream into editing tools; or medical and legal shops where data residency isn’t optional. If your audio is ephemeral and non-sensitive and you’re transcribing a few hours a month, paying Deepgram’s per-minute rate is probably fine. If any of those conditions don’t hold, you’re paying for a constraint you don’t need.

The diarization angle is worth calling out separately. Most self-hosted Whisper wrappers skip it or bolt it on poorly. Speakr exposes diarization as a job parameter rather than a post-processing afterthought, which means speaker labels come back in the same job response as the transcript rather than requiring a second API call or manual merge. For interview recordings with two or more speakers, that single design decision is what makes the output actually usable without additional scripting. For a broader look at how transcription fits into local automation pipelines, the guide on Workflow Automation in 2026: n8n, Zapier, and Self-Hosted Pipelines is worth a read.

What v0.8.19 Actually Changed

The most dangerous change in v0.8.19 isn’t the one getting the most attention. The webhook payload schema quietly flipped segments[].start (and segments[].end) from milliseconds-as-integer to seconds-as-float. No deprecation warning, no migration guide in the changelog. If you have any downstream parser — an n8n HTTP node, a custom subtitle renderer, anything that reads those timestamps — it will continue to run without error and produce timestamps that are off by a factor of 1000. You’ll see clips labeled as starting at 0.5s when they should be 500ms, or worse, subtitle files where every cue fires in the first two seconds. Check your parsers before upgrading if you have production flows depending on segment timing.

The switch to faster-whisper as the default backend is a genuine win for cold-start latency, but the changelog undersells what it means for VRAM. faster-whisper uses CTranslate2 under the hood, which loads weights in a more compressed format. On my 32GB box, large-v3 via the original OpenAI Whisper backend was sitting around 10GB VRAM resident after the first job. With faster-whisper as the default, the same model loads closer to 6.5GB and first-token latency on a 5-minute audio file drops noticeably. The trade-off: the CTranslate2 build adds a native dependency, and if you’re running a stripped-down container image, your first deploy after upgrading may fail with a missing shared library error rather than a clean startup message.

The new SPEAKR_BATCH_CONCURRENCY env var is one of those additions that sounds boring until you’ve had a GPU OOM-kill a transcription halfway through a 90-minute file. Previously, if three jobs hit the queue simultaneously, Speakr would spin up three parallel GPU contexts with no backpressure. On anything under 24GB VRAM, that’s a crash waiting to happen. Now you can actually enforce a ceiling:

# docker-compose.yml fragment
environment:
  SPEAKR_BATCH_CONCURRENCY: "2"   # hard cap at 2 parallel GPU jobs
  SPEAKR_QUEUE_TIMEOUT: "300"     # seconds before a queued job is dropped
  SPEAKR_BACKEND: "faster-whisper" # now the default, but explicit is better

Setting this to 1 is the safest starting point on a shared workstation. Set it to 2 only if you’ve confirmed your model footprint leaves headroom — faster-whisper’s lower VRAM floor makes this more viable than it was before. The queue will block and wait rather than fork a new process, so throughput goes down but reliability goes up. Worth it.

Word-level timestamps being on by default (--word_timestamps true) is the right call for most real use cases, but you should know what you’re paying for. The latency overhead is real — budget roughly 15% extra wall-clock time on large-v3, more on longer files because the alignment pass scales with audio duration. If you’re running batch overnight jobs where timestamp granularity doesn’t matter, you can explicitly disable it to recover that time. But for anyone building subtitle pipelines, diarization prep, or anything that needs to sync text to a specific frame, having this on by default means your output is actually usable without a second post-processing pass. The segment-level timestamps were never precise enough for subtitle work; word-level timestamps are the minimum viable granularity for that use case.

Setup Walkthrough: Docker Compose with GPU Passthrough

The part that bites most people first isn’t the GPU config — it’s that Speakr v0.8.19 will happily pull a 3GB model file at container startup if it doesn’t find the expected cache layout inside /models. That download blocks the health-check endpoint, Docker marks the container unhealthy, and your orchestrator kills and restarts it before transcription ever runs. Pre-stage the model first, configure second.

Here’s a minimum viable docker-compose.yml that pins the image, passes the NVIDIA runtime through, and mounts both required volumes:

services:
  speakr:
    image: ghcr.io/speakr-oss/speakr:0.8.19   # pin — latest breaks on model path changes
    runtime: nvidia
    restart: unless-stopped
    volumes:
      - /mnt/models/speakr:/models             # pre-staged model files live here
      - /mnt/speakr-queue:/queue               # audio queue; survives container restarts
    environment:
      SPEAKR_MODEL: large-v3                   # which Whisper checkpoint to load
      SPEAKR_DEVICE: cuda                      # "cpu" is a valid fallback, ~8x slower
      SPEAKR_BATCH_CONCURRENCY: 2              # parallel decode slots; 1 if sharing GPU
      NVIDIA_VISIBLE_DEVICES: all
      NVIDIA_DRIVER_CAPABILITIES: compute,utility
    ports:
      - "8765:8765"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8765/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s   # give it time if model IS loading cold — longer than default
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

SPEAKR_BATCH_CONCURRENCY is the variable that matters most on a shared-GPU box. On my 32GB VRAM workstation running Ollama alongside Speakr, I keep this at 1 during Ollama’s active window and bump it to 2 overnight. At 3, large-v3 plus a loaded Ollama model will OOM without warning. The model selection directly determines how much headroom you have:

  • large-v3 — roughly 10GB VRAM resident once loaded. Best word-error rate on accented speech, technical vocabulary, mixed-language audio. Use this if Speakr is your primary GPU tenant.
  • medium.en — roughly 5GB VRAM. English-only, noticably faster per-file, WER degrades on non-native speakers and proper nouns. Good middle ground when sharing with a 7B Ollama model.
  • small — roughly 2GB VRAM. Latency is fast enough to feel real-time on short clips. WER on anything with background noise or strong accents is bad enough to require a post-processing correction pass if accuracy matters.

To avoid the startup-download failure, pre-stage the model using huggingface-cli directly into your mounted path before the container ever starts:

# install once if you don't have it
pip install huggingface_hub

# download large-v3 into the exact directory Speakr expects
huggingface-cli download \
  openai/whisper-large-v3 \
  --local-dir /mnt/models/speakr/whisper-large-v3 \
  --local-dir-use-symlinks False

# verify the weights file is there — Speakr checks for model.safetensors
ls -lh /mnt/models/speakr/whisper-large-v3/model.safetensors

The --local-dir-use-symlinks False flag matters: Hugging Face’s default symlink layout confuses Speakr’s model loader on first boot, and the error message it gives back (“model not found, downloading”) is misleading — the files are there, just not where it looks first.

One more failure mode that shows up specifically on long audio files: Speakr’s upload endpoint accepts the file fine, but the transcription takes longer than your reverse proxy’s read timeout. The proxy closes the connection, the client gets a 504, and the transcription finishes successfully in the container — invisibly, with no way to retrieve the result through the normal response path. In Nginx, set this on the location block handling the upload:

location /api/transcribe {
    proxy_pass          http://127.0.0.1:8765;
    proxy_read_timeout  300s;   # 5 min — covers ~45min audio on large-v3
    proxy_send_timeout  120s;
    client_max_body_size 512M;  # Speakr's default upload limit; match this
}

In Caddy the equivalent is reverse_proxy localhost:8765 { transport http { read_timeout 5m } }. Traefik handles it via the traefik.http.middlewares.speakr-timeout.forwardAuth.authResponseHeadersRegex path — or more cleanly, by setting readTimeout on the entrypoint itself rather than per-service, which avoids having to add labels to every container that shares that entrypoint.

Real Resource Costs on a 32GB VRAM Workstation

The VRAM split between Whisper and your LLM is the first thing you need to model before committing to a configuration. On my 32GB workstation, large-v3 via faster-whisper holds roughly 10GB VRAM resident while a transcription job is active — not allocated at startup, but from the moment the first audio chunk hits the model. That leaves ~22GB for Ollama, which is workable for a 13B or 34B quant as long as you’re not triggering both simultaneously. The failure mode is subtle: if an n8n flow kicks off a transcription job while an LLM inference request is mid-generation, you’ll hit fragmented VRAM pressure rather than a clean OOM — the transcription stalls or the LLM inference slows to a crawl without an obvious error. The fix is serializing at the orchestration layer, not the model layer.

If your workload actually needs concurrent transcription rather than LLM coexistence, the concurrency math works out cleaner with a smaller model. medium.en runs at roughly 5GB per worker, so:

# docker-compose or .env
SPEAKR_MODEL=medium.en
SPEAKR_BATCH_CONCURRENCY=2
# Total Speakr VRAM footprint: ~10-11GB
# Leaves ~21GB for a 7B Q4_K_M quant in Ollama (~4.5GB) with room to breathe

Two medium.en workers plus a 7B quantized model fits cleanly without serialization constraints. The accuracy delta between medium.en and large-v3 is noticeable on accented speech and domain-specific vocabulary, but for English interview recordings or meeting audio with clear speakers, medium.en is good enough that you won’t fight the output downstream.

CPU fallback is more useful than it sounds, but only for specific scheduling patterns. Setting SPEAKR_DEVICE=cpu zeroes out VRAM usage entirely — the full GPU is free for LLM work around the clock. The cost is real: a 1-hour audio file that finishes in roughly 4 minutes on GPU takes ~35 minutes on CPU with large-v3. For overnight batch jobs where a PM2 cron queues up everything recorded during the day, that latency is irrelevant. For anything interactive — a user uploads a file and waits for a transcript — it’s not usable. The right pattern is a configurable SPEAKR_DEVICE per deployment profile rather than a single value baked into your compose file.

Disk is the resource most people forget to track until the model cache directory is suddenly 15GB. large-v3 alone sits at ~3GB on disk. The sharp edge in Speakr’s current behavior: changing SPEAKR_MODEL in your env and restarting pulls the new model but does not clean up the old one. After a week of experimentation across four or five model sizes, the cache directory accumulates all of them silently.

# Find your cache mount and audit it
docker exec speakr_app du -sh /root/.cache/huggingface/hub/models--Systran*/
# or wherever SPEAKR_MODEL_CACHE_DIR points in your compose

# Manual prune example — remove a specific model revision
rm -rf /data/speakr-cache/models--Systran--faster-whisper-medium/

There’s no speakr prune command in v0.8.19 — pruning is entirely manual. If you’re on a volume with a hard size limit, add a cleanup step to your deployment script whenever you change SPEAKR_MODEL, otherwise that volume silently fills and the next model pull fails mid-download with an unhelpful I/O error rather than a disk-space warning.

Connecting Speakr to n8n: Webhook Pipeline and the Timestamp Bug

The timestamp unit change in v0.8.19 is the kind of silent breakage that produces no errors, just subtitles that are off by a factor of a thousand. If you had a working pipeline before this update, your segment.start / 1000 conversion in any downstream Function node is now double-dividing — Speakr used to return milliseconds, this version returns seconds. The output still looks like a valid float, the workflow runs green, and your SRT file has timestamps like 00:16:40,000 where the audio says something at the ten-second mark. Delete the division entirely and treat the value as seconds from the start.

The basic wiring is straightforward: an n8n HTTP Request node POSTs a binary audio file to /api/v1/jobs, Speakr returns a job ID immediately, and then you choose between polling and webhooks. Polling is simpler to debug but burns time on long files. The webhook path is cleaner — register a callback URL on the job creation POST, and Speakr will fire a POST back to an n8n Webhook node when processing finishes. The critical gotcha here is that Speakr fires on both completion and failure, and the payload shape differs. A failed job still hits your webhook endpoint with a job_status: "failed" field and no segments array. If your downstream nodes assume segments exists, they’ll throw a runtime error on any failed transcription job. The fix is a single IF node or a Function node guard before anything touches the segments array:

// n8n Function node — drop this before any segment processing
const payload = $input.first().json;

if (payload.job_status !== 'completed') {
  // surface the failure clearly rather than letting it blow up downstream
  throw new Error(`Speakr job failed: ${payload.job_id} — status: ${payload.job_status}`);
}

return $input.all();

On my own stack, the workflow starts with a folder watch — a Node.js script running under PM2 monitors a drop directory and triggers the n8n workflow via its own webhook endpoint when a new audio file lands. The n8n flow calls Speakr, waits for the completion webhook, then immediately pipes the segments array into a second HTTP Request node calling the local Ollama API (http://localhost:11434/api/generate) with the full transcript text and a summarization prompt. The summarization model I use for this is Mistral 7B — fast enough that the whole pipeline from drop to draft is under two minutes for a 20-minute audio file. The final step is a WordPress REST API call that creates a draft post with the summary as the body and the raw transcript stuffed into a custom field for reference.

// Reconstructing plain-text transcript from segments (post-v0.8.19)
// segment.start is already in seconds — no conversion needed
const segments = $input.first().json.segments;

const transcript = segments
  .map(seg => `[${seg.start.toFixed(2)}s] ${seg.text.trim()}`)
  .join('\n');

return [{ json: { transcript } }];

One practical note on the HTTP Request node configuration for the initial job POST: set the body content type to multipart/form-data and attach the binary audio item from earlier in the workflow using the Binary Data option — not base64. Speakr’s file size limits are enforced at the HTTP layer, and if you base64-encode a large WAV before posting, you’ll hit the limit with a file that would have been fine as a raw binary. Also register the webhook URL with a path that includes the job ID if you’re running multiple concurrent transcription workflows; otherwise all completions land on the same endpoint and you’ll need to demux them inside n8n based on the payload’s job ID field, which is doable but adds complexity you don’t need.

Three Non-Obvious Behaviors Worth Knowing

The diarization support is the biggest gotcha. Speakr’s documentation uses language that implies speaker separation is a built-in capability, but what it actually ships is a forwarding layer. The SPEAKR_DIARIZATION_URL environment variable is not optional configuration for a bundled feature — it’s the entire implementation. If that variable isn’t set and pointing at a live pyannote-audio inference endpoint, diarization silently does nothing. You don’t get an error. The transcript just comes back without speaker labels and the UI gives no indication why. You have to run pyannote-audio as a separate service, expose it on HTTP, and wire the URL in. Budget another container and a model download (the pyannote speaker diarization pipeline pulls several hundred MB of weights) before you treat diarization as a working feature.

# docker-compose excerpt — pyannote sidecar wired to Speakr
  speakr:
    image: speakr:0.8.19
    environment:
      # without this, diarization UI toggle does nothing
      SPEAKR_DIARIZATION_URL: http://pyannote:8000/diarize
    depends_on:
      - pyannote

  pyannote:
    image: your-pyannote-serving-image
    ports:
      - "8000:8000"
    volumes:
      - ./models:/root/.cache/torch  # reuse weight cache across restarts

The SQLite persistence problem will bite you the first time you update the container image. By default, Speakr writes its job database to a path inside the container filesystem. Pull a new image, spin up a new container, and everything — job history, completed transcripts, any stored results — is gone. The fix is a single volume mount, but it’s not called out prominently in the v0.8.19 release notes. The database path is /app/data/speakr.db. Mount that directory as a named volume or a bind mount and your job history survives restarts and image updates. Without it you’re running an amnesiac service.

  speakr:
    image: speakr:0.8.19
    volumes:
      # this one line is the difference between persistent and ephemeral jobs
      - speakr_data:/app/data

volumes:
  speakr_data:

Language auto-detection is more expensive than the docs suggest. Leaving SPEAKR_LANGUAGE unset causes Speakr to run Whisper’s detection pass over the first 30 seconds of each audio file before transcription starts. On longer files that’s a minor tax, but on a queue of short clips it adds up fast. The more operationally annoying issue is accuracy: accented English — particularly South Asian, West African, and Australian regional accents — gets misidentified as Hindi, French, or Portuguese with enough regularity to matter. When that happens you get a transcript in the wrong language with no error surfaced, just garbage output. If your use case has a known input language, set it explicitly:

environment:
  SPEAKR_LANGUAGE: en   # ISO 639-1 code; skips detection pass entirely

The detection failure mode is particularly frustrating to debug because the job shows as completed with a non-zero confidence score — the model is confident, just confidently wrong about which language it’s transcribing. Pinning the language also gives you a small but consistent latency reduction on every job, so there’s no downside if your audio source is predictable.

When to Use Speakr vs. Alternatives

The honest decision point is resource allocation: Speakr earns its place when you need a managed queue in front of Whisper and don’t want to wire up Redis, a job worker, a REST layer, and a status-polling endpoint yourself. That plumbing is genuinely tedious, and Speakr ships it pre-assembled. If you’re running multiple services hitting the same GPU — say, an n8n flow triggering transcriptions alongside a separate ingestion pipeline — the job queue is what keeps you from blowing up VRAM with concurrent Whisper loads. That’s the actual value proposition, not the web UI.

For single-pipeline work, Speakr is overhead you don’t need. If you have one Python script, one cron job, or one Node process that needs transcription and nothing else will ever share that path, faster-whisper directly is cleaner:

# faster-whisper, no server, no queue, just a function call
from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", word_timestamps=True)

for segment in segments:
    print(f"[{segment.start:.2f}s] {segment.text}")

Same story with whisper.cpp — if you’re embedding transcription into a Go or C++ service and want a single binary with no Python runtime dependency, whisper.cpp with its server mode is the right call. Speakr’s REST API adds round-trip HTTP overhead and a process boundary that only pays off when you actually need isolation and queuing.

Cloud APIs (Deepgram, AssemblyAI) win on latency at low concurrency — full stop. A self-hosted Whisper large-v3 model on a GPU that’s also running inference for other workloads will not beat Deepgram’s Nova-2 turnaround time on a 5-minute file. The break-even is data locality: if you’re transcribing audio that can’t leave your infrastructure, or if your volume is high enough that per-minute API costs become meaningful, self-hosted makes sense. But if your SLA is “user uploads a file and sees a transcript in under 10 seconds,” a loaded local GPU is a liability, not an asset.

On v0.8.19 specifically: upgrade if you were hitting OOM crashes under concurrent load — the concurrency cap that shipped in this release is a direct fix for that failure mode, not a config tweak you can replicate yourself in older versions. Also upgrade if you need word-level timestamps in your output; that feature wasn’t stable before this release. Hold off if your downstream parser expects millisecond integers for timestamp values — v0.8.19 changed the timestamp format in the response payload, and anything doing parseInt(segment.start) or treating it as a raw number will break silently. Check your consumer code before you roll it out.


Disclaimer: This article is for informational purposes only. The views and opinions expressed are those of the author(s) and do not necessarily reflect the official policy or position of Sonic Rocket or its affiliates. Always consult with a certified professional before making any financial or technical decisions based on this content.


Eric Woo

Written by Eric Woo

Self-Hosted AI & Automation Engineer

Eric runs his own self-hosted stack: local LLM pipelines on Ollama with dual-model VRAM scheduling on a single 32GB workstation, n8n workflows in Docker, and a TypeScript automation engine that publishes to WordPress on cron. He writes about the systems he actually operates — configs, failure modes, and GPU bills included.

Leave a Comment