██╗ ██╗██╗ ██╗██╗███████╗██████╗ ███████╗██████╗ ██╗ ██████╗ ██████╗ █████╗ ██╗ ███████╗ ██████╗██████╗ ██╗██████╗ ███████╗ ██║ ██║██║ ██║██║██╔════╝██╔══██╗██╔════╝██╔══██╗ ██║ ██╔═══██╗██╔════╝██╔══██╗██║ ██╔════╝██╔════╝██╔══██╗██║██╔══██╗██╔════╝ ██║ █╗ ██║███████║██║███████╗██████╔╝█████╗ ██████╔╝ ██║ ██║ ██║██║ ███████║██║ ███████╗██║ ██████╔╝██║██████╔╝█████╗ ██║███╗██║██╔══██║██║╚════██║██╔═══╝ ██╔══╝ ██╔══██╗ ██║ ██║ ██║██║ ██╔══██║██║ ╚════██║██║ ██╔══██╗██║██╔══██╗██╔══╝ ╚███╔███╔╝██║ ██║██║███████║██║ ███████╗██║ ██║ ███████╗╚██████╔╝╚██████╗██║ ██║███████╗ ███████║╚██████╗██║ ██║██║██████╔╝███████╗ ╚══╝╚══╝ ╚═╝ ╚═╝╚═╝╚══════╝╚═╝ ╚══════╝╚═╝ ╚═╝ ╚══════╝ ╚═════╝ ╚═════╝╚═╝ ╚═╝╚══════╝ ╚══════╝ ╚═════╝╚═╝ ╚═╝╚═╝╚═════╝ ╚══════╝

Whisper Local Scribe_

100% offline audio-to-text · no upload · no server · no API key · runs on your GPU via WebGPU

OpenAI Whisper · Transformers.js · ONNX Runtime · your audio never leaves your device

⚠ WebGPU not detected. Your browser doesn't support WebGPU — switch to the WASM tab below for CPU-based transcription, or try Chrome / Edge 113+ on a desktop.

▸ Model not loaded

Pick:

▸ Audio / Video File

🎙️

Drop audio or video file here

or click to browse · MP3, WAV, OGG, FLAC, M4A, MP4, WEBM · no size limit*

Language: Timestamps:

* very large files (>30 min) may require significant RAM and time depending on your hardware

▸ Status

Waiting — load a model, then drop a file.

▸ Transcript

▸ FAQ — Frequently Asked Questions

FAQ — Whisper Local Scribe

Does my audio get uploaded anywhere?

No — never. Everything runs entirely in your browser. The Whisper model weights are downloaded once from the Hugging Face CDN and cached locally. After that, all transcription inference happens directly on your CPU or GPU. No audio byte, no text, no metadata ever leaves your machine. You could disconnect your internet after the model loads and it would still transcribe perfectly.

What is WebGPU and why does it matter?

WebGPU is a modern browser API that gives JavaScript direct, low-overhead access to your graphics card. For neural network inference, GPUs are dramatically faster than CPUs — a 2-minute audio clip that takes 45 seconds on WASM (CPU) may finish in 5–10 seconds with WebGPU. Supported in Chrome 113+, Edge 113+, and behind a flag in Firefox and Safari.

What is OpenAI Whisper?

Whisper is a general-purpose speech recognition model released by OpenAI as open source (MIT license). It was trained on 680,000 hours of multilingual audio and supports transcription in 99 languages. This tool uses ONNX-exported versions of the Whisper Tiny, Base, and Small variants via Transformers.js — Hugging Face's JavaScript port of the Python transformers library.

Which model should I pick?

Whisper Tiny EN (~75 MB) — fastest download, great for short English clips and quick drafts. Ideal for most use cases.

Whisper Base EN (~150 MB) — noticeably more accurate than Tiny for English, still fast. Recommended for important transcriptions.

Whisper Tiny / Base / Small (multilingual) — use when your audio is not English. Small is the most accurate but takes ~245 MB to download and longer to run.

Models are cached after first download, so subsequent uses are instant.

What audio formats are supported?

Any format your browser can decode via the Web Audio API: MP3, WAV, OGG, FLAC, M4A, AAC, OPUS. Video containers (MP4, WEBM, MKV) also work — the audio track is extracted automatically in-browser. If your file doesn't load, try converting it to WAV or MP3 first using the free Audio Editor on this site.

What is an SRT file?

SRT (SubRip Subtitle) is the most widely used subtitle format. It contains numbered captions with start and end timestamps, compatible with VLC, YouTube, Premiere Pro, DaVinci Resolve, and virtually any video player. When you download the .srt from this tool, you can drag it alongside your video in any editor to get instant subtitles without paying a transcription service.

How long can my audio file be?

There is no hard limit. Whisper Local Scribe processes audio in 30-second chunks with a 5-second overlap to ensure smooth sentence boundaries — this means it scales to arbitrarily long recordings. However, be aware that very long files (>30 minutes) will require significant RAM, and the browser tab must stay open and active during processing. For hour-long recordings on WASM, expect it to take some time — use WebGPU where possible.

Why does the first transcription take longer?

The first time you load a model, the weights are downloaded from the Hugging Face CDN (~75–245 MB depending on model). They are then stored in your browser cache (IndexedDB) so subsequent loads are near-instant. WebGPU also compiles shaders on first use, adding a few seconds of warm-up.

Does it work offline after first load?

Yes. Once the model is cached in your browser, Whisper Local Scribe works completely offline. You can disconnect from the internet, open the page, load the cached model, and transcribe as normal. The cache persists until you clear your browser data.

Related tools on this site

Audio Editor — cut, trim, fade, and normalize audio before transcribing.

Browser-Based AI LLM — chat with a small language model running 100% in your browser via WebGPU.

IBM Granite AI — IBM's open-source enterprise LLM running in your browser.

Translate — translate text to 100+ languages with TTS playback.