Show HN: EdgeWhisper – On-device voice-to-text for macOS (Voxtral 4B via MLX)

Show HN: EdgeWhisper – On-device voice-to-text for macOS (Voxtral 4B via MLX) (edgewhisper.com)

2 points by raphaelmansuy 19 hours ago | 1 comment

I built a macOS voice dictation app where zero bytes of audio ever leave your machine.

EdgeWhisper runs Voxtral Mini 4B Realtime (Mistral AI, Apache 2.0) locally on Apple Silicon via the MLX framework. Hold a key, speak, release — text appears at your cursor in whatever app has focus.

Architecture: - Native Swift (SwiftUI + AppKit). No Electron. - Voxtral 4B inference via MLX on the Neural Engine. ~3GB model, runs in ~2GB RAM on M1+. - Dual text injection: AXUIElement (preserves undo stack) with NSPasteboard+CGEvent fallback. - 6-stage post-processing pipeline: filler removal → dictionary → snippets → punctuation → capitalization → formatting. - Sliding window KV cache for unlimited streaming without latency degradation. - Configurable transcription delay (240ms–2.4s). Sweet spot at 480ms.

What it does well: - Works in 20+ terminals/IDEs (VS Code, Xcode, iTerm2, Warp, JetBrains). Most dictation tools break in terminals — we detect them and switch injection strategy. - Removes filler words automatically ("um", "uh", "like"). - 13 languages with auto-detection. - Personal dictionary + snippet expansion with variable support ({{date}}, {{clipboard}}). - Works fully offline after model download. No accounts, no telemetry, no analytics.

What it doesn't do (yet): - No file/meeting transcription (coming) - No translation (coming) - No Linux/Windows (macOS only, Apple Silicon required)

Pricing: Free tier (5 min/day, no account needed). Pro at $7.99/mo or $79.99/yr.

I'd love feedback on: 1. Would local LLM post-processing (e.g., Phi-4-mini via MLX) for grammar/tone be worth the extra ~1GB RAM? 2. For developers using voice→code workflows: what context would you want passed to your editor? 3. Anyone else building on Voxtral Realtime? Curious about your experience with the causal audio encoder.