How Voice Keyboard Pro Works Under the Hood: Architecture & Technology

All posts

Voice Keyboard Pro looks simple on the surface: hold a key, speak, release, and text appears wherever your cursor is. But that simplicity is the product of a deeply layered system. Between the moment sound enters your microphone and the moment characters appear in your text field, Voice Keyboard Pro orchestrates an audio pipeline, on-device speech recognition, profession-aware post-processing, voice isolation, AI actions, and system-wide text injection. This post walks through the entire architecture for engineers, power users, and anyone curious about what happens inside a modern dictation app.

Why Native macOS, Not Electron

The decision to build Voice Keyboard Pro as a native Swift application was not a matter of aesthetic preference. It was an engineering constraint imposed by the problem itself. Voice-to-text requires real-time audio capture, hardware-accelerated inference on the Neural Engine, system-wide keystroke detection via CGEvent taps, and direct access to the Accessibility API for text insertion. Each of these capabilities is either unavailable or severely degraded in Electron, web wrappers, and cross-platform frameworks.

Native means the entire pipeline runs without bridging layers. There is no JavaScript event loop between your keypress and audio capture starting. There is no IPC overhead between the audio thread and the recognition engine. There is no Chromium renderer consuming 200 MB of RAM to display a popover menu. Voice Keyboard Pro idles at roughly 30 MB of memory and registers near-zero CPU usage when you are not actively dictating. For a tool that runs all day in your menu bar, those numbers are not vanity metrics. They are the difference between an app you forget is running and one you eventually quit because it is draining your battery.

The Audio Pipeline: Microphone to PCM Buffer

Everything starts with a keypress. Voice Keyboard Pro registers a global hotkey listener using CGEvent taps, a low-level macOS API that intercepts keyboard events system-wide regardless of which application is in the foreground. When you press and hold the configured hotkey (Left Control by default), the audio pipeline activates.

Audio capture runs on AVAudioEngine, Apple's real-time audio processing framework. Voice Keyboard Pro attaches a tap to the input node and begins receiving PCM audio buffers at the sample rate the Whisper model expects (16 kHz, mono, Float32). The pipeline is deliberately lean. There is no pre-processing, equalization, or gain normalization applied to the raw stream at this stage. The reason is practical: Whisper models are trained on diverse, noisy audio and handle environmental variation better than hand-rolled filters. Adding unnecessary DSP stages increases latency and CPU load without improving accuracy.

The audio engine runs on a high-priority real-time thread managed by Core Audio. Buffer sizes are configured to balance latency against dropout risk. On Apple Silicon, the audio pipeline contributes less than 2% CPU utilization during active recording, which means you can dictate while running a full compile, a video call, or a heavy browser session without audio underruns or frame drops.

Voice Activity Detection and Silence Handling

Not every hotkey press results in speech. You might hold the key and then decide not to say anything. You might accidentally trigger it while reaching for another shortcut. Voice Keyboard Pro handles this with voice activity detection (VAD).

During recording, Voice Keyboard Pro continuously calculates the speech ratio: the proportion of audio frames containing actual vocal energy versus silence or ambient noise. When you release the hotkey, the app checks this ratio against a calibrated threshold. If the recording is predominantly silence, Voice Keyboard Pro discards it entirely and skips transcription. This prevents phantom text from appearing when you did not actually speak, and it avoids wasting compute on empty audio.

The VAD threshold is tuned to be generous with genuine speech. Even a quiet, short utterance of two or three words will pass the gate. But holding the key for five seconds in a silent room will not produce output. This is a small detail that matters enormously in practice. Without it, every accidental keypress would generate a hallucinated transcription, and the tool would feel unreliable.

Voice Isolation: Filtering Out the World

One of the harder problems in real-world dictation is background noise. Open offices, coffee shops, household noise, a colleague on a call nearby. Traditional noise reduction uses spectral gating or adaptive filters, which help with steady-state noise like fan hum but fail with intermittent sounds like voices, music, or television.

Voice Keyboard Pro takes a different approach with voice isolation. Instead of trying to remove noise, it isolates the primary speaker's voice from the input signal. This uses a neural model that has learned to separate a target speaker from a mixture of sounds. The result is a clean audio stream containing only your voice, even when other people are talking in the same room, even when music is playing, even when your dog is barking.

Voice isolation runs on-device and is applied before the audio reaches the Whisper model. This two-stage approach (isolate, then transcribe) produces dramatically better accuracy in noisy environments compared to feeding raw audio directly to the recognizer. In our testing, voice isolation reduced word error rate in open-office conditions by over 40%.

Speech Recognition: Whisper on Apple Neural Engine

The core of Voice Keyboard Pro's transcription is OpenAI's Whisper model, running locally on your Mac. There is no cloud API call, no audio upload, no server round-trip. The model runs entirely on-device, which means transcription works offline, incurs zero latency from network conditions, and ensures that your audio never leaves your computer.

Voice Keyboard Pro ships with Whisper models converted to Core ML format and optimized for Apple's Neural Engine, the dedicated machine learning accelerator present in all Apple Silicon chips. The Neural Engine is designed specifically for the matrix operations that transformer models require, and it runs them at a fraction of the power draw that GPU or CPU inference would require. On an M1 MacBook Air, a 10-second audio clip transcribes in approximately 400 milliseconds. On M3 or M4 hardware, that number drops below 300 milliseconds.

The model selection is configurable. Voice Keyboard Pro offers multiple Whisper model sizes, from the compact base model (roughly 140 MB) that trades some accuracy for speed, to larger models that deliver near-human accuracy for longer or more complex dictation. The default strikes a balance that works well for everyday dictation: fast enough to feel instantaneous, accurate enough that you rarely need to correct output.

Profession Detection and Domain Vocabulary

Generic speech recognition struggles with specialized vocabulary. A radiologist saying "anteroposterior" or an engineer saying "kubectl" will confuse a model trained primarily on general conversation. Voice Keyboard Pro addresses this with profession detection.

During onboarding, you select your profession. Voice Keyboard Pro uses this to configure two things: a Whisper prompt that biases the model toward domain-specific terminology, and a starter vocabulary list that seeds the recognition context. The Whisper prompt is a short text passage containing representative terms from your field. Because of how Whisper's decoder works, this prompt conditions the model to expect and correctly transcribe domain jargon without additional fine-tuning.

On top of this, Voice Keyboard Pro maintains a custom vocabulary system where you can add terms specific to your work: product names, proprietary terms, colleague names, acronyms. These are injected into the recognition context alongside the profession vocabulary, creating a personalized recognition profile that improves over time as you add terms.

Text Insertion: System-Wide Cursor Injection

This is where Voice Keyboard Pro diverges most sharply from how you might expect a voice-to-text tool to work. Most dictation apps use the clipboard: they copy transcribed text to the pasteboard and simulate Cmd+V. This is fast, but it has a critical flaw. It overwrites whatever you last copied. If you were in the middle of a copy-paste workflow, your clipboard is silently corrupted.

Voice Keyboard Pro uses the macOS Accessibility API to inject text directly at the cursor position in the currently focused application. The app queries the system for the focused UI element, confirms it accepts text input, and then programmatically inserts the transcription. From the target application's perspective, the text appears exactly as if you had typed it on the keyboard. This approach preserves the clipboard, works in virtually every text field on the system (including terminal emulators, code editors, and web apps), and produces no side effects.

The Accessibility API is a native macOS Objective-C/C API. Calling it from Swift is direct and zero-overhead. There is no serialization, no IPC, no bridge. This matters because text insertion runs on every single dictation. Even 10 milliseconds of bridge overhead, multiplied across hundreds of daily dictations, would add up to perceptible lag. At native call speed, the insertion itself is effectively instantaneous.

AI Actions: Voice Commands Beyond Text

Not everything you say is meant to be typed. Sometimes you want to tell your computer to do something rather than write something. Voice Keyboard Pro's AI actions system recognizes when your utterance is a command rather than dictation and routes it accordingly.

When you speak, Voice Keyboard Pro's post-processing pipeline classifies the transcription as either text (to be inserted at the cursor) or an action (to be executed). Actions include things like "open Safari," "search for the latest quarterly report," "summarize this page," or "reply to this email saying I will follow up tomorrow." The classification uses a language model that examines the structure and intent of the transcription.

When an action is detected, Voice Keyboard Pro does not insert the text. Instead, it dispatches the command to an action handler that interprets the intent and executes it. This architecture turns Voice Keyboard Pro from a dictation tool into a voice interface for your Mac. You can compose text, trigger workflows, and interact with your system using the same hold-to-speak gesture, and Voice Keyboard Pro figures out which mode you are in.

The Smart Rewrite Pipeline

Spoken language and written language are not the same. When you dictate, your output tends to be more conversational, with filler words, false starts, and looser sentence structure. Voice Keyboard Pro's Smart Rewrite feature bridges this gap. After transcription, the raw text is optionally passed through a language model that reformats it into polished written prose: tightening sentences, removing verbal fillers, and adjusting register to match the context (casual for chat messages, formal for emails, technical for documentation).

Smart Rewrite is context-aware. It considers the target application and the surrounding text to determine the appropriate tone and formatting. Dictating into Slack produces casual, direct text. Dictating into a Google Doc preserves paragraph structure and formal register. This happens automatically without any configuration, though you can override the behavior per-application if you prefer raw transcription in certain contexts.

Offline Mode: Everything On-Device

Voice Keyboard Pro's offline mode is not a degraded fallback. The core pipeline (audio capture, voice isolation, VAD, Whisper transcription, text insertion) runs entirely on-device with no network dependency. You can dictate on an airplane, in a basement, or with your Wi-Fi turned off, and the experience is identical to online usage for basic voice-to-text.

Features that require network connectivity are clearly delineated: AI actions, Smart Rewrite, and cloud-based model options. When offline, these features are gracefully disabled and Voice Keyboard Pro falls back to local-only processing. The transition is seamless. You do not need to toggle an "offline mode" switch. Voice Keyboard Pro detects connectivity and adjusts automatically.

The iPhone Keyboard Extension

Voice Keyboard Pro extends beyond the Mac with an iPhone keyboard that brings the same voice-to-text experience to iOS. The keyboard integrates with the same Whisper-based recognition pipeline, optimized for the iPhone's Neural Engine. You get the same profession-aware vocabulary, the same accuracy, and the same hold-to-speak interaction model, but on your phone.

The keyboard runs as a standalone extension. It does not require a persistent connection to your Mac. Audio is processed on-device on the iPhone itself, maintaining the same privacy model as the Mac app. This is not a remote mic that streams audio to your laptop. It is a complete, self-contained recognition system running on the A-series or M-series chip in your phone.

Performance Numbers

Here are the real numbers from our internal benchmarks on an M2 MacBook Air:

Memory at idle: 28 to 32 MB
CPU at idle: 0.0% (truly zero, no background timers)
Audio pipeline startup: under 15 milliseconds from keypress to recording
Transcription latency (10-second clip): 350 to 450 milliseconds on Neural Engine
Text insertion latency: under 5 milliseconds via Accessibility API
Total end-to-end (speak, release, text appears): under 500 milliseconds for a typical sentence
Battery impact: negligible during dictation, unmeasurable at idle
App binary size: approximately 30 MB (including Whisper model)

These numbers are not theoretical. They are measured on production builds with real-world audio. The performance characteristics are a direct consequence of native development, hardware-accelerated inference, and an architecture that eliminates unnecessary abstraction layers.

Privacy by Design

Privacy in Voice Keyboard Pro is not a policy. It is an architectural property. Audio is processed on-device by the Whisper model running on the Neural Engine. Transcribed text is inserted directly into the target application and stored locally in your dictation history. No audio is uploaded to any server. No transcriptions are sent to any cloud service. No telemetry captures what you said.

The only network requests Voice Keyboard Pro makes are for features that inherently require them: AI actions (which use a language model API) and checking for app updates. These requests contain no audio data and no transcription content. The AI action request contains only the processed command, not the original audio or the full transcription context.

This architecture means there is nothing to breach. If Voice Keyboard Pro's servers were compromised tomorrow, attackers would find update manifests and crash reports. They would not find a single word you ever dictated, because those words never left your Mac.

Frequently Asked Questions

Does Voice Keyboard Pro use OpenAI Whisper for speech recognition?

Yes. Voice Keyboard Pro runs Whisper models locally on your Mac using Apple's Neural Engine. The models are converted to Core ML format and optimized for Apple Silicon, so transcription happens on-device without sending audio to any server. You get the accuracy of Whisper with the speed and privacy of local inference.

How does Voice Keyboard Pro insert text without using the clipboard?

Voice Keyboard Pro uses the macOS Accessibility API to inject text directly at the cursor position in any application. This system-wide cursor injection preserves your clipboard contents and works identically to physical keyboard input from the target app's perspective. It works in browsers, code editors, terminal emulators, and native macOS apps.

Does Voice Keyboard Pro work offline?

Yes. Voice Keyboard Pro's core speech recognition runs entirely on-device using Whisper models on the Apple Neural Engine. You can dictate without an internet connection. Some features like AI actions and Smart Rewrite require a network connection, but basic voice-to-text works completely offline.

What makes Voice Keyboard Pro different from Apple's built-in dictation?

Voice Keyboard Pro adds profession-aware vocabulary, voice isolation that filters background noise and other speakers, AI actions that can execute commands from voice, hallucination filtering, Smart Rewrite for polished output, and a hold-to-speak activation model that gives you precise control over when dictation is active. It also works in every text field system-wide, including terminal emulators and code editors where Apple dictation often fails.

Try Voice Keyboard Pro

Voice Keyboard Pro is built for people who care about how their tools work. If you want a dictation app that is fast because it is engineered to be fast, private because it is architectured to be private, and accurate because it runs state-of-the-art models on dedicated hardware, download Voice Keyboard Pro and try it. It is free to start, and you will feel the difference in the first dictation.