Short answer: Google Cloud Speech-to-Text exposes three recognition modes: synchronous (short audio under one minute), asynchronous (long files via Cloud Storage), and streaming (live microphone audio). You authenticate with a service-account key, send audio plus a RecognitionConfig, and receive a transcript with timing and confidence per segment.
Most "Google speech recognition API" tutorials online are either out-of-date sample dumps or marketing pages with no code. This one is the opposite: a compact, practical walk-through of what you actually need to know to ship a working speech-to-text feature on top of Google Cloud Speech-to-Text. We will cover the three recognition modes, authentication, the most useful configuration options, language hints, streaming with a live microphone, and an honest section on when the API is not the right choice.
What "the Google speech recognition API" actually means
There are two different things that get called this:
- Google Cloud Speech-to-Text: The official paid API hosted on Google Cloud Platform. Production-grade, billed per audio second, supported via official client libraries in Python, Node, Java, Go, C#, Ruby, PHP, and gRPC.
- The Web Speech API in Chrome: The free browser feature behind Google Docs voice typing. Not a server API — it is a JavaScript interface inside Chrome that talks to Google's speech servers internally. You do not get an API key; you only get an in-browser object called
SpeechRecognition.
If you are building anything beyond a Chrome-only browser toy, you want Cloud Speech-to-Text. The rest of this tutorial assumes that.
Setting up a Google Cloud project
- Go to console.cloud.google.com and create a new project (or pick an existing one).
- Enable the Cloud Speech-to-Text API for that project from the API Library.
- Create a service account: IAM & Admin, Service Accounts, Create Service Account. Give it the "Cloud Speech Client" role (or a broader role if you also need to read from Cloud Storage).
- Generate a JSON key for the service account and download it.
- Save the key file somewhere outside your repo and set the environment variable so the client libraries can find it:
export GOOGLE_APPLICATION_CREDENTIALS="/absolute/path/to/key.json"
The client libraries read this variable automatically. No API-key-in-URL pattern, no headers to set manually. If GOOGLE_APPLICATION_CREDENTIALS is unset, the library will try the metadata service (which works on GCE/GKE/Cloud Run/Cloud Functions) and fail loudly elsewhere.
Installing the client library
We will use Python for the examples because it is the shortest. The same shapes exist in every other official client.
pip install google-cloud-speech
That installs the google.cloud.speech package and its gRPC dependencies. Imports look like:
from google.cloud import speech
client = speech.SpeechClient()
If the credentials environment variable is set, the client picks it up here. If not, you will get an authentication error on the first call, not at import time.
Mode 1: Synchronous recognition (short audio)
Synchronous mode is for audio shorter than roughly one minute. It is the simplest pattern and the right choice for "user holds a button, releases, transcribe the clip" workflows.
from google.cloud import speech
client = speech.SpeechClient()
with open("clip.wav", "rb") as f:
audio_bytes = f.read()
audio = speech.RecognitionAudio(content=audio_bytes)
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code="en-US",
)
response = client.recognize(config=config, audio=audio)
for result in response.results:
print(result.alternatives[0].transcript)
print(result.alternatives[0].confidence)
Things worth knowing:
- The audio must be in a format Google supports — LINEAR16 (raw PCM WAV), FLAC, OGG_OPUS, and several others. FLAC is the recommended format for accuracy because it is lossless and compact.
sample_rate_hertzmust match the actual sample rate of the audio. If you record at 48 kHz and declare 16 kHz, transcription will be wrong or empty.language_codeis a BCP-47 tag (en-US, en-GB, es-ES, ja-JP, fr-FR, etc.).response.resultsis a list of independent utterance results, each with analternativeslist. The first alternative is the best guess.
Mode 2: Asynchronous recognition (long audio)
For audio longer than about a minute, you cannot send the bytes inline. Instead you upload to Google Cloud Storage and pass a URI:
audio = speech.RecognitionAudio(uri="gs://your-bucket/long-clip.flac")
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.FLAC,
sample_rate_hertz=16000,
language_code="en-US",
enable_automatic_punctuation=True,
)
operation = client.long_running_recognize(config=config, audio=audio)
response = operation.result(timeout=600)
for result in response.results:
print(result.alternatives[0].transcript)
long_running_recognize returns immediately with an operation handle. Calling .result() blocks until the job finishes. For really long jobs (hour-plus), do not block — store the operation name and poll it from a background worker.
enable_automatic_punctuation is one of the most useful flags. Without it, you get a single run-on transcript with no commas, no periods, and no question marks. With it, the model inserts plausible punctuation and capitalisation.
Mode 3: Streaming recognition (live microphone)
Streaming is what you want for a live captioning UI, a voice assistant, or a dictation app where the user wants to see words as they speak. You open a bidirectional gRPC stream, push audio chunks in, and receive interim and final transcripts as they become available.
import queue
import sounddevice as sd
from google.cloud import speech
RATE = 16000
CHUNK = int(RATE / 10) # 100ms
client = speech.SpeechClient()
audio_queue = queue.Queue()
def callback(indata, frames, time, status):
audio_queue.put(bytes(indata))
def requests():
while True:
chunk = audio_queue.get()
if chunk is None:
return
yield speech.StreamingRecognizeRequest(audio_content=chunk)
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=RATE,
language_code="en-US",
)
streaming_config = speech.StreamingRecognitionConfig(
config=config,
interim_results=True,
)
with sd.RawInputStream(samplerate=RATE, blocksize=CHUNK, dtype='int16',
channels=1, callback=callback):
responses = client.streaming_recognize(
config=streaming_config,
requests=requests(),
)
for response in responses:
for result in response.results:
transcript = result.alternatives[0].transcript
if result.is_final:
print("FINAL:", transcript)
else:
print("interim:", transcript)
A few important constraints:
- A single streaming session has a maximum duration (around five minutes at the time of writing). For longer sessions you must restart the stream — capture audio continuously, but rotate to a new
streaming_recognizecall before the limit. interim_results=Truegives you partial transcripts as the model is still hearing the sentence. Use these for low-latency UI updates, but only persistis_finalresults.- The audio chunk size matters. Too small and you waste bandwidth; too large and you add latency. Around 100ms per chunk is a good default.
Useful RecognitionConfig options
- enable_automatic_punctuation: Insert punctuation and capitalisation in the output. Almost always worth enabling.
- model: Pick a model variant. Options include general-purpose models, telephony models tuned for 8 kHz phone audio, and a video model tuned for media audio. The right model can substantially improve accuracy.
- speech_contexts: Provide hint phrases. If your domain has unusual words ("Kubernetes," "voicekeyboardpro," product names), passing them as speech contexts can dramatically improve recognition of those specific tokens.
- diarization_config: Enable speaker diarization on multi-speaker audio. Output includes a speaker tag per word.
- enable_word_time_offsets: Get start and end timestamps per word, useful for subtitling.
- profanity_filter: Mask profane words with asterisks. Off by default.
- alternative_language_codes: Provide a small list of candidate languages; the API picks the best match. Useful for users who switch languages between sessions.
Speech contexts: the biggest accuracy win
The most underused part of the API. If you know in advance that the user is going to say specific words or phrases — names, codes, jargon — passing them as speech contexts makes the model dramatically more likely to pick them.
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code="en-US",
speech_contexts=[
speech.SpeechContext(phrases=[
"Kubernetes",
"Voice Keyboard Pro",
"PR-3917",
]),
],
)
Use this for: brand names, internal product names, ticket-ID patterns, customer names you have a list of, technical acronyms. Without it, the model will reliably mishear "Kubernetes" as something else.
Pricing and quotas
Cloud Speech-to-Text is billed per audio second processed, with separate rates for the basic and the "enhanced" model variants. Pricing changes regularly enough that it is not worth quoting a number here — check the current rate on Google Cloud's pricing page for Speech-to-Text before committing.
What is worth knowing without specifics:
- You pay per second of audio, not per second of compute. A one-minute clip costs the same whether the API takes one second to process or twenty.
- There is a free tier for new accounts. Useful for prototyping; not enough for production.
- The enhanced models cost more than the standard ones. For most production use cases the accuracy improvement justifies it.
- Streaming costs more per second than batch, because of the live-session overhead.
Error handling and resilience
Production speech recognition fails in a small number of predictable ways. Your client code should handle:
- Auth errors: Bad or expired key, missing role. Surface clearly; do not retry.
- Quota exceeded: Back off and retry with exponential delay, or fail fast and bill alerts.
- Streaming timeouts: Catch the duration-limit error and start a new stream automatically without dropping the user's audio.
- Empty results: The API can return zero results for silent audio, very short audio, or audio with the wrong sample rate. Treat empty as a possible bug in your audio pipeline, not just as "user did not say anything."
When to choose a different engine
Google Cloud Speech-to-Text is a strong default, but it is no longer the only serious option. Three honest cases where you might pick a different engine:
- You want best-in-class accuracy on long, messy audio. Whisper-class models (OpenAI's Whisper API and the open-source variants) frequently outperform Google on long-form, accented, or noisy audio. The latency profile is different (Whisper is a chunked-batch model, not naturally streaming) but accuracy on hard inputs tends to be higher.
- You want on-device processing. Cloud Speech-to-Text is cloud-only. If you need offline transcription (privacy, regulated industries, intermittent connectivity), you want either Apple Speech (on Mac/iOS) or a local Whisper deployment.
- You are building a consumer dictation product, not a transcription pipeline. Wiring up the API yourself for end-user dictation means handling audio capture, VAD, chunking, post-processing, vocabulary, voice profile, formatting, hotkeys, system integration, and a dozen other things. For that use case, ship a finished app, not a raw API integration.
If you just want dictation, not a transcription pipeline
If you ended up on this page looking for a way to add great voice-to-text to your own daily workflow on a Mac or iPhone — rather than to ship a backend feature — you do not need to build on Cloud Speech-to-Text yourself.
Voice Keyboard Pro is a finished macOS menu bar app and iOS keyboard that handles all the engineering above for you. It uses Whisper-class AI transcription as the default, with Apple Speech as the offline fallback. Hold a hotkey, speak, release — text appears at the cursor in any app. It includes a Voice Profile feature that learns your voice over time, a custom vocabulary feature equivalent to speech contexts, Voice Isolation for noisy environments, and Smart Rewrite for cleaning up filler words after dictation.
On the privacy side, the server stores only operational pings — no audio and no transcript content are kept. There is also a fully offline Apple Speech mode for sensitive work.
The Cloud Speech-to-Text API is the right tool for backend transcription pipelines. For everyday human dictation, a dedicated app saves you a year of integration work.
Free tier with daily limits, Pro at $4.99 per month or $34.99 per year. Same workflow on Mac and iPhone. If you have been considering building a dictation feature on the API only to use it yourself, the math usually favours buying the finished tool and getting your weekend back.