← Back to Blog

Short answer: Google Cloud Speech-to-Text exposes three recognition modes: synchronous (short audio under one minute), asynchronous (long files via Cloud Storage), and streaming (live microphone audio). You authenticate with a service-account key, send audio plus a RecognitionConfig, and receive a transcript with timing and confidence per segment.

Most "Google speech recognition API" tutorials online are either out-of-date sample dumps or marketing pages with no code. This one is the opposite: a compact, practical walk-through of what you actually need to know to ship a working speech-to-text feature on top of Google Cloud Speech-to-Text. We will cover the three recognition modes, authentication, the most useful configuration options, language hints, streaming with a live microphone, and an honest section on when the API is not the right choice.

What "the Google speech recognition API" actually means

There are two different things that get called this:

If you are building anything beyond a Chrome-only browser toy, you want Cloud Speech-to-Text. The rest of this tutorial assumes that.

Setting up a Google Cloud project

  1. Go to console.cloud.google.com and create a new project (or pick an existing one).
  2. Enable the Cloud Speech-to-Text API for that project from the API Library.
  3. Create a service account: IAM & Admin, Service Accounts, Create Service Account. Give it the "Cloud Speech Client" role (or a broader role if you also need to read from Cloud Storage).
  4. Generate a JSON key for the service account and download it.
  5. Save the key file somewhere outside your repo and set the environment variable so the client libraries can find it:
export GOOGLE_APPLICATION_CREDENTIALS="/absolute/path/to/key.json"

The client libraries read this variable automatically. No API-key-in-URL pattern, no headers to set manually. If GOOGLE_APPLICATION_CREDENTIALS is unset, the library will try the metadata service (which works on GCE/GKE/Cloud Run/Cloud Functions) and fail loudly elsewhere.

Installing the client library

We will use Python for the examples because it is the shortest. The same shapes exist in every other official client.

pip install google-cloud-speech

That installs the google.cloud.speech package and its gRPC dependencies. Imports look like:

from google.cloud import speech

client = speech.SpeechClient()

If the credentials environment variable is set, the client picks it up here. If not, you will get an authentication error on the first call, not at import time.

Mode 1: Synchronous recognition (short audio)

Synchronous mode is for audio shorter than roughly one minute. It is the simplest pattern and the right choice for "user holds a button, releases, transcribe the clip" workflows.

from google.cloud import speech

client = speech.SpeechClient()

with open("clip.wav", "rb") as f:
    audio_bytes = f.read()

audio = speech.RecognitionAudio(content=audio_bytes)
config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=16000,
    language_code="en-US",
)

response = client.recognize(config=config, audio=audio)

for result in response.results:
    print(result.alternatives[0].transcript)
    print(result.alternatives[0].confidence)

Things worth knowing:

Mode 2: Asynchronous recognition (long audio)

For audio longer than about a minute, you cannot send the bytes inline. Instead you upload to Google Cloud Storage and pass a URI:

audio = speech.RecognitionAudio(uri="gs://your-bucket/long-clip.flac")
config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.FLAC,
    sample_rate_hertz=16000,
    language_code="en-US",
    enable_automatic_punctuation=True,
)

operation = client.long_running_recognize(config=config, audio=audio)
response = operation.result(timeout=600)

for result in response.results:
    print(result.alternatives[0].transcript)

long_running_recognize returns immediately with an operation handle. Calling .result() blocks until the job finishes. For really long jobs (hour-plus), do not block — store the operation name and poll it from a background worker.

enable_automatic_punctuation is one of the most useful flags. Without it, you get a single run-on transcript with no commas, no periods, and no question marks. With it, the model inserts plausible punctuation and capitalisation.

Mode 3: Streaming recognition (live microphone)

Streaming is what you want for a live captioning UI, a voice assistant, or a dictation app where the user wants to see words as they speak. You open a bidirectional gRPC stream, push audio chunks in, and receive interim and final transcripts as they become available.

import queue
import sounddevice as sd
from google.cloud import speech

RATE = 16000
CHUNK = int(RATE / 10)  # 100ms

client = speech.SpeechClient()
audio_queue = queue.Queue()

def callback(indata, frames, time, status):
    audio_queue.put(bytes(indata))

def requests():
    while True:
        chunk = audio_queue.get()
        if chunk is None:
            return
        yield speech.StreamingRecognizeRequest(audio_content=chunk)

config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=RATE,
    language_code="en-US",
)
streaming_config = speech.StreamingRecognitionConfig(
    config=config,
    interim_results=True,
)

with sd.RawInputStream(samplerate=RATE, blocksize=CHUNK, dtype='int16',
                       channels=1, callback=callback):
    responses = client.streaming_recognize(
        config=streaming_config,
        requests=requests(),
    )
    for response in responses:
        for result in response.results:
            transcript = result.alternatives[0].transcript
            if result.is_final:
                print("FINAL:", transcript)
            else:
                print("interim:", transcript)

A few important constraints:

Useful RecognitionConfig options

Speech contexts: the biggest accuracy win

The most underused part of the API. If you know in advance that the user is going to say specific words or phrases — names, codes, jargon — passing them as speech contexts makes the model dramatically more likely to pick them.

config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=16000,
    language_code="en-US",
    speech_contexts=[
        speech.SpeechContext(phrases=[
            "Kubernetes",
            "Voice Keyboard Pro",
            "PR-3917",
        ]),
    ],
)

Use this for: brand names, internal product names, ticket-ID patterns, customer names you have a list of, technical acronyms. Without it, the model will reliably mishear "Kubernetes" as something else.

Pricing and quotas

Cloud Speech-to-Text is billed per audio second processed, with separate rates for the basic and the "enhanced" model variants. Pricing changes regularly enough that it is not worth quoting a number here — check the current rate on Google Cloud's pricing page for Speech-to-Text before committing.

What is worth knowing without specifics:

Error handling and resilience

Production speech recognition fails in a small number of predictable ways. Your client code should handle:

When to choose a different engine

Google Cloud Speech-to-Text is a strong default, but it is no longer the only serious option. Three honest cases where you might pick a different engine:

If you just want dictation, not a transcription pipeline

If you ended up on this page looking for a way to add great voice-to-text to your own daily workflow on a Mac or iPhone — rather than to ship a backend feature — you do not need to build on Cloud Speech-to-Text yourself.

Voice Keyboard Pro is a finished macOS menu bar app and iOS keyboard that handles all the engineering above for you. It uses Whisper-class AI transcription as the default, with Apple Speech as the offline fallback. Hold a hotkey, speak, release — text appears at the cursor in any app. It includes a Voice Profile feature that learns your voice over time, a custom vocabulary feature equivalent to speech contexts, Voice Isolation for noisy environments, and Smart Rewrite for cleaning up filler words after dictation.

On the privacy side, the server stores only operational pings — no audio and no transcript content are kept. There is also a fully offline Apple Speech mode for sensitive work.

The Cloud Speech-to-Text API is the right tool for backend transcription pipelines. For everyday human dictation, a dedicated app saves you a year of integration work.

Free tier with daily limits, Pro at $4.99 per month or $34.99 per year. Same workflow on Mac and iPhone. If you have been considering building a dictation feature on the API only to use it yourself, the math usually favours buying the finished tool and getting your weekend back.