2143 words
11 minutes
How to Convert Speech to Text in Python Using OpenAI Whisper

Converting Speech to Text in Python Using OpenAI Whisper#

Automatic Speech Recognition (ASR) technologies convert spoken language into written text. This capability is fundamental to applications ranging from voice assistants and dictation software to transcription services and audio analysis platforms. Achieving high accuracy and robustness in ASR across various languages, accents, and audio qualities has historically been challenging.

OpenAI’s Whisper model represents a significant advancement in this field. Trained on a massive dataset of diverse audio, Whisper demonstrates strong performance in multilingual speech recognition and translation. Its availability through both a commercial API and an open-source library provides developers with flexible options for integrating sophisticated speech-to-text functionality into Python applications.

Essential Concepts#

Understanding a few core concepts is helpful when working with speech-to-text technologies like Whisper:

  • Automatic Speech Recognition (ASR): The general process by which a computer identifies spoken words and converts them into text.
  • Audio Data: Spoken language captured as a waveform. This can be stored in various digital formats (e.g., WAV, MP3, M4A).
  • Machine Learning Model: An algorithm trained on vast amounts of data to perform a specific task, in this case, mapping audio patterns to text. Whisper is a deep learning model, utilizing a neural network architecture, specifically a Transformer-based sequence-to-sequence model.
  • Transcription: The output of the ASR process – the written text corresponding to the input audio.
  • API (Application Programming Interface): A set of definitions and protocols that allow different software systems to communicate with each other. OpenAI’s API provides access to their models, including Whisper, over the internet.
  • Library: A collection of pre-written code that developers can use to perform common tasks. The open-source whisper Python library allows running the Whisper model directly on local hardware.

Why Choose OpenAI Whisper for Python?#

Several factors make OpenAI Whisper a compelling choice for speech-to-text tasks in Python:

  • High Accuracy: Whisper is recognized for its high accuracy compared to many other models, particularly on diverse audio inputs.
  • Multilingual Support: It supports transcription in many languages and can also translate speech from other languages into English text.
  • Robustness: The model is trained to handle various audio conditions, including background noise and different speaking styles.
  • Flexibility: Available via a managed API service and a downloadable open-source model, offering choices based on development needs, resources, and privacy considerations.
  • Ease of Integration: Both the API and the open-source library are designed to be relatively straightforward to use within Python environments.

Methods for Converting Speech to Text with Whisper in Python#

There are two primary methods for using OpenAI Whisper in Python:

  1. Using the OpenAI API.
  2. Using the open-source openai/whisper Python library (running the model locally).

Each method has distinct requirements, advantages, and disadvantages.

Method 1: Using the OpenAI API#

This method involves sending audio data to OpenAI’s servers via an API call and receiving the transcribed text back.

Prerequisites:#

  • A Python installation (3.7.1 or later).
  • An OpenAI API key. Obtain this from the OpenAI platform website after creating an account. API usage incurs costs based on the amount of audio processed.
  • The openai Python library installed.

Step-by-Step Guide:#

  1. Install the OpenAI library: Use pip, Python’s package installer:

    Terminal window
    pip install openai
  2. Set up your OpenAI API key: It is recommended to set your API key as an environment variable (OPENAI_API_KEY). This prevents hardcoding the key directly in scripts.

    Terminal window
    # On Linux/macOS
    export OPENAI_API_KEY='YOUR_API_KEY'
    # On Windows (Command Prompt)
    set OPENAI_API_KEY='YOUR_API_KEY'
    # On Windows (PowerShell)
    $env:OPENAI_API_KEY='YOUR_API_KEY'

    The openai library automatically reads this environment variable. Alternatively, the key can be passed directly in the code, though this is less secure for production environments.

  3. Prepare the audio file: The OpenAI API supports several audio formats, including flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, and webm. The maximum file size is 25 MB. For larger files, audio segmentation is required before sending them to the API.

  4. Write the Python code: Import the library, open the audio file, and make the API call using openai.audio.transcriptions.create.

    import os
    from openai import OpenAI
    # The library automatically reads OPENAI_API_KEY from environment variables
    client = OpenAI()
    # Path to your audio file
    audio_file_path = "path/to/your/audiofile.mp3"
    # Ensure the file exists
    if not os.path.exists(audio_file_path):
    print(f"Error: File not found at {audio_file_path}")
    else:
    try:
    with open(audio_file_path, "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
    model="whisper-1", # Currently the standard Whisper model offered by OpenAI API
    file=audio_file
    )
    print("Transcription:")
    print(transcript.text)
    except Exception as e:
    print(f"An error occurred: {e}")
  5. Run the script: Execute the Python file from your terminal.

    Terminal window
    python your_script_name.py

API Call Options:#

The client.audio.transcriptions.create method accepts optional parameters:

  • response_format: Specify the output format (e.g., ‘json’, ‘text’, ‘srt’, ‘vtt’). Default is ‘json’.
  • language: Specify the input language using a two-letter ISO-639-1 code (e.g., ‘en’, ‘es’, ‘fr’). Providing the language can improve accuracy and speed.
  • prompt: Provide a short piece of text likely to appear in the audio. This can help guide the model, especially for specific terminology or names.
  • temperature: Controls the randomness of the output (0 to 1). Lower values (closer to 0) make the output more deterministic.

Example with Language and Prompt:#

import os
from openai import OpenAI
client = OpenAI()
audio_file_path = "path/to/your/spanish_audio.wav" # Assuming a Spanish audio file
if not os.path.exists(audio_file_path):
print(f"Error: File not found at {audio_file_path}")
else:
try:
with open(audio_file_path, "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language="es", # Specify Spanish
prompt="Keywords related to technology and artificial intelligence." # Example prompt
)
print("Transcription (Spanish):")
print(transcript.text)
except Exception as e:
print(f"An error occurred: {e}")

Pros of Using the OpenAI API:#

  • Ease of Use: Minimal setup; no need to manage model weights or significant local dependencies.
  • Scalability: Handles varying workloads without local infrastructure changes.
  • Managed Service: OpenAI handles the underlying infrastructure, model updates, and performance optimization.
  • Supports Many Formats: Handles a wide range of common audio file types directly.

Cons of Using the OpenAI API:#

  • Cost: API usage incurs per-minute costs.
  • Latency: Transcription requires sending data over the internet and waiting for the response.
  • File Size Limit: Restricted to 25 MB per file, requiring manual segmentation for longer audio.
  • Internet Dependency: Requires an active internet connection.
  • Data Privacy: Audio data is sent to OpenAI’s servers (though OpenAI has data usage policies).

Method 2: Using the whisper Python Library (Local Model)#

This method involves downloading the Whisper model weights and running the transcription process directly on local hardware using the open-source library from OpenAI’s GitHub repository.

Prerequisites:#

  • A Python installation (3.8 or later recommended).
  • The whisper Python library installed.
  • torch and torchaudio for handling the model and audio data.
  • ffmpeg installed on the system (often required by torchaudio to load various audio formats).
  • Sufficient computational resources, especially RAM and potentially a GPU for faster processing, particularly with larger models.

Step-by-Step Guide:#

  1. Install the necessary libraries: Use pip. Installing torch with CUDA support (for GPU acceleration) is highly recommended if a compatible GPU is available. Refer to the PyTorch installation guide for specific commands based on the CUDA version.

    Terminal window
    # Basic installation (CPU-only for torch)
    pip install openai-whisper torch torchaudio ffmpeg-python
    # For GPU support (example for CUDA 11.8) - check PyTorch documentation for correct command
    # pip install openai-whisper torch torchaudio --index-url https://download.pytorch.org/whl/cu118
    # pip install ffmpeg-python

    Note: The openai-whisper package name is slightly different from the openai API library.

  2. Ensure ffmpeg is installed: Download and install ffmpeg from ffmpeg.org. Ensure the ffmpeg executable is available in your system’s PATH. The ffmpeg-python library provides Python bindings but relies on the underlying ffmpeg executable.

  3. Write the Python code: Import the whisper library, load a Whisper model, load the audio file using whisper.load_audio, and transcribe it using model.transcribe.

    import whisper
    import os
    # Path to your audio file
    audio_file_path = "path/to/your/audiofile.wav" # .wav is a common format, but ffmpeg allows others
    # Ensure the file exists
    if not os.path.exists(audio_file_path):
    print(f"Error: File not found at {audio_file_path}")
    else:
    try:
    # Load a Whisper model
    # Available models: tiny, base, small, medium, large
    # Append '.en' to model name for English-only versions (e.g., 'base.en')
    # English-only models are slightly smaller and faster for English audio
    print("Loading Whisper model...")
    model = whisper.load_model("base") # Choose a model size (e.g., 'base', 'medium', 'large')
    print("Model loaded.")
    # Load audio and pad/trim it to fit 30 seconds
    print("Loading audio...")
    audio = whisper.load_audio(audio_file_path)
    print("Audio loaded.")
    # Transcribe the audio
    print("Transcribing audio...")
    result = model.transcribe(audio)
    print("Transcription complete.")
    print("Transcription:")
    print(result["text"])
    except Exception as e:
    print(f"An error occurred: {e}")
  4. Run the script: Execute the Python file. The first time a model size is used, the weights will be downloaded (typically to ~/.cache/whisper).

    Terminal window
    python your_local_whisper_script.py

Model Sizes:#

The whisper.load_model() function accepts different model sizes:

  • tiny: Fastest, smallest, lowest accuracy.
  • base: Faster, smaller than medium/large, moderate accuracy.
  • small: Good balance of speed and accuracy.
  • medium: Higher accuracy, slower, larger.
  • large: Highest accuracy, slowest, largest.

English-only versions (tiny.en, base.en, small.en, medium.en) are available. They are slightly more accurate for English audio and have smaller download sizes.

Pros of Using the Local Library:#

  • Offline Capability: Does not require an internet connection after the model is downloaded.
  • No Per-Use Cost: Once setup, there are no recurring costs per transcription.
  • No File Size Limits: Can process audio files of any length (though large files benefit from segmentation for performance).
  • Full Control: Data remains on local infrastructure.
  • Customization: Potential for more advanced usage, like accessing intermediate outputs or exploring fine-tuning (though fine-tuning Whisper is complex).

Cons of Using the Local Library:#

  • Setup Complexity: Requires installing multiple dependencies, including ffmpeg, and potentially configuring GPU support.
  • Hardware Requirements: Demands significant CPU resources and RAM; GPU is strongly recommended for acceptable performance, especially with larger models.
  • Performance Variability: Transcription speed depends heavily on local hardware capabilities.
  • Maintenance: Requires managing library versions and model weights.
  • Initial Download: Requires downloading potentially large model files (e.g., large model is several GB).

Comparing the Two Methods#

FeatureOpenAI API (whisper-1)Local Whisper Library (openai/whisper)
Ease of SetupVery Easy (install openai, set key)Moderate (install whisper, torch, torchaudio, ffmpeg)
CostPay-per-use (per minute of audio)None per use (initial hardware/electricity cost)
PerformanceFast (managed infrastructure)Variable (depends heavily on local hardware, especially GPU)
Hardware Req.Minimal (client-side)Significant (CPU, RAM; GPU highly recommended)
Offline UseNo (requires internet)Yes (after model download)
File Size Limit25 MB (requires manual segmentation)None (practical limits based on RAM/CPU)
Control/PrivacyData sent to OpenAI serversData remains on local system
Model OptionsSingle standard model (whisper-1)Multiple model sizes (tiny to large)

Practical Applications and Real-World Examples#

The ability to convert speech to text with high accuracy opens up numerous possibilities in Python applications:

  • Automated Transcription Services: Building platforms to transcribe interviews, meetings, lectures, or podcasts. A service could use the Whisper API for its scalability and ease of use, processing uploaded audio files and providing text transcripts.
  • Voice Assistants and Command Interfaces: Developing applications that respond to spoken commands. The local Whisper model could be integrated into desktop or edge devices for low-latency, offline voice control, such as controlling a media player or smart home devices.
  • Adding Captions/Subtitles to Videos: Processing video audio tracks to automatically generate subtitle files (SRT, VTT). A Python script could extract the audio, pass it to Whisper (either via API or locally), and format the output into a subtitle file for video editing or playback platforms.
  • Analyzing Audio Content: Extracting text from customer service calls, focus group recordings, or media broadcasts for sentiment analysis, topic modeling, or keyword extraction. Data science workflows in Python can easily integrate Whisper to generate the raw text for analysis.
  • Medical or Legal Dictation: Creating tools that allow professionals to dictate notes or documents, converting their speech directly into text within a secure environment. Using the local Whisper model could be preferred here for privacy concerns.

Example Scenario: A small podcast editing tool developer wants to automate transcript generation for their users. They could implement both options:

  • Offer a premium feature using the OpenAI API for users who need quick, hands-off transcription and are willing to pay a small per-episode fee.
  • Provide a free, slower option using the local base or small Whisper model for users with sufficient local processing power, allowing them to generate transcripts without external costs or sending data over the internet. The Python application would detect if a GPU is available and suggest using a larger model size for better accuracy if hardware permits.

Advanced Considerations#

While the basic transcription is straightforward, more complex scenarios might involve:

  • Audio Segmentation: Breaking down very long audio files into smaller chunks for processing by the API (due to the 25 MB limit) or for managing memory/processing time with the local model.
  • Timestamping: Obtaining timestamps for individual words or phrases, which is crucial for tasks like generating synchronized captions. Both the API and the local library provide timestamp information.
  • Handling Noisy Audio: While Whisper is robust, very noisy audio might require pre-processing steps like noise reduction using audio processing libraries in Python (e.g., librosa, pydub).
  • Speaker Diarization: Identifying who is speaking at what time. Whisper does not natively perform diarization, but its output can be combined with separate diarization models or techniques to attribute transcribed speech segments to specific speakers.

Key Takeaways#

  • Converting speech to text in Python using OpenAI Whisper is achievable through two main methods: the OpenAI API (openai library) and the local model (openai-whisper library).
  • The OpenAI API offers ease of use, scalability, and managed infrastructure but incurs costs and requires internet access.
  • The local openai-whisper library provides offline capability, no per-use cost, and full control, but requires more complex setup and significant local hardware resources, especially a GPU for practical performance with larger models.
  • Both methods provide access to high-quality ASR capabilities powered by the advanced Whisper model.
  • The choice between API and local implementation depends on project requirements, budget, privacy needs, and available computing resources.
  • Python’s rich ecosystem of libraries simplifies integrating Whisper transcription into diverse applications, from simple scripts to complex ASR pipelines.
How to Convert Speech to Text in Python Using OpenAI Whisper
https://dev-resources.site/posts/how-to-convert-speech-to-text-in-python-using-openai-whisper/
Author
Dev-Resources
Published at
2025-06-30
License
CC BY-NC-SA 4.0