How to Transcribe & Translate Audio with Whisper

Published Mar. 9, 2023 — Edited Feb. 13, 2024

After following this guide, you will be able to use OpenAI's Whisper to transcribe audio, translate the transcription into English, and to generate subtitles from the transcription.

My interest in this technology is to generate subtitles for lesser-known media, and to generate English subtitles for media that has previously only been available in its original language.

This guide assumes that you are using Linux, specifically Ubuntu in my case, and that you have enough knowledge to follow along with any linked pages and examples.

Requirements

Git
Git Large File Storage
Python 3.10.x
- You may need to run sudo apt install libffi-dev before installing Python, so that the _ctypes module is available for use.
- You will need to run pip install numpy torch transformers after installing Python. This will install all libraries required to run the convert-h5-to-ggml.py script.
FFMPEG

Whisper Installation


# Create and enter a new directory.
mkdir subtitle_generation
cd subtitle_generation

# Clone all required Git repositories.
git clone https://github.com/ggerganov/whisper.cpp.git
git clone https://github.com/openai/whisper.git
git clone https://huggingface.co/openai/whisper-large-v3.git

# Compile whisper.cpp. You may need to install "make" and
# other tools first.
cd whisper.cpp
make
cd ../

# Convert the whisper-large-v3 model from ".h5" to ".ggml".
python3 ./whisper.cpp/models/convert-h5-to-ggml.py ./whisper-large-v3 ./whisper .

Audio Preparation

Whisper will only accept a 16kHz .wav file, so you may need to extract and/or convert your audio with FFMPEG. The following commands are examples of how to do this. You will likely need to spend time learning more about FFMPEG if your situation is more complicated than extracting a single audio track from a video or converting an audio file.


# This will extract the first audio track of a video file, convert the
# audio to 16kHz, and save it as a ".wav" audio file.
ffmpeg -i "input_file.mp4" -c:a pcm_s16le -ar 16000 "output_file.wav"


# This will convert an audio file to 16kHz and save it as a ".wav" file.
ffmpeg -i "input_file.mp3" -c:a pcm_s16le -ar 16000 "output_file.wav"

Running Whisper

There are many parameters that you can use with Whisper. To view them, cd into the whisper.cpp folder and then run ./main.

As an example, assume you have a Dutch movie and that you want to generate English subtitles for it. You could use the -l nl option to tell Whisper that the file is in Dutch, the -tr option to translate the Dutch transcription into English, and the -osrt option to create an .srt subtitle file with the translated transcription.


# Place your ".wav" file in the same folder as "ggml_model.bin"
./whisper.cpp/main -l nl -tr -osrt --model ./ggml-model.bin -f ./input_file.wav

Notes

OpenAI has released a number of other models, which are much smaller than the whisper-large-v3 model used in this guide, and you can find them here.
You can find a list of two-character language codes here.
I have written scripts to help in automating some of the tasks outline above. You can view them here