Whisper

Updated for version: November 17th 2023 

Accessible via: https://github.com/openai/whisper  

Technical paper: https://cdn.openai.com/papers/whisper.pdf 

Alternative version with user interface: https://grisk.itch.io/whisper-gui.

Ratings

Accuracy / Quality   ★★★★☆ 

Flexibility / Features ★★★☆☆ 

Data security / Privacy ★★★★★

Open source model

Pros/cons

Pros 

  • Open-source and locally installed, hence privacy secured. 
  • Usable on both CPU and GPU. 
  • Versed in many languages. 
  • Choice between accuracy and speed. 

Cons 

  • Requires some technical experience and programming. 
  • Ease of installation and operation dependent on hardware. 
  • No voice recognition / speaker identification. 
  • No timestamps. 

The following guide is written for users with WUR-devices. Should you use your personal device the guide should remain applicable, and require fewer administrative rights. 

Description

When making recordings for interviews, one of the more time-consuming tasks may be the transcribing of the audio for further analysis and documentation. And as these recordings can contain sensitive or personal information it is important that such information does not leak out. There are of course commercial tools which can be purchased to achieve this, but this may not always be possible within the budget. Hence we offer the Whisper model as open source alternative.  

Whisper is an AI-powered free (open source) transcription tool developed by OpenAI. Whisper can transcribe from multiple languages at varying levels of accuracy. The model does not assign the text to a given speaker, hence it may be helpful to format the output into readable sections via a natural language processing toolkit (nltk), which is described later in this guide. 

For this manual we will employ Jupyter Notebook to interact with the model, though other methods are available. This is Python-based software which is highly compatible with Whisper. To install the necessary depedencies and packages we use either Anaconda (conda) or pip.  

Before beginning the installation of any of the required software it is good to check the hardware and software of your device, as this may affect the functionality of the model code. Depending on the size of the model you intend to use, the required VRAM (Virtual Random Access Memory) may vary from 1 GB to 10 GB. In addition, if you have an NVIDIA-powered GPU you may have access to CUDA. CUDA was designed by NVIDIA for rapid processing of data on the GPU, allowing for faster model operation using the GPU. You can check whether your device is CUDA-enabled by pressing Windows+R, typing “devmgmt.msc”, navidating to “Display adapters” and checking which graphics card you have. If this is an NVIDIA graphics card and listed on NVIDIA’s CUDA-capable GPU list you have a CUDA-enabled GPU. If not, the model will still work, but will operate more slowly, and the code will differ slightly in Python. This guide has been written for Windows, but should require only minimal adaptations for use on a Mac.  

Image showing the devicemanagement menu. Here you can check if your Graphics card is CUDA ENabled

Please note that when copying code from the PDF-version of this document the indentation (empty spaces at the start of a line) may be accidentally removed. These would need to be restored manually, so carefully read the code. 

Features and examples

Example code (basic)

This is an example of what the base code would look like. For the full code, see the end of this document. 

# Importing required libraries  
import whisper 
import torch 

# Select the GPU (cuda) or CPU for processing. 
device = 'cuda' if torch.cuda.is_available() else 'cpu' 

# Load the model to the GPU/CPU and specify the model size. 
model = whisper.load_model('large').to(device) 

# Transcribe the recording to text. 
result = model.transcribe("Recording.mp3") 

# Open a new text file and write the transcription to it. 
with open("Transcription.txt", "w", encoding="utf-8") as txt: 
	txt.write(result["text"]) 

Installation guide

Follow the steps below in order to install Whisper via Jupyter Notebook and conda. 

  1. Install Anaconda via the WUR Software Center or via https://www.anaconda.com/download.  
  2. Install FFmpeg. This software is used for audio processing and is essential to the functioning of the model. 
    1. Navigate to https://www.gyan.dev/ffmpeg/builds/ and download the “ffmpeg-release-full.7z” file from the “release builds section.

b. Extract the *.7z folder and rename the extracted folder to “FFmpeg”. 
c.  Move this new “FFmpeg” folder to your C-drive
d.  Search in your Windows search bar for the “Edit environment variables for your account” option (NL: search for “Omgevingsvariabelen”). In this window, edit the settings for the PATH by adding the location “C:/FFmpeg” to it. Save the settings. 

e. Press Windows+R, type “cmd” to open a command window. Type “ffmpeg - version” and press ENTER to validate the installation of this package. 

3.  Open Anaconda Prompt as administrator and type the following: 

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia 

 To open Anaconda Prompt as administrator on a WUR device, look for Anaconda Prompt in your Start menu, right-click “Anaconda Prompt (Anaconda3)” and open the file location. In the file location, right-click the same program again, but now select “WUR – Run with administrative rights”. 

This will install the necessary packages for the processing of data via PyTorch. You may need to adjust the mentioned version of CUDA depending on the GPU in your device. You can adjust this using information from the PyTorch website (https://pytorch.org/get-started/locally/). If you do not have an NVIDIA GPU, then choose the CPU-option from the PyTorch website. 

4. You also want to install git in order to be able to install the Whisper model via pip later. Run the following code in the same Anaconda Prompt window as used for step 3: 

conda install git 

In addition, add git to the PATH using the same method as applied for FFmpeg in step 2. The location you need to add may vary depending on your system settings. You can obtain the location by running the script once (in step 8) and copying the location indicated from the error. The most likely location is “C:/Program Files/Git”. 

5. Open Jupyter Notebook and open a new notebook with a Python kernel. Copy the example code shown at the top of this guide above into the notebook. 

6. Add an extra cell at the top of the notebook (using the ‘+’ button) and type the following line: 

!pip install git+https://github.com/openai/whisper.git 
Click to copy

This would only need to be done once to install the Whisper model. After that, this line of code may be deleted or commented out using a ‘#’ at the front of the line. You can run this code by going to “Cell” and selecting “Run Cells”. 

If you are looking to upgrade Whisper to the latest model, use the following script instead:

!pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git
Click to copy

7. Choose the type of model you wish to use. There are five different models available. For Dutch recommended is the large model for accuracy (though this comes at the cost of processing speed). For English the small model is often sufficient. The example script uses the ‘large’ model. Adjust the name to the model you wish to use (multilingual or English-only version, see table below).

The turbo model is the newest model released (October 2024). This model is near-equal in quality to the large model, but is faster than even the small model, making it the preferred model for most tasks.

Size Parameters English-only model Multilingual model Required VRAM Relative speed
Tiny 39 M tiny.en tiny ~1GB ~32x
Base 74 M base.en base ~1GB ~16x
Small 244 M small.en small ~2GB ~6x
Medium 769 M medium.en medium ~5GB ~2x
Large 1550 M N/A large ~10GB 1x
Turbo (NEW) 809 M N/A turbo ~6GB ~8x

The first time you run a new model version the model will first be downloaded and saved in “C:\Users\<WUR-account-name>\.cache”. You can delete the models manually from that location if you choose to stop using Whisper. 

8.  The model can now be run in full. Ensure the audio file to be transcribed is in the same folder as the script (though this can be adjusted by amending the Python script with a document selection interface, see instructions for this in the “Optional modifications” section down below) and that the name of the audio file corresponds with the audio file name in the script. After processing the output can be found in a text file in the same folder.  

You can tell the model is running by looking at the top-right corner of the Notebook. If the circle is fully black the script is active. If the circle is empty, the script is no longer running. 

9. Before running the model again, navigate to “Kernel” and select “Restart & Clear Output”. Restarting the kernel will prevent any existing ‘reserved memory’ to persist, thus preventing the script from crashing due to a lack of available RAM. The final segment of the ‘Complete code’ at the end of this guide should prevent this problem, but in specific cases the error may still occur. 

The final output of the model can be found in a text file in the same folder as where the Jupyter Notebook Python script is saved. 

Optional modifications

Duration tracking

Transcription may take a while, and may factor into the considerations for usage. Therefore it may be worth considering adding a tracker to the script for the duration of the transcription. For this functionality, add the following: 

At the start of the script: 

# Import the datetime library and save the starting time of the script. 
from datetime import datetime 
startTime = datetime.now() 
Click to copy

And at the end of the script

# Display the total time elapsed since starting the script. 
print("Processing time:" + str(datetime.now() - startTime)) 
Click to copy

File dialog for audio file selection

If you want to replace the default *.mp3 file selection line in the script with a file selection dialog, allowing you to select files regardless of the location or file name, the code below may be of use: 

# Import the GUI (Graphical User Interface) library 
import tkinter as tk 
from tkinter import filedialog 
 
# Start up the GUI and hide it in the background until a window is required. 
root = tk.Tk() 
root.withdraw() 
 
# Open a file selection window for audio files and save the file location of the file. 
file_path = filedialog.askopenfilename(filetypes=(("Audio Files", 
".mp3 .mp4 .mpeg .m4a .wav .webm"))) 
Click to copy

The package tkinter is a user interface package for Python, copying the existing Windows interface format for specific tasks, in this case file selection. In the original script the following adjustment would need to be made: 

result = model.transcribe("Recording.mp3") 
Click to copy

Should be changed to: 

result = model.transcribe(file_path) 
Click to copy

This will allow the model to use the selected file rather than a file called “Recording.mp3” for the transcription. 

Customizing the transcription file save location and name

The default script saves the transcription text in the same folder as where the script is located, and under a pre-set name. This can be adjusted by replacing the final segment of the script with the following code: 

text_file = asksaveasfilename(title="Select Location", 
filetypes=(("Text Files", "*.txt"),)) 
with open(text_file, 'w', encoding="utf-8") as txt: 
	txt.write(result["text"]) 
Click to copy

Separation of sentences in the output file

Without adjustment to the script, the output file will contain a large amount of text condensed into a single paragraph. This may make it more difficult to interpret the output. An option could be to add a tokenizer (machine learning algorithm) to the code to help distinguish between sentences, thus separating the output into smaller more readable segments. To achieve this, the following adjustments need to be made: 

Install (one-time only) the nltk package by running the following single line of code: 

!pip install nltk 
nltk.download(‘punkt’) 
Click to copy

At the start of the script add the following line: 

import nltk 
Click to copy

After the line ‘result = model.transcribe()’ the following line should be added: 

sent_text = nltk.sent_tokenize(result['text']) 
Click to copy

At the end of the code a small change also needs to be made to the output file: 

txt.write(result["text"]) 
Click to copy

Should be replaced with: 

for sentence in sent_text: 
txt.write(sentence + "\n") 
Click to copy

All optional modifications combined

By combining all aforementioned modifications, the following script follows:

# Import required libraries and save the current start time. 
from datetime import datetime 
startTime = datetime.now() 
import whisper 
import torch 
import nltk 
import tkinter as tk 
from tkinter import filedialog 
import gc 
 
# Start up the GUI and hide it in the background until a window is required. 
root = tk.Tk() 
root.withdraw() 
 
# Open a file selection window for audio files and save the file location of the file. 
file_path = filedialog.askopenfilename(filetypes=[ 
	("Audio Files", "*.mp3 *.m4a *.wav"), 
	("Video Files", "*.mp4 *.mpeg *.webm") 
]) 
 
# Specify the model to be used: 
MODEL_VERSION = 'turbo' 
MODEL_MEMORY = { 
	'tiny': 1.6, 
	'base': 1.8, 
	'small': 2.9, 
	'medium': 5.8, 
	'large': 10.8,
    'turbo': 6.8
} 
 
# Obtain the required memory to run the model from the dictionary. 
req_mem = MODEL_MEMORY.get(MODEL_VERSION, float('inf')) 
 
# If an unknown model version was specified, print a warning 
if req_mem == float('inf'): 
	print(f"Unknown model version: {MODEL_VERSION}") 
	sys.exit() 
 
# Select the GPU (cuda) or CPU for processing. If there’s insufficient memory available, the CPU will be used. 
device = 'cpu' 
if torch.cuda.is_available(): 
	if (torch.cuda.mem_get_info()[1] / 2**30) > req_mem: 
		device = 'cuda' 
	else: 
		print("There is insufficient GPU memory for the chosen model. \ 
		For now the CPU will be used. If this is not desired, please choose \ 
		a smaller model size.") 
 
# Load the model to the GPU/CPU and specify the model size. 
model = whisper.load_model(MODEL_VERSION, device=device) 
 
# Transcribe the recording to text. 
result = model.transcribe(file_path) 
print("Transcription complete after: " + str(datetime.now() - startTime)) 
sent_text = nltk.sent_tokenize(result['text']) 
 
# Open a new text file and write the transcription to it sentence by sentence. 
# The user selects the text file name and location. 
text_file = (filedialog.asksaveasfilename(title="Select Location", 
	filetypes=(("Text Files", "*.txt"),))) + ".txt" 
with open(text_file, "w", encoding="utf-8") as txt: 
	for sentence in sent_text: 
		txt.write(sentence+ "\n") 
 
# Display the total time elapsed since starting the script. 
print("Processing time:" + str(datetime.now() - startTime)) 
 
# Clear dedicated GPU memory if CUDA was used. 
# Doing this should prevent having to manually clear the GPU memory. 
if device == 'cuda': 
	model = None 
	gc.collect() 
	torch.cuda.empty_cache() 
Click to copy

References

  • [Github Whisper] (https://github.com/openai/whisper) 
  • [PyTorch] (https://pytorch.org/get-started/locally/) 
  • [FFmpeg] (https://ffmpeg.org/) 
  • [Jupyter Notebook] (https://jupyter.org/) 
  • [Anaconda] (https://www.anaconda.com/) 
  • [pip] (https://pip.pypa.io/en/stable/installation/) 
  • [gyan.dev - FFmpeg] (https://gyan.dev/ffmpeg/builds) 
  • [NVIDIA GPU List] (https://developer.nvidia.com/cuda-gpus) 
  • [NLTK] (https://www.nltk.org/)