This is a high level overview of the transcription method used by Borges Typing.
This a heuristic copy of my field notes from work with Whisper CPP.
Whisper is the sleeping tiger from OpenAI. While generative AI chatbots using LLMs have taken the attention of the general public by storm during the AI boom, the Python-based Whisper has a lot of potential to help with speech-to-text tasks.
whisper.cpp
is a C++
reimplementation of Whisper, which we will refer to as Whisper CPP throughout
this piece. However, it is currently only available as a
CLI tool. So, finding a
way to let Whisper CPP, even under the hood,
be used more widely would be a common good.
Just for reference, here are my research sources that I have used to guide my decisions for future project work.
- Moxie Marlinspike
- "End to End Encryption for Everyone" talk on YouTube from Next Generation Threats 2014
- A dress rehearsal for the next talk below in 2015
- "Making private communication simple" talk from Webstock 2015
- "The ecosystem is moving" talk from 36C3 in 2019
- "End to End Encryption for Everyone" talk on YouTube from Next Generation Threats 2014
- Andrew "bunnie" Huang
- "Qubes OS for Organizational Security Auditing" talk by Harlo Holmes at HOPE 2020
- Video available on Internet Archive or YouTube
- "The Insecurity Industry" by Edward Snowden, after the Pegasus Investigation into NSO Group
Here are the main takeaways:
- From Marlinspike, I learned about usable security through Signal Messenger;
- From Huang, I learned a sustainable open source technology project, from a human effort stance, should be able to be sustained with less than 10 people and not a giant corporation through Precursor and Betrusted;
- From Holmes, I learned that small organizations also need to pay attention to cybersecurity beyond 2020; and
- From Snowden, you're completely correct, because of the next point below.
The call to use Rust in new software projects is prescient, given that a White House press release from February 2024 named Rust as an example of a memory-safe language in its full report.
I know someone is going to read this and think something among the lines of: this isn't new information. However, I warn you to not misconstrue this with a "My job here is done/But you didn't do anything" meme situation.
These are a few other projects using Whisper (both Python or CPP) in an accessible way:
- Stage Whisper
- Last mentioned in a 2022 article from The Verge
- MacWhisper, a proprietary MacOS client primarily for Apple Silicon
- Well, that is what happens when you use the MIT License
- C.f. Minix and Intel ME
- Vibe, built with Tauri for desktop (Windows, Mac, & Linux)
- Transcribro an Android app distributed through Accrescent
- OTranscribe, open-source web app
TL;DR: You trust these services to not keep copies of your audio forever.
There are three broad reasons why I am interested in working with Whisper CPP.
First, I genuinely have trouble discerning what's being said aloud in relatively modern multimedia (since 2015). I really resonated with this Vox article from January 2023 when its video companion was released on YouTube. However, though I'm not very fluent in Spanish, somehow I can hear every word enunciated in Spanish music? (Does someone want to take me to a Bad Bunny concert to test this hypothesis? Psychologist really need to be studying this.) Joking aside, I would like to increase accessibility in technology overall - and if I cast my stone into the collective ocean of knowledge to get that effort started, then I'll be happy.
Second, I do a quite a bit of work with Whisper CPP. I use it to help me write meeting minutes, write transcripts for podcasts, or subtitle the occasional video. I'd like to share how I use Whisper CPP, since the most I've seen documented so far is when it's malfunctioning very badly in its GitHub issues.
Lastly, I feel that there must be some way to work on AI in an ethical manner. Even though I'm not even making commits back to Whisper CPP, I have a Hilbert-like amount of confidence that working on AI can be done ethically. Just like how money is ultimately a tool, AI is also a tool that ultimately is a giant force multiplier of whatever existing values, virtues, or traits the user already has. It is up to the user whether AI will be used in ways that are detrimental or beneficial to society. We have to forge the beneficial tools in the open, because the general AI tools of dubious nature have already been long in development before 2020. Clearview AI and the NSA's internal voice recognition tools detailed by The Intercept in 2018 are only two examples that could've been used.
The purpose is to develop a usable GUI for Whisper CPP that anyone, with no knowledge of the command line, can use. Many people who I have used Whisper CPP with don't have the technical knowledge to rebuild the organizational supports I have built thanks to Whisper CPP - so, basically the bus factor is in effect. Creating a Whisper CPP GUI would help mitigate bus factor concerns, if not effectively eliminate it once a sustaining collective effort behind reaches critical mass.
Eventually explicit and actionable requirements will be listed, but basically the GUI would satisfy these guiding design requirements:
- be written in Rust,
- be cross-platform on desktop (Windows, macOS, and Linux)
- only make internet connection to download Whisper CPP's LLMs,
- shows in-progress transcription lines while execution is in progress, and
- saves the output in: plain text, SRT, VTT, and
asciinema
cast.
The GUI is not a live text editor: users are free to choose any application to edit the resulting files. (However, for your own sanity, don't choose a rich text editor like Microsoft Word or LibreOffice Write if you have to significantly rework the text content prior to visual formatting.) Regarding UX, we will follow the KISS principle.
In a terminal emulator instance, create a local clone of the Whisper C++ Git repo with:
$ git clone https://github.com/ggerganov/whisper.cpp.git
Then, navigate into the whisper.cpp
directory to start using Whisper CPP.
After a new point version is released, run the following:
$ git pull
$ git log # Scroll through stdout to find the point release
$ git checkout <commit-checksum-in-hexadecimal> # Or obtain hash in GH releases
To revert the repo to the latest commit to prepare for a new update, run:
$ git switch <primary-branch-name> # Usually master or main
You only need to redownload each LLM when it has changed. Reference the Git LFS (Large File Storage) repo on Hugging Face to check for LLM updates.
While in the Whisper CPP directory, convert any input audio file into a 16kHz WAV file with FFmpeg:
$ ffmpeg -i <input.audio> -ar 16000 <output.wav>
Then, run:
$ ./main -m models/ggml-large-v2-q5_0.bin -pp -pc -otxt -ovtt -osrt -f <output.wav>
(I will further explain the flags used for: confidence colors and the output in
plain text, SRT,
and VTT formats. I will also discuss
the quantized versions of the large-v*
LLMs and possibly using the distilled
and/or the tinydiarize
models in the long-term future.)
Instead of trying to automate asciinema
, use the
built-in -c
flag to record 1 command with the -i
flag (to limit recording
idle time up to 2 seconds):
$ asciinema rec -i 2 filename.cast -c "./main -m models/ggml-large-v2-q5_0.bin -pp -pc -otxt -ovtt -osrt -f /path/to/output.wav"
After the output files in a separate location, repeat with large-v1-q5_0
,
large-v3-q5_0
, or any other Whisper CPP model.
Using asciinema
to save Whisper CPP's color output has shown me there are many
possibilities through pipelining
when using Whisper CPP. One example is to make a video of Whisper CPP's output
via VHS from Charmbracelet.
There are three broad categories I use Whisper CPP for.
From the words of others and not mine, I have basically outdone my predecessors when it comes to taking notes from board meetings thanks to Whisper CPP.
With the consent of everyone in board meetings, I use the Zoom H1n recorder and a lavalier mic to record the meeting. Afterwards, I then use Whisper CPP on the recording. Lastly, this lets me make very detailed meeting minutes.
Fortunately, I don't have to edit the output from Whisper CPP. Due to the lossless quality of the recording setup, Whisper CPP almost always diarizes, since I am downscaling the audio quality for Whisper CPP.
This is typically for a podcast, where there is no video - so, I don't have to synchronize the subtitles.
I don't always have access to the original audio file, so I tend to "upscale" the audio quality on a lossy audio file.
This is where I have noticed that, along with the audio quality aspect, Whisper performs worse for people of color; and I'm not talking about new English language learners. I mean I'm basically watching the audio version of Coded Bias play out right in front of me when I work with Whisper CPP. This empirically validates everything that Joy Buolamwini and her colleagues have to say about discrimination in facial recognition AI, except this is in speech recognition.
- Darknet Diaries
- Example:
- "How to Fix the Internet" by the EFF
- Example: "Don't Be Afraid to Poke the Tigers" with Andrew "bunnie" Huang
This is a bit tough, as I have the least experience with this.
Currently, I know that FFsubsync is a Python program that's supposed to help with this, but I somehow could not get it to work.
So, I used the online version of SubSync.
(I will really need to look into this in the future.)
- Wikipedia article section regarding hardcoded vs. softcoded subtitles
- YouTube support page on uploading subtitles
- Support page on supported file types
- Vimeo support page on uploading subtitles
I used these to help me figure out the actual programming:
- Stack Overflow Q&A on a detached HEAD in Git
- Documentation of Whisper CPP and
asciinema
- Russian Bear, my long-time collaborator for assistance with Git
- A convoluted "pun", based on purposeful cyber misattribution to Russian APTs (such as Cozy Bear), and the general wariness of Russians by the policymakers of the U.S.A. stemming from the Illegals Program and the TV show The Americans
- The threat model of Tillitis for inspiration
- There are more entities to thank, but we're not quite ready yet
This is licensed under the CC BY-NC-SA 4.0 (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International) License.