Skip to content

Latest commit

 

History

History
280 lines (227 loc) · 16.7 KB

README.md

File metadata and controls

280 lines (227 loc) · 16.7 KB

Method for Borges Typing

This is a high level overview of the transcription method used by Borges Typing.

This a heuristic copy of my field notes from work with Whisper CPP.

Introduction

Whisper is the sleeping tiger from OpenAI. While generative AI chatbots using LLMs have taken the attention of the general public by storm during the AI boom, the Python-based Whisper has a lot of potential to help with speech-to-text tasks.

whisper.cpp is a C++ reimplementation of Whisper, which we will refer to as Whisper CPP throughout this piece. However, it is currently only available as a CLI tool. So, finding a way to let Whisper CPP, even under the hood, be used more widely would be a common good.

Background

Just for reference, here are my research sources that I have used to guide my decisions for future project work.

Here are the main takeaways:

  • From Marlinspike, I learned about usable security through Signal Messenger;
  • From Huang, I learned a sustainable open source technology project, from a human effort stance, should be able to be sustained with less than 10 people and not a giant corporation through Precursor and Betrusted;
  • From Holmes, I learned that small organizations also need to pay attention to cybersecurity beyond 2020; and
  • From Snowden, you're completely correct, because of the next point below.

The call to use Rust in new software projects is prescient, given that a White House press release from February 2024 named Rust as an example of a memory-safe language in its full report.

I know someone is going to read this and think something among the lines of: this isn't new information. However, I warn you to not misconstrue this with a "My job here is done/But you didn't do anything" meme situation.

Examples of prior transcription projects using Whisper

These are a few other projects using Whisper (both Python or CPP) in an accessible way:

Examples of existing non-free online transcription services

TL;DR: You trust these services to not keep copies of your audio forever.

Motivation

There are three broad reasons why I am interested in working with Whisper CPP.

First, I genuinely have trouble discerning what's being said aloud in relatively modern multimedia (since 2015). I really resonated with this Vox article from January 2023 when its video companion was released on YouTube. However, though I'm not very fluent in Spanish, somehow I can hear every word enunciated in Spanish music? (Does someone want to take me to a Bad Bunny concert to test this hypothesis? Psychologist really need to be studying this.) Joking aside, I would like to increase accessibility in technology overall - and if I cast my stone into the collective ocean of knowledge to get that effort started, then I'll be happy.

Second, I do a quite a bit of work with Whisper CPP. I use it to help me write meeting minutes, write transcripts for podcasts, or subtitle the occasional video. I'd like to share how I use Whisper CPP, since the most I've seen documented so far is when it's malfunctioning very badly in its GitHub issues.

Lastly, I feel that there must be some way to work on AI in an ethical manner. Even though I'm not even making commits back to Whisper CPP, I have a Hilbert-like amount of confidence that working on AI can be done ethically. Just like how money is ultimately a tool, AI is also a tool that ultimately is a giant force multiplier of whatever existing values, virtues, or traits the user already has. It is up to the user whether AI will be used in ways that are detrimental or beneficial to society. We have to forge the beneficial tools in the open, because the general AI tools of dubious nature have already been long in development before 2020. Clearview AI and the NSA's internal voice recognition tools detailed by The Intercept in 2018 are only two examples that could've been used.

Purpose

The purpose is to develop a usable GUI for Whisper CPP that anyone, with no knowledge of the command line, can use. Many people who I have used Whisper CPP with don't have the technical knowledge to rebuild the organizational supports I have built thanks to Whisper CPP - so, basically the bus factor is in effect. Creating a Whisper CPP GUI would help mitigate bus factor concerns, if not effectively eliminate it once a sustaining collective effort behind reaches critical mass.

Eventually explicit and actionable requirements will be listed, but basically the GUI would satisfy these guiding design requirements:

  • be written in Rust,
  • be cross-platform on desktop (Windows, macOS, and Linux)
    • APT-, RPM-, and Pacman-based distros will be supported when the "first" 1.0.0 version is released
  • only make internet connection to download Whisper CPP's LLMs,
  • shows in-progress transcription lines while execution is in progress, and
  • saves the output in: plain text, SRT, VTT, and asciinema cast.

The GUI is not a live text editor: users are free to choose any application to edit the resulting files. (However, for your own sanity, don't choose a rich text editor like Microsoft Word or LibreOffice Write if you have to significantly rework the text content prior to visual formatting.) Regarding UX, we will follow the KISS principle.

Get started with Whisper CPP

Download Whisper CPP

In a terminal emulator instance, create a local clone of the Whisper C++ Git repo with:

$ git clone https://github.com/ggerganov/whisper.cpp.git

Then, navigate into the whisper.cpp directory to start using Whisper CPP.

How to update the application Whisper CPP

After a new point version is released, run the following:

$ git pull
$ git log  # Scroll through stdout to find the point release
$ git checkout <commit-checksum-in-hexadecimal>  # Or obtain hash in GH releases

To revert the repo to the latest commit to prepare for a new update, run:

$ git switch <primary-branch-name>  # Usually master or main

How to update each LLM

You only need to redownload each LLM when it has changed. Reference the Git LFS (Large File Storage) repo on Hugging Face to check for LLM updates.

How to transcribe with Whisper CPP

While in the Whisper CPP directory, convert any input audio file into a 16kHz WAV file with FFmpeg:

$ ffmpeg -i <input.audio> -ar 16000 <output.wav>

Then, run:

$ ./main -m models/ggml-large-v2-q5_0.bin -pp -pc -otxt -ovtt -osrt -f <output.wav>

(I will further explain the flags used for: confidence colors and the output in plain text, SRT, and VTT formats. I will also discuss the quantized versions of the large-v* LLMs and possibly using the distilled and/or the tinydiarize models in the long-term future.)

asciinema 1-liner

Instead of trying to automate asciinema, use the built-in -c flag to record 1 command with the -i flag (to limit recording idle time up to 2 seconds):

$ asciinema rec -i 2 filename.cast -c "./main -m models/ggml-large-v2-q5_0.bin -pp -pc -otxt -ovtt -osrt -f /path/to/output.wav"

After the output files in a separate location, repeat with large-v1-q5_0, large-v3-q5_0, or any other Whisper CPP model.

Future possibilities

Using asciinema to save Whisper CPP's color output has shown me there are many possibilities through pipelining when using Whisper CPP. One example is to make a video of Whisper CPP's output via VHS from Charmbracelet.

End products

There are three broad categories I use Whisper CPP for.

A) Reminding myself what happened during meetings

From the words of others and not mine, I have basically outdone my predecessors when it comes to taking notes from board meetings thanks to Whisper CPP.

With the consent of everyone in board meetings, I use the Zoom H1n recorder and a lavalier mic to record the meeting. Afterwards, I then use Whisper CPP on the recording. Lastly, this lets me make very detailed meeting minutes.

Fortunately, I don't have to edit the output from Whisper CPP. Due to the lossless quality of the recording setup, Whisper CPP almost always diarizes, since I am downscaling the audio quality for Whisper CPP.

B) Writing a written transcript

This is typically for a podcast, where there is no video - so, I don't have to synchronize the subtitles.

I don't always have access to the original audio file, so I tend to "upscale" the audio quality on a lossy audio file.

This is where I have noticed that, along with the audio quality aspect, Whisper performs worse for people of color; and I'm not talking about new English language learners. I mean I'm basically watching the audio version of Coded Bias play out right in front of me when I work with Whisper CPP. This empirically validates everything that Joy Buolamwini and her colleagues have to say about discrimination in facial recognition AI, except this is in speech recognition.

Inspirations for podcast transcriptions:

C) Making synchronized subtitles for videos

This is a bit tough, as I have the least experience with this.

Currently, I know that FFsubsync is a Python program that's supposed to help with this, but I somehow could not get it to work.

So, I used the online version of SubSync.

(I will really need to look into this in the future.)

Info about softcoded subtitles

Working sources

I used these to help me figure out the actual programming:

  • Stack Overflow Q&A on a detached HEAD in Git
  • Documentation of Whisper CPP and asciinema

Thanks

  • Russian Bear, my long-time collaborator for assistance with Git
    • A convoluted "pun", based on purposeful cyber misattribution to Russian APTs (such as Cozy Bear), and the general wariness of Russians by the policymakers of the U.S.A. stemming from the Illegals Program and the TV show The Americans
  • The threat model of Tillitis for inspiration
    • Tillitus is a hardware company that spun off from Mullvad in 2022
  • There are more entities to thank, but we're not quite ready yet

License

This is licensed under the CC BY-NC-SA 4.0 (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International) License.