Method for Borges Typing

This is a high level overview of the transcription method used by Borges Typing.

This a heuristic copy of my field notes from work with Whisper CPP.

Introduction

Whisper is the sleeping tiger from OpenAI. While generative AI chatbots using LLMs have taken the attention of the general public by storm during the AI boom, the Python-based Whisper has a lot of potential to help with speech-to-text tasks.

whisper.cpp is a C++ reimplementation of Whisper, which we will refer to as Whisper CPP throughout this piece. However, it is currently only available as a CLI tool. So, finding a way to let Whisper CPP, even under the hood, be used more widely would be a common good.

Background

Just for reference, here are my research sources that I have used to guide my decisions for future project work.

Moxie Marlinspike
- "End to End Encryption for Everyone" talk on YouTube from Next Generation Threats 2014
  - A dress rehearsal for the next talk below in 2015
- "Making private communication simple" talk from Webstock 2015
- "The ecosystem is moving" talk from 36C3 in 2019
Andrew "bunnie" Huang
- "Supply Chain Security: 'If I were a Nation State...'" talk in February 2019 (informally, part 1 of 2)
- "Open Source is Insufficient to Solve Trust Problems in Hardware" talk in December 2019 (informally, part 2 of 2)
  - 36C3 mirror
"Qubes OS for Organizational Security Auditing" talk by Harlo Holmes at HOPE 2020
- Video available on Internet Archive or YouTube
"The Insecurity Industry" by Edward Snowden, after the Pegasus Investigation into NSO Group

Here are the main takeaways:

From Marlinspike, I learned about usable security through Signal Messenger;
From Huang, I learned a sustainable open source technology project, from a human effort stance, should be able to be sustained with less than 10 people and not a giant corporation through Precursor and Betrusted;
From Holmes, I learned that small organizations also need to pay attention to cybersecurity beyond 2020; and
From Snowden, you're completely correct, because of the next point below.

The call to use Rust in new software projects is prescient, given that a White House press release from February 2024 named Rust as an example of a memory-safe language in its full report.

I know someone is going to read this and think something among the lines of: this isn't new information. However, I warn you to not misconstrue this with a "My job here is done/But you didn't do anything" meme situation.

Examples of prior transcription projects using Whisper

These are a few other projects using Whisper (both Python or CPP) in an accessible way:

Stage Whisper
- Last mentioned in a 2022 article from The Verge
MacWhisper, a proprietary MacOS client primarily for Apple Silicon
- Well, that is what happens when you use the MIT License
- C.f. Minix and Intel ME
Vibe, built with Tauri for desktop (Windows, Mac, & Linux)
Transcribro an Android app distributed through Accrescent
OTranscribe, open-source web app

Examples of existing non-free online transcription services

Trint, AI-based transcription
Rev

TL;DR: You trust these services to not keep copies of your audio forever.

Motivation

There are three broad reasons why I am interested in working with Whisper CPP.

First, I genuinely have trouble discerning what's being said aloud in relatively modern multimedia (since 2015). I really resonated with this Vox article from January 2023 when its video companion was released on YouTube. However, though I'm not very fluent in Spanish, somehow I can hear every word enunciated in Spanish music? (Does someone want to take me to a Bad Bunny concert to test this hypothesis? Psychologist really need to be studying this.) Joking aside, I would like to increase accessibility in technology overall - and if I cast my stone into the collective ocean of knowledge to get that effort started, then I'll be happy.

Second, I do a quite a bit of work with Whisper CPP. I use it to help me write meeting minutes, write transcripts for podcasts, or subtitle the occasional video. I'd like to share how I use Whisper CPP, since the most I've seen documented so far is when it's malfunctioning very badly in its GitHub issues.

Lastly, I feel that there must be some way to work on AI in an ethical manner. Even though I'm not even making commits back to Whisper CPP, I have a Hilbert-like amount of confidence that working on AI can be done ethically. Just like how money is ultimately a tool, AI is also a tool that ultimately is a giant force multiplier of whatever existing values, virtues, or traits the user already has. It is up to the user whether AI will be used in ways that are detrimental or beneficial to society. We have to forge the beneficial tools in the open, because the general AI tools of dubious nature have already been long in development before 2020. Clearview AI and the NSA's internal voice recognition tools detailed by The Intercept in 2018 are only two examples that could've been used.

Purpose

The purpose is to develop a usable GUI for Whisper CPP that anyone, with no knowledge of the command line, can use. Many people who I have used Whisper CPP with don't have the technical knowledge to rebuild the organizational supports I have built thanks to Whisper CPP - so, basically the bus factor is in effect. Creating a Whisper CPP GUI would help mitigate bus factor concerns, if not effectively eliminate it once a sustaining collective effort behind reaches critical mass.

Eventually explicit and actionable requirements will be listed, but basically the GUI would satisfy these guiding design requirements:

be written in Rust,
be cross-platform on desktop (Windows, macOS, and Linux)
- APT-, RPM-, and Pacman-based distros will be supported when the "first" 1.0.0 version is released
only make internet connection to download Whisper CPP's LLMs,
shows in-progress transcription lines while execution is in progress, and
saves the output in: plain text, SRT, VTT, and asciinema cast.

The GUI is not a live text editor: users are free to choose any application to edit the resulting files. (However, for your own sanity, don't choose a rich text editor like Microsoft Word or LibreOffice Write if you have to significantly rework the text content prior to visual formatting.) Regarding UX, we will follow the KISS principle.

Get started with Whisper CPP

Download Whisper CPP

In a terminal emulator instance, create a local clone of the Whisper C++ Git repo with:

$ git clone https://github.com/ggerganov/whisper.cpp.git

Then, navigate into the whisper.cpp directory to start using Whisper CPP.

How to update the application Whisper CPP

After a new point version is released, run the following:

$ git pull
$ git log  # Scroll through stdout to find the point release
$ git checkout <commit-checksum-in-hexadecimal>  # Or obtain hash in GH releases

To revert the repo to the latest commit to prepare for a new update, run:

$ git switch <primary-branch-name>  # Usually master or main

How to update each LLM

You only need to redownload each LLM when it has changed. Reference the Git LFS (Large File Storage) repo on Hugging Face to check for LLM updates.

How to transcribe with Whisper CPP

While in the Whisper CPP directory, convert any input audio file into a 16kHz WAV file with FFmpeg:

$ ffmpeg -i <input.audio> -ar 16000 <output.wav>

Then, run:

$ ./main -m models/ggml-large-v2-q5_0.bin -pp -pc -otxt -ovtt -osrt -f <output.wav>

(I will further explain the flags used for: confidence colors and the output in plain text, SRT, and VTT formats. I will also discuss the quantized versions of the large-v* LLMs and possibly using the distilled and/or the tinydiarize models in the long-term future.)

`asciinema` 1-liner

Instead of trying to automate asciinema, use the built-in -c flag to record 1 command with the -i flag (to limit recording idle time up to 2 seconds):

$ asciinema rec -i 2 filename.cast -c "./main -m models/ggml-large-v2-q5_0.bin -pp -pc -otxt -ovtt -osrt -f /path/to/output.wav"

After the output files in a separate location, repeat with large-v1-q5_0, large-v3-q5_0, or any other Whisper CPP model.

Future possibilities

Using asciinema to save Whisper CPP's color output has shown me there are many possibilities through pipelining when using Whisper CPP. One example is to make a video of Whisper CPP's output via VHS from Charmbracelet.

End products

There are three broad categories I use Whisper CPP for.

A) Reminding myself what happened during meetings

From the words of others and not mine, I have basically outdone my predecessors when it comes to taking notes from board meetings thanks to Whisper CPP.

With the consent of everyone in board meetings, I use the Zoom H1n recorder and a lavalier mic to record the meeting. Afterwards, I then use Whisper CPP on the recording. Lastly, this lets me make very detailed meeting minutes.

Fortunately, I don't have to edit the output from Whisper CPP. Due to the lossless quality of the recording setup, Whisper CPP almost always diarizes, since I am downscaling the audio quality for Whisper CPP.

B) Writing a written transcript

This is typically for a podcast, where there is no video - so, I don't have to synchronize the subtitles.

I don't always have access to the original audio file, so I tend to "upscale" the audio quality on a lossy audio file.

This is where I have noticed that, along with the audio quality aspect, Whisper performs worse for people of color; and I'm not talking about new English language learners. I mean I'm basically watching the audio version of Coded Bias play out right in front of me when I work with Whisper CPP. This empirically validates everything that Joy Buolamwini and her colleagues have to say about discrimination in facial recognition AI, except this is in speech recognition.

Inspirations for podcast transcriptions:

Darknet Diaries
- Example:
"How to Fix the Internet" by the EFF
- Example: "Don't Be Afraid to Poke the Tigers" with Andrew "bunnie" Huang

C) Making synchronized subtitles for videos

This is a bit tough, as I have the least experience with this.

Currently, I know that FFsubsync is a Python program that's supposed to help with this, but I somehow could not get it to work.

So, I used the online version of SubSync.

(I will really need to look into this in the future.)

Info about softcoded subtitles

Wikipedia article section regarding hardcoded vs. softcoded subtitles
YouTube support page on uploading subtitles
- Support page on supported file types
Vimeo support page on uploading subtitles

Working sources

I used these to help me figure out the actual programming:

Stack Overflow Q&A on a detached HEAD in Git
Documentation of Whisper CPP and asciinema

Thanks

Russian Bear, my long-time collaborator for assistance with Git
- A convoluted "pun", based on purposeful cyber misattribution to Russian APTs (such as Cozy Bear), and the general wariness of Russians by the policymakers of the U.S.A. stemming from the Illegals Program and the TV show The Americans
The threat model of Tillitis for inspiration
- Tillitus is a hardware company that spun off from Mullvad in 2022
There are more entities to thank, but we're not quite ready yet

License

This is licensed under the CC BY-NC-SA 4.0 (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International) License.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Method for Borges Typing

Introduction

Background

Examples of prior transcription projects using Whisper

Examples of existing non-free online transcription services

Motivation

Purpose

Get started with Whisper CPP

Download Whisper CPP

How to update the application Whisper CPP

How to update each LLM

How to transcribe with Whisper CPP

`asciinema` 1-liner

Future possibilities

End products

A) Reminding myself what happened during meetings

B) Writing a written transcript

Inspirations for podcast transcriptions:

C) Making synchronized subtitles for videos

Info about softcoded subtitles

Working sources

Thanks

License

About

Releases

Packages

License

BorgesTyping/method

Folders and files

Latest commit

History

Repository files navigation

Method for Borges Typing

Introduction

Background

Examples of prior transcription projects using Whisper

Examples of existing non-free online transcription services

Motivation

Purpose

Get started with Whisper CPP

Download Whisper CPP

How to update the application Whisper CPP

How to update each LLM

How to transcribe with Whisper CPP

asciinema 1-liner

Future possibilities

End products

A) Reminding myself what happened during meetings

B) Writing a written transcript

Inspirations for podcast transcriptions:

C) Making synchronized subtitles for videos

Info about softcoded subtitles

Working sources

Thanks

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

`asciinema` 1-liner

Packages