Unexpected Processing Times for Short vs. Long Audio Files with MLX-Whisper #1275
Replies: 1 comment 1 reply
-
That is not expected. Maybe you could provide the audio file? Or alternatively try running with this file and report the time? On my machine (M1 Max) the time of the following command is barely 2 seconds:
Yes linear with audio length. There is no batching in MLX Whisper. It is possible to do batching to speed things up substantially but it requires breaking a sequential dependency so the results will be slightly different.
I think we should figure out why your 40 second clip is so slow / what the bottleneck is and go from there.
V3 Largs is still a bigger model and it makes sense it is slower than Whisper small. A more apt comparison is to regular Whisper large. In that case it should give comparable quality and be much faster. |
Beta Was this translation helpful? Give feedback.
-
Description:
I’ve observed some unexpected behavior regarding the processing times for audio files when using MLX-Whisper. For example, a 40‑second audio file takes more than 22 seconds to complete processing, while a 5‑minute audio file takes around 28 seconds when using the small model. Although the 5‑minute file is significantly longer, it didn't take much longer than the 40‑second audio file. However, 28 seconds for a 5‑minute file doesn't seem particularly fast either.
In addition, when switching to V3 Large Turbo or Turbo models, the processing time for a 5‑minute file jumps to approximately 40 seconds. I’m wondering if this behavior is expected or if there might be some underlying factors that I’m missing.
Environment:
Hardware:
M1 Pro with 32GB RAM
Model Variants:
Small
(observed ~28 seconds for 5‑minute audio)V3 Large Turbo
/Turbo
(observed ~40 seconds for 5‑minute audio)Additional Settings:
Using clip_timestamps
VAD timestamps obtained via silero-vad
Questions:
Is it normal for a 40‑second audio file to take over 22 seconds, and similarly for a 5‑minute file to take around 28 seconds (small model)?
Should I expect the processing time to scale linearly with audio length, or is there some form of batching or parallel processing that explains the similar durations?
Are there any parameters or configurations in MLX-Whisper that I can adjust to boost processing speed?
Why do the V3 Large Turbo/Turbo models take significantly longer (around 40 seconds for a 5‑minute file) compared to the small model? Is this an expected trade-off between model size and processing speed?
Observation
Process a 40‑second audio file using MLX-Whisper with the small model and observe the processing time (~22+ seconds).
Process a 5‑minute audio file using the same settings and note the processing time (~28 seconds).
Switch to V3 Large Turbo or Turbo and process a 5‑minute audio file, noting the increased processing time (~40 seconds).
I’m curious whether the observed processing times are typical for the underlying implementation of MLX-Whisper and if there are any recommended optimizations for faster transcription.
Any insights or suggestions for configuration adjustments would be greatly appreciated.
Beta Was this translation helpful? Give feedback.
All reactions