Unexpected Processing Times for Short vs. Long Audio Files with MLX-Whisper #1275

esphoenixc · 2025-02-11T12:11:39Z

esphoenixc
Feb 11, 2025

Description:

I’ve observed some unexpected behavior regarding the processing times for audio files when using MLX-Whisper. For example, a 40‑second audio file takes more than 22 seconds to complete processing, while a 5‑minute audio file takes around 28 seconds when using the small model. Although the 5‑minute file is significantly longer, it didn't take much longer than the 40‑second audio file. However, 28 seconds for a 5‑minute file doesn't seem particularly fast either.

In addition, when switching to V3 Large Turbo or Turbo models, the processing time for a 5‑minute file jumps to approximately 40 seconds. I’m wondering if this behavior is expected or if there might be some underlying factors that I’m missing.

Environment:

Hardware: M1 Pro with 32GB RAM
Model Variants:
Small (observed ~28 seconds for 5‑minute audio)
V3 Large Turbo/Turbo (observed ~40 seconds for 5‑minute audio)

Additional Settings:

Using clip_timestamps
VAD timestamps obtained via silero-vad

result = whisper.transcribe(
                    chunk_path,
                    path_or_hf_repo=self.model_path,
                    word_timestamps=True,
                    language=self.language,
                    fp16=False,
                    condition_on_previous_text=False,
                    clip_timestamps=clip_times,
                )

Questions:

Is it normal for a 40‑second audio file to take over 22 seconds, and similarly for a 5‑minute file to take around 28 seconds (small model)?
Should I expect the processing time to scale linearly with audio length, or is there some form of batching or parallel processing that explains the similar durations?
Are there any parameters or configurations in MLX-Whisper that I can adjust to boost processing speed?
Why do the V3 Large Turbo/Turbo models take significantly longer (around 40 seconds for a 5‑minute file) compared to the small model? Is this an expected trade-off between model size and processing speed?

Observation

Process a 40‑second audio file using MLX-Whisper with the small model and observe the processing time (~22+ seconds).
Process a 5‑minute audio file using the same settings and note the processing time (~28 seconds).
Switch to V3 Large Turbo or Turbo and process a 5‑minute audio file, noting the increased processing time (~40 seconds).

I’m curious whether the observed processing times are typical for the underlying implementation of MLX-Whisper and if there are any recommended optimizations for faster transcription.

Any insights or suggestions for configuration adjustments would be greatly appreciated.

awni · 2025-02-11T23:50:02Z

awni
Feb 11, 2025
Maintainer

Is it normal for a 40‑second audio file to take over 22 seconds, and similarly for a 5‑minute file to take around 28 seconds (small model)?

That is not expected. Maybe you could provide the audio file? Or alternatively try running with this file and report the time?

On my machine (M1 Max) the time of the following command is barely 2 seconds:

time mlx_whisper mlx_whisper/assets/ls_test.flac --model mlx-community/whisper-small-mlx

Should I expect the processing time to scale linearly with audio length, or is there some form of batching or parallel processing that explains the similar durations?

Yes linear with audio length. There is no batching in MLX Whisper. It is possible to do batching to speed things up substantially but it requires breaking a sequential dependency so the results will be slightly different.

Are there any parameters or configurations in MLX-Whisper that I can adjust to boost processing speed?

I think we should figure out why your 40 second clip is so slow / what the bottleneck is and go from there.

Why do the V3 Large Turbo/Turbo models take significantly longer (around 40 seconds for a 5‑minute file) compared to the small model? Is this an expected trade-off between model size and processing speed?

V3 Largs is still a bigger model and it makes sense it is slower than Whisper small. A more apt comparison is to regular Whisper large. In that case it should give comparable quality and be much faster.

1 reply

esphoenixc Feb 12, 2025
Author

That is not expected. Maybe you could provide the audio file? Or alternatively try running with this file and report the time?

On my machine (M1 Max) the time of the following command is barely 2 seconds:

Thank you for your response. I can confirm that with the ls_test.flac example file, I got similar results - about 2.14 seconds on my M1 Pro with 32GB RAM. The short duration (7 seconds) of the test file likely explains the quick processing time. If the processing time scales linearly with audio length, then a 14-second file should take about 4 seconds to process, a 21-second file about 6 seconds, and so on. Does this linear scaling explain why my 40-second audio file takes significantly longer?

While testing my 40-second audio file again, I noticed an interesting behavior related to the clip_timestamps parameter. With clip_timestamps enabled, the processing took about 27-28 seconds, but without it, the processing time dropped to just 3-4 seconds. Is this expected behavior?

Interestingly, this parameter doesn't seem to affect all audio lengths the same way:

For the 7-second test file, the processing time remained consistent (1 ~ 2 seconds) whether clip_timestamps was enabled or not
For a 5-minute audio file, both scenarios took around 27 seconds regardless of the clip_timestamps setting

Regarding my original issue, I can provide you with my 40-second audio file for further analysis. I’ll send it to you via email.

V3 Largs is still a bigger model and it makes sense it is slower than Whisper small. A more apt comparison is to regular Whisper large. In that case it should give comparable quality and be much faster.

Comparing models, are you suggesting I should compare the regular Whisper Large model with the V3 Large, or with the Small model?
I'm also curious about Whisper Turbo - would this be a faster alternative to the small model?
https://huggingface.co/mlx-community/whisper-turbo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected Processing Times for Short vs. Long Audio Files with MLX-Whisper #1275

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Unexpected Processing Times for Short vs. Long Audio Files with MLX-Whisper #1275

esphoenixc Feb 11, 2025

Description:

Environment:

Additional Settings:

Questions:

Observation

Replies: 1 comment · 1 reply

awni Feb 11, 2025 Maintainer

esphoenixc Feb 12, 2025 Author

esphoenixc
Feb 11, 2025

Replies: 1 comment 1 reply

awni
Feb 11, 2025
Maintainer

esphoenixc Feb 12, 2025
Author