Unexpected processing times for short vs. long audio files with mLX-whisper with clip_timestamps enabled #1285

esphoenixc · 2025-02-14T01:09:59Z

Issue: Unexpected Processing Time Behavior with `clip_timestamps` Parameter

Description:

I'm observing unexpected processing times when using MLX-Whisper on audio files, particularly when the clip_timestamps parameter is enabled. For example, processing a 40‑second audio file takes significantly longer (~27-28 seconds) with clip_timestamps enabled compared to just 3-4 seconds when disabled. This behavior was brought up in discussions previously, but I am seeing this consistently in my environment, so I am raising this as an issue.

#1275

40‑second audio file:
- With clip_timestamps enabled: ~27-28 seconds processing time.
- With clip_timestamps disabled: ~3-4 seconds processing time.
7‑second test file (ls_test.flac):
- Processing time remains ~1-2 seconds, regardless of the clip_timestamps setting.
5‑minute audio file (small model):
- Processing time is ~27 seconds with or without clip_timestamps.
5‑minute audio file (V3 Large Turbo/Turbo models):
- Processing time increases to ~40 seconds.

This behavior seems inconsistent:

The 7‑second and 5‑minute test files perform similarly regardless of whether clip_timestamps is enabled, but the 40‑second file shows a dramatic increase in processing time when clip_timestamps is enabled. This suggests that processing times do not scale linearly with audio length when using clip_timestamps.

Environment:

Hardware: M1 Pro with 32GB RAM
Models Tested:
- MLX-Whisper Small (observed ~27 seconds for 5‑minute audio)
- V3 Large Turbo/Turbo (observed ~40 seconds for 5‑minute audio)
Additional Settings:
- Using clip_timestamps
- VAD timestamps obtained via silero-vad

Code Snippet:

result = whisper.transcribe(
    chunk_path,
    path_or_hf_repo=self.model_path,
    word_timestamps=True,
    language=self.language,
    fp16=False,
    condition_on_previous_text=False,
    clip_timestamps=clip_times,
)

Steps to Reproduce:

40‑Second File Test:
- Process a 40‑second audio file with clip_timestamps enabled.
- Observe processing time of ~27-28 seconds.
- Process the same file with clip_timestamps disabled.
- Observe processing time of ~3-4 seconds.
7‑Second Test File ([ls_test.flac](https://github.com/ml-explore/mlx-examples/blob/main/whisper/mlx_whisper/assets/ls_test.flac)):
- Process with and without clip_timestamps.
- Observe similar processing times (~1-2 seconds) in both cases.
5‑Minute File Test:
- Process with the small model; observe ~27 seconds regardless of the clip_timestamps setting.
- Process with V3 Large Turbo/Turbo models; observe ~40 seconds.

Questions/Concerns:

Unexpected Slowdown:
- Is it expected that a 40‑second audio file takes ~27-28 seconds to process with clip_timestamps enabled, compared to just 3-4 seconds when disabled?
Bottlenecks and Optimizations:
- Are there any known bottlenecks or configuration parameters in MLX-Whisper that can be adjusted to boost processing speed when clip_timestamps is enabled?
Model Comparisons:
- The V3 Large Turbo/Turbo models are slower (e.g., 40 seconds for a 5‑minute file) compared to the small model. Should I compare these models to the regular Whisper Large model instead of the small model?
- Is Whisper Turbo expected to be a faster alternative to the small model?

Any insights or suggestions for optimizing transcription speed, especially with clip_timestamps enabled, would be greatly appreciated.

Thank you!

The text was updated successfully, but these errors were encountered:

esphoenixc changed the title ~~Unexpected Processing Times for Short vs. Long Audio Files with MLX-Whisper with clip_timestamps enabled~~ Unexpected processing times for short vs. long audio files with mLX-whisper with clip_timestamps enabled Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected processing times for short vs. long audio files with mLX-whisper with clip_timestamps enabled #1285

Unexpected processing times for short vs. long audio files with mLX-whisper with clip_timestamps enabled #1285

esphoenixc commented Feb 14, 2025

Unexpected processing times for short vs. long audio files with mLX-whisper with clip_timestamps enabled #1285

Unexpected processing times for short vs. long audio files with mLX-whisper with clip_timestamps enabled #1285

Comments

esphoenixc commented Feb 14, 2025

Issue: Unexpected Processing Time Behavior with clip_timestamps Parameter

This behavior seems inconsistent:

Issue: Unexpected Processing Time Behavior with `clip_timestamps` Parameter