Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected processing times for short vs. long audio files with mLX-whisper with clip_timestamps enabled #1285

Open
esphoenixc opened this issue Feb 14, 2025 · 0 comments

Comments

@esphoenixc
Copy link

Issue: Unexpected Processing Time Behavior with clip_timestamps Parameter

Description:

I'm observing unexpected processing times when using MLX-Whisper on audio files, particularly when the clip_timestamps parameter is enabled. For example, processing a 40‑second audio file takes significantly longer (~27-28 seconds) with clip_timestamps enabled compared to just 3-4 seconds when disabled. This behavior was brought up in discussions previously, but I am seeing this consistently in my environment, so I am raising this as an issue.

#1275

  • 40‑second audio file:

    • With clip_timestamps enabled: ~27-28 seconds processing time.
    • With clip_timestamps disabled: ~3-4 seconds processing time.
  • 7‑second test file (ls_test.flac):

    • Processing time remains ~1-2 seconds, regardless of the clip_timestamps setting.
  • 5‑minute audio file (small model):

    • Processing time is ~27 seconds with or without clip_timestamps.
  • 5‑minute audio file (V3 Large Turbo/Turbo models):

    • Processing time increases to ~40 seconds.

This behavior seems inconsistent:

  • The 7‑second and 5‑minute test files perform similarly regardless of whether clip_timestamps is enabled, but the 40‑second file shows a dramatic increase in processing time when clip_timestamps is enabled. This suggests that processing times do not scale linearly with audio length when using clip_timestamps.

Environment:

  • Hardware: M1 Pro with 32GB RAM
  • Models Tested:
    • MLX-Whisper Small (observed ~27 seconds for 5‑minute audio)
    • V3 Large Turbo/Turbo (observed ~40 seconds for 5‑minute audio)
  • Additional Settings:
    • Using clip_timestamps
    • VAD timestamps obtained via silero-vad

Code Snippet:

result = whisper.transcribe(
    chunk_path,
    path_or_hf_repo=self.model_path,
    word_timestamps=True,
    language=self.language,
    fp16=False,
    condition_on_previous_text=False,
    clip_timestamps=clip_times,
)

Steps to Reproduce:

  1. 40‑Second File Test:

    • Process a 40‑second audio file with clip_timestamps enabled.
    • Observe processing time of ~27-28 seconds.
    • Process the same file with clip_timestamps disabled.
    • Observe processing time of ~3-4 seconds.
  2. 7‑Second Test File ([ls_test.flac](https://github.com/ml-explore/mlx-examples/blob/main/whisper/mlx_whisper/assets/ls_test.flac)):

    • Process with and without clip_timestamps.
    • Observe similar processing times (~1-2 seconds) in both cases.
  3. 5‑Minute File Test:

    • Process with the small model; observe ~27 seconds regardless of the clip_timestamps setting.
    • Process with V3 Large Turbo/Turbo models; observe ~40 seconds.

Questions/Concerns:

  1. Unexpected Slowdown:

    • Is it expected that a 40‑second audio file takes ~27-28 seconds to process with clip_timestamps enabled, compared to just 3-4 seconds when disabled?
  2. Bottlenecks and Optimizations:

    • Are there any known bottlenecks or configuration parameters in MLX-Whisper that can be adjusted to boost processing speed when clip_timestamps is enabled?
  3. Model Comparisons:

    • The V3 Large Turbo/Turbo models are slower (e.g., 40 seconds for a 5‑minute file) compared to the small model. Should I compare these models to the regular Whisper Large model instead of the small model?
    • Is Whisper Turbo expected to be a faster alternative to the small model?

Any insights or suggestions for optimizing transcription speed, especially with clip_timestamps enabled, would be greatly appreciated.

Thank you!

@esphoenixc esphoenixc changed the title Unexpected Processing Times for Short vs. Long Audio Files with MLX-Whisper with clip_timestamps enabled Unexpected processing times for short vs. long audio files with mLX-whisper with clip_timestamps enabled Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant