-
Notifications
You must be signed in to change notification settings - Fork 8.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip silence around hallucinations #1838
Conversation
Testing on another example from #679 (comment) Output
|
When using |
It should work without padding, but if the VAD is inaccurate then padding might help compensate for that. |
Will this PR included in the next release? If so, when it it planned? |
Related PR: #2005 (fixes a bug in |
Just tested out hallucination_silence_threshold and it worked for me |
@ryanheise Can you show me your transcribe.py with debug stuff? |
I don't have the exact code anymore, but you could try temporarily inserting these two lines: if score >= 3 or score + 0.01 >= len(words):
print(f"DETECTED HALLUCINATION: {segment['text']}") before the return in this function: def is_segment_anomaly(segment: Optional[dict]) -> bool:
if segment is None or not segment["words"]:
return False
words = [w for w in segment["words"] if w["word"] not in punctuation]
words = words[:8]
score = sum(word_anomaly_score(w) for w in words)
return score >= 3 or score + 0.01 >= len(words) |
@ryanheise Below is example with disappeared I'm using faster-whisper, but you should be able to reproduce it with whisper too as implementation is same.
EDIT: |
I think, I've noticed a pattern, it happens when Then chunk go exactly by 30 secs cutting off the word. Chunking when
Chunking by setting high threshold
|
Another thing, this PR affects transcription even if both new parameters are not enabled, I meant comparing vs without this PR. This happens sometimes, but when it happens the discrepancy is always in the last chunk. And sometimes when discrepancy happens it tries to process additional micro chunk after it which produces some hallucination or fails because no-speech threshold is met, not sure if this is related to PR or to a discrepancy. Example of such discrepancy [audio is Without this PR [perfect transcription]:
With this PR [all goes exactly same till the last chunk]:
|
This logic is part of the original Whisper strategy of advancing by the full 30 seconds to the next window whenever the current segment is unfinished. So basically, if the segment finishes before the end of the 30 second window, then Whisper will crop the window to the exact end timestamp of the last word in that segment. But if the segment does not finish by the end of the 30 second window, the window is not cropped, the speech is assumed to run all the way to the end of the window. This logic exists whether or not the In your case, the sentence in question is:
This sentence does not fit within the 30 second window, and the word "orange" is right on the boundary. In fact, the word "orange" is slightly before the boundary and the human ear can pick it up (as can the larger models) but the smaller models fail to pick it up. And given Whisper's logic in this case, it will assume the speech went right up to the end of the 30 second window and will resume the next window from there. So although yes the large models would probably resolve this, I think it would still be better to change Whisper's strategy and crop the window to the end timestamp of the last word even in this case where we have an unfinished segment. |
I can't connect the dots... How |
Apologies, my explanation of that was around the wrong way. The original Whisper behaviour was that if the last segment in the window is "complete", THEN it skips to the end of the full 30 second window. If the last segment is incomplete, then it crops the window to end timestamp of the last word. But when # skip silence before possible hallucinations
if hallucination_silence_threshold is not None:
threshold = hallucination_silence_threshold
if not single_timestamp_ending:
last_word_end = get_end(current_segments)
if last_word_end is not None and last_word_end > time_offset:
remaining_duration = window_end_time - last_word_end
if remaining_duration > threshold: # <--- misfired heuristic
seek = round(last_word_end * FRAMES_PER_SECOND)
else:
seek = previous_seek + segment_size The goal was to skip over as much silence as safely possible. However, in hindsight, this was a bit opportunistic, since after all if not single_timestamp_ending:
last_word_end = get_end(current_segments)
if last_word_end is not None and last_word_end > time_offset:
remaining_duration = window_end_time - last_word_end
if remaining_duration > threshold: # <--- misfired heuristic
seek = round(last_word_end * FRAMES_PER_SECOND)
else:
seek = previous_seek + segment_size (It's OK, the other parts of this code block are already handled elsewhere.) |
I've created a PR #2043 incorporating the above fix based on your counter example. |
Thanks for explanation, now this part of code makes sense.
Imho, skipping to full 30s window is pretty unsafe. 😆 |
Do you have an audio file to reproduce? |
This file has discrepancy in the last window/chunk: Whisper without this PR:
Whisper with this PR:
|
I'll test tomorrow, but does this also happen on PR #2043 ? |
Removes the wishful heuristic causing more issues than it's fixing. Same as openai/whisper#2043 Example of the issue: openai/whisper#1838 (comment)
Yes, because Culprit affecting only the last window is found. it happens because of this: mel_segment = mel[:, seek : seek + segment_size] This is the fix [that's how it was before this PR]: mel_segment = mel[:, seek : seek + N_FRAMES] Not sure why you changed it, on my observation it makes more hallucinations [probably it's random]. |
That is changed for |
I've confirmed the discrepancy, which seems to be a consequence of slightly different mel spectrograms. Although in the two examples you gave (only the latter of which I have tested with the supplied audio file), the PR actually removed a hallucination on one example and introduced a hallucination on the other example. So on balance, it's hard to say whether this discrepancy it better or worse or about the same. So if it's not clear whether it's better or worse, do you see anything incorrect in the clipping logic? I think the difference is that I am always clipping exactly to the stretch of audio being examined, and then padding it. But originally, there was padding on the end that was added immediately when the mel spectrogram was first generated, and then (in the original code), it is also possible that due to the dynamic shifting of the window starts, it could end up padding the last part of the audio twice, because there is no guarantee that that initial padding Whisper added at the start of the process was enough to reflect where this last window ended up actually starting. But it's possible I've done something wrong which I can't see, so let me know if you do spot something incorrect in the logic. |
After plotting the mel spectrograms, I noticed the padding when the audio is first loaded (as a whole) contains all -1.0's, while the padding in the main loop for each 30 second window contains all 0.0's. Not sure why that is, but there are two different padding algorithms in the code, and weirdly they are producing different padding results. So in your example, the PR ends up always using the padding algorithm that pads to 0.0's whereas originally the end of file padding had -1.0's. |
Removes the wishful heuristic causing more issues than it's fixing. Same as openai/whisper#2043 Example of the issue: openai/whisper#1838 (comment)
There's still a chance that a hallucination will be produced.
i. e.
Notably, this timestamp belongs to the end of the audio. Model size: small. Also there are some results in google, if you search for this phrase. One of them:
|
That's certainly possible, and unfortunately there is no single choice of parameters that will be perfect in all scenarios. You can tweak the silence threshold, which is exposed on the command line. You can also try tweaking the other thresholds that were built into the code (like how long a word must be before it is flagged as an abnormality). If we can gather a large enough dataset of audio samples that produce hallucinations, we should be able to come up with better default settings that work well across a variety of scenarios and languages. |
This PR introduces a heuristic that determines if a segment is probably a hallucination. If that "probable" hallucination occurs after a period of silence (specified by
--hallucination_silence_threshold
in seconds), then we seek past the silence and reprocess from that point. Eliminating the silence before a hallucination improves the likelihood of getting a correct inference, but since this also requires extra processing time, we only do this when a probable hallucination is detected.The heuristic itself is based on the observation that words in a hallucination are often either extremely long, extremely short, or have an extremely low probability. The probability of later words may be less reliable, so we take the first 8 words of a segment only, and look for a certain threshold of anomalies within that.
Below are some successive test runs on the audio sample in #1783 with
--word_timestamps True --hallucination_silence_threshold 2
. It includes debug output to show when hallucinations were detected. I can confirm that the results are better on v2 than v3.Sample output
Also, this PR includes an option
--clip_timestamps
to specify a list of clips within the audio file where inference should be applied, given in the formatstart,end,start,end,...
(each timestamp specified in seconds). For example,--clip_timestamps 10.5,57.8,71,103
will only run the inference on the region between 10.5 to 57.8 and on the region between 71 to 103. Also, the finalend
will default to the end of the file, so--clip_timestamps 30
will run inference from the 30 second mark to the end of the file. All timestamps will still be relative to the original audio file. I found this option helpful when testing the hallucination heuristic above, but obviously this option would also be very useful for someone who wants to run their own VAD model on the audio first, and then pass that into--clip_timestamps
.One interesting observation is that when running the v3 model on the linked sample audio, we can get very good results if we choose a precise clip region around the 53 second mark (I can't remember now), but if we adjust that forward or backward by very small amounts, we can get the v3 model to completely fail to notice anything that was actually uttered in the audio, suggesting it is a very sensitive model.
I've put each option in a separate commit (happy to split into two PRs if preferred).