Feature/data processing #110

Leminen · 2025-03-13T16:16:49Z

added support for audio normalization and changed default sampling probs for multi-dataset trainings

…ainst/coral into feature/data_processing

…ature/data_processing

saattrupdan

Looks good! Nice better default for dataset probabilities as well. Only have a minor change.

Also, do we know for sure that the do_normalize argument works for both Wav2vec and Whisper models?

saattrupdan · 2025-03-14T07:43:04Z

src/coral/data.py

    audio_column: str | None,
    convert_numerals: bool,
+    normalize_audio: bool = False,


No need to include a default here, as you're including the argument explicitly when calling the function. Keeping it as a default could lead to silent errors:

Suggested change

normalize_audio: bool = False,

normalize_audio: bool,

Leminen and others added 6 commits February 27, 2025 15:45

added optional audio normalization

e3714dd

moved process_data to enable dataset specific preprocessing.

6f06cde

Changed interleave sampling to depend on relative dataset sizes

29e7888

Merge branch 'feature/data_processing' of https://github.com/alexandr…

635fd73

…ainst/coral into feature/data_processing

reverting to model dependent audio normalization

9869ec7

Merge branch 'main' of https://github.com/alexandrainst/coral into fe…

73840a6

…ature/data_processing

Leminen requested review from sorenmulli and saattrupdan March 13, 2025 16:16

saattrupdan approved these changes Mar 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/data processing #110

Feature/data processing #110

Leminen commented Mar 13, 2025

saattrupdan left a comment

saattrupdan Mar 14, 2025

Feature/data processing #110

Are you sure you want to change the base?

Feature/data processing #110

Conversation

Leminen commented Mar 13, 2025

saattrupdan left a comment

Choose a reason for hiding this comment

saattrupdan Mar 14, 2025

Choose a reason for hiding this comment