Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does the prosody codes[0] work? #10

Open
dalazymodder opened this issue Jun 30, 2024 · 3 comments
Open

Does the prosody codes[0] work? #10

dalazymodder opened this issue Jun 30, 2024 · 3 comments

Comments

@dalazymodder
Copy link

I tried to test the code some specifically for prosody but it seemed like the prosody was tied to codes[1] with the content?

@Plachtaa
Copy link
Owner

what kind of test did you performed on prosody code?

@dalazymodder
Copy link
Author

dalazymodder commented Jun 30, 2024

I took two different sound bytes and changed the line

z2 = model.encoder(codes[0], codes[1], timbre2, use_p_code=False, n_c=1)

line 103 to

z2 = model.encoder(codes2[0], codes[1], timbre, use_p_code=False, n_c=1)

this should make it so it only outputs the file with prosody changed between the two different sound files right?

But if you look at the resulting files they appear to be identical for the the unedited reconstruction vs the new one that should have different prosody.

image

I also tried changing the code to this, which changed content and prosody.

z2 = model.encoder(codes[0], codes2[1], timbre, use_p_code=False, n_c=1)

image

Sorry if I got anything wrong I'm a novice at this but isnt the prosody kind of like the emotion and timing of the speech?

I did add some lines to pad both audio files to same length, but I don't think that should affect the prosody.

def main(args):
source = args.source
target = args.target
source_audio = librosa.load(source, sr=24000)[0]
ref_audio = librosa.load(target, sr=24000)[0]

# Find the length of the longest audio and add a small buffer (e.g., 1 second)
max_length = max(len(source_audio), len(ref_audio))
target_length = max_length + 24000  # Add 1 second (24000 samples at 24kHz)

# Pad both audios to the target length
source_audio = np.pad(source_audio, (0, target_length - len(source_audio)), mode='constant')
ref_audio = np.pad(ref_audio, (0, target_length - len(ref_audio)), mode='constant')

# Convert to torch tensors
source_audio = torch.tensor(source_audio).unsqueeze(0).float().to(device)
ref_audio = torch.tensor(ref_audio).unsqueeze(0).float().to(device)

@Plachtaa
Copy link
Owner

Plachtaa commented Jul 4, 2024

Thanks for your experiment. It was very helpful for us to understand what exactly the prosody component stands for.
I will try to replicate your experiment myself, and will tell you if I could give you any explanations about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants