Skip to content

ex3ndr/supervoice-voicebox

Folders and files

NameName
Last commit message
Last commit date

Latest commit

2840683 · Aug 2, 2024

History

68 Commits
Mar 21, 2024
Mar 21, 2024
Mar 21, 2024
Mar 10, 2024
Mar 21, 2024
Mar 21, 2024
Aug 2, 2024
Aug 2, 2024
Feb 15, 2024
Feb 21, 2024
Mar 6, 2024
Mar 6, 2024
Mar 21, 2024
Mar 21, 2024
Feb 14, 2024
Mar 21, 2024
Mar 21, 2024
Jul 17, 2024
Mar 2, 2024
Feb 12, 2024
Mar 21, 2024

Repository files navigation

✨ SuperVoice VoiceBox

Feel free to join my Discord Server to discuss this model!

An independent VoiceBox implementation for voice synthesis. Currently in BETA.

Features

  • ⚡️ Narural sounding
  • 🎤 High quality - 24khz audio
  • 🤹‍♂️ Versatile - synthesiszed voice has high variability
  • 📕 Currently only English language is supported, but nothing stops us from adding more languages.

Samples

sample_1.mp4
sample_2.mp4
sample_3.mp4
sample_4.mp4

How to use

Supervoice consists of three networks: gpt for phoneme and prosogy generation, audio model for audio synthesis and vocoder for audio generation. Supervoice is published using Torch Hub, so you can use it as follows:

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Vocoder
vocoder = torch.hub.load(repo_or_dir='ex3ndr/supervoice-vocoder', model='bigvsan')
vocoder.to(device)
vocoder.eval()

# GPT Model
gpt = torch.hub.load(repo_or_dir='ex3ndr/supervoice-gpt', model='phonemizer')
gpt.to(device)
gpt.eval()

# Main Model
model = torch.hub.load(repo_or_dir='ex3ndr/supervoice-voicebox', model='phonemizer', gpt=gpt, vocoder=vocoder)
model.to(device)
model.eval()

# Generate audio
# Supervoice has three example voices: "voice_1", "voice_2" (my favorite), "voice_3"
# You can also remove the voice parameter to use the random one, or provide your own, but you need a TextGrid alignment for that.
# Steps means quality of the audio, recommended value is 4, 8 or 32.
# Alpha is a parameter of randomness, it should be less than 1.0, stable synthesis with small variaons is 0.1, 0.3 is a good value for more expressive synthesis, 0.5 is a maximum recommended value.
output = model.synthesize("What time is it, Steve?", voice = "voice_1", steps = 8, alpha = 0.1)

# Output of melspec
melspec = output['melspec']

# Output 1D tensor of 24000khz audio (missing if vocoder is not provided)
waveform = output['wav']

# Play audio in notebook
display(Audio(data=waveform, rate=24000))

License

MIT

About

VoiceBox neural network implementation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages