Skip to content

Commit

Permalink
Commit comparisons with naturalspeech
Browse files Browse the repository at this point in the history
This is the first TTS engine I've seen come along that has comparable performance
to Tortoise, though what has been released is pretty sparse on actual results. Still,
it's an interesting comparison.
  • Loading branch information
neonbjb committed May 22, 2022
1 parent f4bd9c4 commit 12a767c
Show file tree
Hide file tree
Showing 7 changed files with 19 additions and 3 deletions.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
22 changes: 19 additions & 3 deletions tortoise_v2_examples.html
Original file line number Diff line number Diff line change
Expand Up @@ -32,10 +32,10 @@ <h2>Short-form</h2>
<h2>Short-form</h2>
<audio controls="" style="width: 600px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/examples/favorite_riding_hood.mp3" type="audio/mp3"></audio><br>

<h1>Compared to Tacotron2 (with the LJSpeech voice): 🐢 </h1>
<h1>Comparisons (with the LJSpeech voice): 🐢 </h1>
<p>LJSpeech is a popular dataset used to train small-scale TTS models. TorToiSe is a multi-voice model, following is how
it renders the LJSpeech voice with no fine-tuning, compared with results for the same text from the popular Tacotron2
model paired with the Waveglow transformer:</p>
it renders the LJSpeech voice with and without fine-tuning, compared with results for the same text from the popular Tacotron2
model paired with the Waveglow vocoder.</p>
<table><th>Tacotron2+Waveglow</th><th>TorToiSe</th><th>TorToiSe Finetuned</th><tr>
<td><audio controls="" style="width: 300px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/examples/tacotron_comparison/2-tacotron2.mp3" type="audio/mp3"></audio><br>
</td>
Expand All @@ -50,6 +50,22 @@ <h1>Compared to Tacotron2 (with the LJSpeech voice): 🐢 </h1>

<td><audio controls="" style="width: 300px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/examples/finetuned/lj/4.mp3" type="audio/mp3"></audio><br></td>
</tr></table>
<p>NaturalVoice is a SOTA TTS engine developed by Microsoft Research Asia in May 2022. It features realistic prosody
and end-to-end generation with no need for a vocoder. While not much has actually been released about this model other
than five samples, those samples are quite good and I would consider this the most competitive TTS engine out there
right now.</p>
<table><th>Natural Voice</th><th>TorToiSe Finetuned</th>
<tr><td><audio controls="" style="width: 300px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/examples/naturalspeech_comparison/lax/naturalspeech.mp3" type="audio/mp3"></audio><br></td>
<td><audio controls="" style="width: 300px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/examples/naturalspeech_comparison/lax/tortoise.mp3" type="audio/mp3"></audio><br></td>
</tr><tr><td><audio controls="" style="width: 300px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/examples/naturalspeech_comparison/maltby/naturalspeech.mp3" type="audio/mp3"></audio><br></td>
<td><audio controls="" style="width: 300px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/examples/naturalspeech_comparison/maltby/tortoise.mp3" type="audio/mp3"></audio><br></td>
</tr><tr><td><audio controls="" style="width: 300px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/examples/naturalspeech_comparison/fibers/naturalspeech.mp3" type="audio/mp3"></audio><br>
</td><td><audio controls="" style="width: 300px;"><source src="https://github.com/neonbjb/tortoise-tts/raw/main/examples/naturalspeech_comparison/fibers/tortoise.mp3" type="audio/mp3"></audio><br></td>
</tr></table>
<p>It is important to note that it is not actually fair to compare any of these models: Tortoise is a multi-voice probabilistic
model trained on millions of hours of speech with an exceptionally slow inference time. Tacotron and NaturalVoice are efficient,
fast, single-voice models trained on 24 hours of speech. Unfortunately, there isn't much in the way of actually comparable
research to Tortoise.</p>

<h1>All Results 🐢</h1>
<p> Following are all the results from which the hand-picked results were drawn from. Also included is the reference
Expand Down

2 comments on commit 12a767c

@wavymulder
Copy link
Contributor

@wavymulder wavymulder commented on 12a767c May 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these comparisons, did you feed the naturalspeech's output in as a Tortoise voice? If so, I think your test may be a bit off. I've been experimenting with using Tortoise outputs as new voices and have noticed some "feedback" occurring even in the first generation. I think that may contribute to Tortoise sounding more robotic in the comparisons.

I plan on making a discussion post about my findings and sharing some voices I made soon, just want to get it all together cohesively.

Naturalspeech sounds really good for traditional TTS, though.

@neonbjb
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I used the "lj" voice which is already in this repo. I did, however, use the fine-tuned LJSpeech models which are not available publicly. I do plan to release them soon.

Please sign in to comment.