modify intro texts

ARDiT-TTS · ARDiT-TTS · commit d873a2929dd5 · 2024-05-31T11:59:31.000+08:00
diff --git a/index.html b/index.html
@@ -1259,10 +1259,11 @@ <h3>Prompted Generation</h3>
           <p class="lead">* please scroll horizontally to explore additional columns in the table.</p>
         </div>
         <div class="container pt-5 mt-5 shadow p-5 mb-5 bg-white rounded">
-          <h3>Speech Inpainting</h3>
+          <h3>Speech Editing</h3>
           <p class="lead">
-                In this task, we evaluate on test set C. We mask fragments of the waveforms, and ask the models to generate the full waveforms. The masked sections are highlighted within the text.
-                All speakers are unseen for all systems during training.
+                We evaluated the performance of text-based speech editing on the speech inpainting task.
+                The models generate complete waveforms given complete texts and partially masked waveforms. The masked sections are highlighted within the text.
+                All speakers were unseen by all systems during training. The following 20 test cases are from test set C (long).
                 </p>
           <div class="table-responsive" style="overflow-x: scroll">
             <table class="table table-sm">
@@ -2046,6 +2047,7 @@ <h3>Prompted Generation (Comparing with Proprietary Systems)</h3>
           <p class="lead">
                 In this section, we compare our system with proprietary systems including NaturalSpeech 2/3, MegaTTS 2, UniAudio, CLaM-TTS, VoiceBox, and VALL-E. The source codes and model weights for these models are not available.
                 The following samples are obtained from their online demo pages. All waveforms are downsampled to 16kHz.
+                Please note that ARDiT's performance is influenced by the fact that the prompt waveforms are in 16kHz, not 24kHz, and the prompt texts are not semantically coherent with the target texts.
                 </p>
           <p class="lead">1~4 are obtained from 
             <a href="https://speechresearch.github.io/naturalspeech3/">NaturalSpeech 3</a> and 5~20 are obtained from 
diff --git a/index.py b/index.py
@@ -49,11 +49,12 @@
 
         with div(cls="container pt-5 mt-5 shadow p-5 mb-5 bg-white rounded"):
             from inpaint import get_table
-            h3("Speech Inpainting")
+            h3("Speech Editing")
             p(
                 """
-                In this task, we evaluate on test set C. We mask fragments of the waveforms, and ask the models to generate the full waveforms. The masked sections are highlighted within the text.
-                All speakers are unseen for all systems during training.
+                We evaluated the performance of text-based speech editing on the speech inpainting task.
+                The models generate complete waveforms given complete texts and partially masked waveforms. The masked sections are highlighted within the text.
+                All speakers were unseen by all systems during training. The following 20 test cases are from test set C (long).
                 """,
                 cls="lead"
             )
@@ -67,6 +68,7 @@
                 """
                 In this section, we compare our system with proprietary systems including NaturalSpeech 2/3, MegaTTS 2, UniAudio, CLaM-TTS, VoiceBox, and VALL-E. The source codes and model weights for these models are not available.
                 The following samples are obtained from their online demo pages. All waveforms are downsampled to 16kHz.
+                Please note that ARDiT's performance is influenced by the fact that the prompt waveforms are in 16kHz, not 24kHz, and the prompt texts are not semantically coherent with the target texts.
                 """,
                 cls="lead"
             )