Skip to content

Commit

Permalink
Merge pull request svc-develop-team#292 from svc-develop-team/4.1-Latest
Browse files Browse the repository at this point in the history
4.1 latest
  • Loading branch information
ylzz1997 authored Jul 12, 2023
2 parents 06da30b + c0213d5 commit 1919391
Show file tree
Hide file tree
Showing 18 changed files with 681 additions and 13 deletions.
15 changes: 13 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,8 @@ While the pretrained model typically does not pose copyright concerns, it is ess

#### **Optional(Select as Required)**

##### NSF-HIFIGAN

If you are using the `NSF-HIFIGAN enhancer` or `shallow diffusion`, you will need to download the pre-trained NSF-HIFIGAN model.

- Pre-trained NSF-HIFIGAN Vocoder: [nsf_hifigan_20221211.zip](https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip)
Expand All @@ -158,6 +160,13 @@ unzip -od pretrain/nsf_hifigan pretrain/nsf_hifigan_20221211.zip
# URL: https://github.com/openvpi/vocoders/releases/tag/nsf-hifigan-v1
```

##### RMVPE

If you are using the `rmvpe` F0 Predictor, you will need to download the pre-trained RMVPE model.

- download model at [rmvpe.pt](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/rmvpe.pt)
- Place it under the `pretrain` directory

## 📊 Dataset Preparation

Simply place the dataset in the `dataset_raw` directory with the following file structure:
Expand Down Expand Up @@ -278,13 +287,14 @@ nsf-snake-hifigan
python preprocess_hubert_f0.py --f0_predictor dio
```

f0_predictor has four options
f0_predictor has the following options

```
crepe
dio
pm
harvest
rmvpe
```

If the training set is too noisy,it is recommended to use `crepe` to handle f0
Expand Down Expand Up @@ -336,7 +346,7 @@ Required parameters:

Optional parameters: see the next section
- `-lg` | `--linear_gradient`: The cross fade length of two audio slices in seconds. If there is a discontinuous voice after forced slicing, you can adjust this value. Otherwise, it is recommended to use the default value of 0.
- `-f0p` | `--f0_predictor`: Select a F0 predictor, options are `crepe`, `pm`, `dio`, `harvest`, default value is `pm`(note: f0 mean pooling will be enable when using `crepe`)
- `-f0p` | `--f0_predictor`: Select a F0 predictor, options are `crepe`, `pm`, `dio`, `harvest`, `rmvpe`, default value is `pm`(note: f0 mean pooling will be enable when using `crepe`)
- `-a` | `--auto_predict_f0`: automatic pitch prediction, do not enable this when converting singing voices as it can cause serious pitch issues.
- `-cm` | `--cluster_model_path`: Cluster model or feature retrieval index path, if left blank, it will be automatically set as the default path of these models. If there is no training cluster or feature retrieval, fill in at will.
- `-cr` | `--cluster_infer_ratio`: The proportion of clustering scheme or feature retrieval ranges from 0 to 1. If there is no training clustering model or feature retrieval, the default is 0.
Expand Down Expand Up @@ -474,6 +484,7 @@ Note: For Hubert Onnx models, please use the models provided by MoeSS. Currently
|[aes35-000039](https://www.aes.org/e-lib/online/browse.cfm?elib=15165) | Dio (F0 Predictor) | Fast and reliable F0 estimation method based on the period extraction of vocal fold vibration of singing voice and speech | [mmorise/World/dio](https://github.com/mmorise/World/blob/master/src/dio.cpp) |
|[8461329](https://ieeexplore.ieee.org/document/8461329) | Crepe (F0 Predictor) | Crepe: A Convolutional Representation for Pitch Estimation | [maxrmorrison/torchcrepe](https://github.com/maxrmorrison/torchcrepe) |
|[DOI:10.1016/j.wocn.2018.07.001](https://doi.org/10.1016/j.wocn.2018.07.001) | Parselmouth (F0 Predictor) | Introducing Parselmouth: A Python interface to Praat | [YannickJadoul/Parselmouth](https://github.com/YannickJadoul/Parselmouth) |
|[2306.15412v2](https://arxiv.org/abs/2306.15412v2) | RMVPE (F0 Predictor) | RMVPE: A Robust Model for Vocal Pitch Estimation in Polyphonic Music | [Dream-High/RMVPE](https://github.com/Dream-High/RMVPE) |
|[2010.05646](https://arxiv.org/abs/2010.05646) | HIFIGAN (Vocoder) | HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis | [jik876/hifi-gan](https://github.com/jik876/hifi-gan) |
|[1810.11946](https://arxiv.org/abs/1810.11946.pdf) | NSF (Vocoder) | Neural source-filter-based waveform model for statistical parametric speech synthesis | [openvpi/DiffSinger/modules/nsf_hifigan](https://github.com/openvpi/DiffSinger/tree/refactor/modules/nsf_hifigan)
|[2006.08195](https://arxiv.org/abs/2006.08195) | Snake (Vocoder) | Neural Networks Fail to Learn Periodic Functions and How to Fix It | [EdwardDixon/snake](https://github.com/EdwardDixon/snake)
Expand Down
16 changes: 14 additions & 2 deletions README_zh_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,8 @@ wget -P pretrain/ https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/mai

#### **可选项(根据情况选择)**

##### NSF-HIFIGAN

如果使用`NSF-HIFIGAN 增强器``浅层扩散`的话,需要下载预训练的 NSF-HIFIGAN 模型,如果不需要可以不下载

+ 预训练的 NSF-HIFIGAN 声码器 :[nsf_hifigan_20221211.zip](https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip)
Expand All @@ -158,6 +160,14 @@ unzip -od pretrain/nsf_hifigan pretrain/nsf_hifigan_20221211.zip
# 地址:https://github.com/openvpi/vocoders/releases/tag/nsf-hifigan-v1
```

##### RMVPE

如果使用`rmvpe`F0预测器的话,需要下载预训练的 RMVPE 模型

+ 下载模型 [rmvpe.pt](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/rmvpe.pt)
+ 放在`pretrain`目录下


## 📊 数据集准备

仅需要以以下文件结构将数据集放入 dataset_raw 目录即可
Expand Down Expand Up @@ -280,13 +290,14 @@ nsf-snake-hifigan
python preprocess_hubert_f0.py --f0_predictor dio
```

f0_predictor 拥有四个选择
f0_predictor 拥有以下选择

```
crepe
dio
pm
harvest
rmvpe
```

如果训练集过于嘈杂,请使用 crepe 处理 f0
Expand Down Expand Up @@ -338,7 +349,7 @@ python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "

可选项部分:部分具体见下一节
+ `-lg` | `--linear_gradient`:两段音频切片的交叉淡入长度,如果强制切片后出现人声不连贯可调整该数值,如果连贯建议采用默认值 0,单位为秒
+ `-f0p` | `--f0_predictor`:选择 F0 预测器,可选择 crepe,pm,dio,harvest, 默认为 pm(注意:crepe 为原 F0 使用均值滤波器)
+ `-f0p` | `--f0_predictor`:选择 F0 预测器,可选择 crepe,pm,dio,harvest,rmvpe, 默认为 pm(注意:crepe 为原 F0 使用均值滤波器)
+ `-a` | `--auto_predict_f0`:语音转换自动预测音高,转换歌声时不要打开这个会严重跑调
+ `-cm` | `--cluster_model_path`:聚类模型或特征检索索引路径,留空则自动设为各方案模型的默认路径,如果没有训练聚类或特征检索则随便填
+ `-cr` | `--cluster_infer_ratio`:聚类方案或特征检索占比,范围 0-1,若没有训练聚类模型或特征检索则默认 0 即可
Expand Down Expand Up @@ -474,6 +485,7 @@ python compress_model.py -c="configs/config.json" -i="logs/44k/G_30400.pth" -o="
|[aes35-000039](https://www.aes.org/e-lib/online/browse.cfm?elib=15165) | Dio (F0 Predictor) | Fast and reliable F0 estimation method based on the period extraction of vocal fold vibration of singing voice and speech | [mmorise/World/dio](https://github.com/mmorise/World/blob/master/src/dio.cpp) |
|[8461329](https://ieeexplore.ieee.org/document/8461329) | Crepe (F0 Predictor) | Crepe: A Convolutional Representation for Pitch Estimation | [maxrmorrison/torchcrepe](https://github.com/maxrmorrison/torchcrepe) |
|[DOI:10.1016/j.wocn.2018.07.001](https://doi.org/10.1016/j.wocn.2018.07.001) | Parselmouth (F0 Predictor) | Introducing Parselmouth: A Python interface to Praat | [YannickJadoul/Parselmouth](https://github.com/YannickJadoul/Parselmouth) |
|[2306.15412v2](https://arxiv.org/abs/2306.15412v2) | RMVPE (F0 Predictor) | RMVPE: A Robust Model for Vocal Pitch Estimation in Polyphonic Music | [Dream-High/RMVPE](https://github.com/Dream-High/RMVPE) |
|[2010.05646](https://arxiv.org/abs/2010.05646) | HIFIGAN (Vocoder) | HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis | [jik876/hifi-gan](https://github.com/jik876/hifi-gan) |
|[1810.11946](https://arxiv.org/abs/1810.11946.pdf) | NSF (Vocoder) | Neural source-filter-based waveform model for statistical parametric speech synthesis | [openvpi/DiffSinger/modules/nsf_hifigan](https://github.com/openvpi/DiffSinger/tree/refactor/modules/nsf_hifigan)
|[2006.08195](https://arxiv.org/abs/2006.08195) | Snake (Vocoder) | Neural Networks Fail to Learn Periodic Functions and How to Fix It | [EdwardDixon/snake](https://github.com/EdwardDixon/snake)
Expand Down
1 change: 1 addition & 0 deletions compress_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ def removeOptimizer(config: str, input_model: str, ishalf: bool, output_model: s
new_dict_g = copyStateDict(state_dict_g)
keys = []
for k, v in new_dict_g['model'].items():
if "enc_q" in k: continue # noqa: E701
keys.append(k)

new_dict_g = {k: new_dict_g['model'][k].half() for k in keys} if ishalf else {k: new_dict_g['model'][k] for k in keys}
Expand Down
2 changes: 1 addition & 1 deletion inference_main.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ def main():
parser.add_argument('-cm', '--cluster_model_path', type=str, default="", help='聚类模型或特征检索索引路径,留空则自动设为各方案模型的默认路径,如果没有训练聚类或特征检索则随便填')
parser.add_argument('-cr', '--cluster_infer_ratio', type=float, default=0, help='聚类方案或特征检索占比,范围0-1,若没有训练聚类模型或特征检索则默认0即可')
parser.add_argument('-lg', '--linear_gradient', type=float, default=0, help='两段音频切片的交叉淡入长度,如果强制切片后出现人声不连贯可调整该数值,如果连贯建议采用默认值0,单位为秒')
parser.add_argument('-f0p', '--f0_predictor', type=str, default="pm", help='选择F0预测器,可选择crepe,pm,dio,harvest,默认为pm(注意:crepe为原F0使用均值滤波器)')
parser.add_argument('-f0p', '--f0_predictor', type=str, default="pm", help='选择F0预测器,可选择crepe,pm,dio,harvest,rmvpe,默认为pm(注意:crepe为原F0使用均值滤波器)')
parser.add_argument('-eh', '--enhance', action='store_true', default=False, help='是否使用NSF_HIFIGAN增强器,该选项对部分训练集少的模型有一定的音质增强效果,但是对训练好的模型有反面效果,默认关闭')
parser.add_argument('-shd', '--shallow_diffusion', action='store_true', default=False, help='是否使用浅层扩散,使用后可解决一部分电音问题,默认关闭,该选项打开时,NSF_HIFIGAN增强器将会被禁止')
parser.add_argument('-usm', '--use_spk_mix', action='store_true', default=False, help='是否使用角色融合')
Expand Down
106 changes: 106 additions & 0 deletions modules/F0Predictor/RMVPEF0Predictor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
from typing import Union

import numpy as np
import torch
import torch.nn.functional as F

from modules.F0Predictor.F0Predictor import F0Predictor

from .rmvpe import RMVPE


class RMVPEF0Predictor(F0Predictor):
def __init__(self,hop_length=512,f0_min=50,f0_max=1100, dtype=torch.float32, device=None,sampling_rate=44100,threshold=0.05):
self.rmvpe = RMVPE(model_path="pretrain/rmvpe.pt",dtype=dtype,device=device)
self.hop_length = hop_length
self.f0_min = f0_min
self.f0_max = f0_max
if device is None:
self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
else:
self.device = device
self.threshold = threshold
self.sampling_rate = sampling_rate
self.dtype = dtype

def repeat_expand(
self, content: Union[torch.Tensor, np.ndarray], target_len: int, mode: str = "nearest"
):
ndim = content.ndim

if content.ndim == 1:
content = content[None, None]
elif content.ndim == 2:
content = content[None]

assert content.ndim == 3

is_np = isinstance(content, np.ndarray)
if is_np:
content = torch.from_numpy(content)

results = torch.nn.functional.interpolate(content, size=target_len, mode=mode)

if is_np:
results = results.numpy()

if ndim == 1:
return results[0, 0]
elif ndim == 2:
return results[0]

def post_process(self, x, sampling_rate, f0, pad_to):
if isinstance(f0, np.ndarray):
f0 = torch.from_numpy(f0).float().to(x.device)

if pad_to is None:
return f0

f0 = self.repeat_expand(f0, pad_to)

vuv_vector = torch.zeros_like(f0)
vuv_vector[f0 > 0.0] = 1.0
vuv_vector[f0 <= 0.0] = 0.0

# 去掉0频率, 并线性插值
nzindex = torch.nonzero(f0).squeeze()
f0 = torch.index_select(f0, dim=0, index=nzindex).cpu().numpy()
time_org = self.hop_length / sampling_rate * nzindex.cpu().numpy()
time_frame = np.arange(pad_to) * self.hop_length / sampling_rate

vuv_vector = F.interpolate(vuv_vector[None,None,:],size=pad_to)[0][0]

if f0.shape[0] <= 0:
return torch.zeros(pad_to, dtype=torch.float, device=x.device),vuv_vector.cpu().numpy()
if f0.shape[0] == 1:
return torch.ones(pad_to, dtype=torch.float, device=x.device) * f0[0],vuv_vector.cpu().numpy()

# 大概可以用 torch 重写?
f0 = np.interp(time_frame, time_org, f0, left=f0[0], right=f0[-1])
#vuv_vector = np.ceil(scipy.ndimage.zoom(vuv_vector,pad_to/len(vuv_vector),order = 0))

return f0,vuv_vector.cpu().numpy()

def compute_f0(self,wav,p_len=None):
x = torch.FloatTensor(wav).to(self.dtype).to(self.device)
if p_len is None:
p_len = x.shape[0]//self.hop_length
else:
assert abs(p_len-x.shape[0]//self.hop_length) < 4, "pad length error"
f0 = self.rmvpe.infer_from_audio(x,self.sampling_rate,self.threshold)
if torch.all(f0 == 0):
rtn = f0.cpu().numpy() if p_len is None else np.zeros(p_len)
return rtn,rtn
return self.post_process(x,self.sampling_rate,f0,p_len)[0]

def compute_f0_uv(self,wav,p_len=None):
x = torch.FloatTensor(wav).to(self.dtype).to(self.device)
if p_len is None:
p_len = x.shape[0]//self.hop_length
else:
assert abs(p_len-x.shape[0]//self.hop_length) < 4, "pad length error"
f0 = self.rmvpe.infer_from_audio(x,self.sampling_rate,self.threshold)
if torch.all(f0 == 0):
rtn = f0.cpu().numpy() if p_len is None else np.zeros(p_len)
return rtn,rtn
return self.post_process(x,self.sampling_rate,f0,p_len)
10 changes: 10 additions & 0 deletions modules/F0Predictor/rmvpe/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
from .constants import * # noqa: F403
from .inference import RMVPE # noqa: F401
from .model import E2E, E2E0 # noqa: F401
from .spec import MelSpectrogram # noqa: F401
from .utils import ( # noqa: F401
cycle,
summary,
to_local_average_cents,
to_viterbi_cents,
)
9 changes: 9 additions & 0 deletions modules/F0Predictor/rmvpe/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
SAMPLE_RATE = 16000

N_CLASS = 360

N_MELS = 128
MEL_FMIN = 30
MEL_FMAX = SAMPLE_RATE // 2
WINDOW_LENGTH = 1024
CONST = 1997.3794084376191
Loading

0 comments on commit 1919391

Please sign in to comment.