Abstract:In this work, we propose a new mathematical vocoder algorithm(new type of Spectral inversion) that generates a waveform from acoustic features without phase estimation. The main benefit of using our proposed method is that it excludes the training stage of the neural vocoder from the end-to-end speech synthesis model without sacrificing sound quality. Moreover, the vocoder no longer limits the synthesis speed or sound quality in inferencing. Our implementation can synthesize high fidelity speech at approximately 20 Mhz on CPU and 59.6MHz on GPU. This is 909 and 2,702 times faster compared to real-time. Since the proposed methodology is not a data-driven method, it is applicable to unseen voices and multiple languages without any additional work. The proposed method is expected to adapt for researching on neural network models capable of synthesizing speech at the studio recording level. |
Contents
CAUTION : All sample files are not preloaded, so you have to wait a bit after hitting the play button depending on your internet connection environment.
For the comparison of sound quality, all neural vocoder synthesis samples except our algorithm were taken from each neural vocoder demo site.
[Hifi-GAN] [GAN vocoder] [WaveFlow] [DiffWave] [WaveGlow] [SqueezeWave]
LJSpeech samples from [Hifi-GAN]
1 (LJ050-0270) | 2 (LJ017-0033) | 3 (LJ004-0233) | 4 (LJ011-0009) | 5 (LJ042-0161) | |
---|---|---|---|---|---|
Ground Truth | |||||
WaveNet (MoL) | |||||
WaveGlow | |||||
MelGAN | |||||
HiFi-GAN V1 | |||||
HiFi-GAN V2 | |||||
HiFi-GAN V3 | |||||
Ours(Algo1 half zeroclip 1024/64) | |||||
Ours(Algo2 full zeroclip 1024/64) | |||||
Ours(Algo3 full signed 512/384) |
LJSpeech samples from [GAN vocoder]
1 (LJ003-0307) | 2 (LJ034-0083) | 3( LJ005-0101) | 4 (LJ036-0216) | 5 (LJ007-0217) | |
---|---|---|---|---|---|
Ground Truth | |||||
HifiGAN | |||||
HifiGAN(MultiRes) | |||||
pWaveGAN | |||||
uMelGAN | |||||
VocGAN | |||||
Ours(Algo1 half zeroclip 1024/64) | |||||
Ours(Algo2 full zeroclip 1024/64) | |||||
Ours(Algo3 full signed 512/384) |
LJSpeech samples from [WaveFlow]
1 (LJ001-0001) | 2 (LJ001-0003) | 3 (LJ001-0005) | 4 (LJ001-0015) | 5 (LJ001-00016) | |
---|---|---|---|---|---|
Ground-truth (recorded speech) | |||||
WaveGlow (96-layer, res. ch = 256) | |||||
WaveFlow (64-layer, res. ch = 256) | |||||
Ours(Algo1 half zeroclip 1024/64) | |||||
Ours(Algo2 full zeroclip 1024/64) | |||||
Ours(Algo3 full signed 512/384) |
LJSpeech samples from [DiffWave]
1 (LJ001-0001) | 2 (LJ001-0002) | 3 (LJ001-0003) | 4 (LJ001-0004) | 5 (LJ001-0005) | |
---|---|---|---|---|---|
Ground-truth | |||||
WaveNet (C = 128) | |||||
WaveFlow (C = 128) | |||||
DiffWave (C = 128, T = 200) | |||||
WaveFlow (C = 64) | |||||
DiffWave (C = 64, T = 50) | |||||
ClariNet (C = 64) | |||||
Ours(Algo1 half zeroclip 1024/64) | |||||
Ours(Algo2 full zeroclip 1024/64) | |||||
Ours(Algo3 full signed 512/384) |
LJSpeech samples from [WaveGlow] and [SqueezeWave]
1 (LJ001-0015) | 2 (LJ001-0051) | 3 (LJ001-0063) | 4 (LJ001-0072) | 5 (LJ001-0079) | |
---|---|---|---|---|---|
Ground Truth | |||||
Griffin-Lim(WG) | |||||
WaveNet(WG) | |||||
WaveGlow | |||||
WaveGlow(SW) | |||||
SqueezeWave 128L | |||||
SqueezeWave 128S | |||||
SqueezeWave 64L | |||||
SqueezeWave 64S | |||||
Ours(Algo1 half zeroclip 1024/64) | |||||
Ours(Algo2 full zeroclip 1024/64) | |||||
Ours(Algo3 full signed 512/384) |
Compare half & full Spectrum Bin for LJ001-0001 sample
We could get better quality with smaller hop size.
In Algorithm 1(half), we could see three types of artifacts.
In Algorithm 2(full), artifacts disappeared.
Half Frequency Bin | Full Frequency Bin | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
hop | win_size : 1024 | win_size : 512 | hop | win_size : 1024 | win_size : 512 | ||||||||
MCD | MCD | MCD | MCD | ||||||||||
512 | 7.1201 | - | - | 512 | 6.342 | - | - | ||||||
384 | 7.0498 | - | - | 384 | 6.190 | - | - | ||||||
256 | 6.9279 | 7.0491 | 256 | 6.061 | 6.578 | ||||||||
192 | 6.9282 | 6.9884 | 192 | 6.011 | 6.163 | ||||||||
128 | 6.6855 | 6.9730 | 128 | 5.948 | 6.051 | ||||||||
96 | 6.558 | 6.7327 | 96 | 5.825 | 5.951 | ||||||||
64 | 6.4737 | 6.7075 | 64 | 5.798 | 5.858 |
denoising effect with positive threshold
In Algorithm 2, the noise removal effect was obtained by slightly increasing the threshold value instead of zero clipping.
The light noise in the silent section was able to obtain the denoising effect of a commercial tool level without special work.
LJ001-0001 | LJ001-0015 | LJ002-0097 | LJ003-0048 | LJ005-0240 | |
---|---|---|---|---|---|
Ground Truth | |||||
denoising tool mild | |||||
denoising tool strong | |||||
zero clipping threshold=0 | |||||
zero clipping threshold=0.05 | |||||
zero clipping threshold=0.1 |
44Khz samples
We evaluate 44Khz audio samples Speech, Music(Pop) and Performance(Piano)
In Algorithm2, we need more fine hop size for higher sampling rate samples for speech.
However, we could increase hop size for music because we could not recognize the ghost effect in loud sound.
Speech | POP | K-POP | Drum | Piano | Piano | |
---|---|---|---|---|---|---|
GT(44Khz) | ||||||
resample 22KHz | ||||||
Algo2(2048/1024) | ||||||
Algo2(2048/768) | ||||||
Algo2(2048/512) | ||||||
Algo2(2048/384) | ||||||
Algo2(2048/256) | ||||||
Algo2(2048/192) | ||||||
Algo2(2048/128) | ||||||
Algo2(2048/64) | ||||||
Algo2(1024/512) | ||||||
Algo2(1024/384) | ||||||
Algo2(1024/256) | ||||||
Algo2(1024/192) | ||||||
Algo2(1024/128) | ||||||
Algo2(1024/64) | ||||||
Algo3(2048/2046) | ||||||
Algo3(1024/1022) | ||||||
Algo3(512/510) |
MLS samples
We evaluate Multi Languages from MLS dataset.
The sampling rate is 48Khz for original OPUS file. However, when we check the spectrogram, the recorded sound data exists less than 17Khz.
We use 16Khz sampling rate to truncate the higher frequency.
English | Spain | German | French | Dutch | Polish | Italian | Portuguese | |
---|---|---|---|---|---|---|---|---|
GT | ||||||||
algo1(1024/256) | ||||||||
algo1(1024/128) | ||||||||
algo1(1024/64) | ||||||||
algo2(1024/256) | ||||||||
algo2(1024/128) | ||||||||
algo2(1024/64) | ||||||||
algo3(1024/1022) |