Mathematical Vocoder Algorithm : Modified Spectral Inversion for Efficient Neural Speech Synthesis

Abstract:

In this work, we propose a new mathematical vocoder algorithm(new type of Spectral inversion) that generates a waveform from acoustic features without phase estimation. The main benefit of using our proposed method is that it excludes the training stage of the neural vocoder from the end-to-end speech synthesis model without sacrificing sound quality. Moreover, the vocoder no longer limits the synthesis speed or sound quality in inferencing. Our implementation can synthesize high fidelity speech at approximately 20 Mhz on CPU and 59.6MHz on GPU. This is 909 and 2,702 times faster compared to real-time. Since the proposed methodology is not a data-driven method, it is applicable to unseen voices and multiple languages without any additional work. The proposed method is expected to adapt for researching on neural network models capable of synthesizing speech at the studio recording level.

Contents

CAUTION : All sample files are not preloaded, so you have to wait a bit after hitting the play button depending on your internet connection environment.

Quality Comparision

For the comparison of sound quality, all neural vocoder synthesis samples except our algorithm were taken from each neural vocoder demo site.

[Hifi-GAN] [GAN vocoder] [WaveFlow] [DiffWave] [WaveGlow] [SqueezeWave]



LJSpeech samples from [Hifi-GAN]


1 (LJ050-0270) 2 (LJ017-0033) 3 (LJ004-0233) 4 (LJ011-0009) 5 (LJ042-0161)
Ground Truth
WaveNet (MoL)
WaveGlow
MelGAN
HiFi-GAN V1
HiFi-GAN V2
HiFi-GAN V3
Ours(Algo1 half zeroclip 1024/64)
Ours(Algo2 full zeroclip 1024/64)
Ours(Algo3 full signed 512/384)

LJSpeech samples from [GAN vocoder]


1 (LJ003-0307) 2 (LJ034-0083) 3( LJ005-0101) 4 (LJ036-0216) 5 (LJ007-0217)
Ground Truth
HifiGAN
HifiGAN(MultiRes)
pWaveGAN
uMelGAN
VocGAN
Ours(Algo1 half zeroclip 1024/64)
Ours(Algo2 full zeroclip 1024/64)
Ours(Algo3 full signed 512/384)

LJSpeech samples from [WaveFlow]


1 (LJ001-0001) 2 (LJ001-0003) 3 (LJ001-0005) 4 (LJ001-0015) 5 (LJ001-00016)
Ground-truth (recorded speech)
WaveGlow (96-layer, res. ch = 256)
WaveFlow (64-layer, res. ch = 256)
Ours(Algo1 half zeroclip 1024/64)
Ours(Algo2 full zeroclip 1024/64)
Ours(Algo3 full signed 512/384)

LJSpeech samples from [DiffWave]


1 (LJ001-0001) 2 (LJ001-0002) 3 (LJ001-0003) 4 (LJ001-0004) 5 (LJ001-0005)
Ground-truth
WaveNet (C = 128)
WaveFlow (C = 128)
DiffWave (C = 128, T = 200)
WaveFlow (C = 64)
DiffWave (C = 64, T = 50)
ClariNet (C = 64)
Ours(Algo1 half zeroclip 1024/64)
Ours(Algo2 full zeroclip 1024/64)
Ours(Algo3 full signed 512/384)

LJSpeech samples from [WaveGlow] and [SqueezeWave]


1 (LJ001-0015) 2 (LJ001-0051) 3 (LJ001-0063) 4 (LJ001-0072) 5 (LJ001-0079)
Ground Truth
Griffin-Lim(WG)
WaveNet(WG)
WaveGlow
WaveGlow(SW)
SqueezeWave 128L
SqueezeWave 128S
SqueezeWave 64L
SqueezeWave 64S
Ours(Algo1 half zeroclip 1024/64)
Ours(Algo2 full zeroclip 1024/64)
Ours(Algo3 full signed 512/384)

Ablation Studies

Compare half & full Spectrum Bin for LJ001-0001 sample

We could get better quality with smaller hop size.
In Algorithm 1(half), we could see three types of artifacts. In Algorithm 2(full), artifacts disappeared.


Half Frequency Bin Full Frequency Bin
hop win_size : 1024 win_size : 512 hop win_size : 1024 win_size : 512
       MCD               MCD               MCD               MCD       
512 7.1201 - - 512 6.342 - -
384 7.0498 - - 384 6.190 - -
256 6.9279 7.0491 256 6.061 6.578
192 6.9282 6.9884 192 6.011 6.163
128 6.6855 6.9730 128 5.948 6.051
96 6.558 6.7327 96 5.825 5.951
64 6.4737 6.7075 64 5.798 5.858

denoising effect with positive threshold

In Algorithm 2, the noise removal effect was obtained by slightly increasing the threshold value instead of zero clipping.
The light noise in the silent section was able to obtain the denoising effect of a commercial tool level without special work.


LJ001-0001 LJ001-0015 LJ002-0097 LJ003-0048 LJ005-0240
Ground Truth
denoising tool mild
denoising tool strong
zero clipping threshold=0
zero clipping threshold=0.05
zero clipping threshold=0.1

44Khz samples

We evaluate 44Khz audio samples Speech, Music(Pop) and Performance(Piano)
In Algorithm2, we need more fine hop size for higher sampling rate samples for speech. However, we could increase hop size for music because we could not recognize the ghost effect in loud sound.


Speech POP K-POP Drum Piano Piano
GT(44Khz)
resample 22KHz
Algo2(2048/1024)
Algo2(2048/768)
Algo2(2048/512)
Algo2(2048/384)
Algo2(2048/256)
Algo2(2048/192)
Algo2(2048/128)
Algo2(2048/64)
Algo2(1024/512)
Algo2(1024/384)
Algo2(1024/256)
Algo2(1024/192)
Algo2(1024/128)
Algo2(1024/64)
Algo3(2048/2046)
Algo3(1024/1022)
Algo3(512/510)

MLS samples

We evaluate Multi Languages from MLS dataset.
The sampling rate is 48Khz for original OPUS file. However, when we check the spectrogram, the recorded sound data exists less than 17Khz.
We use 16Khz sampling rate to truncate the higher frequency.


English Spain German French Dutch Polish Italian Portuguese
GT
algo1(1024/256)
algo1(1024/128)
algo1(1024/64)
algo2(1024/256)
algo2(1024/128)
algo2(1024/64)
algo3(1024/1022)