Mathematical Vocoder Algorithm : Modified Spectral Inversion for Efficient Neural Speech Synthesis

Abstract:

In this work, we propose a new mathematical vocoder algorithm(new type of Spectral inversion) that generates a waveform from acoustic features without phase estimation. The main benefit of using our proposed method is that it excludes the training stage of the neural vocoder from the end-to-end speech synthesis model without sacrificing sound quality. Moreover, the vocoder no longer limits the synthesis speed or sound quality in inferencing. Our implementation can synthesize high fidelity speech at approximately 20 Mhz on CPU and 59.6MHz on GPU. This is 909 and 2,702 times faster compared to real-time. Since the proposed methodology is not a data-driven method, it is applicable to unseen voices and multiple languages without any additional work. The proposed method is expected to adapt for researching on neural network models capable of synthesizing speech at the studio recording level.

Contents

comparison with [Hifi-GAN] (LJ Speech Dataset)
comparison with [GAN vocoder] (LJ Speech Dataset)
Comparison with [WaveFlow] (LJ Speech Dataset)
Comparison with [DiffWave] (LJ Speech Dataset)
Comparison with [WaveGlow] (LJ Speech Dataset)
Comparison with [SqeezeWave] (LJ Speech Dataset)
Ablation Studies 1. various hop size (LJ Speech Dataset)
Ablation Studies 2. denoising effects(LJ Speech Dataset)

CAUTION : All sample files are not preloaded, so you have to wait a bit after hitting the play button depending on your internet connection environment.

Quality Comparision

For the comparison of sound quality, all neural vocoder synthesis samples except our algorithm were taken from each neural vocoder demo site.

[Hifi-GAN] [GAN vocoder] [WaveFlow] [DiffWave] [WaveGlow] [SqueezeWave]

LJSpeech samples from [Hifi-GAN]

	1 (LJ050-0270)	2 (LJ017-0033)	3 (LJ004-0233)	4 (LJ011-0009)	5 (LJ042-0161)
Ground Truth
WaveNet (MoL)
WaveGlow
MelGAN
HiFi-GAN V1
HiFi-GAN V2
HiFi-GAN V3
Ours(Algo1 half zeroclip 1024/64)
Ours(Algo2 full zeroclip 1024/64)
Ours(Algo3 full signed 512/384)

LJSpeech samples from [GAN vocoder]

	1 (LJ003-0307)	2 (LJ034-0083)	3( LJ005-0101)	4 (LJ036-0216)	5 (LJ007-0217)
Ground Truth
HifiGAN
HifiGAN(MultiRes)
pWaveGAN
uMelGAN
VocGAN
Ours(Algo1 half zeroclip 1024/64)
Ours(Algo2 full zeroclip 1024/64)
Ours(Algo3 full signed 512/384)

LJSpeech samples from [WaveFlow]

	1 (LJ001-0001)	2 (LJ001-0003)	3 (LJ001-0005)	4 (LJ001-0015)	5 (LJ001-00016)
Ground-truth (recorded speech)
WaveGlow (96-layer, res. ch = 256)
WaveFlow (64-layer, res. ch = 256)
Ours(Algo1 half zeroclip 1024/64)
Ours(Algo2 full zeroclip 1024/64)
Ours(Algo3 full signed 512/384)

LJSpeech samples from [DiffWave]

	1 (LJ001-0001)	2 (LJ001-0002)	3 (LJ001-0003)	4 (LJ001-0004)	5 (LJ001-0005)
Ground-truth
WaveNet (C = 128)
WaveFlow (C = 128)
DiffWave (C = 128, T = 200)
WaveFlow (C = 64)
DiffWave (C = 64, T = 50)
ClariNet (C = 64)
Ours(Algo1 half zeroclip 1024/64)
Ours(Algo2 full zeroclip 1024/64)
Ours(Algo3 full signed 512/384)

LJSpeech samples from [WaveGlow] and [SqueezeWave]

	1 (LJ001-0015)	2 (LJ001-0051)	3 (LJ001-0063)	4 (LJ001-0072)	5 (LJ001-0079)
Ground Truth
Griffin-Lim(WG)
WaveNet(WG)
WaveGlow
WaveGlow(SW)
SqueezeWave 128L
SqueezeWave 128S
SqueezeWave 64L
SqueezeWave 64S
Ours(Algo1 half zeroclip 1024/64)
Ours(Algo2 full zeroclip 1024/64)
Ours(Algo3 full signed 512/384)

Ablation Studies

Compare half & full Spectrum Bin for LJ001-0001 sample

We could get better quality with smaller hop size.
In Algorithm 1(half), we could see three types of artifacts. In Algorithm 2(full), artifacts disappeared.

Half Frequency Bin
hop	win_size : 1024	win_size : 512		hop	win_size : 1024	win_size : 512
hop	MCD	MCD		hop	MCD	MCD
512	7.1201	-	-	512	6.342	-	-
384	7.0498	-	-	384	6.190	-	-
256	6.9279	7.0491		256	6.061	6.578
192	6.9282	6.9884		192	6.011	6.163
128	6.6855	6.9730		128	5.948	6.051
96	6.558	6.7327		96	5.825	5.951
64	6.4737	6.7075		64	5.798	5.858

denoising effect with positive threshold

In Algorithm 2, the noise removal effect was obtained by slightly increasing the threshold value instead of zero clipping.
The light noise in the silent section was able to obtain the denoising effect of a commercial tool level without special work.

	LJ001-0001	LJ001-0015	LJ002-0097	LJ003-0048	LJ005-0240
Ground Truth
denoising tool mild
denoising tool strong
zero clipping threshold=0
zero clipping threshold=0.05
zero clipping threshold=0.1

44Khz samples

We evaluate 44Khz audio samples Speech, Music(Pop) and Performance(Piano)
In Algorithm2, we need more fine hop size for higher sampling rate samples for speech. However, we could increase hop size for music because we could not recognize the ghost effect in loud sound.

	Speech	POP	K-POP	Drum	Piano	Piano
GT(44Khz)
resample 22KHz
Algo2(2048/1024)
Algo2(2048/768)
Algo2(2048/512)
Algo2(2048/384)
Algo2(2048/256)
Algo2(2048/192)
Algo2(2048/128)
Algo2(2048/64)
Algo2(1024/512)
Algo2(1024/384)
Algo2(1024/256)
Algo2(1024/192)
Algo2(1024/128)
Algo2(1024/64)
Algo3(2048/2046)
Algo3(1024/1022)
Algo3(512/510)

MLS samples

We evaluate Multi Languages from MLS dataset.
The sampling rate is 48Khz for original OPUS file. However, when we check the spectrogram, the recorded sound data exists less than 17Khz.
We use 16Khz sampling rate to truncate the higher frequency.

	English	Spain	German	French	Dutch	Polish	Italian	Portuguese
GT
algo1(1024/256)
algo1(1024/128)
algo1(1024/64)
algo2(1024/256)
algo2(1024/128)
algo2(1024/64)
algo3(1024/1022)