Improving WaveRNN with Heuristic Dynamic Blending for Fast and High-Quality GPU Vocoding
Auto-regressive vocoders are typically less efficient at inference due to their serial nature, making it difficult to fully utilize graphics processing units (GPUs). In this context, batched inference with upsampled feature folding can be used to speed up vocoding. However, speech quality degradation caused by blending folded waveform segments making it hard to be applied to production. To address this issue, we propose a novel blending approach called heuristic dynamic blending (HDB), which effectively addresses the voice trembling and diplopia issues of conventional static blending. We also propose a parallel algorithm of HDB running on GPUs, which significantly reduces the additional time overhead introduced by the naive HDB algorithm. Experimental results demonstrate that using a multi-band WaveRNN with HDB can effectively improve parallelism for real-time GPU vocoding while maintaining high speech quality comparable to non-folding inference.
Audio Samples
Dataset: The LJ Speech Dataset - Keith Ito
Non-Folding: Non-Folding inference.
Folding wo Blending: Folding inference without blending. Crackling Noise.
Folding w Static Blending: Folding inference with Static Blending. Voice Trembling and
Echo.
Folding w HDB: Folding inference with the proposed Heuristic Dynamic Blending.
Audio Samples of \(\hat{SL}=100\), \(\hat{OL}=50\)
Name | Ground Truth | Non-Folding | Folding wo Blending | Folding w Static Blending | Folding w HDB |
---|---|---|---|---|---|
LJ009-0110 | |||||
LJ009-0207 | |||||
LJ010-0138 | |||||
LJ015-0159 | |||||
LJ016-0326 | |||||
LJ033-0003 | |||||
LJ039-0184 | |||||
LJ043-0080 |
Audio Samples of \(\hat{SL}=500\), \(\hat{OL}=50\)
Name | Ground Truth | Non-Folding | Folding wo Blending | Folding w Static Blending | Folding w HDB |
---|---|---|---|---|---|
LJ009-0110 | |||||
LJ009-0207 | |||||
LJ010-0138 | |||||
LJ015-0159 | |||||
LJ016-0326 | |||||
LJ033-0003 | |||||
LJ039-0184 | |||||
LJ043-0080 |
Audio Samples of \(\hat{SL}=1000\), \(\hat{OL}=50\)
Name | Ground Truth | Non-Folding | Folding wo Blending | Folding w Static Blending | Folding w HDB |
---|---|---|---|---|---|
LJ009-0110 | |||||
LJ009-0207 | |||||
LJ010-0138 | |||||
LJ015-0159 | |||||
LJ016-0326 | |||||
LJ033-0003 | |||||
LJ039-0184 | |||||
LJ043-0080 |