Efficient Incremental Text-to-Speech on GPUs

Incremental text-to-speech, also known as streaming TTS, has been increasingly applied to online speech applications that require ultra-low response latency to provide an optimal user experience. However, most of the existing speech synthesis pipelines deployed on GPU are still non-incremental, which uncovers limitations in high-concurrency scenarios, especially when the pipeline is built with end-to-end neural network models. To address this issue, we present a highly efficient approach to perform real-time incremental TTS on GPUs with Instant Request Pooling and Module-wise Dynamic Batching. Experimental results demonstrate that the proposed method is capable of producing high-quality speech with a first-chunk latency lower than 80ms under 100 QPS on a single NVIDIA A10 GPU and significantly outperforms the non-incremental twin in both concurrency and latency. Our work reveals the effectiveness of high-performance incremental TTS on GPUs.

architecture

Metrics


QPS: Queries-per-Second. The number of TTS requests sent to the server per second. FCL: First Chunk Latency. (between the request been sent and the first audio chunk been received). LCL: Last Chunk Latency. (between the request been sent and the last audio chunk being received).
RTF: Real-Time Factor. LCL/Duration for incremental synthesis and Latency/Duration for non-incremental synthesis.

Incremental Pipeline - Overlap 4
Short Text
QPS FCL(ms) LCL(ms) RTF
10 22.48 109.58 0.041
20 25.43 123.37 0.047
30 26.45 130.42 0.049
40 26.58 139.27 0.053
50 27.00 146.58 0.055
60 30.52 159.37 0.060
70 32.04 168.60 0.064
80 33.27 176.22 0.067
90 35.76 186.99 0.071
100 37.69 200.02 0.076
Medium Text
QPS FCL(ms) LCL(ms) RTF
10 24.92 167.39 0.041
20 26.27 184.02 0.045
30 27.17 197.99 0.048
40 28.41 218.33 0.053
50 33.41 249.08 0.061
60 35.28 270.97 0.066
70 38.16 295.36 0.072
80 43.03 330.87 0.081
90 48.46 378.39 0.093
100 56.19 438.77 0.107
Long Text
QPS FCL(ms) LCL(ms) RTF
10 27.20 286.47 0.041
20 27.23 321.10 0.046
30 30.83 383.15 0.055
40 35.12 477.44 0.068
50 45.67 585.71 0.084
60 57.95 745.58 0.107
70 77.05 1002.56 0.143
80* 936.99 2466.44 0.352
90* 15844.50 17401.40 2.484
100* 30566.30 32131.50 4.590
Mixed Text
QPS FCL(ms) LCL(ms) RTF
10 25.18 187.11 0.041
20 26.60 208.41 0.046
30 28.26 233.91 0.051
40 29.93 268.41 0.058
50 34.51 307.90 0.067
60 38.90 346.18 0.075
70 44.47 400.32 0.087
80 51.30 456.63 0.101
90 60.42 541.99 0.121
100 75.15 669.85 0.151
Incremental Pipeline - Overlap 8
Short Text
QPS FCL(ms) LCL(ms) RTF
10 22.25 108.24 0.041
20 25.03 120.98 0.046
30 26.19 129.40 0.049
40 26.24 137.21 0.052
50 26.99 146.90 0.055
60 30.87 161.18 0.061
70 32.26 170.41 0.064
80 33.70 177.68 0.067
90 36.10 189.02 0.071
100 38.63 205.13 0.077
Medium Text
QPS FCL(ms) LCL(ms) RTF
10 24.48 163.90 0.040
20 26.02 182.18 0.045
30 26.44 194.37 0.048
40 28.52 219.53 0.054
50 33.67 252.22 0.062
60 35.83 275.58 0.067
70 39.38 305.88 0.075
80 45.02 348.09 0.085
90 52.69 408.19 0.100
100 61.51 480.58 0.118
Long Text
QPS FCL(ms) LCL(ms) RTF
10 26.26 279.01 0.040
20 27.27 314.08 0.045
30 31.03 385.89 0.055
40 37.41 493.60 0.071
50 48.61 623.64 0.089
60 61.92 798.00 0.114
70 91.42 1190.83 0.170
80* 7593.05 9227.65 1.318
90* 23025.10 24670.90 3.523
100* 37970.60 39622.70 5.661
Mixed Text
QPS FCL(ms) LCL(ms) RTF
10 23.89 179.55 0.040
20 25.83 200.64 0.044
30 27.28 224.38 0.049
40 29.46 265.13 0.057
50 34.80 303.04 0.066
60 39.69 355.34 0.077
70 46.29 418.37 0.091
80 53.60 481.11 0.106
90 64.86 580.29 0.130
100 86.24 767.58 0.174

* The permitted max batch size is set to 128 for each module. If the number of retrieved items of a module exceeds the permitted max batch size, the excess will be skipped in the current iteration and put back into the pool to be processed in the following iterations, resulting in a backlog in the request pool.

Non-Incremental Pipeline
Short Text
QPS Latency(ms) RTF
10 99.63 0.038
20 128.45 0.049
30 132.49 0.050
40 181.84 0.067
50 182.79 0.068
60 212.45 0.079
70 227.66 0.086
80 247.27 0.094
90 293.49 0.111
100 464.85 0.176
Medium Text
QPS Latency(ms) RTF
10 195.82 0.046
20 215.76 0.052
30 234.68 0.057
40 258.29 0.063
50 320.32 0.078
60 377.08 0.092
70 451.67 0.109
80 566.63 0.139
90 779.93 0.190
100 1517.05 0.371
Long Text
QPS Latency(ms) RTF
10 327.04 0.046
20 355.41 0.051
30 438.09 0.062
40 531.79 0.076
50 690.85 0.098
60 897.92 0.128
70 1405.51 0.202
80 7648.74 1.094
90 18913.50 2.706
100 52586.90 7.523
Mixed Text
QPS Latency(ms) RTF
10 222.80 0.050
20 268.58 0.062
30 311.24 0.072
40 376.21 0.090
50 516.92 0.127
60 865.47 0.217
70 1253.25 0.313
80 7521.03 1.842
90 18606.80 4.525
100 51348.20 12.708

Latency


Short Text (Audio Duration: 2.64s) under Diffeent QPS.
Medium Text (Audio Duration: 4.08s) under Diffeent QPS.
Long Text (Audio Duration: 7.00s) under Diffeent QPS.
Mixed Text (Audio Duration: 4.54s) under Diffeent QPS.

Real-Time Factor


For better visualization, some lines are not displayed by default. You can click on the legends to show/hide lines.

Samples


Dataset: Chinese Standard Mandarin Speech Copus (10000 Sentences) from Databaker


室外建有葡萄园和停机坪。shì wài jiàn yǒu pú táo yuán hé tíng jī píng 。
Ground Truth
Non-Incremental Synthesis
Incremental Synthesis - Overlap 4
Incremental Synthesis - Overlap 8
一起在晚上仰望星空吗?yī qǐ zài wǎn shàng yǎng wàng xīng kōng ma ?
Ground Truth
Non-Incremental Synthesis
Incremental Synthesis - Overlap 4
Incremental Synthesis - Overlap 8
有望年底与观众见面。yǒu wàng nián dǐ yǔ guān zhòng jiàn miàn 。
Ground Truth
Non-Incremental Synthesis
Incremental Synthesis - Overlap 4
Incremental Synthesis - Overlap 8
太湖太浦闸开闸排水。tài hú tài pǔ zhá kāi zhá pái shuǐ 。
Ground Truth
Non-Incremental Synthesis
Incremental Synthesis - Overlap 4
Incremental Synthesis - Overlap 8
天气转好利于春运交通。tiān qì zhuǎn hǎo lì yú chūn yùn jiāo tōng 。
Ground Truth
Non-Incremental Synthesis
Incremental Synthesis - Overlap 4
Incremental Synthesis - Overlap 8
许多美妙的念头纷至沓来。xǔ duō měi miào de niàn tou fēn zhì tà lái 。
Ground Truth
Non-Incremental Synthesis
Incremental Synthesis - Overlap 4
Incremental Synthesis - Overlap 8
时值傍晚,烟囱开始排烟。shí zhí bàng wǎn , yān cōng kāi shǐ pái yān 。
Ground Truth
Non-Incremental Synthesis
Incremental Synthesis - Overlap 4
Incremental Synthesis - Overlap 8
出差也要照顾好自己。chū chāi yě yào zhào gù hǎo zì jǐ 。
Ground Truth
Non-Incremental Synthesis
Incremental Synthesis - Overlap 4
Incremental Synthesis - Overlap 8