Efficient Incremental Text-to-Speech on GPUs
Incremental text-to-speech, also known as streaming TTS, has been increasingly applied to online speech
applications that require ultra-low response latency to provide an optimal user experience. However, most of the
existing speech synthesis pipelines deployed on GPU are still non-incremental, which uncovers limitations in
high-concurrency scenarios, especially when the pipeline is built with end-to-end neural network models. To
address this issue, we present a highly efficient approach to perform real-time incremental TTS on GPUs with
Instant Request Pooling and Module-wise Dynamic Batching. Experimental results demonstrate that the proposed
method is capable of producing high-quality speech with a first-chunk latency lower than 80ms under 100 QPS on a
single NVIDIA A10 GPU and significantly outperforms the non-incremental twin in both concurrency and latency.
Our work reveals the effectiveness of high-performance incremental TTS on GPUs.
Metrics
QPS: Queries-per-Second. The number of TTS requests sent to the server per second.
FCL: First Chunk Latency. (between the request been sent and the first audio chunk been
received).
LCL: Last Chunk Latency. (between the request been sent and the last audio chunk being
received).
RTF: Real-Time Factor. LCL/Duration for incremental synthesis and Latency/Duration for
non-incremental synthesis.
Incremental Pipeline - Overlap 4
Short Text
QPS |
FCL(ms) |
LCL(ms) |
RTF |
10 |
22.48 |
109.58 |
0.041 |
20 |
25.43 |
123.37 |
0.047 |
30 |
26.45 |
130.42 |
0.049 |
40 |
26.58 |
139.27 |
0.053 |
50 |
27.00 |
146.58 |
0.055 |
60 |
30.52 |
159.37 |
0.060 |
70 |
32.04 |
168.60 |
0.064 |
80 |
33.27 |
176.22 |
0.067 |
90 |
35.76 |
186.99 |
0.071 |
100 |
37.69 |
200.02 |
0.076 |
Medium Text
QPS |
FCL(ms) |
LCL(ms) |
RTF |
10 |
24.92 |
167.39 |
0.041 |
20 |
26.27 |
184.02 |
0.045 |
30 |
27.17 |
197.99 |
0.048 |
40 |
28.41 |
218.33 |
0.053 |
50 |
33.41 |
249.08 |
0.061 |
60 |
35.28 |
270.97 |
0.066 |
70 |
38.16 |
295.36 |
0.072 |
80 |
43.03 |
330.87 |
0.081 |
90 |
48.46 |
378.39 |
0.093 |
100 |
56.19 |
438.77 |
0.107 |
Long Text
QPS |
FCL(ms) |
LCL(ms) |
RTF |
10 |
27.20 |
286.47 |
0.041 |
20 |
27.23 |
321.10 |
0.046 |
30 |
30.83 |
383.15 |
0.055 |
40 |
35.12 |
477.44 |
0.068 |
50 |
45.67 |
585.71 |
0.084 |
60 |
57.95 |
745.58 |
0.107 |
70 |
77.05 |
1002.56 |
0.143 |
80* |
936.99 |
2466.44 |
0.352 |
90* |
15844.50 |
17401.40 |
2.484 |
100* |
30566.30 |
32131.50 |
4.590 |
Mixed Text
QPS |
FCL(ms) |
LCL(ms) |
RTF |
10 |
25.18 |
187.11 |
0.041 |
20 |
26.60 |
208.41 |
0.046 |
30 |
28.26 |
233.91 |
0.051 |
40 |
29.93 |
268.41 |
0.058 |
50 |
34.51 |
307.90 |
0.067 |
60 |
38.90 |
346.18 |
0.075 |
70 |
44.47 |
400.32 |
0.087 |
80 |
51.30 |
456.63 |
0.101 |
90 |
60.42 |
541.99 |
0.121 |
100 |
75.15 |
669.85 |
0.151 |
Incremental Pipeline - Overlap 8
Short Text
QPS |
FCL(ms) |
LCL(ms) |
RTF |
10 |
22.25 |
108.24 |
0.041 |
20 |
25.03 |
120.98 |
0.046 |
30 |
26.19 |
129.40 |
0.049 |
40 |
26.24 |
137.21 |
0.052 |
50 |
26.99 |
146.90 |
0.055 |
60 |
30.87 |
161.18 |
0.061 |
70 |
32.26 |
170.41 |
0.064 |
80 |
33.70 |
177.68 |
0.067 |
90 |
36.10 |
189.02 |
0.071 |
100 |
38.63 |
205.13 |
0.077 |
Medium Text
QPS |
FCL(ms) |
LCL(ms) |
RTF |
10 |
24.48 |
163.90 |
0.040 |
20 |
26.02 |
182.18 |
0.045 |
30 |
26.44 |
194.37 |
0.048 |
40 |
28.52 |
219.53 |
0.054 |
50 |
33.67 |
252.22 |
0.062 |
60 |
35.83 |
275.58 |
0.067 |
70 |
39.38 |
305.88 |
0.075 |
80 |
45.02 |
348.09 |
0.085 |
90 |
52.69 |
408.19 |
0.100 |
100 |
61.51 |
480.58 |
0.118 |
Long Text
QPS |
FCL(ms) |
LCL(ms) |
RTF |
10 |
26.26 |
279.01 |
0.040 |
20 |
27.27 |
314.08 |
0.045 |
30 |
31.03 |
385.89 |
0.055 |
40 |
37.41 |
493.60 |
0.071 |
50 |
48.61 |
623.64 |
0.089 |
60 |
61.92 |
798.00 |
0.114 |
70 |
91.42 |
1190.83 |
0.170 |
80* |
7593.05 |
9227.65 |
1.318 |
90* |
23025.10 |
24670.90 |
3.523 |
100* |
37970.60 |
39622.70 |
5.661 |
Mixed Text
QPS |
FCL(ms) |
LCL(ms) |
RTF |
10 |
23.89 |
179.55 |
0.040 |
20 |
25.83 |
200.64 |
0.044 |
30 |
27.28 |
224.38 |
0.049 |
40 |
29.46 |
265.13 |
0.057 |
50 |
34.80 |
303.04 |
0.066 |
60 |
39.69 |
355.34 |
0.077 |
70 |
46.29 |
418.37 |
0.091 |
80 |
53.60 |
481.11 |
0.106 |
90 |
64.86 |
580.29 |
0.130 |
100 |
86.24 |
767.58 |
0.174 |
* The permitted max batch size is set to 128 for each module. If the
number of retrieved items of a module exceeds the permitted max batch size, the excess will be skipped in
the current iteration and put back into the pool to be processed in the following iterations, resulting in a
backlog in the request pool.
Non-Incremental Pipeline
Short Text
QPS |
Latency(ms) |
RTF |
10 |
99.63 |
0.038 |
20 |
128.45 |
0.049 |
30 |
132.49 |
0.050 |
40 |
181.84 |
0.067 |
50 |
182.79 |
0.068 |
60 |
212.45 |
0.079 |
70 |
227.66 |
0.086 |
80 |
247.27 |
0.094 |
90 |
293.49 |
0.111 |
100 |
464.85 |
0.176 |
Medium Text
QPS |
Latency(ms) |
RTF |
10 |
195.82 |
0.046 |
20 |
215.76 |
0.052 |
30 |
234.68 |
0.057 |
40 |
258.29 |
0.063 |
50 |
320.32 |
0.078 |
60 |
377.08 |
0.092 |
70 |
451.67 |
0.109 |
80 |
566.63 |
0.139 |
90 |
779.93 |
0.190 |
100 |
1517.05 |
0.371 |
Long Text
QPS |
Latency(ms) |
RTF |
10 |
327.04 |
0.046 |
20 |
355.41 |
0.051 |
30 |
438.09 |
0.062 |
40 |
531.79 |
0.076 |
50 |
690.85 |
0.098 |
60 |
897.92 |
0.128 |
70 |
1405.51 |
0.202 |
80 |
7648.74 |
1.094 |
90 |
18913.50 |
2.706 |
100 |
52586.90 |
7.523 |
Mixed Text
QPS |
Latency(ms) |
RTF |
10 |
222.80 |
0.050 |
20 |
268.58 |
0.062 |
30 |
311.24 |
0.072 |
40 |
376.21 |
0.090 |
50 |
516.92 |
0.127 |
60 |
865.47 |
0.217 |
70 |
1253.25 |
0.313 |
80 |
7521.03 |
1.842 |
90 |
18606.80 |
4.525 |
100 |
51348.20 |
12.708 |
Latency
Short Text (Audio Duration: 2.64s) under Diffeent QPS.
Medium Text (Audio Duration: 4.08s) under Diffeent QPS.
Long Text (Audio Duration: 7.00s) under Diffeent QPS.
Mixed Text (Audio Duration: 4.54s) under Diffeent QPS.
Real-Time Factor
For better visualization, some lines are not displayed by default. You can click on the legends to show/hide
lines.
Samples
Dataset: Chinese Standard Mandarin Speech
Copus (10000 Sentences) from Databaker
室外建有葡萄园和停机坪。shì wài jiàn yǒu pú táo yuán hé tíng jī píng 。
Non-Incremental Synthesis
Incremental Synthesis - Overlap 4
Incremental Synthesis - Overlap 8
一起在晚上仰望星空吗?yī qǐ zài wǎn shàng yǎng wàng xīng kōng ma ?
Non-Incremental Synthesis
Incremental Synthesis - Overlap 4
Incremental Synthesis - Overlap 8
有望年底与观众见面。yǒu wàng nián dǐ yǔ guān zhòng jiàn miàn 。
Non-Incremental Synthesis
Incremental Synthesis - Overlap 4
Incremental Synthesis - Overlap 8
太湖太浦闸开闸排水。tài hú tài pǔ zhá kāi zhá pái shuǐ 。
Non-Incremental Synthesis
Incremental Synthesis - Overlap 4
Incremental Synthesis - Overlap 8
天气转好利于春运交通。tiān qì zhuǎn hǎo lì yú chūn yùn jiāo tōng 。
Non-Incremental Synthesis
Incremental Synthesis - Overlap 4
Incremental Synthesis - Overlap 8
许多美妙的念头纷至沓来。xǔ duō měi miào de niàn tou fēn zhì tà lái 。
Non-Incremental Synthesis
Incremental Synthesis - Overlap 4
Incremental Synthesis - Overlap 8
时值傍晚,烟囱开始排烟。shí zhí bàng wǎn , yān cōng kāi shǐ pái yān 。
Non-Incremental Synthesis
Incremental Synthesis - Overlap 4
Incremental Synthesis - Overlap 8
出差也要照顾好自己。chū chāi yě yào zhào gù hǎo zì jǐ 。
Non-Incremental Synthesis
Incremental Synthesis - Overlap 4
Incremental Synthesis - Overlap 8