Paper
RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching (accepted on INTERSPEECH 2025)Abstract
We introduce RapFlow-TTS, a rapid and high-fidelity TTS acoustic model that leverages velocity consistency constraints in flow matching (FM) training. Although ordinary differential equation (ODE)-based TTS generation achieves natural-quality speech, it typically requires a large number of generation steps, resulting in a trade-off between quality and inference speed. To address this challenge, RapFlow-TTS enforces consistency in the velocity field along the FM-straightened ODE trajectory, enabling consistent synthetic quality with fewer generation steps. Additionally, we introduce techniques such as time interval scheduling and adversarial learning to further enhance the quality of the few-step synthesis. Experimental results show that RapFlow-TTS achieves high-fidelity speech synthesis with a 5- and 10-fold reduction in synthesis steps than the conventional FM- and score-based approaches, respectively.Text-to-Speech Demo
Below, we provide TTS samples depending on the datasets:- 1. Single speaker (LJSpeech)
- 2. Multi-speaker (VCTK)
- 3. Additional samples
1. Single speaker (LJSpeech)
Text: RapFlow TTS is a TTS model using improved consistency flow matching, and it can synthesize high-quality speech with fewer steps.
Text: The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves.
GT | RapFlow-TTS | RapFlow-TTS† | FastSpeech2 | Comospeech | Grad-TTS | Matcha-TTS | VoiceFlow |
---|---|---|---|---|---|---|---|
|
NFE - 2
|
NFE - 2
|
NFE - 1
|
NFE - 2
|
NFE - 2
NFE - 25 |
NFE - 2
NFE - 10 |
NFE - 2
NFE - 10 |
Text: And report at length upon the condition of the prisons of the country.
GT | RapFlow-TTS | RapFlow-TTS† | FastSpeech2 | Comospeech | Grad-TTS | Matcha-TTS | VoiceFlow |
---|---|---|---|---|---|---|---|
|
NFE - 2
|
NFE - 2
|
NFE - 1
|
NFE - 2
|
NFE - 2
NFE - 25 |
NFE - 2
NFE - 10 |
NFE - 2
NFE - 10 |
Text: One of the earliest of the big operators in fraudulent finance was Edward Beaumont Smith.
GT | RapFlow-TTS | RapFlow-TTS† | FastSpeech2 | Comospeech | Grad-TTS | Matcha-TTS | VoiceFlow |
---|---|---|---|---|---|---|---|
|
NFE - 2
|
NFE - 2
|
NFE - 1
|
NFE - 2
|
NFE - 2
NFE - 25 |
NFE - 2
NFE - 10 |
NFE - 2
NFE - 10 |
Text: Break apart from one another and pile on a plate, throwing a clean doily or a small napkin over them. Break open at table.
GT | RapFlow-TTS | RapFlow-TTS† | FastSpeech2 | Comospeech | Grad-TTS | Matcha-TTS | VoiceFlow |
---|---|---|---|---|---|---|---|
|
NFE - 2
|
NFE - 2
|
NFE - 1
|
NFE - 2
|
NFE - 2
NFE - 25 |
NFE - 2
NFE - 10 |
NFE - 2
NFE - 10 |
Text: From the Presidential airplane, the Vice President telephoned Attorney General Robert F. Kennedy.
GT | RapFlow-TTS | RapFlow-TTS† | FastSpeech2 | Comospeech | Grad-TTS | Matcha-TTS | VoiceFlow |
---|---|---|---|---|---|---|---|
|
NFE - 2
|
NFE - 2
|
NFE - 1
|
NFE - 2
|
NFE - 2
NFE - 25 |
NFE - 2
NFE - 10 |
NFE - 2
NFE - 10 |
Text: Might have been more alert in the Dallas motorcade if they had retired promptly in Fort Worth.
GT | RapFlow-TTS | RapFlow-TTS† | FastSpeech2 | Comospeech | Grad-TTS | Matcha-TTS | VoiceFlow |
---|---|---|---|---|---|---|---|
|
NFE - 2
|
NFE - 2
|
NFE - 1
|
NFE - 2
|
NFE - 2
NFE - 25 |
NFE - 2
NFE - 10 |
NFE - 2
NFE - 10 |
2. Multi-speaker (VCTK)
Text: It's really no great surprise, because the price difference is so much.
Speaker: p225
GT | Baseline | RapFlow-TTS | RapFlow-TTS† |
---|---|---|---|
NFE - 2
NFE - 10 |
NFE - 2
NFE - 10 |
NFE - 2
NFE - 10 |
Text: This will take in dividend policy and capital structure.
Speaker: s5
GT | Baseline | RapFlow-TTS | RapFlow-TTS† |
---|---|---|---|
NFE - 2
NFE - 10 |
NFE - 2
NFE - 10 |
NFE - 2
NFE - 10 |
Text: When a man looks for something beyond his reach, his friends say he is looking for the pot of gold at the end of the rainbow.
Speaker: p248
GT | Baseline | RapFlow-TTS | RapFlow-TTS† |
---|---|---|---|
NFE - 2
NFE - 10 |
NFE - 2
NFE - 10 |
NFE - 2
NFE - 10 |
Text: The decision is an absolute disgrace.
Speaker: p299
GT | Baseline | RapFlow-TTS | RapFlow-TTS† |
---|---|---|---|
NFE - 2
NFE - 10 |
NFE - 2
NFE - 10 |
NFE - 2
NFE - 10 |
Text: The Government is keen to promote the growth of friendly societies.
Speaker: p232
GT | Baseline | RapFlow-TTS | RapFlow-TTS† |
---|---|---|---|
NFE - 2
NFE - 10 |
NFE - 2
NFE - 10 |
NFE - 2
NFE - 10 |
Text: He may be ready for first team action in March.
Speaker: p243
GT | Baseline | RapFlow-TTS | RapFlow-TTS† |
---|---|---|---|
NFE - 2
NFE - 10 |
NFE - 2
NFE - 10 |
NFE - 2
NFE - 10 |
Text: If that's the case, he will struggle.
Speaker: p256
GT | Baseline | RapFlow-TTS | RapFlow-TTS† |
---|---|---|---|
NFE - 2
NFE - 10 |
NFE - 2
NFE - 10 |
NFE - 2
NFE - 10 |
Text: It was an easy decision to come here.
Speaker: p279
GT | Baseline | RapFlow-TTS | RapFlow-TTS† |
---|---|---|---|
NFE - 2
NFE - 10 |
NFE - 2
NFE - 10 |
NFE - 2
NFE - 10 |
3. Additional samples on the LJSpeech dataset
Text: The fire had not quite burnt out at twelve, in nearly four hours, that is to say.
GT | Baseline | RapFlow-TTS | RapFlow-TTS† |
---|---|---|---|
NFE - 2
NFE - 10 NFE - 25 |
NFE - 2
NFE - 10 NFE - 25 |
NFE - 2
NFE - 10 NFE - 25 |
Text: About this time Davidson and Gordon, the people above-mentioned,
GT | Baseline | RapFlow-TTS | RapFlow-TTS† |
---|---|---|---|
NFE - 2
NFE - 10 NFE - 25 |
NFE - 2
NFE - 10 NFE - 25 |
NFE - 2
NFE - 10 NFE - 25 |
Text: And in many directions, the intervention of that organized control which we call government.
GT | Baseline | RapFlow-TTS | RapFlow-TTS† |
---|---|---|---|
NFE - 2
NFE - 10 NFE - 25 |
NFE - 2
NFE - 10 NFE - 25 |
NFE - 2
NFE - 10 NFE - 25 |
Thanks for your interest!