Contextual Expressive Text-to-Speech

Jianhong Tu1,2,*, Zeyu Cui1,*, Xiaohuan Zhou1, Siqi Zheng1, Kai Hu1, Ju Fan2, Chang Zhou1,†
1 DAMO Academy, Alibaba Group, China
2 Renmin University, China

0. Contents

  1. Abstract
  2. Demos -- Expressive speech synthesis on EmoV-DB test set.
  3. Demos -- Zeroshot expressive speech synthesis on novel "Love in the Time of Cholera"
  4. Demos -- Zeroshot expressive speech synthesis on handwritten context


1. Abstract

The goal of expressive Text-to-speech (TTS) is to synthesize natural speech with desired content, prosody, emotion, or timbre, in high expressiveness. Most of previous studies attempt to generate speech from given labels of styles and emotions, which over-simplifies the problem by classifying styles and emotions into a fixed number of pre-defined categories. In this paper, we introduce a new task setting, Contextual TTS (CTTS). The main idea of CTTS is that how a person speaks depends on the particular context she is in, where the context can typically be represented as text. Thus, in the CTTS task, we propose to utilize such context to guide the speech synthesis process instead of relying on explicit labels of styles and emotions. To achieve this task, we construct a synthetic dataset and develop an effective framework. Experiments show that our framework can generate high-quality expressive speech based on the given context both in synthetic datasets and real-world scenarios.



2. Demos -- Expressive speech synthesis on EmoV-DB

Corresponding to Section 3.2 in our paper, below lists the samples that are synthesized on EmoV-DB dataset. We compared M-CTTS with M-TTS, M-LTTS, M-CTTS-NT.

.

Short summary: The results show that M-CTTS can synthesize speeches with accurate content and high expressiveness. M-TTS synthesizes content correct speeches without emotion. M-LTTS can synthesize expressive speeches with accurate labels. Due to insufficient understanding of the context, the speeches generated by M-CTTS-NT are sometimes not emotional.

3. Demos -- Zeroshot expressive speech synthesis on novel "Love in the Time of Cholera"

Here is a brief example of conversation scenario:

4. Demos -- Zeroshot expressive speech synthesis on handwritten context

Short summary: Because of M-CTTS's ability to understand the context, M-CTTS can better understand the emotion contained in the context in the out-of-domain scenes and synthesize appropriate expressive speeches. M-TTS completely loses emotion, which uses neutral tone. M-LTTS and M-CTTS-NT are sometimes good and sometimes bad due to the lack of understanding of the context. We find that M-LTTS is difficult to understand the positive emotions in the context, but tend to synthesize neutral or angry tone. M-CTTS-NT becomes difficult to understand the context in the out-of-domain scenario, and the unknown context will cause interference, resulting in poor speech quality.