Contextual Expressive Text-to-Speech

Jianhong Tu^1,2,*, Zeyu Cui^1,*, Xiaohuan Zhou¹, Siqi Zheng¹, Kai Hu¹, Ju Fan², Chang Zhou^1,† ¹ DAMO Academy, Alibaba Group, China ² Renmin University, China

0. Contents

Abstract
Demos -- Expressive speech synthesis on EmoV-DB test set.
Demos -- Zeroshot expressive speech synthesis on novel "Love in the Time of Cholera"
Demos -- Zeroshot expressive speech synthesis on handwritten context

1. Abstract

The goal of expressive Text-to-speech (TTS) is to synthesize natural speech with desired content, prosody, emotion, or timbre, in high expressiveness. Most of previous studies attempt to generate speech from given labels of styles and emotions, which over-simplifies the problem by classifying styles and emotions into a fixed number of pre-defined categories. In this paper, we introduce a new task setting, Contextual TTS (CTTS). The main idea of CTTS is that how a person speaks depends on the particular context she is in, where the context can typically be represented as text. Thus, in the CTTS task, we propose to utilize such context to guide the speech synthesis process instead of relying on explicit labels of styles and emotions. To achieve this task, we construct a synthetic dataset and develop an effective framework. Experiments show that our framework can generate high-quality expressive speech based on the given context both in synthetic datasets and real-world scenarios.

2. Demos -- Expressive speech synthesis on EmoV-DB

Corresponding to Section 3.2 in our paper, below lists the samples that are synthesized on EmoV-DB dataset. We compared M-CTTS with M-TTS, M-LTTS, M-CTTS-NT.

Context	the art in the alley behind it is cool too !
Content	From that moment his friendship for Belize turns to hatred and jealousy.
Label	Emotion: Amused, Speaker: Bea
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	this is the best yarn store in the metro area.
Content	Shall I carry you.
Label	Emotion: Amused, Speaker: Bea
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	extremely attentive and genuinely a good person.
Content	She had been thoroughly and efficiently mauled.
Label	Emotion: Amused, Speaker: Bea
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	they loved the rock climb.
Content	The questions may have come vaguely in his mind.
Label	Emotion: Amused, Speaker: Bea
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	the wait staff is extremely attractive and friendly !
Content	I don't know why you're here at all.
Label	Emotion: Amused, Speaker: Bea
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	it 's good solid food.
Content	I do not blame you for anything; remember that.
Label	Emotion: Amused, Speaker: Bea
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	it tasted like melted plastic and had the same tough consistency.
Content	For the twentieth time that evening the two men shook hands.
Label	Emotion: Anger, Speaker: Bea
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	needless to say , i will not be returning to this place ever again.
Content	What was the object of your little sensation.
Label	Emotion: Anger, Speaker: Bea
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	lost a long time customer !
Content	Red-Eye swung back and forth on the branch farther down.
Label	Emotion: Anger, Speaker: Bea
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	however , the tech said nothing to me about this.
Content	It seems like a strange pointing of the hand of God.
Label	Emotion: Anger, Speaker: Bea
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	the staff is awesome and location is right in the heart of old town !
Content	The men stared into each other's face.
Label	Emotion: Amused, Speaker: Jenie
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	had dinner here last night and it was great.
Content	Yes, it was a man who asked, a stranger.
Label	Emotion: Amused, Speaker: Jenie
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	fantastic place to see a show as every seat is a great seat !
Content	He had barely entered this when he saw the glow of a fire.
Label	Emotion: Amused, Speaker: Jenie
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	this was the best i have ever had !
Content	Was it the rendezvous of those who were striving to work his ruin.
Label	Emotion: Amused, Speaker: Jenie
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	these two women are professionals.
Content	Philip began to feel that he had foolishly overestimated his strength.
Label	Emotion: Amused, Speaker: Jenie
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	um , yah , it does n't need replacing just yet.
Content	Since then some mysterious force has been fighting us at every step.
Label	Emotion: Anger, Speaker: Jenie
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	this is the worst walmart neighborhood market out of any of them.
Content	His face was streaming with blood.
Label	Emotion: Anger, Speaker: Jenie
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	gammage itself however is not so amazing.
Content	She added, with genuine sympathy in her face and voice.
Label	Emotion: Anger, Speaker: Jenie
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	we stood there in shock , because we never expected this.
Content	He was determined now to maintain a more certain hold upon himself.
Label	Emotion: Anger, Speaker: Jenie
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	one was for my dog , and one was for my wife 's dog.
Content	For a time the exciting thrill of his adventure was gone.
Label	Emotion: Anger, Speaker: Jenie
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	the falafel 's looked like chicken nuggets , and were lacking flavor.
Content	Why, the average review is more nauseating than cod liver oil.
Label	Emotion: Anger, Speaker: Jenie
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	not even real brown sauce.
Content	He cried in such genuine dismay that she broke into hearty laughter.
Label	Emotion: Anger, Speaker: Jenie
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	had to returned one entree because too cold.
Content	He was the soul of devotion to his employers.
Label	Emotion: Anger, Speaker: Jenie
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	uneven pieces and falling apart -- i paid for that.
Content	It is very plausible to such people, a most convincing hypothesis.
Label	Emotion: Anger, Speaker: Jenie
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	this is the worst walmart neighborhood market out of any of them.
Content	They are greatly delighted with anything that is bright or giveth a sound.
Label	Emotion: Anger, Speaker: Jenie
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	this room that he found also reeked of smoke !
Content	I'm sure going along with you all, Elijah.
Label	Emotion: Anger, Speaker: Jenie
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	uneven pieces and falling apart -- i paid for that.
Content	There's too much of the schoolboy in me.
Label	Emotion: Anger, Speaker: Jenie
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	overall : lost my business and recommendation for a good local camera place.
Content	Tom Spink has a harpoon.
Label	Emotion: Anger, Speaker: Jenie
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	let me tell you , this place was far from busy !
Content	Both Johnny and his mother shuffled their feet as they walked.
Label	Emotion: Anger, Speaker: Jenie
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	i 've eaten here many times , but none as bad as last night.
Content	His beady black eyes saw bargains where other men saw bankruptcy.
Label	Emotion: Anger, Speaker: Jenie
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	service was great as they continued to check on our table.
Content	God bless 'em, I hope I'll go on seeing them forever.
Label	Emotion: Amused, Speaker: Josh
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	our waitress was the best , very accommodating.
Content	The girl faced him, her eyes shining with sudden fear.
Label	Emotion: Amused, Speaker: Josh
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	thank you fiesta , lunch with you is always good.
Content	Philip began to feel that he had foolishly overestimated his strength.
Label	Emotion: Amused, Speaker: Josh
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	they made me feel like i was at home and their an extended family !
Content	Ah, I had forgotten, he exclaimed.
Label	Emotion: Amused, Speaker: Josh
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	nice for me to go and work and have a great breakfast !
Content	Gregson had left the outer door slightly ajar.
Label	Emotion: Amused, Speaker: Josh
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	and again , the food is incredibly delicious !
Content	They were three hundred yards apart.
Label	Emotion: Amused, Speaker: Sam
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	she chose a great color that looks incredible with my skin , too.
Content	Perhaps she had already met her fate a little deeper in the forest.
Label	Emotion: Amused, Speaker: Sam
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	always a fun and friendly atmosphere.
Content	Fresh cases, still able to walk, they clustered about the spokesman.
Label	Emotion: Amused, Speaker: Sam
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	when it came we should have sent it back.
Content	I am going to surprise father, and you will go with Pierre.
Label	Emotion: Anger, Speaker: Sam
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	the staff was nowhere to be found.
Content	And the air was growing chilly.
Label	Emotion: Anger, Speaker: Sam
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	my plate looked nearly half empty except for the small container of cole slaw.
Content	Don't you see, I'm chewing this thing in two.
Label	Emotion: Anger, Speaker: Sam
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	we sit down and we got some really slow and lazy service.
Content	They were following the shore of a lake.
Label	Emotion: Anger, Speaker: Sam
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	she did offer me a copy if i would like a soda while waiting.
Content	I had been sad too long already.
Label	Emotion: Anger, Speaker: Sam
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	this room that he found also reeked of smoke !
Content	They are not regular oyster pirates, Nicholas continued.
Label	Emotion: Anger, Speaker: Sam
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	horrible , horrible , horrible service !
Content	He was pressing beyond the limits of his vocabulary.
Label	Emotion: Anger, Speaker: Sam
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	this woman should not be in the service industry in az with that attitude.
Content	Broken-Tooth yelled with fright and pain.
Label	Emotion: Anger, Speaker: Sam
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	the staff was nowhere to be found.
Content	He was a wise hyena.
Label	Emotion: Anger, Speaker: Sam
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	zero complaints with his work.
Content	So unexpected was my charge that I knocked him off his feet.
Label	Emotion: Anger, Speaker: Sam
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	you 'll have zero appetite after the first bite.
Content	Seventeen, no, eighteen days ago.
Label	Emotion: Anger, Speaker: Sam
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	even if i was insanely drunk , i could n't force this pizza down.
Content	These rumors may even originate with us.
Label	Emotion: Anger, Speaker: Sam
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	i was very disappointed with this place.
Content	This sound did not disturb the hush and awe of the place.
Label	Emotion: Anger, Speaker: Sam
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	do not sign a lease with these people.
Content	It lasted as a deterrent for two days.
Label	Emotion: Anger, Speaker: Sam
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	the rooms are not that nice and the food is not that good either.
Content	The creative joy, I murmured.
Label	Emotion: Anger, Speaker: Sam
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	the food here is bland and boring and bad.
Content	The river bared its bosom, and snorting steamboats challenged the wilderness.
Label	Emotion: Anger, Speaker: Sam
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	this is the worst walmart neighborhood market out of any of them.
Content	The lines were now very taut.
Label	Emotion: Anger, Speaker: Sam
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Context	it 's always busy and the restaurant is very dirty.
Content	Who the devil gave it to you to be judge and jury.
Label	Emotion: Anger, Speaker: Sam
Performance	Ref	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT
Performance

Short summary: The results show that M-CTTS can synthesize speeches with accurate content and high expressiveness. M-TTS synthesizes content correct speeches without emotion. M-LTTS can synthesize expressive speeches with accurate labels. Due to insufficient understanding of the context, the speeches generated by M-CTTS-NT are sometimes not emotional.

3. Demos -- Zeroshot expressive speech synthesis on novel "Love in the Time of Cholera"

Context	roar with the laughter and say:
Content	Not at all: as if you were nobody.
Label	Emotion: Unknown, Speaker: Jenie
Performance	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT

Context	with a very sweet smile, said to him
Content	Thank you for coming.
Label	Emotion: Unknown, Speaker: Jenie
Performance	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT

Context	One day, at the height of desperation, she had shouted at him
Content	You don't understand how unhappy I am.
Label	Emotion: Unknown, Speaker: Jenie
Performance	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT

Context	his subordinates joking good-naturedly.
Content	You can't teach an old dog new tricks.
Label	Emotion: Unknown, Speaker: Bea
Performance	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT

Context	roar with the laughter and say:
Content	Not at all: as if you were nobody.
Label	Emotion: Unknown, Speaker: Bea
Performance	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT

Context	with a very sweet smile, said to him
Content	Thank you for coming.
Label	Emotion: Unknown, Speaker: Bea
Performance	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT

Context	She aroused the cockatoo again with her joyous laughter.
Content	Not even Jonah's wife would swallow that story.
Label	Emotion: Unknown, Speaker: Bea
Performance	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT

Context	despit all her self-control, she lost her temper with a historic cry.
Content	To hell with the Archbishop!
Label	Emotion: Unknown, Speaker: Bea
Performance	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT

Context	move her hand away, sat up, and said in a tremulous voice.
Content	Be careful, we have no rubbers.
Label	Emotion: Unknown, Speaker: Bea
Performance	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT

Context	she was imperturbable.
Content	This is another one.
Label	Emotion: Unknown, Speaker: Bea
Performance	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT

Context	still standing, she said to him in confusion.
Content	Well, you are here now.
Label	Emotion: Unknown, Speaker: Bea
Performance	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT

Here is a brief example of conversation scenario:

Narrator (Context): Florentino Ariza was left exhausted, incomplete, floating in a puddle of their perspiration, but with the impression of being no more than an instrument of pleasure. He would say:

Content: "You treat me as if I were just anybody."

Narrator (Context): She would roar with the laughter of a free female and say:

Content: "Not at all: as if you were nobody."

4. Demos -- Zeroshot expressive speech synthesis on handwritten context

Context	things are getting better,
Content	A dead man is of no use on a plantation.
Label	Emotion: Unknown, Speaker: Bea
Performance	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT

Context	things are getting better,
Content	The girl faced him, her eyes shining with sudden fear.
Label	Emotion: Unknown, Speaker: Bea
Performance	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT

Context	her eyes ablaze with anger,
Content	Philip began to feel that he had foolishly overestimated his strength.
Label	Emotion: Unknown, Speaker: Bea
Performance	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT

Context	it is a disgusting situtation.
Content	You want to go over and see his gang throw dirt.
Label	Emotion: Unknown, Speaker: Bea
Performance	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT

Context	today is a good day. the sun is shining brightly.
Content	These rumors may even originate with us.
Label	Emotion: Unknown, Speaker: Bea
Performance	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT

Context	the speaker is shocked.
Content	What do you mean by this outrageous conduct.
Label	Emotion: Unknown, Speaker: Bea
Performance	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT

Context	Today is a good day. The sun is shining brightly.
Content	Yes, it was a man who asked, a stranger.
Label	Emotion: Unknown, Speaker: Jenie
Performance	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT

Context	her eyes ablaze with anger.
Content	Gregson had left the outer door slightly ajar.
Label	Emotion: Unknown, Speaker: Jenie
Performance	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT

Context	the speaker is cheerful.
Content	Pearce's little eyes were fixed on him shrewdly.
Label	Emotion: Unknown, Speaker: Jenie
Performance	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT

Context	the speaker becomes serious.
Content	Bassett was a fastidious man.
Label	Emotion: Unknown, Speaker: Jenie
Performance	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT

Context	speaking with a great of joyness.
Content	His beady black eyes saw bargains where other men saw bankruptcy.
Label	Emotion: Unknown, Speaker: Jenie
Performance	M-TTS	M-CTTS	M-LTTS	M-CTTS-NT

Short summary: Because of M-CTTS's ability to understand the context, M-CTTS can better understand the emotion contained in the context in the out-of-domain scenes and synthesize appropriate expressive speeches. M-TTS completely loses emotion, which uses neutral tone. M-LTTS and M-CTTS-NT are sometimes good and sometimes bad due to the lack of understanding of the context. We find that M-LTTS is difficult to understand the positive emotions in the context, but tend to synthesize neutral or angry tone. M-CTTS-NT becomes difficult to understand the context in the out-of-domain scenario, and the unknown context will cause interference, resulting in poor speech quality.