Research Blog: TTS

Expressive Speech Synthesis with Tacotron

Tuesday, March 27, 2018

Posted by Yuxuan Wang, Research Scientist and RJ Skerry-Ryan, Software Engineer, on behalf of the Machine Perception, Google Brain and TTS Research teamsTacotronprosodyTowards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotronprosody embedding

We augment Tacotron with a prosody encoder. The lower half of the image is the original Tacotron sequence-to-sequence model. For technical details, please refer to the paper.

Text: *Is* that Utah travel agency?

Reference prosody (Australian)

Synthesized without prosody embedding (American)

Synthesized with prosody embedding (American)

Reference Text: For the first time in her life she had been danced tired. Synthesized Text: For the last time in his life he had been handily embarrassed.

Reference prosody (American)

Synthesized without prosody embedding (American)

Synthesized with prosody embedding (American)

Text: I've Swallowed a Pollywog.

Reference prosody (Unseen American Speaker)

Synthesized without prosody embedding (British)

Synthesized with prosody embedding (British)

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotronthis web pageStyle Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech SynthesisGlobal Style Tokens

Model architecture of Global Style Tokens. The prosody embedding is decomposed into “style tokens” to enable unsupervised style control and transfer. For technical details, please refer to the paper.

Text: United Airlines five six three from Los Angeles to New Orleans has Landed.

Style 1

Style 2

Style 3

Style 4

Style 5

style transferStyle Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesisthis web pageAcknowledgementsThese projects were done jointly between multiple Google teams. Contributors include RJ Skerry-Ryan, Yuxuan Wang, Daisy Stanton, Eric Battenberg, Ying Xiao, Joel Shor, Rif A. Saurous, Yu Zhang, Ron J. Weiss, Rob Clark, Fei Ren and Ye Jia.

Evaluation of Speech for the Google Assistant

Thursday, December 21, 2017

Posted by Enrique Alfonseca, Staff Research Scientist, Google AssistantSearch Quality Rating Guidelinespublishing some of the first Google Assistant guidelines
Creating the Guidelinesweather this weekend

explicit linguistic knowledge and deep learning solutions

Information Satisfaction: the content of the answer should meet the information needs of the user.

Length: when a displayed answer is too long, users can quickly scan it visually and locate the relevant information. For voice answers, that is not possible. It is much more important to ensure that we provide a helpful amount of information, hopefully not too much or too little. Some of our previous work is currently in use for identifying the most relevant fragments of answers.

Formulation: it is much easier to understand a badly formulated written answer than an ungrammatical spoken answer, so more care has to be placed in ensuring grammatical correctness.

Elocution: spoken answers must have proper pronunciation and prosody. Improvements in text-to-speech generation, such as WaveNet and Tacotron 2, are quickly reducing the gap with human performance.

here

Tacotron 2: Generating Human-like Speech from Text

Tuesday, December 19, 2017

Posted by Jonathan Shen and Ruoming Pang, Software Engineers, on behalf of the Google Brain and Machine Perception TeamsTacotronWaveNetTacotron 2Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram PredictionsWaveNet

A detailed look at Tacotron 2's model architecture. The lower half of the image describes the sequence-to-sequence model that maps a sequence of letters to a spectrogram. For technical details, please refer to the paper.

Tacotron 2 audio samplesdecorummerlotAcknowledgementsJonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu, Sound Understanding team, TTS Research team, and TensorFlow team.

Text-to-Speech for low resource languages (episode 3): But can it say “Google”?

Friday, February 19, 2016

Posted by Martin Jansche, Software Engineer, Google Research for Low Resource LanguagesThis is the third episode in the series of posts reporting on the work we are doing to build text-to-speech (TTS) systems for low resource languages. In the first episode, we described the crowdsourced acoustic data collection effort for Project Unison. In the second episode, we described how we built parametric voices based on that data. In this episode, we look at how we are compiling a pronunciation lexicon for a TTS system.gather sufficient datatrain a statistical parametric voiceInternational Phonetic AlphabetBengali scriptBengali phonology

an earlier versionin use at Google for several yearsBengali pronunciation dictionaryCreative Commons License (CC BY 4.0)

Text-to-Speech for low resource languages (episode 2): Building a parametric voice

Tuesday, December 15, 2015

Posted by Alexander Gutkin, Google Speech TeamThis is the second episode in the series of posts reporting on the work we are doing to build text-to-speech (TTS) systems for low resource languages. In the previous episode, we described the crowdsourced data collection effort for Project Unison. In this episode, we describe our work to construct a parametric voice based on that data.previous episodeunit selectionaveraged outHidden Markov Modelswell-established techniqueProf. Keiichi TokudaRecurrent Neural NetworksvocodersphonemesVocaineproposed a neural network-based modelLong Short-Term Memorydescribed the LSTM RNN architecturerecently mentioned in our blogBangla corpusHMM synthesizer outputresulting audiothis waveform

Crowdsourcing a Text-to-Speech voice for low resource languages (episode 1)

Tuesday, September 08, 2015

Posted by Linne Ha, Senior Program Manager, Google Research for Low Resource LanguagesarticlesCrowd-sourcing projects for automatic speech recognitionYannis AgiomyrgiannakisKnot Pipatsrisawat

past researchZakaria HaquePRAATAhmed ChowduryMohammad HossainSyeed FaizMd. Arifuzzaman ArifSabbir Yousuf SannyZakaria HaqueHyunJeong Choe

Left: TPM Mohammad Khan measures the distance from the speaker to the mic to keep the sound quality consistent across all speakers. Right: Analytical Linguist HyunJeong Choe coaches SWE Ahmed Chowdury on how to speak in a friendly, knowledgeable, "Googly" voice

As illustrated in the third image, speaker3 has a drop in energy above 13kHz which is visible in the graph and may be present at speech, distorting the speaker’s voice to sound as if he were speaking through a tube.

the publicly available TTS data from the Indian Institute of Information TechnologyAlexander GutkinRichard Sproat Martin Jansche

Google Research Blog

Expressive Speech Synthesis with Tacotron

Evaluation of Speech for the Google Assistant

Tacotron 2: Generating Human-like Speech from Text

Text-to-Speech for low resource languages (episode 3): But can it say “Google”?

Text-to-Speech for low resource languages (episode 2): Building a parametric voice

Crowdsourcing a Text-to-Speech voice for low resource languages (episode 1)

Labels

Archive

Feed

Company-wide

Products

Developers