Google’s Tacotron 2 Text To Speech AI Produces Sounds Indistinguishable From Human Speech

Author Photo
Dec 27, 2017

Google has been up to a lot when it comes to experimenting in the field of Artificial Intelligence. Today, the tech giant has taken yet another step to advance further in the field. Google touts that its latest version of AI-powered speech synthesis system, Tacotron 2, falls pretty close to human speech. It has also uploaded some speech samples of the Tacotron 2 so that listeners can experience the ultimate technology.

Uses two deep neural networks for output

The Tacotron 2 is Google’s second generation of the speech-to-text technology, it comes with two deep neural networks for flawless output. The first neural network is responsible for translating the text into a spectrogram (pdf), which visually renders audio frequencies. After converting to spectrogram, it is then fed into WaveNet, which is a system developed by Alphabet’s AI research lab DeepMind. Wavenet reads the spectrogram chart and produces the similar audio elements.

pixel-4-9Related Google Pixel 1st-Generation Available for Just $199 – Brand New Model Available for $329.99

Speech-to-text is not a new technology of course, for the Mac users, it has been there for quite some time. However, Google claims that its text-to-speech technology superior to most and is almost indistinguishable from human speech.

Responds to punctuations too

The Tacotron 2 uses context to pronounce perfectly even identical words like ‘read’ (to read) and ‘read’ (has read). It also responds to the punctuations used in the text and can also learn to stress on the particular words, when they are written in caps.

In a post, Quartz’s Dave Gershgorn explained the working of the Tacotron 2, he wrote:

google-eu-2-1Related EU Fines Google Record $5.1 Billion Dollars over “Unfair Favoring” of Search and Other Apps

The system is Google’s second official generation of the technology, which consists of two deep neural networks. The first network translates the text into a spectrogram (pdf), a visual way to represent audio frequencies over time. That spectrogram is then fed into WaveNet, a system from Alphabet’s AI research lab DeepMind, which reads the chart and generates the corresponding audio elements accordingly.

You can check out all the comparative audio samples by clicking on this link. There are two audio samples for every single text and Google has not made it clear that which one is generated by Tacotron 2 and which one is human speech. But if you dig down deeper and view the file source, you can figure out which audio sample is from Tacotron 2.

Impressive much?

After listening to the samples and figuring out Tacotron 2 samples by viewing the source code, we can say that Google has achieved some impressive results here. The voice is pretty much similar to the human speech, not utterly human, but close enough. Better than other speech-to-text technologies that sometimes sound too mechanical. Also, it takes notes of punctuations in the text and changes the pace accordingly.