DeepMind’s WaveNet Technology Makes Google Assistant’s New Male and Female Voices Sound More Realistic

Author Photo
Oct 9

Google recently rolled out Male and Female voice options for Google Assistant in English. A worthy alternative for those who have voice preferences for virtual assistants. The new voices for the assistant sound more real, thanks to the deep neural network for sound synthesis by Alphabet’s DeepMind division.

In 2016, Alphabet lab introduced the WaveNet deep neural network for “generating raw audio waveforms that is capable of producing better and more realistic-sounding speech than existing techniques.”

pixel-2-8RelatedPixel 2 and Pixel 2 XL Have A Secret ‘Menu’ Button On The Navigation Bar

In the span of 12 months, the team tested this “computationally intensive” research prototype on consumer products, first one being Google Assistant voices for US English and Japanese. The new model can produce waveforms 1000 times faster with better resolution and fidelity than the original.

Computational Approach

Alphabet’s computational approach to text-to-speech is a big leap forward in comparison to previous methods that involved voice artists in recording a huge database of sounds that were compiled together. On the downside, the computational method could result in synthetic sounds that are difficult to modify as the whole database needs tweaking whenever new changes are introduced such as intonations or emotions. But it takes way lesser time in processing sounds than the previous method.

Google Assistant Waveform

DeepMind’s computational approach introduced in 2016 included a “deep generative model that can create individual waveforms from scratch.”

screen-shot-2017-10-18-at-7-57-48-pmRelatedChrome 62 Stable Channel Now Rolling Out Mac, Windows, and Linux With Enhanced HTTP Security and Network Information API

It enabled inclusion of natural sounds that sync better and present natural accents, intonation, and even skeuomorphic sounds like “lip smacks.”

In its blog post, DeepMind explains:

It was built using a convolutional neural network, which was trained on a large dataset of speech samples. During this training phase, the network determined the underlying structure of the speech, such as which tones followed each other and what waveforms were realistic (and which were not). The trained network then synthesised a voice one sample at a time, with each generated sample taking into account the properties of the previous sample.

The resulting voice contained natural intonation and other features such as lip smacks. Its “accent” depended on the voices it had trained on, opening up the possibility of creating any number of unique voices from blended datasets. As with all text-to-speech systems, WaveNet used a text input to tell it which words it should generate in response to a query.

You can check out DeepMind’s latest blog post on the new approach for male and female voices on Google Assistant.