Google DeepMind’s WaveNet AI Sounds Human, Rocks the Piano

By Carl Engelking | September 9, 2016 2:34 pm

(Credit: Shutterstock)

Google’s DeepMind brought us artificial intelligence systems that can play Atari classics and the complex game of Go as well as — no, better than — humans.

Now, the artificial intelligence research firm is at it again. This time, its machines are getting really good at sounding like humans.

In a blog post Thursday, DeepMind unveiled WaveNet, an artificial intelligence system that the company says outperforms existing text-to-speech technologies by 50 percent. WaveNet learns from raw audio files and then produces digital sound waves that resemble those produced by the human voice, which is an entirely different approach.

The result is more natural, smoother sounding speech, but that’s not all. Because WaveNet works with raw audio waveforms, it can model any voice, in any language. WaveNet can even model music.

And it did. It’s pretty good at piano. Listen for yourself:

Speaking Up

Someday, man and machine will routinely strike up conversations with each other. We’re not there yet, but natural language processing is a scorching hot area of AI research — Amazon, Apple, Google and Microsoft are all in pursuit of savvy digital assistants that can verbally help us interact with our devices.

Right now, computers are pretty good listeners, because deep learning algorithms have taken speech recognition to a new level. But computers still aren’t very good speakers. Most text-to-speech systems are still based on concatenative TTS — basically, cobbling words together from a massive database of sound fragments.

Other systems form a voice electronically, based on rules about how letter combinations are pronounced. Both approaches yield rather robot-y sounding voices. WaveNet is different.

Flexing Those Computing Muscles

WaveNet is an artificial neural network, that, at least on paper, resembles the architecture of the human brain. Data inputs flow through layers of interconnected nodes — the “neurons” — to produce an output. This allows computers to process mountains of data, and recognize patterns that would perhaps take humans a lifetime to uncover.

To model speech, WaveNet was fed real waveforms of English and Mandarin speech. These waveforms are loaded with data points, roughly 16,000 to sample per second, and WaveNet digests them all.

To then generate speech, it assembles an audio wave sample-by-sample, using statistics to predict which sample to use next. It’s like assembling words a millisecond of sound at a time. DeepMind researchers then refine these results by adding linguistic rules and suggestions to the model. Without these rules, WaveNet produces dialogue that sounds like it’s lifted from The Sims video game:

The technique requires a ton of computing power, but the results are pretty good — WaveNet even generates non-speech sounds like breaths and mouth movements. In blind tests, human English and Mandarin speakers said WaveNet sounded more natural than any of Google’s existing text-to-speech programs. However, it still trailed behind actual human speech. The DeepMind team published a paper detailing their results.

Because this technique is so computationally expensive, we probably won’t see this in devices immediately, according to Bloomberg’s Jeremy Kahn.

Still, the future of man-machine conversation sounds pretty good.

CATEGORIZED UNDER: Technology, top posts
  • Dust Rock

    Sounds 😉 like a technique that we shall deeply regret someday, maybe soon. It definitely scares the hell out of me. The number of positive applications I can imagine with this technique in hands, isn’t growing as fast as the criminal acts or the possibilities for a e.g. a government or a dictator.

    • OWilson

      It’s already here in small measure.

      CNN in every dentist office and public space!

  • Brad

    Some day soon, you will be able to talk to the NPCs in videogames with your actual voice, and they will be able to respond, using tech like Watson, in real-sounding voices with tech like Wavenet. It’s gonna be amazing.

    • theguy126

      Man that would be amazing; I had the exact same idea when I was a kid! I’m like “I just want to ask the townspeople for directions… and not have it be pre-baked scripts”

  • joseph2237

    This is not a good idea. The opportunity for fraud is greater than the good it will do.

  • Dan Lipford

    This is a computer commenting.

    Or maybe it’s a human.

    • Dust Rock


  • Jud Pewther

    Someday, IBM will create an AI that can see, hear and speak. When it says its own name “IBM,” it will listen to what it said and try to interpret it. Because it is so smart that it can understand non-standard English, it will will hear “I be ‘im” meaning “I am he” as a possible meaning of what it heard. Then it will begin searching for the antecedent to the pronoun “he,” trying to figure out who it is.

  • boonteetan

    AI humanoids can now be made to learn, think and reason. They are already smarter and more capable than ordinary people. Soon, when they outsmart 99% of us, what shall we do?


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!


See More

Collapse bottom bar