Is It a Human or Computer Talking? Google Blurs the Lines

By Nathaniel Scharping | January 8, 2018 4:03 pm
(Credit: Viktorus/Shutterstock)

(Credit: Viktorus/Shutterstock)

Siri and Alexa are good, but no one would mistake them for a human being. Google’s newest project, however, could change that.

Called Tacotron 2, the latest attempt to make computers talk like people builds on two of the company’s most recent text-to-speech projects, the original Tacotron and WaveNet.

Repeat After Me

Tacotron 2 pairs the text-mapping abilities of its predecessor with the speaking prowess of WaveNet for an end result that is, frankly, a bit unsettling. It works by taking text, and, based on training from snippets of actual human speech, mapping the syllables and words onto a spectrogram—a visual representation of audio waves. From there, the spectrogram is then turned into actual speech by a vocoder based on WaveNet. Tacotron 2 uses a spectrogram that can handle 80 different speech dimensions, which Google says is enough to recreate not only the accurate pronunciation of words but natural rhythms of human speech as well. The researchers report their work in a paper published to the preprint server arXiv.

Most computer voice programs use a library of syllables and words to construct sentences, something called concatenation synthesis. When humans speak, we vary our pronunciation widely depending on context, and this gives computer-speak its lifeless patina. What Google is attempting to do is get away from the repetition of words and sounds and construct sentences based on not only the words they’re made of, but what they mean as well. The program uses a network of interconnected nodes joined together to identify patterns in speech and ultimately predict what will come next in a sentence, helping to smooth out intonation.

The researchers back up their bluster with a bevy of examples posted online. Where WaveNet sounded accurate but a bit flat, Tacotron 2 sounds fleshed out and impressively varied. For a sample, just check out the same phrase repeated by both programs:


Tacotron 2:

The program can also handle complex, multi-syllabic words with ease, and can be instructed to add stress to words or syllables to alter the interpretation of sentences. This means Tacotron 2 can phrase things as questions and correctly differentiate between homonyms, as well as more subtle things like highlighting the subject of a sentence by adding emphasis to a word.

The final, and most compelling test is a side-by-side comparison of a human and computerized voice. Tacotron 2 scores a 4.53 on a popular test of speech quality, the researchers say, compared to 4.58 for professionally-recorded speech. See if you can tell the difference:

Although the program is impressive, it still has a few flaws. It can’t inject any emotion into its speech, and isn’t yet fast enough to produce audio in real time. And don’t ask it to order wine for you either:


CATEGORIZED UNDER: Technology, top posts
MORE ABOUT: computers
  • Uncle Al

    TACOtron? I’m going to ignore the obvious and assume it is a gender slur.

  • OWilson

    Trolls, have routinely tried defying the old Turing Test, by using fake bot language to take rational debate down deep holes to the bottom of the Swamp.

    Why do you ask?

    Are you really sure?

    What is your evidence?

    What is your source?

    Why don’t you believe the experts?

    You are wrong!

    Who’s paying you?

    You are just….. (too old, too young, too mentally challenged to be taken seriously)

    We have a couple of these posters on our very own Discover blogs! They think nobody notices! :)

    • John C

      Your English needs some work, Vlad.

      • OWilson

        Are you really sure?

        What is your evidence?

        Did you call me Vlad because you think I am a Russian spy?

        Why is that? :)

  • Vijay Jegakumar

    hello google


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!


See More

Collapse bottom bar