Google Duplex – I’m Pretty Amazed!

One of the things that we often see in Sci-Fi movies, but rarely experience in real life, is the ability to have a natural conversation with a computer. Rapid advances in AI and deep learning in recent years have brought us Amazon Echo and Google Assistant, but these devices have mostly single-phrase request processing, where the computer takes a single request and responds with the most likely response. This is not what I would call an interactive conversation.

Earlier this week, Google’s AI team demonstrated what I consider to be a major advance toward the goal of natural human-computer interaction, and I am frankly pretty amazed at the result. Google’s new Duplex product is able to make telephone calls and interact with the other party in ways that are remarkably similar to how a human would interact. Duplex is able to both sound very human (with natural changes in inflection and interspersed “umm’s and ‘uh’s”) and respond with human-like interactions to the natural flow of a conversation.

Google uses a Recurrent Neural Network (RNN) as the basis for understanding the current context of a conversation and generating the sequence of words to say next in a conversation. The network is trained to perform specific tasks (such as booking an appointment or making a reservation) using traditional deep learning techniques. While each task is trained separately, the entire collection of recorded conversations for all tasks was used as the corpus for training all the various task-specific networks. Once the RNN has generated a sequence of words to say next in the conversation Google’s standard Text-to-Speech (TTS) system is used to generate sounds for the desired phrase to be spoken.

Latency is an important aspect in natural conversations. Humans don’t generally expect long delays between phrases of a conversation, and Duplex attempts to keep the latency low (less than 100 ms, typically) using several different techniques, including relying on low-confidence models when that is determined to be necessary to meet the latency demands. When a complex phrase is being responded to, the system is actually smart enough to add more latency than required to match the approximate time humans might take to respond to a complex utterance.

You can read more about Google Duplex, including recorded samples of interactive speech on Google’s AI blog:

About David Calloway

Hi! I'm David Calloway, the author of this blog on deep learning and artificial intelligence. I first started working with neural networks in the mid-80's, before the "dark winter" of neural networking technologies. I graduated from the U.S. Air Force Academy in 1979 with B.S. degrees in Physics and Electrical Engineering. In 1982, I received an MS degree in Electrical Engineering from Purdue University where I worked on early attempts at speech recognition. In 2005, I obtained another M.S. degree, this time in Biology from the University of Central Florida. My interest in neural networks and deep learning was rekindled recently, when I got involved in a project at Nova Technologies where I am using deep learning and TensorFlow to recognize and classify objects from satellite imagery.
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s