One of the things that we often see in Sci-Fi movies, but rarely experience in real life, is the ability to have a natural conversation with a computer. Rapid advances in AI and deep learning in recent years have brought us Amazon Echo and Google Assistant, but these devices have mostly single-phrase request processing, where the computer takes a single request and responds with the most likely response. This is not what I would call an interactive conversation.
Earlier this week, Google’s AI team demonstrated what I consider to be a major advance toward the goal of natural human-computer interaction, and I am frankly pretty amazed at the result. Google’s new Duplex product is able to make telephone calls and interact with the other party in ways that are remarkably similar to how a human would interact. Duplex is able to both sound very human (with natural changes in inflection and interspersed “umm’s and ‘uh’s”) and respond with human-like interactions to the natural flow of a conversation.
Google uses a Recurrent Neural Network (RNN) as the basis for understanding the current context of a conversation and generating the sequence of words to say next in a conversation. The network is trained to perform specific tasks (such as booking an appointment or making a reservation) using traditional deep learning techniques. While each task is trained separately, the entire collection of recorded conversations for all tasks was used as the corpus for training all the various task-specific networks. Once the RNN has generated a sequence of words to say next in the conversation Google’s standard Text-to-Speech (TTS) system is used to generate sounds for the desired phrase to be spoken.
Latency is an important aspect in natural conversations. Humans don’t generally expect long delays between phrases of a conversation, and Duplex attempts to keep the latency low (less than 100 ms, typically) using several different techniques, including relying on low-confidence models when that is determined to be necessary to meet the latency demands. When a complex phrase is being responded to, the system is actually smart enough to add more latency than required to match the approximate time humans might take to respond to a complex utterance.
You can read more about Google Duplex, including recorded samples of interactive speech on Google’s AI blog: https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-conversation.html