Google Duplex – I’m Pretty Amazed!

One of the things that we often see in Sci-Fi movies, but rarely experience in real life, is the ability to have a natural conversation with a computer. Rapid advances in AI and deep learning in recent years have brought us Amazon Echo and Google Assistant, but these devices have mostly single-phrase request processing, where the computer takes a single request and responds with the most likely response. This is not what I would call an interactive conversation.

Earlier this week, Google’s AI team demonstrated what I consider to be a major advance toward the goal of natural human-computer interaction, and I am frankly pretty amazed at the result. Google’s new Duplex product is able to make telephone calls and interact with the other party in ways that are remarkably similar to how a human would interact. Duplex is able to both sound very human (with natural changes in inflection and interspersed “umm’s and ‘uh’s”) and respond with human-like interactions to the natural flow of a conversation.

Google uses a Recurrent Neural Network (RNN) as the basis for understanding the current context of a conversation and generating the sequence of words to say next in a conversation. The network is trained to perform specific tasks (such as booking an appointment or making a reservation) using traditional deep learning techniques. While each task is trained separately, the entire collection of recorded conversations for all tasks was used as the corpus for training all the various task-specific networks. Once the RNN has generated a sequence of words to say next in the conversation Google’s standard Text-to-Speech (TTS) system is used to generate sounds for the desired phrase to be spoken.

Latency is an important aspect in natural conversations. Humans don’t generally expect long delays between phrases of a conversation, and Duplex attempts to keep the latency low (less than 100 ms, typically) using several different techniques, including relying on low-confidence models when that is determined to be necessary to meet the latency demands. When a complex phrase is being responded to, the system is actually smart enough to add more latency than required to match the approximate time humans might take to respond to a complex utterance.

You can read more about Google Duplex, including recorded samples of interactive speech on Google’s AI blog:

Posted in Uncategorized | Leave a comment

Nvidia Quadro GV100 – A Deep Learning Supercomputer on a Card

Quadro GV 100 front

At this year’s GPU Technology Conference (Mar 28-30, 2018), Nvidia announced a new GPU specifically designed for deep learning.  The Quadro GV100, based on Nvidia’s latest Volta architecture, sports 5120 CUDA cores, 640 tensor cores, and 32GB of VRAM – producing 14.8 TFLOPS of single-precision floating point performance, 7.4 TFLOPS of double-precision performance, and an incredible 118.5 TFLOPS of tensor performance to speed up deep learning inference and training.  The GV100 is designed to interface via PCIe, consumes only 250W of power, and two cards can be linked together to provide 64GB of shared memory.  However, the $9,000 price point for a single card makes this something that hobbyists will likely do without, but serious deep learning researchers may still find attractive.  Reportedly, Nvidia spent $3 Billion developing the GV100, which puts the $9,000 per card price point in perspective.

The 640 tensor cores Nvidia incorporated into the GV100 architecture are a new type of processor specifically designated to perform 4×4 matrix multiplications in support of deep learning operations.  Each tensor core performs the multiplication of two 4×4 matrices and adds the result to a third 4×4 matrix – exactly the type of operation that consumes the vast majority of processing in a deep learning training or inference scenario, resulting in the equivalent of a 120 TFLOP super-computer on a card.  When looked at from that point of view, the $9K price of the card actually looks like a bargain.  It will be interesting to see where these cards get used and what new breakthroughs in AI and deep learning we might see in the future as researchers apply this technology to this fast-moving domain.

Posted in Uncategorized | Leave a comment

Linux Foundation Launches Deep Learning Foundation

The Linux Foundation recently launched the LF Deep Learning Foundation. The goal of this foundation is to support open source innovations in artificial intelligence, machine learning, and deep learning.

Jim Zemlin, the Linux Foundation’s Executive Director, stated  “We are excited to offer a deep learning foundation that can drive long-term strategy and support for a host of projects in the AI, machine learning, and deep learning ecosystems”.

The first project to be launched by the foundation is Acumos: a comprehensive platform for AI model discovery, development and sharing of AI models and workflows. AT&T and Tech Mahindra contributed the Initial code for the Acumos AI Project. The Linux Foundation will host the Acumos AI platform and the Acumos Marketplace. Code is now available for download at .

In addition to the Acumos AI Project, LF Deep Learning anticipates future project contributions from Baidu and Tencent, among others.

  • Baidu’s EDL project enhances Kubernetes with the feature of elastic scheduling and uses PaddlePaddle’s fault-tolerable feature to significantly improve the overall utilization of Kubernetes clusters.
  • Tencent’s Angel project, a high-performance distributed machine-learning platform jointly developed by Tencent and Peking University, is tuned for big data/models. It is capable of supporting over a billion parameters.
Posted in Uncategorized | Leave a comment

Keras Cheat Sheet

If you use keras (if you don’t, you should!) in your deep learning projects, there is a nice cheat sheet that is worth printing out and keeping near your computer. It summarizes many of the key things you will want to do to create and update a keras project.


Here’s a link to the full pdf:

Posted in Uncategorized | Leave a comment

word2vec Embeddings

What is an “embedding” for a dictionary of words?  That is the question this article addresses.

When one uses neural networks for linguistic processing, one of the first things to consider is “how should a word be represented”.   In other words, every word needs to have a numeric representation in order for the network to be able to perform any processing on it.  “Embedding” is the term used for assigning a number, or vector (set of numbers) to represent a particular word from a dictionary.  An obvious, but as it turns out, not very practical embedding would be the traditional one-hot code:  Take the dictionary of all possible words in the vocabulary, define a vector matching the length of the dictionary, and set the element of the vector matching the position of a word to 1 and all others to zero.  For example, if the dictionary of all words you will work with in your network is simply:  [‘man’, ‘woman’, ‘red’, ‘blue’, ‘green’], then your vector would have 5 dimensions and the word ‘man’ would be represented by the vector [1, 0, 0, 0, 0], woman would be assigned the vector [0, 1, 0, 0, 0], etc.  There are several problems with this choice of embedding:

  1.  The size of each vector is defined by the size of the dictionary.  For typical “real-world” dictionaries, each word would be represented by a vector that might be more than a million numbers in length.
  2. The vectors themselves are sparse (all but one of the 1,000,000+ numbers in any particular vector has a value of 0.0).
  3. Words embedded in this vector space have no particular relationship to one another based on their location in the vector space.

So, the question arises:  Can a vector space be found that can be used to embed (represent) words from a dictionary such that the vectors themselves are small (say, only 100-300 elements, rather than 1,000,000+ elements for a one-hot encoding) and would have the desirable feature that words that are near one another in this multi-dimensional space have similar meanings?  The nice thing about the last of these desired attributes is that rather than representing a word by an arbitrary number or vector, the chosen vectors (one for each word) would somehow represent the syntactic or semantic characteristics of each word in the dictionary.

Rather than using the simple one-hot encoding, or embedding, Tomas Mikolov in 2013 (working at Google at the time) came up with the idea of training a simple neural network (one with only linear activations – no sigmoid, relu, or other non-linear activations) to  learn embeddings for words based on the context in which they are used.  Here’s a link to Mikolov’s word2vec paper , titled “Efficient Estimation of Word Representations in Vector Space” (it’s a very readable paper!).  Mikolov trained a network using a dictionary of the million most-frequently-used words in a training corpus consisting of the 6 billion words from Google News.  An example vector space consisted of a set of 300 floating point values (there could be any number of these vector spaces, one is created each time the a word2vec network architecture is defined (via hyper parameters) and trained).

The interesting thing about Mikolov’s 300-dimensional vector space, is that words with similar syntactic and semantic meanings tend to group near one another in various dimensions of the space.  And, surprisingly, one can do simple vector math on the learned  embedding vectors to explore these relationships.  For example, if you train a network to learn a set of word embeddings, and you take the vector representing the word “King”, subtract the vector for the word “Man” and add the vector for the word “Woman” you will find that the word closest to the resulting vector in the word2vec vector space turns out to be “Queen”.  This was a rather amazing result, and to my knowledge, Mikolov’s group was the first to find this type of multi-dimensional vector encodings that could perform semantic math!  In fact, after training their network on the Google News corpus, Mikolov’s group performed a test using some 20,000 similar semantic and syntactic tests (analogies) and found that their learned embeddings placed 65% of the syntactic vectors and 34% of the semantic vectors in the “correct” location in the vector space – where “correct” meant that the embedding (vector code) for any particular target word was closer to the expected word than the million other words in the dictionary.

Mikolov’s group trained their network using two different methods:  Continuous Bag of Words (CBOW) and the skip-gram model.  The CBOW method tries to guess a particular word based on the 4 words preceding the target word and the four words following the target word (the window size of 4 words is arbitrary, another hyper parameter, but Mikolov found that a +/- 4 window size was found to produce good results).  Positive training examples were produced from the Google News corpus, negative examples were generated from random sets of preceding and following words .  Here’s a pictorial representation of the CBOW model (from the Mikolov paper):

As you can see, the input in the CBOW model in this example consists of the embeddings for the two words preceding the target word and the two words following the target word, and the desired output is then the predicted target word.  This network is trained to produce high activations for examples from the corpus and low activations for random sequences of input words.

The skip-gram model turns this network on its head:  rather than predicting the word that would lie between a set of preceding and succeeding words, the skip-gram model takes as input a single word and predicts, as output, the most likely set of words to precede and succeed the input word, as shown in the second diagram, below:

It turns out that the skip-gram model actually does the better job at determining the “best” set of embeddings for a dictionary of words, where “best” means that the vector encodings for words in the dictionary end up being located in the 300-dimensional vector space such that words with similar syntactic and semantic meanings are located near one another in this vector space.

Posted in Uncategorized | Leave a comment

Kaggle-MNIST Walkthrough Using Keras and Python 3

I’ve been interested in joining some Kaggle competitions to sharpen my Data Science and Deep Learning skills.  So, I thought I would report on my experiences – starting with an easy competition :  recognize digits from the MNIST training set.  This is a classic intro-problem for Deep Learning and computer vision, and the competition is open through Jul 2020 — so it seems like a good place to start.  Here’s a link to the Kaggle page for this “Getting Started” exercise:

I have some previous experience using Keras, Python 3, and Convolutional Neural Networks (CNNs), so I plan to use these as the core set of tools I’ll use in my first introduction to Kaggle.  I’m starting off with a PC I haven’t used for Deep Learning yet – it’s a Dell 8910 with 32 GB of memory, an Nvidia 1070 video card, and Ubuntu 16.04.2 (“xenial”).  I won’t go into the details of getting the video card to be recognized; installing the Nvidia drivers, CNN toolkit; etc. – since I went through that process a few months ago and it was so painful that I don’t really want to have to do it again.  I don’t know why this is such a difficult thing to do…you’d think that with Nvidia’s leading role in the AI and Deep Learning marketplace they would make this a seamless process.  But if your experience is anything like mine, it is not close to being seamless.  I found several different web sites that will attempt to walk you through this process, and in the end, the recommendations in this one seemed to work the best:

As an alternative, I know that Amazon AWS now has GPU-ready virtual machine environments that are ready to go, with the necessary drivers and tools (like Python, Keras, Tensorflow, etc.) pre-installed and ready to use.  I relied on an Amazon AWS environment to complete some of my Udacity Deep Learning Foundations coursework and found it to be a good, pain-free experience.  So, if this is your first time getting into Deep Learning, I might recommend AWS as a starting point – particularly if you have any difficulties getting a desktop Ubuntu environment set up and working.  Or, start with just CPU-based training and save the GPU configuration stuff for later!

To get started, let’s begin by setting up a programming environment with the tools we’ll be using:  Python 3, Keras, etc.

Step 1:  Setup a Programming Environment and Install Keras

The first thing I want to do is group all my work into a python virtual environment, so  all the tools I need are accessible to my project, but won’t cause version-inconsistency problems with any other projects I might have now or in the future.  For simplicity, I’ll call my virtualenv “kaggle-mnist” to make it clear that this is the environment I’ll be using for this Kaggle MNIST competition.

Let’s verify that the environment will be using python3:

In response, on my machine, I see:

So, things look good so far.

The next thing I want to do is install the Keras Deep Learning Toolkit,  a GPU-aware version of TensorFlow, and some other utilities that I will need to complete this task (using pip3, since I’ve decided to use Python 3):

Step 2:  Get the MNIST Data Sets from Kaggle

The data for this competition is hosed on the kaggle site, at the following URL:

There are two files of interest here:  train.csv and test.csv.  I clicked on each file and pressed the “Download” button on the Kaggle site to get the two files, which I moved to a “/data” folder in my kaggle-mnist programming environment.

While I was at it, I also downloaded the sample-submission.csv file, which I will later need to create and submit to Kaggle to have my submission evaluated.  This is a simple file that  identifies the digit for each of the 28,000 samples in the test set (test.csv).  It looks something like this:

Step 3:  Examine the Dataset

The first step to take when working with any dataset is to take a look at the data, and see how it is distributed.  I looked at some of the Kaggle kernels provided by other competitors in this competion, and the following code is based on the “Inroduction to CNN Keras” Jupyter notebook contributed by Yassine Ghouzam.  The following (short) python script will read in the datasets and display the number of training set images assigned to each of the 10 possible categories, sorted from the most common to the least common digit in the training set.  he final few lines check to see if any of the data is null/missing:

Executing this python script on my PC resulted in the following output:

As you can see, the training set data is fairly-evenly split across the 10 possible classes, and there doesn’t appear to be any missing data in the training or test datasets (all the isnull() entries are False).  Good to know!

Step 4 – Create the CNN with Keras, Train, and Predict

Reference the following (rather long) implementation of the remaining tasks:

  1. Lines 1-65:  This is the data prep actions discussed above.
  2. Lines 66-121:  Define the CNN network, with 3 main convolution layers and a couple fully-connected layers to a softmax output.
  3. Lines 122-136:  Configure optimizer and learning rate adjustments.
  4. Lines 137-152:  Perform some data augmentation to add variety to the training set.
  5. Lines 153-161: Train the network.
  6. Lines 162-212:  Analyze various errors when the model is applied to the validation set.
  7. Lines 213-225:  Apply the trained network to the test set and output the submission file.

Step 5 – Submit the .csv File to Kaggle for Grading

Running the script from step 4 for 250 epochs resulted in a 99.64% accuracy on the validation set.  Submitting the resulting output file to Kaggle resulted in a 99.699% accuracy on the Kaggle-private data set, placing this submittal at rank 72 out of 1879 total competitors.

Posted in Uncategorized | Leave a comment

Generative Adversarial Networks (GANs)

Two interesting papers were published this week demonstrating some remarkable progress in the use of GANs to produce and enhance imagery that is surprisingly realistic.  The first paper, from Nvidia (  – a company that produces GPUs that are often re-purposed to train deep learning networks, was trained to generate images of celebrities that look very realistic, as the following image demonstrates.  These are not pictures of real people, but rather were generated by Nvidia’s implementation of a GAN network.


The second paper, by a group at the Max Planck Institute (, demonstrates a machine learning technique to take a blurry image and produce a sharpened image that ends up being very similar to the original from which the blurry image was made.  The first image, below, is a low-resolution image of a bird on a branch.  The middle image is the one produced by the author’s GAN network.  Compare the generated image to the original high-resolution image on the right.  They are, indeed, very similar.


GANs are composed of two adversarial networks, the first network (the Generator) produces images that attempt to “fool” the second network into believing they are “real”.  The second network tries its best to discriminate between real images and those produced by the Generator.  Through multiple rounds of back-and-forth (like an arms race in AI land), the generator gets more and more capable of generating images that fool the discriminator and, as a result, generate images that appear to be very real.  Traditional GANs start with an image consisting of random noise, and iteration by iteration, the generator begins to morph these noisy images into something that shares characteristics from the dataset that the discriminator uses to distinguish between real and artificially-generated images.

The Nvidia team used this traditional method, training its discriminator GANs on a database of celebrity images.  The Max Planck team, by contrast, started with the blurry image of a sample from the discriminator training set and the generator, using this blurry image as the seed, began to iteratively refine the image until it was virtually indistinguishable from the original.  It is conceivable that a large training set of domain-specific imagery might be capable of similarly improving the resolution of new images that were not in the original training set, but this is a topic for future research.

Applications of this technology might include, for instance, the processing of old imagery or movies shot in low resolution and re-issuing them to today’s audience as a high-def facsimile.  Or, adding capabilities to tools such as Adobe Photoshop that photographers could use to improve the quality of their own images.

Posted in Uncategorized | Leave a comment