Keras Cheat Sheet

If you use keras (if you don’t, you should!) in your deep learning projects, there is a nice cheat sheet that is worth printing out and keeping near your computer. It summarizes many of the key things you will want to do to create and update a keras project.

 

Here’s a link to the full pdf:

https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Keras_Cheat_Sheet_Python.pdf

Posted in Uncategorized | Leave a comment

word2vec Embeddings

What is an “embedding” for a dictionary of words?  That is the question this article addresses.

When one uses neural networks for linguistic processing, one of the first things to consider is “how should a word be represented”.   In other words, every word needs to have a numeric representation in order for the network to be able to perform any processing on it.  “Embedding” is the term used for assigning a number, or vector (set of numbers) to represent a particular word from a dictionary.  An obvious, but as it turns out, not very practical embedding would be the traditional one-hot code:  Take the dictionary of all possible words in the vocabulary, define a vector matching the length of the dictionary, and set the element of the vector matching the position of a word to 1 and all others to zero.  For example, if the dictionary of all words you will work with in your network is simply:  [‘man’, ‘woman’, ‘red’, ‘blue’, ‘green’], then your vector would have 5 dimensions and the word ‘man’ would be represented by the vector [1, 0, 0, 0, 0], woman would be assigned the vector [0, 1, 0, 0, 0], etc.  There are several problems with this choice of embedding:

  1.  The size of each vector is defined by the size of the dictionary.  For typical “real-world” dictionaries, each word would be represented by a vector that might be more than a million numbers in length.
  2. The vectors themselves are sparse (all but one of the 1,000,000+ numbers in any particular vector has a value of 0.0).
  3. Words embedded in this vector space have no particular relationship to one another based on their location in the vector space.

So, the question arises:  Can a vector space be found that can be used to embed (represent) words from a dictionary such that the vectors themselves are small (say, only 100-300 elements, rather than 1,000,000+ elements for a one-hot encoding) and would have the desirable feature that words that are near one another in this multi-dimensional space have similar meanings?  The nice thing about the last of these desired attributes is that rather than representing a word by an arbitrary number or vector, the chosen vectors (one for each word) would somehow represent the syntactic or semantic characteristics of each word in the dictionary.

Rather than using the simple one-hot encoding, or embedding, Tomas Mikolov in 2013 (working at Google at the time) came up with the idea of training a simple neural network (one with only linear activations – no sigmoid, relu, or other non-linear activations) to  learn embeddings for words based on the context in which they are used.  Here’s a link to Mikolov’s word2vec paper , titled “Efficient Estimation of Word Representations in Vector Space” (it’s a very readable paper!).  Mikolov trained a network using a dictionary of the million most-frequently-used words in a training corpus consisting of the 6 billion words from Google News.  An example vector space consisted of a set of 300 floating point values (there could be any number of these vector spaces, one is created each time the a word2vec network architecture is defined (via hyper parameters) and trained).

The interesting thing about Mikolov’s 300-dimensional vector space, is that words with similar syntactic and semantic meanings tend to group near one another in various dimensions of the space.  And, surprisingly, one can do simple vector math on the learned  embedding vectors to explore these relationships.  For example, if you train a network to learn a set of word embeddings, and you take the vector representing the word “King”, subtract the vector for the word “Man” and add the vector for the word “Woman” you will find that the word closest to the resulting vector in the word2vec vector space turns out to be “Queen”.  This was a rather amazing result, and to my knowledge, Mikolov’s group was the first to find this type of multi-dimensional vector encodings that could perform semantic math!  In fact, after training their network on the Google News corpus, Mikolov’s group performed a test using some 20,000 similar semantic and syntactic tests (analogies) and found that their learned embeddings placed 65% of the syntactic vectors and 34% of the semantic vectors in the “correct” location in the vector space – where “correct” meant that the embedding (vector code) for any particular target word was closer to the expected word than the million other words in the dictionary.

Mikolov’s group trained their network using two different methods:  Continuous Bag of Words (CBOW) and the skip-gram model.  The CBOW method tries to guess a particular word based on the 4 words preceding the target word and the four words following the target word (the window size of 4 words is arbitrary, another hyper parameter, but Mikolov found that a +/- 4 window size was found to produce good results).  Positive training examples were produced from the Google News corpus, negative examples were generated from random sets of preceding and following words .  Here’s a pictorial representation of the CBOW model (from the Mikolov paper):

As you can see, the input in the CBOW model in this example consists of the embeddings for the two words preceding the target word and the two words following the target word, and the desired output is then the predicted target word.  This network is trained to produce high activations for examples from the corpus and low activations for random sequences of input words.

The skip-gram model turns this network on its head:  rather than predicting the word that would lie between a set of preceding and succeeding words, the skip-gram model takes as input a single word and predicts, as output, the most likely set of words to precede and succeed the input word, as shown in the second diagram, below:

It turns out that the skip-gram model actually does the better job at determining the “best” set of embeddings for a dictionary of words, where “best” means that the vector encodings for words in the dictionary end up being located in the 300-dimensional vector space such that words with similar syntactic and semantic meanings are located near one another in this vector space.

Posted in Uncategorized | Leave a comment

Kaggle-MNIST Walkthrough Using Keras and Python 3

I’ve been interested in joining some Kaggle competitions to sharpen my Data Science and Deep Learning skills.  So, I thought I would report on my experiences – starting with an easy competition :  recognize digits from the MNIST training set.  This is a classic intro-problem for Deep Learning and computer vision, and the competition is open through Jul 2020 — so it seems like a good place to start.  Here’s a link to the Kaggle page for this “Getting Started” exercise:  https://www.kaggle.com/c/digit-recognizer

I have some previous experience using Keras, Python 3, and Convolutional Neural Networks (CNNs), so I plan to use these as the core set of tools I’ll use in my first introduction to Kaggle.  I’m starting off with a PC I haven’t used for Deep Learning yet – it’s a Dell 8910 with 32 GB of memory, an Nvidia 1070 video card, and Ubuntu 16.04.2 (“xenial”).  I won’t go into the details of getting the video card to be recognized; installing the Nvidia drivers, CNN toolkit; etc. – since I went through that process a few months ago and it was so painful that I don’t really want to have to do it again.  I don’t know why this is such a difficult thing to do…you’d think that with Nvidia’s leading role in the AI and Deep Learning marketplace they would make this a seamless process.  But if your experience is anything like mine, it is not close to being seamless.  I found several different web sites that will attempt to walk you through this process, and in the end, the recommendations in this one seemed to work the best:

https://gist.github.com/ksopyla/813a62d6afc4307755e5832a3b62f432

As an alternative, I know that Amazon AWS now has GPU-ready virtual machine environments that are ready to go, with the necessary drivers and tools (like Python, Keras, Tensorflow, etc.) pre-installed and ready to use.  I relied on an Amazon AWS environment to complete some of my Udacity Deep Learning Foundations coursework and found it to be a good, pain-free experience.  So, if this is your first time getting into Deep Learning, I might recommend AWS as a starting point – particularly if you have any difficulties getting a desktop Ubuntu environment set up and working.  Or, start with just CPU-based training and save the GPU configuration stuff for later!

To get started, let’s begin by setting up a programming environment with the tools we’ll be using:  Python 3, Keras, etc.

Step 1:  Setup a Programming Environment and Install Keras

The first thing I want to do is group all my work into a python virtual environment, so  all the tools I need are accessible to my project, but won’t cause version-inconsistency problems with any other projects I might have now or in the future.  For simplicity, I’ll call my virtualenv “kaggle-mnist” to make it clear that this is the environment I’ll be using for this Kaggle MNIST competition.

$ sudo apt install virtualenv
$ virtualenv -p /usr/bin/python3 kaggle-mnist
$ source kaggle-mnist/bin/activate
view raw gistfile1.txt hosted with ❤ by GitHub

Let’s verify that the environment will be using python3:

$ python
view raw gistfile1.txt hosted with ❤ by GitHub

In response, on my machine, I see:

Python 3.5.2 (default, Nov 17 2016, 17:05:23)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
view raw gistfile1.txt hosted with ❤ by GitHub

So, things look good so far.

The next thing I want to do is install the Keras Deep Learning Toolkit,  a GPU-aware version of TensorFlow, and some other utilities that I will need to complete this task (using pip3, since I’ve decided to use Python 3):

$ pip3 install keras
$ pip3 install tensorflow-gpu
$ pip3 install tensorflow
$ pip3 install pandas
$ pip3 install sklearn
$ pip3 install matplotlib
$ pip3 install seaborn
$ sudo apt-get install python3-tk
view raw gistfile1.txt hosted with ❤ by GitHub

Step 2:  Get the MNIST Data Sets from Kaggle

The data for this competition is hosed on the kaggle site, at the following URL:

https://www.kaggle.com/c/digit-recognizer/data

There are two files of interest here:  train.csv and test.csv.  I clicked on each file and pressed the “Download” button on the Kaggle site to get the two files, which I moved to a “/data” folder in my kaggle-mnist programming environment.

While I was at it, I also downloaded the sample-submission.csv file, which I will later need to create and submit to Kaggle to have my submission evaluated.  This is a simple file that  identifies the digit for each of the 28,000 samples in the test set (test.csv).  It looks something like this:

ImageId,Label
1,3
2,7
3,8
(27997 more lines)
view raw gistfile1.txt hosted with ❤ by GitHub

Step 3:  Examine the Dataset

The first step to take when working with any dataset is to take a look at the data, and see how it is distributed.  I looked at some of the Kaggle kernels provided by other competitors in this competion, and the following code is based on the “Inroduction to CNN Keras” Jupyter notebook contributed by Yassine Ghouzam.  The following (short) python script will read in the datasets and display the number of training set images assigned to each of the 10 possible categories, sorted from the most common to the least common digit in the training set.  he final few lines check to see if any of the data is null/missing:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns
np.random.seed(2)
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import itertools
from keras.utils.np_utils import to_categorical # convert to one-hot-encoding
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D
from keras.optimizers import RMSprop
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ReduceLROnPlateau
sns.set(style='white', context='notebook', palette='deep')
# Load the data
train = pd.read_csv("data/train.csv")
test = pd.read_csv("data/test.csv")
Y_train = train["label"]
# Drop 'label' column
X_train = train.drop(labels = ["label"],axis = 1)
# free some space
del train
g = sns.countplot(Y_train)
print(Y_train.value_counts())
print()
print("Checking for missing values in the training set:")
print(X_train.isnull().any().describe())
print()
print("and, in the test set:")
print(test.isnull().any().describe())
view raw gistfile1.txt hosted with ❤ by GitHub

Executing this python script on my PC resulted in the following output:

python kaggle_mnist.py
Using TensorFlow backend.
2017-11-13 14:35:53.915831: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
1 4684
7 4401
3 4351
9 4188
2 4177
6 4137
0 4132
4 4072
8 4063
5 3795
Name: label, dtype: int64
Checking for missing values in the training set:
count 784
unique 1
top False
freq 784
dtype: object
and, in the test set:
count 784
unique 1
top False
freq 784
dtype: object
view raw gistfile1.txt hosted with ❤ by GitHub

As you can see, the training set data is fairly-evenly split across the 10 possible classes, and there doesn’t appear to be any missing data in the training or test datasets (all the isnull() entries are False).  Good to know!

Step 4 – Create the CNN with Keras, Train, and Predict

Reference the following (rather long) implementation of the remaining tasks:

  1. Lines 1-65:  This is the data prep actions discussed above.
  2. Lines 66-121:  Define the CNN network, with 3 main convolution layers and a couple fully-connected layers to a softmax output.
  3. Lines 122-136:  Configure optimizer and learning rate adjustments.
  4. Lines 137-152:  Perform some data augmentation to add variety to the training set.
  5. Lines 153-161: Train the network.
  6. Lines 162-212:  Analyze various errors when the model is applied to the validation set.
  7. Lines 213-225:  Apply the trained network to the test set and output the submission file.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns
np.random.seed(2)
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import itertools
from keras.utils.np_utils import to_categorical # convert to one-hot-encoding
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D
from keras.optimizers import RMSprop
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ReduceLROnPlateau
sns.set(style='white', context='notebook', palette='deep')
# Load the data
train = pd.read_csv("data/train.csv")
test = pd.read_csv("data/test.csv")
Y_train = train["label"]
# Drop 'label' column
X_train = train.drop(labels = ["label"],axis = 1)
# free some space
del train
g = sns.countplot(Y_train)
print(Y_train.value_counts())
print()
print("Checking for missing values in the training set:")
print(X_train.isnull().any().describe())
print()
print("and, in the test set:")
print(test.isnull().any().describe())
# Normalize the data
X_train = X_train / 255.0
test = test / 255.0
# Reshape image in 3 dimensions (height = 28px, width = 28px , channels = 1)
X_train = X_train.values.reshape(-1,28,28,1)
test = test.values.reshape(-1,28,28,1)
# Encode labels to one hot vectors (ex : 2 -> [0,0,1,0,0,0,0,0,0,0])
Y_train = to_categorical(Y_train, num_classes = 10)
# Set the random seed
random_seed = 2
# Split the train and the validation set for the fitting
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size = 0.1, random_state=random_seed)
# Some examples
g = plt.imshow(X_train[0][:,:,0])
#=====================================================================
# Set the CNN model
# my CNN architechture is In -> [[Conv2D->relu]*2 -> MaxPool2D -> Dropout]*2 -> Flatten -> Dense -> Dropout -> Out
model = Sequential()
# Layer 1: 32 3x3 convolutions (x2)
model.add(Conv2D(filters = 32,
kernel_size = (3,3),
padding = 'Same',
activation ='relu',
input_shape = (28,28,1)))
model.add(Conv2D(filters = 32,
kernel_size = (3,3),
padding = 'Same',
activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Dropout(0.20))
# Layer 2 CNN 64 3x3 convolutions (X2)
model.add(Conv2D(filters = 64,
kernel_size = (3,3),
padding = 'Same',
activation ='relu'))
model.add(Conv2D(filters = 64,
kernel_size = (3,3),
padding = 'Same',
activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2), strides=(2,2)))
model.add(Dropout(0.20))
# Layer 3 CNN
model.add(Conv2D(filters = 128,
kernel_size = (3,3),
padding = 'Same',
activation ='relu'))
model.add(Conv2D(filters = 128,
kernel_size = (3,3),
padding = 'Same',
activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2), strides=(2,2)))
model.add(Dropout(0.20))
# Layer 4: FC, Softmax output
model.add(Flatten())
model.add(Dense(512, activation = "relu"))
model.add(Dropout(0.25))
model.add(Dense(256, activation = "relu"))
model.add(Dropout(0.25))
model.add(Dense(10, activation = "softmax"))
# Define the optimizer
optimizer = RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)
# Compile the model
model.compile(optimizer = optimizer , loss = "categorical_crossentropy", metrics=["accuracy"])
# Set a learning rate annealer
learning_rate_reduction = ReduceLROnPlateau(monitor = 'val_acc',
patience = 3,
verbose = 1,
factor = 0.8,
min_lr = 0.000001)
epochs = 250
batch_size = 512
datagen = ImageDataGenerator(
featurewise_center = False, # set input mean to 0 over the dataset
samplewise_center = False, # set each sample mean to 0
featurewise_std_normalization = False, # divide inputs by std of the dataset
samplewise_std_normalization = False, # divide each input by its std
zca_whitening = False, # apply ZCA whitening
rotation_range = 12, # randomly rotate images in the range (degrees, 0 to 180)
zoom_range = 0.10, # Randomly zoom image
width_shift_range = 0.15, # randomly shift images horizontally (fraction of total width)
height_shift_range = 0.15, # randomly shift images vertically (fraction of total height)
horizontal_flip = False, # randomly flip images
vertical_flip = False) # randomly flip images
datagen.fit(X_train)
# Fit the model
history = model.fit_generator(datagen.flow(X_train,Y_train, batch_size=batch_size),
epochs = epochs,
validation_data = (X_val,Y_val),
verbose = 1,
steps_per_epoch = X_train.shape[0] // batch_size,
callbacks = [learning_rate_reduction])
#==========================================================================================
# Predict the values from the validation dataset
Y_pred = model.predict(X_val)
# Convert predictions classes to one hot vectors
Y_pred_classes = np.argmax(Y_pred,axis = 1)
# Convert validation observations to one hot vectors
Y_true = np.argmax(Y_val,axis = 1)
# Display some error results
# Errors are difference between predicted labels and true labels
errors = (Y_pred_classes - Y_true != 0)
Y_pred_classes_errors = Y_pred_classes[errors]
Y_pred_errors = Y_pred[errors]
Y_true_errors = Y_true[errors]
X_val_errors = X_val[errors]
def display_errors(errors_index,img_errors,pred_errors, obs_errors):
""" This function shows 6 images with their predicted and real labels"""
n = 0
nrows = 2
ncols = 3
fig, ax = plt.subplots(nrows,ncols,sharex=True,sharey=True)
for row in range(nrows):
for col in range(ncols):
error = errors_index[n]
ax[row,col].imshow((img_errors[error]).reshape((28,28)))
ax[row,col].set_title("Predicted label :{}\nTrue label :{}".format(pred_errors[error],obs_errors[error]))
n += 1
# Probabilities of the wrong predicted numbers
Y_pred_errors_prob = np.max(Y_pred_errors,axis = 1)
# Predicted probabilities of the true values in the error set
true_prob_errors = np.diagonal(np.take(Y_pred_errors, Y_true_errors, axis=1))
# Difference between the probability of the predicted label and the true label
delta_pred_true_errors = Y_pred_errors_prob - true_prob_errors
# Sorted list of the delta prob errors
sorted_dela_errors = np.argsort(delta_pred_true_errors)
# Top 6 errors
most_important_errors = sorted_dela_errors[-6:]
# Show the top 6 errors
#display_errors(most_important_errors, X_val_errors, Y_pred_classes_errors, Y_true_errors)
# ================================================================================================================
# predict results
results = model.predict(test)
# select the index with the maximum probability
results = np.argmax(results,axis = 1)
results = pd.Series(results,name="Label")
submission = pd.concat([pd.Series(range(1,28001),name = "ImageId"),results],axis = 1)
submission.to_csv("my_kaggle_mnist_submission.csv",index=False)
view raw gistfile1.txt hosted with ❤ by GitHub

Step 5 – Submit the .csv File to Kaggle for Grading

Running the script from step 4 for 250 epochs resulted in a 99.64% accuracy on the validation set.  Submitting the resulting output file to Kaggle resulted in a 99.699% accuracy on the Kaggle-private data set, placing this submittal at rank 72 out of 1879 total competitors.

Posted in Uncategorized | Leave a comment

Generative Adversarial Networks (GANs)

Two interesting papers were published this week demonstrating some remarkable progress in the use of GANs to produce and enhance imagery that is surprisingly realistic.  The first paper, from Nvidia (http://research.nvidia.com/publication/2017-10_Progressive-Growing-of)  – a company that produces GPUs that are often re-purposed to train deep learning networks, was trained to generate images of celebrities that look very realistic, as the following image demonstrates.  These are not pictures of real people, but rather were generated by Nvidia’s implementation of a GAN network.

gan_celebrity_image_512x256

The second paper, by a group at the Max Planck Institute (http://www.is.mpg.de/16376353/EnhanceNET-PAT-Mehdi), demonstrates a machine learning technique to take a blurry image and produce a sharpened image that ends up being very similar to the original from which the blurry image was made.  The first image, below, is a low-resolution image of a bird on a branch.  The middle image is the one produced by the author’s GAN network.  Compare the generated image to the original high-resolution image on the right.  They are, indeed, very similar.

original-bird

GANs are composed of two adversarial networks, the first network (the Generator) produces images that attempt to “fool” the second network into believing they are “real”.  The second network tries its best to discriminate between real images and those produced by the Generator.  Through multiple rounds of back-and-forth (like an arms race in AI land), the generator gets more and more capable of generating images that fool the discriminator and, as a result, generate images that appear to be very real.  Traditional GANs start with an image consisting of random noise, and iteration by iteration, the generator begins to morph these noisy images into something that shares characteristics from the dataset that the discriminator uses to distinguish between real and artificially-generated images.

The Nvidia team used this traditional method, training its discriminator GANs on a database of celebrity images.  The Max Planck team, by contrast, started with the blurry image of a sample from the discriminator training set and the generator, using this blurry image as the seed, began to iteratively refine the image until it was virtually indistinguishable from the original.  It is conceivable that a large training set of domain-specific imagery might be capable of similarly improving the resolution of new images that were not in the original training set, but this is a topic for future research.

Applications of this technology might include, for instance, the processing of old imagery or movies shot in low resolution and re-issuing them to today’s audience as a high-def facsimile.  Or, adding capabilities to tools such as Adobe Photoshop that photographers could use to improve the quality of their own images.

Posted in Uncategorized | Leave a comment

AlphaGo Zero: A Stunning New Breakthrough at DeepMind

Image: DeepMind

A paper published in Nature this week documents a major advance in deep learning. DeepMind, the Alphabet/Google deep learning group reports that they have built a new version of their Go-playing AI program that represents a major improvement over the previous state of the art.

The earlier version last year surprised the AI community and the Go-playing world by demonstrating that a computer was capable of beating the best human Go players in the world – a feat that many thought the world would not see for another decade or more.  Lee Sedol, the human grandmaster that was defeated 4-games-to-1 in the 2016 match was surprised at the beauty and depth of mastery displayed by the original Alpha-Go program.  That initial version was programmed by feeding millions of positions from some 160,000 games played by humans and having the neural network find from these examples the basic features that led to winning moves.

The new breakthrough, though, is even more impressive.  Rather than use examples from human play as the initial knowledge base, the DeepMind team started from scratch and the only knowledge given to the computer was the basic rules of the game.  From these simple rules, the deep learning network was given the freedom to start playing games against itself and learn on its own what worked and what didn’t.  At the end of the first three hours of training, the program was playing like a typical beginner, greedily capturing stones at every opportunity with no sense of any long-term strategy.

After only 19 hours of training, the program had advanced well beyond the skills of typical beginners and was displaying the sense that it was mastering several typical strategies known to experienced human Go players.  The real surprise came after 70 hours of training, though, when the program started displaying super-human performance.  In fact, after only 3 days of training, the program was already exceeding the abilities of the original AlphaGo that beat Lee Sedol.  Three weeks later, it had learned enough to be the best Go player in the world, human or computer.  In fact, in a match against the original AlphaGo, AlphaGo Zero beat the orignal 100 games to 0.  A stunning result.

The most amazing part of all this, IMO, is that the new AlphaGo Zero gained this amount of knowledge in a very short time from just the very basic rules of the game.  The other interesting fact is that the original program was run on a network of 48 Google Tensor Processing Units (TPUs), while the new AlphaGo Zero learned to play at super-human level on a much smaller 4-TPU system.   The techniques used to achieve these results have immediate application in other domains, such as protein-folding for drug discovery, medical diagnostics, investment advisers, etc.  The rapid advancements displayed by AlphaGo Zero are in line with the exponential march to the Singularity, when computers will out-match humans in every domain.

Posted in Uncategorized | Leave a comment

Convolution Filters and Feature Maps

I’ve always been a little confused on how the size of convolution filters, input image sizes, stride, padding, etc relates to the final size of feature maps in a Convolutional Neural Network (CNN).  So, here are some notes that I’ve gathered that help explain things a little:

The following diagram, from the “Caffe in a Day Tutorial” (https://docs.google.com/presentation/d/1HxGdeq8MPktHaPb-rlmYYQ723iWzq9ur6Gjo71YiG0Y/edit#slide=id.gc2fcdcce7_216_0) provides a good overview of what happens when a single convolution filter (or, kernel) is applied to an input image.Caffe in a Day Tutorial

In this example, the input image is 32×32 pixels with 3 separate RGB color channels and the filter is a 5×5 kernel.  In fact, it is actually 3 separate 5×5 kernels, one for each of the 3 color channels.  Each of these 3 separate 5×5 kernels are independent of one another and learned during training.  The filter slides over the input image and the value of each pixel under the corresponding kernel location is multiplied by the kernel value.  Results from each of the individual multiplications are added together (3x5x5 = 75, in all) and the resulting sum produces a single pixel in the feature map.  The first (top-left corner) feature of the feature map is produced from the sum of the 75 multiplications when the filter sits over the top-left corner of the input image.  After this first feature is calculated, the filter is shifted to the right one pixel (or, more precisely, the number of pixels represented by the “stride”) and the process is repeated to obtain the 2nd feature.  This is repeated, sliding to the right until the filter reaches the right edge, sliding down one pixel to the next row, and so on until the filter sits at the bottom-right corner of the input image and the bottom-right feature of the feature map has been calculated.  In this particular example, the “Stride” was set to one and no padding was used.  In practice, the input image is generally padded with addition 0-value pixels so that the feature map ends up being the same size as the input image.  For a 3×3 kernel, a padding of 1 (on both sides, top, and bottom) would result in a feature map matching the input image size.  For a 5×5 kernel, a padding of 2 pixels on each side/top/bottom would produce the correct size feature map.

Note, in particular, that the same set of weights and bias value are used as this filter scans across the image, resulting in a much-reduced number of network parameters compared to a Fully Connected (FC) neural network layer.  In this particular example, the 5×5 kernel (3x5x5, actually, when the 3 separate filters, one for each RGB channel are considered) results in 75 (76 counting the bias term) unique weights and biases to be learned compared to 3,072 (3,073 counting the bias term) for a fully-connected layer.  The difference in number of parameters is even more dramatic with larger input image sizes, since the number of parameters in a fully-connected network 3xHxW (where H and W are the height and width of the input imagery) while the number of parameters in a convolutional layer is proportional only to the size of the filter and is independent of the input image size.  So, for example, a 3x256x256 image from the ImageNet data set would require nearly 200,000 FC parameters to be learned while the convolution layer for this same image would still need only 76 parameters.

Rather than a single set of kernels in a layer, though, CNNs generally provide more modeling “capacity” by including numerous separate kernels that can be learned by the network.  Each separate kernel produces it’s own feature map, which are generally shown as a 3D volume of feature maps, where the depth of the feature net volume indicates the number of filters to be learned in that layer of the network.  The next diagram, from the same “Cafe in a Day Tutorial”, illustrates this for a particular example where 6 sets of 3x5x5 kernels (18 kernels total) produce a set of 6 feature maps.Caffe in a Day Tutorial (1)

Finally, note that choosing a stride other than 1 would reduce the size of the feature maps.  For example, a stride of 2 (sliding the 5×5 filter two pixels right (or down) after each convolution is computed) would result in feature maps that are 14×14 in size, thus reducing the computational requirements on subsequent layers of the network.

Posted in Uncategorized | Leave a comment