Kaggle-MNIST Walkthrough Using Keras and Python 3

I’ve been interested in joining some Kaggle competitions to sharpen my Data Science and Deep Learning skills.  So, I thought I would report on my experiences – starting with an easy competition :  recognize digits from the MNIST training set.  This is a classic intro-problem for Deep Learning and computer vision, and the competition is open through Jul 2020 — so it seems like a good place to start.  Here’s a link to the Kaggle page for this “Getting Started” exercise:  https://www.kaggle.com/c/digit-recognizer

I have some previous experience using Keras, Python 3, and Convolutional Neural Networks (CNNs), so I plan to use these as the core set of tools I’ll use in my first introduction to Kaggle.  I’m starting off with a PC I haven’t used for Deep Learning yet – it’s a Dell 8910 with 32 GB of memory, an Nvidia 1070 video card, and Ubuntu 16.04.2 (“xenial”).  I won’t go into the details of getting the video card to be recognized; installing the Nvidia drivers, CNN toolkit; etc. – since I went through that process a few months ago and it was so painful that I don’t really want to have to do it again.  I don’t know why this is such a difficult thing to do…you’d think that with Nvidia’s leading role in the AI and Deep Learning marketplace they would make this a seamless process.  But if your experience is anything like mine, it is not close to being seamless.  I found several different web sites that will attempt to walk you through this process, and in the end, the recommendations in this one seemed to work the best:

https://gist.github.com/ksopyla/813a62d6afc4307755e5832a3b62f432

As an alternative, I know that Amazon AWS now has GPU-ready virtual machine environments that are ready to go, with the necessary drivers and tools (like Python, Keras, Tensorflow, etc.) pre-installed and ready to use.  I relied on an Amazon AWS environment to complete some of my Udacity Deep Learning Foundations coursework and found it to be a good, pain-free experience.  So, if this is your first time getting into Deep Learning, I might recommend AWS as a starting point – particularly if you have any difficulties getting a desktop Ubuntu environment set up and working.  Or, start with just CPU-based training and save the GPU configuration stuff for later!

To get started, let’s begin by setting up a programming environment with the tools we’ll be using:  Python 3, Keras, etc.

Step 1:  Setup a Programming Environment and Install Keras

The first thing I want to do is group all my work into a python virtual environment, so  all the tools I need are accessible to my project, but won’t cause version-inconsistency problems with any other projects I might have now or in the future.  For simplicity, I’ll call my virtualenv “kaggle-mnist” to make it clear that this is the environment I’ll be using for this Kaggle MNIST competition.

$ sudo apt install virtualenv
$ virtualenv -p /usr/bin/python3 kaggle-mnist
$ source kaggle-mnist/bin/activate
view raw gistfile1.txt hosted with ❤ by GitHub

Let’s verify that the environment will be using python3:

$ python
view raw gistfile1.txt hosted with ❤ by GitHub

In response, on my machine, I see:

Python 3.5.2 (default, Nov 17 2016, 17:05:23)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
view raw gistfile1.txt hosted with ❤ by GitHub

So, things look good so far.

The next thing I want to do is install the Keras Deep Learning Toolkit,  a GPU-aware version of TensorFlow, and some other utilities that I will need to complete this task (using pip3, since I’ve decided to use Python 3):

$ pip3 install keras
$ pip3 install tensorflow-gpu
$ pip3 install tensorflow
$ pip3 install pandas
$ pip3 install sklearn
$ pip3 install matplotlib
$ pip3 install seaborn
$ sudo apt-get install python3-tk
view raw gistfile1.txt hosted with ❤ by GitHub

Step 2:  Get the MNIST Data Sets from Kaggle

The data for this competition is hosed on the kaggle site, at the following URL:

https://www.kaggle.com/c/digit-recognizer/data

There are two files of interest here:  train.csv and test.csv.  I clicked on each file and pressed the “Download” button on the Kaggle site to get the two files, which I moved to a “/data” folder in my kaggle-mnist programming environment.

While I was at it, I also downloaded the sample-submission.csv file, which I will later need to create and submit to Kaggle to have my submission evaluated.  This is a simple file that  identifies the digit for each of the 28,000 samples in the test set (test.csv).  It looks something like this:

ImageId,Label
1,3
2,7
3,8
(27997 more lines)
view raw gistfile1.txt hosted with ❤ by GitHub

Step 3:  Examine the Dataset

The first step to take when working with any dataset is to take a look at the data, and see how it is distributed.  I looked at some of the Kaggle kernels provided by other competitors in this competion, and the following code is based on the “Inroduction to CNN Keras” Jupyter notebook contributed by Yassine Ghouzam.  The following (short) python script will read in the datasets and display the number of training set images assigned to each of the 10 possible categories, sorted from the most common to the least common digit in the training set.  he final few lines check to see if any of the data is null/missing:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns
np.random.seed(2)
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import itertools
from keras.utils.np_utils import to_categorical # convert to one-hot-encoding
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D
from keras.optimizers import RMSprop
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ReduceLROnPlateau
sns.set(style='white', context='notebook', palette='deep')
# Load the data
train = pd.read_csv("data/train.csv")
test = pd.read_csv("data/test.csv")
Y_train = train["label"]
# Drop 'label' column
X_train = train.drop(labels = ["label"],axis = 1)
# free some space
del train
g = sns.countplot(Y_train)
print(Y_train.value_counts())
print()
print("Checking for missing values in the training set:")
print(X_train.isnull().any().describe())
print()
print("and, in the test set:")
print(test.isnull().any().describe())
view raw gistfile1.txt hosted with ❤ by GitHub

Executing this python script on my PC resulted in the following output:

python kaggle_mnist.py
Using TensorFlow backend.
2017-11-13 14:35:53.915831: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
1 4684
7 4401
3 4351
9 4188
2 4177
6 4137
0 4132
4 4072
8 4063
5 3795
Name: label, dtype: int64
Checking for missing values in the training set:
count 784
unique 1
top False
freq 784
dtype: object
and, in the test set:
count 784
unique 1
top False
freq 784
dtype: object
view raw gistfile1.txt hosted with ❤ by GitHub

As you can see, the training set data is fairly-evenly split across the 10 possible classes, and there doesn’t appear to be any missing data in the training or test datasets (all the isnull() entries are False).  Good to know!

Step 4 – Create the CNN with Keras, Train, and Predict

Reference the following (rather long) implementation of the remaining tasks:

  1. Lines 1-65:  This is the data prep actions discussed above.
  2. Lines 66-121:  Define the CNN network, with 3 main convolution layers and a couple fully-connected layers to a softmax output.
  3. Lines 122-136:  Configure optimizer and learning rate adjustments.
  4. Lines 137-152:  Perform some data augmentation to add variety to the training set.
  5. Lines 153-161: Train the network.
  6. Lines 162-212:  Analyze various errors when the model is applied to the validation set.
  7. Lines 213-225:  Apply the trained network to the test set and output the submission file.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns
np.random.seed(2)
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import itertools
from keras.utils.np_utils import to_categorical # convert to one-hot-encoding
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D
from keras.optimizers import RMSprop
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ReduceLROnPlateau
sns.set(style='white', context='notebook', palette='deep')
# Load the data
train = pd.read_csv("data/train.csv")
test = pd.read_csv("data/test.csv")
Y_train = train["label"]
# Drop 'label' column
X_train = train.drop(labels = ["label"],axis = 1)
# free some space
del train
g = sns.countplot(Y_train)
print(Y_train.value_counts())
print()
print("Checking for missing values in the training set:")
print(X_train.isnull().any().describe())
print()
print("and, in the test set:")
print(test.isnull().any().describe())
# Normalize the data
X_train = X_train / 255.0
test = test / 255.0
# Reshape image in 3 dimensions (height = 28px, width = 28px , channels = 1)
X_train = X_train.values.reshape(-1,28,28,1)
test = test.values.reshape(-1,28,28,1)
# Encode labels to one hot vectors (ex : 2 -> [0,0,1,0,0,0,0,0,0,0])
Y_train = to_categorical(Y_train, num_classes = 10)
# Set the random seed
random_seed = 2
# Split the train and the validation set for the fitting
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size = 0.1, random_state=random_seed)
# Some examples
g = plt.imshow(X_train[0][:,:,0])
#=====================================================================
# Set the CNN model
# my CNN architechture is In -> [[Conv2D->relu]*2 -> MaxPool2D -> Dropout]*2 -> Flatten -> Dense -> Dropout -> Out
model = Sequential()
# Layer 1: 32 3x3 convolutions (x2)
model.add(Conv2D(filters = 32,
kernel_size = (3,3),
padding = 'Same',
activation ='relu',
input_shape = (28,28,1)))
model.add(Conv2D(filters = 32,
kernel_size = (3,3),
padding = 'Same',
activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Dropout(0.20))
# Layer 2 CNN 64 3x3 convolutions (X2)
model.add(Conv2D(filters = 64,
kernel_size = (3,3),
padding = 'Same',
activation ='relu'))
model.add(Conv2D(filters = 64,
kernel_size = (3,3),
padding = 'Same',
activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2), strides=(2,2)))
model.add(Dropout(0.20))
# Layer 3 CNN
model.add(Conv2D(filters = 128,
kernel_size = (3,3),
padding = 'Same',
activation ='relu'))
model.add(Conv2D(filters = 128,
kernel_size = (3,3),
padding = 'Same',
activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2), strides=(2,2)))
model.add(Dropout(0.20))
# Layer 4: FC, Softmax output
model.add(Flatten())
model.add(Dense(512, activation = "relu"))
model.add(Dropout(0.25))
model.add(Dense(256, activation = "relu"))
model.add(Dropout(0.25))
model.add(Dense(10, activation = "softmax"))
# Define the optimizer
optimizer = RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)
# Compile the model
model.compile(optimizer = optimizer , loss = "categorical_crossentropy", metrics=["accuracy"])
# Set a learning rate annealer
learning_rate_reduction = ReduceLROnPlateau(monitor = 'val_acc',
patience = 3,
verbose = 1,
factor = 0.8,
min_lr = 0.000001)
epochs = 250
batch_size = 512
datagen = ImageDataGenerator(
featurewise_center = False, # set input mean to 0 over the dataset
samplewise_center = False, # set each sample mean to 0
featurewise_std_normalization = False, # divide inputs by std of the dataset
samplewise_std_normalization = False, # divide each input by its std
zca_whitening = False, # apply ZCA whitening
rotation_range = 12, # randomly rotate images in the range (degrees, 0 to 180)
zoom_range = 0.10, # Randomly zoom image
width_shift_range = 0.15, # randomly shift images horizontally (fraction of total width)
height_shift_range = 0.15, # randomly shift images vertically (fraction of total height)
horizontal_flip = False, # randomly flip images
vertical_flip = False) # randomly flip images
datagen.fit(X_train)
# Fit the model
history = model.fit_generator(datagen.flow(X_train,Y_train, batch_size=batch_size),
epochs = epochs,
validation_data = (X_val,Y_val),
verbose = 1,
steps_per_epoch = X_train.shape[0] // batch_size,
callbacks = [learning_rate_reduction])
#==========================================================================================
# Predict the values from the validation dataset
Y_pred = model.predict(X_val)
# Convert predictions classes to one hot vectors
Y_pred_classes = np.argmax(Y_pred,axis = 1)
# Convert validation observations to one hot vectors
Y_true = np.argmax(Y_val,axis = 1)
# Display some error results
# Errors are difference between predicted labels and true labels
errors = (Y_pred_classes - Y_true != 0)
Y_pred_classes_errors = Y_pred_classes[errors]
Y_pred_errors = Y_pred[errors]
Y_true_errors = Y_true[errors]
X_val_errors = X_val[errors]
def display_errors(errors_index,img_errors,pred_errors, obs_errors):
""" This function shows 6 images with their predicted and real labels"""
n = 0
nrows = 2
ncols = 3
fig, ax = plt.subplots(nrows,ncols,sharex=True,sharey=True)
for row in range(nrows):
for col in range(ncols):
error = errors_index[n]
ax[row,col].imshow((img_errors[error]).reshape((28,28)))
ax[row,col].set_title("Predicted label :{}\nTrue label :{}".format(pred_errors[error],obs_errors[error]))
n += 1
# Probabilities of the wrong predicted numbers
Y_pred_errors_prob = np.max(Y_pred_errors,axis = 1)
# Predicted probabilities of the true values in the error set
true_prob_errors = np.diagonal(np.take(Y_pred_errors, Y_true_errors, axis=1))
# Difference between the probability of the predicted label and the true label
delta_pred_true_errors = Y_pred_errors_prob - true_prob_errors
# Sorted list of the delta prob errors
sorted_dela_errors = np.argsort(delta_pred_true_errors)
# Top 6 errors
most_important_errors = sorted_dela_errors[-6:]
# Show the top 6 errors
#display_errors(most_important_errors, X_val_errors, Y_pred_classes_errors, Y_true_errors)
# ================================================================================================================
# predict results
results = model.predict(test)
# select the index with the maximum probability
results = np.argmax(results,axis = 1)
results = pd.Series(results,name="Label")
submission = pd.concat([pd.Series(range(1,28001),name = "ImageId"),results],axis = 1)
submission.to_csv("my_kaggle_mnist_submission.csv",index=False)
view raw gistfile1.txt hosted with ❤ by GitHub

Step 5 – Submit the .csv File to Kaggle for Grading

Running the script from step 4 for 250 epochs resulted in a 99.64% accuracy on the validation set.  Submitting the resulting output file to Kaggle resulted in a 99.699% accuracy on the Kaggle-private data set, placing this submittal at rank 72 out of 1879 total competitors.

About David Calloway

Hi! I'm David Calloway, the author of this blog on deep learning and artificial intelligence. I first started working with neural networks in the mid-80's, before the "dark winter" of neural networking technologies. I graduated from the U.S. Air Force Academy in 1979 with B.S. degrees in Physics and Electrical Engineering. In 1982, I received an MS degree in Electrical Engineering from Purdue University where I worked on early attempts at speech recognition. In 2005, I obtained another M.S. degree, this time in Biology from the University of Central Florida. My interest in neural networks and deep learning was rekindled recently, when I got involved in a project at Nova Technologies where I am using deep learning and TensorFlow to recognize and classify objects from satellite imagery.
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s