Kaggle-MNIST Walkthrough Using Keras and Python 3

I’ve been interested in joining some Kaggle competitions to sharpen my Data Science and Deep Learning skills. So, I thought I would report on my experiences – starting with an easy competition : recognize digits from the MNIST training set. This is a classic intro-problem for Deep Learning and computer vision, and the competition is open through Jul 2020 — so it seems like a good place to start. Here’s a link to the Kaggle page for this “Getting Started” exercise: https://www.kaggle.com/c/digit-recognizer

I have some previous experience using Keras, Python 3, and Convolutional Neural Networks (CNNs), so I plan to use these as the core set of tools I’ll use in my first introduction to Kaggle. I’m starting off with a PC I haven’t used for Deep Learning yet – it’s a Dell 8910 with 32 GB of memory, an Nvidia 1070 video card, and Ubuntu 16.04.2 (“xenial”). I won’t go into the details of getting the video card to be recognized; installing the Nvidia drivers, CNN toolkit; etc. – since I went through that process a few months ago and it was so painful that I don’t really want to have to do it again. I don’t know why this is such a difficult thing to do…you’d think that with Nvidia’s leading role in the AI and Deep Learning marketplace they would make this a seamless process. But if your experience is anything like mine, it is not close to being seamless. I found several different web sites that will attempt to walk you through this process, and in the end, the recommendations in this one seemed to work the best:

https://gist.github.com/ksopyla/813a62d6afc4307755e5832a3b62f432

As an alternative, I know that Amazon AWS now has GPU-ready virtual machine environments that are ready to go, with the necessary drivers and tools (like Python, Keras, Tensorflow, etc.) pre-installed and ready to use. I relied on an Amazon AWS environment to complete some of my Udacity Deep Learning Foundations coursework and found it to be a good, pain-free experience. So, if this is your first time getting into Deep Learning, I might recommend AWS as a starting point – particularly if you have any difficulties getting a desktop Ubuntu environment set up and working. Or, start with just CPU-based training and save the GPU configuration stuff for later!

To get started, let’s begin by setting up a programming environment with the tools we’ll be using: Python 3, Keras, etc.

Step 1: Setup a Programming Environment and Install Keras

The first thing I want to do is group all my work into a python virtual environment, so all the tools I need are accessible to my project, but won’t cause version-inconsistency problems with any other projects I might have now or in the future. For simplicity, I’ll call my virtualenv “kaggle-mnist” to make it clear that this is the environment I’ll be using for this Kaggle MNIST competition.

	$ sudo apt install virtualenv
	$ virtualenv -p /usr/bin/python3 kaggle-mnist
	$ source kaggle-mnist/bin/activate

view raw gistfile1.txt hosted with ❤ by GitHub

Let’s verify that the environment will be using python3:

$ python

view raw gistfile1.txt hosted with ❤ by GitHub

In response, on my machine, I see:

	Python 3.5.2 (default, Nov 17 2016, 17:05:23)
	[GCC 5.4.0 20160609] on linux
	Type "help", "copyright", "credits" or "license" for more information.
	>>>

view raw gistfile1.txt hosted with ❤ by GitHub

So, things look good so far.

The next thing I want to do is install the Keras Deep Learning Toolkit, a GPU-aware version of TensorFlow, and some other utilities that I will need to complete this task (using pip3, since I’ve decided to use Python 3):

	$ pip3 install keras
	$ pip3 install tensorflow-gpu
	$ pip3 install tensorflow
	$ pip3 install pandas
	$ pip3 install sklearn
	$ pip3 install matplotlib
	$ pip3 install seaborn
	$ sudo apt-get install python3-tk

view raw gistfile1.txt hosted with ❤ by GitHub

Step 2: Get the MNIST Data Sets from Kaggle

The data for this competition is hosed on the kaggle site, at the following URL:

https://www.kaggle.com/c/digit-recognizer/data

There are two files of interest here: train.csv and test.csv. I clicked on each file and pressed the “Download” button on the Kaggle site to get the two files, which I moved to a “/data” folder in my kaggle-mnist programming environment.

While I was at it, I also downloaded the sample-submission.csv file, which I will later need to create and submit to Kaggle to have my submission evaluated. This is a simple file that identifies the digit for each of the 28,000 samples in the test set (test.csv). It looks something like this:

	ImageId,Label
	1,3
	2,7
	3,8
	(27997 more lines)

view raw gistfile1.txt hosted with ❤ by GitHub

Step 3: Examine the Dataset

The first step to take when working with any dataset is to take a look at the data, and see how it is distributed. I looked at some of the Kaggle kernels provided by other competitors in this competion, and the following code is based on the “Inroduction to CNN Keras” Jupyter notebook contributed by Yassine Ghouzam. The following (short) python script will read in the datasets and display the number of training set images assigned to each of the 10 possible categories, sorted from the most common to the least common digit in the training set. he final few lines check to see if any of the data is null/missing:

	import pandas as pd
	import numpy as np
	import matplotlib.pyplot as plt
	import matplotlib.image as mpimg
	import seaborn as sns

	np.random.seed(2)

	from sklearn.model_selection import train_test_split
	from sklearn.metrics import confusion_matrix
	import itertools

	from keras.utils.np_utils import to_categorical # convert to one-hot-encoding
	from keras.models import Sequential
	from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D
	from keras.optimizers import RMSprop
	from keras.preprocessing.image import ImageDataGenerator
	from keras.callbacks import ReduceLROnPlateau


	sns.set(style='white', context='notebook', palette='deep')

	# Load the data
	train = pd.read_csv("data/train.csv")
	test = pd.read_csv("data/test.csv")

	Y_train = train["label"]

	# Drop 'label' column
	X_train = train.drop(labels = ["label"],axis = 1)

	# free some space
	del train

	g = sns.countplot(Y_train)

	print(Y_train.value_counts())

	print()
	print("Checking for missing values in the training set:")
	print(X_train.isnull().any().describe())

	print()
	print("and, in the test set:")
	print(test.isnull().any().describe())

view raw gistfile1.txt hosted with ❤ by GitHub

Executing this python script on my PC resulted in the following output:

	python kaggle_mnist.py
	Using TensorFlow backend.
	2017-11-13 14:35:53.915831: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
	1 4684
	7 4401
	3 4351
	9 4188
	2 4177
	6 4137
	0 4132
	4 4072
	8 4063
	5 3795
	Name: label, dtype: int64

	Checking for missing values in the training set:
	count 784
	unique 1
	top False
	freq 784
	dtype: object

	and, in the test set:
	count 784
	unique 1
	top False
	freq 784
	dtype: object

view raw gistfile1.txt hosted with ❤ by GitHub

As you can see, the training set data is fairly-evenly split across the 10 possible classes, and there doesn’t appear to be any missing data in the training or test datasets (all the isnull() entries are False). Good to know!

Step 4 – Create the CNN with Keras, Train, and Predict

Reference the following (rather long) implementation of the remaining tasks:

Lines 1-65: This is the data prep actions discussed above.
Lines 66-121: Define the CNN network, with 3 main convolution layers and a couple fully-connected layers to a softmax output.
Lines 122-136: Configure optimizer and learning rate adjustments.
Lines 137-152: Perform some data augmentation to add variety to the training set.
Lines 153-161: Train the network.
Lines 162-212: Analyze various errors when the model is applied to the validation set.
Lines 213-225: Apply the trained network to the test set and output the submission file.

	import pandas as pd
	import numpy as np
	import matplotlib.pyplot as plt
	import matplotlib.image as mpimg
	import seaborn as sns

	np.random.seed(2)

	from sklearn.model_selection import train_test_split
	from sklearn.metrics import confusion_matrix
	import itertools

	from keras.utils.np_utils import to_categorical # convert to one-hot-encoding
	from keras.models import Sequential
	from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D
	from keras.optimizers import RMSprop
	from keras.preprocessing.image import ImageDataGenerator
	from keras.callbacks import ReduceLROnPlateau


	sns.set(style='white', context='notebook', palette='deep')

	# Load the data
	train = pd.read_csv("data/train.csv")
	test = pd.read_csv("data/test.csv")

	Y_train = train["label"]

	# Drop 'label' column
	X_train = train.drop(labels = ["label"],axis = 1)

	# free some space
	del train

	g = sns.countplot(Y_train)

	print(Y_train.value_counts())

	print()
	print("Checking for missing values in the training set:")
	print(X_train.isnull().any().describe())

	print()
	print("and, in the test set:")
	print(test.isnull().any().describe())

	# Normalize the data
	X_train = X_train / 255.0
	test = test / 255.0
	# Reshape image in 3 dimensions (height = 28px, width = 28px , channels = 1)
	X_train = X_train.values.reshape(-1,28,28,1)
	test = test.values.reshape(-1,28,28,1)

	# Encode labels to one hot vectors (ex : 2 -> [0,0,1,0,0,0,0,0,0,0])
	Y_train = to_categorical(Y_train, num_classes = 10)

	# Set the random seed
	random_seed = 2

	# Split the train and the validation set for the fitting
	X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size = 0.1, random_state=random_seed)

	# Some examples
	g = plt.imshow(X_train[0][:,:,0])

	#=====================================================================
	# Set the CNN model
	# my CNN architechture is In -> [[Conv2D->relu]2 -> MaxPool2D -> Dropout]2 -> Flatten -> Dense -> Dropout -> Out

	model = Sequential()

	# Layer 1: 32 3x3 convolutions (x2)
	model.add(Conv2D(filters = 32,
	kernel_size = (3,3),
	padding = 'Same',
	activation ='relu',
	input_shape = (28,28,1)))

	model.add(Conv2D(filters = 32,
	kernel_size = (3,3),
	padding = 'Same',
	activation ='relu'))

	model.add(MaxPool2D(pool_size=(2,2)))
	model.add(Dropout(0.20))

	# Layer 2 CNN 64 3x3 convolutions (X2)
	model.add(Conv2D(filters = 64,
	kernel_size = (3,3),
	padding = 'Same',
	activation ='relu'))

	model.add(Conv2D(filters = 64,
	kernel_size = (3,3),
	padding = 'Same',
	activation ='relu'))

	model.add(MaxPool2D(pool_size=(2,2), strides=(2,2)))
	model.add(Dropout(0.20))

	# Layer 3 CNN
	model.add(Conv2D(filters = 128,
	kernel_size = (3,3),
	padding = 'Same',
	activation ='relu'))

	model.add(Conv2D(filters = 128,
	kernel_size = (3,3),
	padding = 'Same',
	activation ='relu'))
	model.add(MaxPool2D(pool_size=(2,2), strides=(2,2)))
	model.add(Dropout(0.20))

	# Layer 4: FC, Softmax output
	model.add(Flatten())
	model.add(Dense(512, activation = "relu"))
	model.add(Dropout(0.25))
	model.add(Dense(256, activation = "relu"))
	model.add(Dropout(0.25))
	model.add(Dense(10, activation = "softmax"))

	# Define the optimizer
	optimizer = RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)

	# Compile the model
	model.compile(optimizer = optimizer , loss = "categorical_crossentropy", metrics=["accuracy"])

	# Set a learning rate annealer
	learning_rate_reduction = ReduceLROnPlateau(monitor = 'val_acc',
	patience = 3,
	verbose = 1,
	factor = 0.8,
	min_lr = 0.000001)

	epochs = 250
	batch_size = 512

	datagen = ImageDataGenerator(
	featurewise_center = False, # set input mean to 0 over the dataset
	samplewise_center = False, # set each sample mean to 0
	featurewise_std_normalization = False, # divide inputs by std of the dataset
	samplewise_std_normalization = False, # divide each input by its std
	zca_whitening = False, # apply ZCA whitening
	rotation_range = 12, # randomly rotate images in the range (degrees, 0 to 180)
	zoom_range = 0.10, # Randomly zoom image
	width_shift_range = 0.15, # randomly shift images horizontally (fraction of total width)
	height_shift_range = 0.15, # randomly shift images vertically (fraction of total height)
	horizontal_flip = False, # randomly flip images
	vertical_flip = False) # randomly flip images


	datagen.fit(X_train)

	# Fit the model
	history = model.fit_generator(datagen.flow(X_train,Y_train, batch_size=batch_size),
	epochs = epochs,
	validation_data = (X_val,Y_val),
	verbose = 1,
	steps_per_epoch = X_train.shape[0] // batch_size,
	callbacks = [learning_rate_reduction])

	#==========================================================================================

	# Predict the values from the validation dataset
	Y_pred = model.predict(X_val)

	# Convert predictions classes to one hot vectors
	Y_pred_classes = np.argmax(Y_pred,axis = 1)

	# Convert validation observations to one hot vectors
	Y_true = np.argmax(Y_val,axis = 1)

	# Display some error results

	# Errors are difference between predicted labels and true labels
	errors = (Y_pred_classes - Y_true != 0)

	Y_pred_classes_errors = Y_pred_classes[errors]
	Y_pred_errors = Y_pred[errors]
	Y_true_errors = Y_true[errors]
	X_val_errors = X_val[errors]

	def display_errors(errors_index,img_errors,pred_errors, obs_errors):
	""" This function shows 6 images with their predicted and real labels"""
	n = 0
	nrows = 2
	ncols = 3
	fig, ax = plt.subplots(nrows,ncols,sharex=True,sharey=True)
	for row in range(nrows):
	for col in range(ncols):
	error = errors_index[n]
	ax[row,col].imshow((img_errors[error]).reshape((28,28)))
	ax[row,col].set_title("Predicted label :{}\nTrue label :{}".format(pred_errors[error],obs_errors[error]))
	n += 1

	# Probabilities of the wrong predicted numbers
	Y_pred_errors_prob = np.max(Y_pred_errors,axis = 1)

	# Predicted probabilities of the true values in the error set
	true_prob_errors = np.diagonal(np.take(Y_pred_errors, Y_true_errors, axis=1))

	# Difference between the probability of the predicted label and the true label
	delta_pred_true_errors = Y_pred_errors_prob - true_prob_errors

	# Sorted list of the delta prob errors
	sorted_dela_errors = np.argsort(delta_pred_true_errors)

	# Top 6 errors
	most_important_errors = sorted_dela_errors[-6:]

	# Show the top 6 errors
	#display_errors(most_important_errors, X_val_errors, Y_pred_classes_errors, Y_true_errors)

	# ================================================================================================================
	# predict results
	results = model.predict(test)

	# select the index with the maximum probability
	results = np.argmax(results,axis = 1)

	results = pd.Series(results,name="Label")

	submission = pd.concat([pd.Series(range(1,28001),name = "ImageId"),results],axis = 1)

	submission.to_csv("my_kaggle_mnist_submission.csv",index=False)

view raw gistfile1.txt hosted with ❤ by GitHub

Step 5 – Submit the .csv File to Kaggle for Grading

Running the script from step 4 for 250 epochs resulted in a 99.64% accuracy on the validation set. Submitting the resulting output file to Kaggle resulted in a 99.699% accuracy on the Kaggle-private data set, placing this submittal at rank 72 out of 1879 total competitors.