I’ve been interested in joining some Kaggle competitions to sharpen my Data Science and Deep Learning skills. So, I thought I would report on my experiences – starting with an easy competition : recognize digits from the MNIST training set. This is a classic intro-problem for Deep Learning and computer vision, and the competition is open through Jul 2020 — so it seems like a good place to start. Here’s a link to the Kaggle page for this “Getting Started” exercise: https://www.kaggle.com/c/digit-recognizer
I have some previous experience using Keras, Python 3, and Convolutional Neural Networks (CNNs), so I plan to use these as the core set of tools I’ll use in my first introduction to Kaggle. I’m starting off with a PC I haven’t used for Deep Learning yet – it’s a Dell 8910 with 32 GB of memory, an Nvidia 1070 video card, and Ubuntu 16.04.2 (“xenial”). I won’t go into the details of getting the video card to be recognized; installing the Nvidia drivers, CNN toolkit; etc. – since I went through that process a few months ago and it was so painful that I don’t really want to have to do it again. I don’t know why this is such a difficult thing to do…you’d think that with Nvidia’s leading role in the AI and Deep Learning marketplace they would make this a seamless process. But if your experience is anything like mine, it is not close to being seamless. I found several different web sites that will attempt to walk you through this process, and in the end, the recommendations in this one seemed to work the best:
https://gist.github.com/ksopyla/813a62d6afc4307755e5832a3b62f432
As an alternative, I know that Amazon AWS now has GPU-ready virtual machine environments that are ready to go, with the necessary drivers and tools (like Python, Keras, Tensorflow, etc.) pre-installed and ready to use. I relied on an Amazon AWS environment to complete some of my Udacity Deep Learning Foundations coursework and found it to be a good, pain-free experience. So, if this is your first time getting into Deep Learning, I might recommend AWS as a starting point – particularly if you have any difficulties getting a desktop Ubuntu environment set up and working. Or, start with just CPU-based training and save the GPU configuration stuff for later!
To get started, let’s begin by setting up a programming environment with the tools we’ll be using: Python 3, Keras, etc.
Step 1: Setup a Programming Environment and Install Keras
The first thing I want to do is group all my work into a python virtual environment, so all the tools I need are accessible to my project, but won’t cause version-inconsistency problems with any other projects I might have now or in the future. For simplicity, I’ll call my virtualenv “kaggle-mnist” to make it clear that this is the environment I’ll be using for this Kaggle MNIST competition.
$ sudo apt install virtualenv | |
$ virtualenv -p /usr/bin/python3 kaggle-mnist | |
$ source kaggle-mnist/bin/activate |
Let’s verify that the environment will be using python3:
$ python |
In response, on my machine, I see:
Python 3.5.2 (default, Nov 17 2016, 17:05:23) | |
[GCC 5.4.0 20160609] on linux | |
Type "help", "copyright", "credits" or "license" for more information. | |
>>> |
So, things look good so far.
The next thing I want to do is install the Keras Deep Learning Toolkit, a GPU-aware version of TensorFlow, and some other utilities that I will need to complete this task (using pip3, since I’ve decided to use Python 3):
$ pip3 install keras | |
$ pip3 install tensorflow-gpu | |
$ pip3 install tensorflow | |
$ pip3 install pandas | |
$ pip3 install sklearn | |
$ pip3 install matplotlib | |
$ pip3 install seaborn | |
$ sudo apt-get install python3-tk |
Step 2: Get the MNIST Data Sets from Kaggle
The data for this competition is hosed on the kaggle site, at the following URL:
https://www.kaggle.com/c/digit-recognizer/data
There are two files of interest here: train.csv and test.csv. I clicked on each file and pressed the “Download” button on the Kaggle site to get the two files, which I moved to a “/data” folder in my kaggle-mnist programming environment.
While I was at it, I also downloaded the sample-submission.csv file, which I will later need to create and submit to Kaggle to have my submission evaluated. This is a simple file that identifies the digit for each of the 28,000 samples in the test set (test.csv). It looks something like this:
ImageId,Label | |
1,3 | |
2,7 | |
3,8 | |
(27997 more lines) |
Step 3: Examine the Dataset
The first step to take when working with any dataset is to take a look at the data, and see how it is distributed. I looked at some of the Kaggle kernels provided by other competitors in this competion, and the following code is based on the “Inroduction to CNN Keras” Jupyter notebook contributed by Yassine Ghouzam. The following (short) python script will read in the datasets and display the number of training set images assigned to each of the 10 possible categories, sorted from the most common to the least common digit in the training set. he final few lines check to see if any of the data is null/missing:
import pandas as pd | |
import numpy as np | |
import matplotlib.pyplot as plt | |
import matplotlib.image as mpimg | |
import seaborn as sns | |
np.random.seed(2) | |
from sklearn.model_selection import train_test_split | |
from sklearn.metrics import confusion_matrix | |
import itertools | |
from keras.utils.np_utils import to_categorical # convert to one-hot-encoding | |
from keras.models import Sequential | |
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D | |
from keras.optimizers import RMSprop | |
from keras.preprocessing.image import ImageDataGenerator | |
from keras.callbacks import ReduceLROnPlateau | |
sns.set(style='white', context='notebook', palette='deep') | |
# Load the data | |
train = pd.read_csv("data/train.csv") | |
test = pd.read_csv("data/test.csv") | |
Y_train = train["label"] | |
# Drop 'label' column | |
X_train = train.drop(labels = ["label"],axis = 1) | |
# free some space | |
del train | |
g = sns.countplot(Y_train) | |
print(Y_train.value_counts()) | |
print() | |
print("Checking for missing values in the training set:") | |
print(X_train.isnull().any().describe()) | |
print() | |
print("and, in the test set:") | |
print(test.isnull().any().describe()) | |
Executing this python script on my PC resulted in the following output:
python kaggle_mnist.py | |
Using TensorFlow backend. | |
2017-11-13 14:35:53.915831: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA | |
1 4684 | |
7 4401 | |
3 4351 | |
9 4188 | |
2 4177 | |
6 4137 | |
0 4132 | |
4 4072 | |
8 4063 | |
5 3795 | |
Name: label, dtype: int64 | |
Checking for missing values in the training set: | |
count 784 | |
unique 1 | |
top False | |
freq 784 | |
dtype: object | |
and, in the test set: | |
count 784 | |
unique 1 | |
top False | |
freq 784 | |
dtype: object |
As you can see, the training set data is fairly-evenly split across the 10 possible classes, and there doesn’t appear to be any missing data in the training or test datasets (all the isnull() entries are False). Good to know!
Step 4 – Create the CNN with Keras, Train, and Predict
Reference the following (rather long) implementation of the remaining tasks:
- Lines 1-65: This is the data prep actions discussed above.
- Lines 66-121: Define the CNN network, with 3 main convolution layers and a couple fully-connected layers to a softmax output.
- Lines 122-136: Configure optimizer and learning rate adjustments.
- Lines 137-152: Perform some data augmentation to add variety to the training set.
- Lines 153-161: Train the network.
- Lines 162-212: Analyze various errors when the model is applied to the validation set.
- Lines 213-225: Apply the trained network to the test set and output the submission file.
import pandas as pd | |
import numpy as np | |
import matplotlib.pyplot as plt | |
import matplotlib.image as mpimg | |
import seaborn as sns | |
np.random.seed(2) | |
from sklearn.model_selection import train_test_split | |
from sklearn.metrics import confusion_matrix | |
import itertools | |
from keras.utils.np_utils import to_categorical # convert to one-hot-encoding | |
from keras.models import Sequential | |
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D | |
from keras.optimizers import RMSprop | |
from keras.preprocessing.image import ImageDataGenerator | |
from keras.callbacks import ReduceLROnPlateau | |
sns.set(style='white', context='notebook', palette='deep') | |
# Load the data | |
train = pd.read_csv("data/train.csv") | |
test = pd.read_csv("data/test.csv") | |
Y_train = train["label"] | |
# Drop 'label' column | |
X_train = train.drop(labels = ["label"],axis = 1) | |
# free some space | |
del train | |
g = sns.countplot(Y_train) | |
print(Y_train.value_counts()) | |
print() | |
print("Checking for missing values in the training set:") | |
print(X_train.isnull().any().describe()) | |
print() | |
print("and, in the test set:") | |
print(test.isnull().any().describe()) | |
# Normalize the data | |
X_train = X_train / 255.0 | |
test = test / 255.0 | |
# Reshape image in 3 dimensions (height = 28px, width = 28px , channels = 1) | |
X_train = X_train.values.reshape(-1,28,28,1) | |
test = test.values.reshape(-1,28,28,1) | |
# Encode labels to one hot vectors (ex : 2 -> [0,0,1,0,0,0,0,0,0,0]) | |
Y_train = to_categorical(Y_train, num_classes = 10) | |
# Set the random seed | |
random_seed = 2 | |
# Split the train and the validation set for the fitting | |
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size = 0.1, random_state=random_seed) | |
# Some examples | |
g = plt.imshow(X_train[0][:,:,0]) | |
#===================================================================== | |
# Set the CNN model | |
# my CNN architechture is In -> [[Conv2D->relu]*2 -> MaxPool2D -> Dropout]*2 -> Flatten -> Dense -> Dropout -> Out | |
model = Sequential() | |
# Layer 1: 32 3x3 convolutions (x2) | |
model.add(Conv2D(filters = 32, | |
kernel_size = (3,3), | |
padding = 'Same', | |
activation ='relu', | |
input_shape = (28,28,1))) | |
model.add(Conv2D(filters = 32, | |
kernel_size = (3,3), | |
padding = 'Same', | |
activation ='relu')) | |
model.add(MaxPool2D(pool_size=(2,2))) | |
model.add(Dropout(0.20)) | |
# Layer 2 CNN 64 3x3 convolutions (X2) | |
model.add(Conv2D(filters = 64, | |
kernel_size = (3,3), | |
padding = 'Same', | |
activation ='relu')) | |
model.add(Conv2D(filters = 64, | |
kernel_size = (3,3), | |
padding = 'Same', | |
activation ='relu')) | |
model.add(MaxPool2D(pool_size=(2,2), strides=(2,2))) | |
model.add(Dropout(0.20)) | |
# Layer 3 CNN | |
model.add(Conv2D(filters = 128, | |
kernel_size = (3,3), | |
padding = 'Same', | |
activation ='relu')) | |
model.add(Conv2D(filters = 128, | |
kernel_size = (3,3), | |
padding = 'Same', | |
activation ='relu')) | |
model.add(MaxPool2D(pool_size=(2,2), strides=(2,2))) | |
model.add(Dropout(0.20)) | |
# Layer 4: FC, Softmax output | |
model.add(Flatten()) | |
model.add(Dense(512, activation = "relu")) | |
model.add(Dropout(0.25)) | |
model.add(Dense(256, activation = "relu")) | |
model.add(Dropout(0.25)) | |
model.add(Dense(10, activation = "softmax")) | |
# Define the optimizer | |
optimizer = RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0) | |
# Compile the model | |
model.compile(optimizer = optimizer , loss = "categorical_crossentropy", metrics=["accuracy"]) | |
# Set a learning rate annealer | |
learning_rate_reduction = ReduceLROnPlateau(monitor = 'val_acc', | |
patience = 3, | |
verbose = 1, | |
factor = 0.8, | |
min_lr = 0.000001) | |
epochs = 250 | |
batch_size = 512 | |
datagen = ImageDataGenerator( | |
featurewise_center = False, # set input mean to 0 over the dataset | |
samplewise_center = False, # set each sample mean to 0 | |
featurewise_std_normalization = False, # divide inputs by std of the dataset | |
samplewise_std_normalization = False, # divide each input by its std | |
zca_whitening = False, # apply ZCA whitening | |
rotation_range = 12, # randomly rotate images in the range (degrees, 0 to 180) | |
zoom_range = 0.10, # Randomly zoom image | |
width_shift_range = 0.15, # randomly shift images horizontally (fraction of total width) | |
height_shift_range = 0.15, # randomly shift images vertically (fraction of total height) | |
horizontal_flip = False, # randomly flip images | |
vertical_flip = False) # randomly flip images | |
datagen.fit(X_train) | |
# Fit the model | |
history = model.fit_generator(datagen.flow(X_train,Y_train, batch_size=batch_size), | |
epochs = epochs, | |
validation_data = (X_val,Y_val), | |
verbose = 1, | |
steps_per_epoch = X_train.shape[0] // batch_size, | |
callbacks = [learning_rate_reduction]) | |
#========================================================================================== | |
# Predict the values from the validation dataset | |
Y_pred = model.predict(X_val) | |
# Convert predictions classes to one hot vectors | |
Y_pred_classes = np.argmax(Y_pred,axis = 1) | |
# Convert validation observations to one hot vectors | |
Y_true = np.argmax(Y_val,axis = 1) | |
# Display some error results | |
# Errors are difference between predicted labels and true labels | |
errors = (Y_pred_classes - Y_true != 0) | |
Y_pred_classes_errors = Y_pred_classes[errors] | |
Y_pred_errors = Y_pred[errors] | |
Y_true_errors = Y_true[errors] | |
X_val_errors = X_val[errors] | |
def display_errors(errors_index,img_errors,pred_errors, obs_errors): | |
""" This function shows 6 images with their predicted and real labels""" | |
n = 0 | |
nrows = 2 | |
ncols = 3 | |
fig, ax = plt.subplots(nrows,ncols,sharex=True,sharey=True) | |
for row in range(nrows): | |
for col in range(ncols): | |
error = errors_index[n] | |
ax[row,col].imshow((img_errors[error]).reshape((28,28))) | |
ax[row,col].set_title("Predicted label :{}\nTrue label :{}".format(pred_errors[error],obs_errors[error])) | |
n += 1 | |
# Probabilities of the wrong predicted numbers | |
Y_pred_errors_prob = np.max(Y_pred_errors,axis = 1) | |
# Predicted probabilities of the true values in the error set | |
true_prob_errors = np.diagonal(np.take(Y_pred_errors, Y_true_errors, axis=1)) | |
# Difference between the probability of the predicted label and the true label | |
delta_pred_true_errors = Y_pred_errors_prob - true_prob_errors | |
# Sorted list of the delta prob errors | |
sorted_dela_errors = np.argsort(delta_pred_true_errors) | |
# Top 6 errors | |
most_important_errors = sorted_dela_errors[-6:] | |
# Show the top 6 errors | |
#display_errors(most_important_errors, X_val_errors, Y_pred_classes_errors, Y_true_errors) | |
# ================================================================================================================ | |
# predict results | |
results = model.predict(test) | |
# select the index with the maximum probability | |
results = np.argmax(results,axis = 1) | |
results = pd.Series(results,name="Label") | |
submission = pd.concat([pd.Series(range(1,28001),name = "ImageId"),results],axis = 1) | |
submission.to_csv("my_kaggle_mnist_submission.csv",index=False) | |
Step 5 – Submit the .csv File to Kaggle for Grading
Running the script from step 4 for 250 epochs resulted in a 99.64% accuracy on the validation set. Submitting the resulting output file to Kaggle resulted in a 99.699% accuracy on the Kaggle-private data set, placing this submittal at rank 72 out of 1879 total competitors.