I’ve been interested in joining some Kaggle competitions to sharpen my Data Science and Deep Learning skills. So, I thought I would report on my experiences – starting with an easy competition : recognize digits from the MNIST training set. This is a classic intro-problem for Deep Learning and computer vision, and the competition is open through Jul 2020 — so it seems like a good place to start. Here’s a link to the Kaggle page for this “Getting Started” exercise: https://www.kaggle.com/c/digit-recognizer
I have some previous experience using Keras, Python 3, and Convolutional Neural Networks (CNNs), so I plan to use these as the core set of tools I’ll use in my first introduction to Kaggle. I’m starting off with a PC I haven’t used for Deep Learning yet – it’s a Dell 8910 with 32 GB of memory, an Nvidia 1070 video card, and Ubuntu 16.04.2 (“xenial”). I won’t go into the details of getting the video card to be recognized; installing the Nvidia drivers, CNN toolkit; etc. – since I went through that process a few months ago and it was so painful that I don’t really want to have to do it again. I don’t know why this is such a difficult thing to do…you’d think that with Nvidia’s leading role in the AI and Deep Learning marketplace they would make this a seamless process. But if your experience is anything like mine, it is not close to being seamless. I found several different web sites that will attempt to walk you through this process, and in the end, the recommendations in this one seemed to work the best:
As an alternative, I know that Amazon AWS now has GPU-ready virtual machine environments that are ready to go, with the necessary drivers and tools (like Python, Keras, Tensorflow, etc.) pre-installed and ready to use. I relied on an Amazon AWS environment to complete some of my Udacity Deep Learning Foundations coursework and found it to be a good, pain-free experience. So, if this is your first time getting into Deep Learning, I might recommend AWS as a starting point – particularly if you have any difficulties getting a desktop Ubuntu environment set up and working. Or, start with just CPU-based training and save the GPU configuration stuff for later!
To get started, let’s begin by setting up a programming environment with the tools we’ll be using: Python 3, Keras, etc.
Step 1: Setup a Programming Environment and Install Keras
The first thing I want to do is group all my work into a python virtual environment, so all the tools I need are accessible to my project, but won’t cause version-inconsistency problems with any other projects I might have now or in the future. For simplicity, I’ll call my virtualenv “kaggle-mnist” to make it clear that this is the environment I’ll be using for this Kaggle MNIST competition.
Let’s verify that the environment will be using python3:
In response, on my machine, I see:
So, things look good so far.
The next thing I want to do is install the Keras Deep Learning Toolkit, a GPU-aware version of TensorFlow, and some other utilities that I will need to complete this task (using pip3, since I’ve decided to use Python 3):
Step 2: Get the MNIST Data Sets from Kaggle
The data for this competition is hosed on the kaggle site, at the following URL:
There are two files of interest here: train.csv and test.csv. I clicked on each file and pressed the “Download” button on the Kaggle site to get the two files, which I moved to a “/data” folder in my kaggle-mnist programming environment.
While I was at it, I also downloaded the sample-submission.csv file, which I will later need to create and submit to Kaggle to have my submission evaluated. This is a simple file that identifies the digit for each of the 28,000 samples in the test set (test.csv). It looks something like this:
Step 3: Examine the Dataset
The first step to take when working with any dataset is to take a look at the data, and see how it is distributed. I looked at some of the Kaggle kernels provided by other competitors in this competion, and the following code is based on the “Inroduction to CNN Keras” Jupyter notebook contributed by Yassine Ghouzam. The following (short) python script will read in the datasets and display the number of training set images assigned to each of the 10 possible categories, sorted from the most common to the least common digit in the training set. he final few lines check to see if any of the data is null/missing:
Executing this python script on my PC resulted in the following output:
As you can see, the training set data is fairly-evenly split across the 10 possible classes, and there doesn’t appear to be any missing data in the training or test datasets (all the isnull() entries are False). Good to know!
Step 4 – Create the CNN with Keras, Train, and Predict
Reference the following (rather long) implementation of the remaining tasks:
- Lines 1-65: This is the data prep actions discussed above.
- Lines 66-121: Define the CNN network, with 3 main convolution layers and a couple fully-connected layers to a softmax output.
- Lines 122-136: Configure optimizer and learning rate adjustments.
- Lines 137-152: Perform some data augmentation to add variety to the training set.
- Lines 153-161: Train the network.
- Lines 162-212: Analyze various errors when the model is applied to the validation set.
- Lines 213-225: Apply the trained network to the test set and output the submission file.
Step 5 – Submit the .csv File to Kaggle for Grading
Running the script from step 4 for 250 epochs resulted in a 99.64% accuracy on the validation set. Submitting the resulting output file to Kaggle resulted in a 99.699% accuracy on the Kaggle-private data set, placing this submittal at rank 72 out of 1879 total competitors.