I’ve always been a little confused on how the size of convolution filters, input image sizes, stride, padding, etc relates to the final size of feature maps in a Convolutional Neural Network (CNN). So, here are some notes that I’ve gathered that help explain things a little:
The following diagram, from the “Caffe in a Day Tutorial” (https://docs.google.com/presentation/d/1HxGdeq8MPktHaPb-rlmYYQ723iWzq9ur6Gjo71YiG0Y/edit#slide=id.gc2fcdcce7_216_0) provides a good overview of what happens when a single convolution filter (or, kernel) is applied to an input image.
In this example, the input image is 32×32 pixels with 3 separate RGB color channels and the filter is a 5×5 kernel. In fact, it is actually 3 separate 5×5 kernels, one for each of the 3 color channels. Each of these 3 separate 5×5 kernels are independent of one another and learned during training. The filter slides over the input image and the value of each pixel under the corresponding kernel location is multiplied by the kernel value. Results from each of the individual multiplications are added together (3x5x5 = 75, in all) and the resulting sum produces a single pixel in the feature map. The first (top-left corner) feature of the feature map is produced from the sum of the 75 multiplications when the filter sits over the top-left corner of the input image. After this first feature is calculated, the filter is shifted to the right one pixel (or, more precisely, the number of pixels represented by the “stride”) and the process is repeated to obtain the 2nd feature. This is repeated, sliding to the right until the filter reaches the right edge, sliding down one pixel to the next row, and so on until the filter sits at the bottom-right corner of the input image and the bottom-right feature of the feature map has been calculated. In this particular example, the “Stride” was set to one and no padding was used. In practice, the input image is generally padded with addition 0-value pixels so that the feature map ends up being the same size as the input image. For a 3×3 kernel, a padding of 1 (on both sides, top, and bottom) would result in a feature map matching the input image size. For a 5×5 kernel, a padding of 2 pixels on each side/top/bottom would produce the correct size feature map.
Note, in particular, that the same set of weights and bias value are used as this filter scans across the image, resulting in a much-reduced number of network parameters compared to a Fully Connected (FC) neural network layer. In this particular example, the 5×5 kernel (3x5x5, actually, when the 3 separate filters, one for each RGB channel are considered) results in 75 (76 counting the bias term) unique weights and biases to be learned compared to 3,072 (3,073 counting the bias term) for a fully-connected layer. The difference in number of parameters is even more dramatic with larger input image sizes, since the number of parameters in a fully-connected network 3xHxW (where H and W are the height and width of the input imagery) while the number of parameters in a convolutional layer is proportional only to the size of the filter and is independent of the input image size. So, for example, a 3x256x256 image from the ImageNet data set would require nearly 200,000 FC parameters to be learned while the convolution layer for this same image would still need only 76 parameters.
Rather than a single set of kernels in a layer, though, CNNs generally provide more modeling “capacity” by including numerous separate kernels that can be learned by the network. Each separate kernel produces it’s own feature map, which are generally shown as a 3D volume of feature maps, where the depth of the feature net volume indicates the number of filters to be learned in that layer of the network. The next diagram, from the same “Cafe in a Day Tutorial”, illustrates this for a particular example where 6 sets of 3x5x5 kernels (18 kernels total) produce a set of 6 feature maps.
Finally, note that choosing a stride other than 1 would reduce the size of the feature maps. For example, a stride of 2 (sliding the 5×5 filter two pixels right (or down) after each convolution is computed) would result in feature maps that are 14×14 in size, thus reducing the computational requirements on subsequent layers of the network.