Convolutional Networks continued

I’ve made a good deal of progress in understanding the Convolutional Network concept described by Turaga and Murray and blogged about here previously. This is in part thanks to Turaga himself, who kindly responded to my email with some questions, and also thanks to the help of mathemagician @a.

The main thing that was tripping me up was the size of the different images/objects involved. The magic, it turns out, is all in the difference between “valid” and “full” convolution (and cross-correlation). Both of these processes rely on moving one image around another, so when you’re processing the corner pixel, the kernel spills over the edges of the base image. You can handle this by wrap-around, zero-padding or symmetry, but in any event, a “full” convolution returns all pixels, so the convolution of an NxNxN image with an mxmxm kernel is still NxNxN, while a “valid” convolution returns only those pixels which do not rely on zero-padding, so you lose m-1 from each dimension each time you convolute.

Thus the end image, I_O, that you come out with after 3 iterations has edge length N-3*m+3. Then when you do backpropagation to calculate Sb, you do “full” cross-correlation, and when you do weight kernel gradients, you do valid again. My original confusion was about the line:
ΔWab = ηI_a(*)S_b

Because it seemed to me that I_a and S_bare both NxNxN. In fact Sb is m-1 pixels smaller, on each dimension, than Ia. Therefore when you “valid” cross-correlate them, there are only mxmxm places for Sb to move around Ia. So the output of this cross-correlation is actually an mxmxm matrix, the same size as Wab.

So if I understood that correctly I think it solves my major confusion. I also spent a bit of time reading the LeCun papers that were cited, and one upshot is that the randomly initialized weight kernels are not just initialized with random() i.e. a uniform distribution from 0 to 1. You want to give them a normal distribution with mean 0 (see “Efficient Backprop” tip #16, p. 13) and standard deviation as follows:

We initialize all the elements of the filters w_ab and the biases h_a randomly from a normal distribution of standard deviation , where |w| is the number of elements in each filter (here, 5³) and |b| is the number of input feature maps incident on a particular output feature map. This choice mirrors a suggestion in LeCun, Bottou, Orr, & Muller (1998) that weight vectors have unit norm.

LeCun’s Efficient Backprop also has tips on transforming the inputs (p. 9). It’s clear one should normalize images to a mean of zero, I’m less clear what to do with the other tips since my “input variables” are pixels in an image:

Transforming the Inputs

The average of each input variable over the training set should be close to zero

Scale input variables so that their covariances are about the same

Input variables should be uncorrelated if possible

But in any case I now feel I may actually understand enough of this to try my hand at an implementation of it in Python. More fun and games to follow.