Backprop through Convolutions

Juan Vera

October 2024

Backprop for CNNs

Backprop for CNNs is similar to the backpropagation for fully connected feed forward neural networks, as they both use the multivariable chain rule, but given that a conv layer shares weights, we can instead compute the gradient for individual WiW_i weight at layer ll (WilW_i^l) in the kernel, K\mathcal{K}, as the summed gradient of the given WilW_i^l across all possible positions of K\mathcal{K}, during the convolution of KAl1\mathcal{K} * A^{l-1}, where Al1A^{l-1} is the input to the lthl\text{th} layer or the activation of the previous layer.

Let's assume the lthlth layer is a convolutional layer for now.

Prior to doing so, as is done in a fully connected model, you'd need Zl∂Z^l, in order to backpropagate, which in the case of a convnet, is simply a transposed convolution between the kernel at the lthlth layer, Kl\mathcal{K}^l, and the gradient of the loss w.r.t ZZ at the l+1l+1 layer, Z(l+1)∂Z^{(l+1)}.

The transposed convolution (sometimes known as the deconvolution purely for the following reason) puts back the elements of an output in a convolution, back to it's original spatial position prior to the convolution.

IMPORTANT

" * " denotes a convolution

" ⊙ " denotes the hadamard product, or element wise multiplication

" ^\hat{*} " denotes a tranposed convolution

" • " denotes a dot product or matmul

For example:

WX=ZW * X = Z

is the convolution of kernel, WW, slid over XX to yield ZZ.

The transposed convolution computes:

W^Z=X^W \hspace{1mm}\hat{*}\hspace{1mm}Z = \hat{X}

where X^\hat{X} is the same shape as XX. The values of the transposed convolution, don't yield the same values as the original XX. Just the same spatial dimensions, expanding values of ZZ back onto the same original positions X^\hat{X}.

So, going back, Zl∂Z^{l} is simply the backpropated gradient w.r.t to the weighted sum at ll, attained via the transpose convolution of (Kl+1)F(\mathcal{K^{l+1})}^{\text{F}} and Zl+1∂Z^{l+1}, where FF denotes the flipping of K\mathcal{K}.

We flip K\mathcal{K}, as that allows the gradients to be correctly propagated back into the right positions to then allow us to properly update WlW^l.

So where in a fully connected layer you have:

Zl=(LAl+1)(Al+1Zl+1)(Zl+1Al)(AlZl)=(W(l+1))TZ(l+1)act-func(Zl)∂Z^l = (\frac{∂L}{∂A^{l+1}})(\frac{∂A^{l+1}}{∂Z^{l+1}})(\frac{∂Z^{l+1}}{∂A^l})(\frac{∂A^l}{∂Z^l}) = (W^{(l+1)})^T ∂Z^{(l+1)} \odot ∂\text{act-func(}Z^{l})

through a convolutional layer, we have:

Zl=(LAl+1)(Al+1Zl+1)(Zl+1Al)(AlZl)=((Kl+1)F^Z(l+1))act-func(Zl)∂Z^{l} = (\frac{∂L}{∂A^{l+1}})(\frac{∂A^{l+1}}{∂Z^{l+1}})(\frac{∂Z^{l+1}}{∂A^l})(\frac{∂A^l}{∂Z^l})= (\mathcal{(K^{l+1})}^{\text{F}} \hat{*}\hspace{1mm}∂Z^{(l + 1)}) \hspace{.5mm} \odot ∂\text{act-func}(Z^{l})

where:

(Kl+1)F^Z(l+1)\mathcal{(K^{l+1})}^{\text{F}} \hat{*}\hspace{1mm}∂Z^{(l + 1)} is the same dimensions as ZlZ^l.

act-func(Zl)∂\text{act-func}(Z^{l}) is the derivative of the activation w.r.t to ZlZ^l, AZl\frac{∂A}{∂Z^l}

Ultimately, we have Zl∂Z^l, the gradients w.r.t to the output weighted sum of the lthlth convolutional layer.

Once we have Zl∂Z^l, we can compute Wl∂W^l.

To get Wl∂W^l, given Zl∂Z^l, the kernel Kl\mathcal{K}^l, and the input to the current layer, Al1A^{l-1}, we essentially slide the kernel Kl\mathcal{K}^l over Al1A^{l-1} to extract a patch of Al1A^{l-1}, at position i,ji, j ( denoted as Apatchl1A_{patch}^{l-1} ), as is done in the a convolution. But rather than computing a weighted sum, we simply element wise multiply with the ithith element in Zl∂Z^l at each ii location to get a portion of our gradient at each ithith step, call it Oi\mathcal{O}_i

Once we have every possible Oi\mathcal{O}_i, we element wise sum all Oi\mathcal{O}_i, to get our gradient for Kl\mathcal{K}^l, as Wl∂W^l.

Oij=Apatchijl1Zijl\mathcal{O}_{ij} = A^{l-1}_{patch_{ij}} \odot ∂Z_{ij}^l Wl=nNOn∂W^l = \sum_n^N\mathcal{O}_{n}

where NN is the total number of patches.

Ultimately, we can express this as:

Wl=nN(Apatchijnl1Zijl,i,j)∂W^l = \sum_n^N(A^{l-1}_{patch_{ij}^n} \odot ∂Z_{ij}^l, \forall i, j)

where ii denotes the row position and jj denotes the column position of the current nthnth patch.

We can also compute Bl∂B^l from Zl∂Z^l as:

Bijl=i=0Ij=0JZijl∂B_{ij}^l = \sum_{i = 0}^I \sum_{j = 0}^{J} ∂Z_{ij}^l

Simply summing over all spatial positions of Zijl∂Z_{ij}^l.

Thereby, if we have a batch size that is greater than 11, we can average all our gradients over the batchsize as:

B=1βs=0βBs∂B =\frac{1}{\beta} \sum_{s=0}^{\beta} ∂B_s W=1βW∂W = \frac{1}{\beta} ∂W

Backpropagation through Max-Pooling

Gradients for max pooling layers can be a tricky headache at times, especially in scenarios where you have a max pooling layer right before a fully connected layer, and you want to get the gradients for both the max-pooling layer and the convolutional layer prior to it.

In regards to getting the gradients, Zl∂Z^l, assuming the current layer, ll, is the max pooling layer and we're propagating back from a fully connected layer, we can simply compute it as:

Zl=(Wl+1)TZl+1∂Z^l = (W^{l+1})^T \cdot ∂Z^{l+1}

or if propagating back from a convolutional layer,

Zl=(Kl+1)F^Z(l+1)∂Z^{l} = \mathcal{(K^{l+1})}^{\text{F}} \hat{*}\hspace{1mm}∂Z^{(l + 1)}

without the need for the hadamard product with act-func(Zl)∂\text{act-func}(Z^l) that would've been needed if we were computing a Z∂Z within a fully connected layer or convolutional layer. This is as the output of the max pooling layer, ZlZ^l, makes no use of such an activation function and therefore, we don't need to compute the gradient of the activation during backprop because we don't have one.

Now that we have Zl∂Z^{l}, the gradient w.r.t the output of the max pooling layer, things get a little more tricky, when we want to get the 's w.r.t to the inputs of the pooling layer, or equivalently, w.r.t to the outputs of the prior convolutional layer, l1l - 1.

Given that max pooling downsamples the input to a smaller output (Rm×nRm^×n^\mathbb{R}^{m\times n} \rightarrow \mathbb{R}^{\hat{m} \times \hat{n}}), we need to find a way to upsample or "undo" the max pooling operation within backpropagation. We want to upsample.

To upsample, we need to remember the indices of the maximum values within the pooling region of a given input ZZ, that were yielded via the kernel in the max pooling layer, KMP\mathcal{K}^{\text{MP}}.

Then, when we perform the max unpooling, we create an empty mask (Zmaskl1∂Z_{mask}^{l-1}) with same shape as of the input of a pooling layer ZZ. Then, with the cached indices, we identify the respective positions in this empty mask and insert each value from the derivative of the max pooling layer, Zl∂Z^l, into it's corresponding positions in Zmaskl1∂Z_{mask}^{l-1}. Then Zmaskl1Zl1∂Z_{mask}^{l-1} \rightarrow ∂Z^{l-1}.

This is essentially, max-unpooling, with with our partial derivatives.

Then afterward, to follow the chain rule of calculus, as would've been done in the gradient in a fully connected layer, Zl=(Wl+1)TZl+1act-func(zl)∂Z^l = (W^{l+1})^T∂Z^{l+1} \odot ∂\text{act-func}(z^{l}) we compute the hadamard product with act-func(Zl1)∂\text{act-func}(Z^{l-1}).

Zl1=max-unpool(Z,idxs,Zl)act-func(Zl1)∂Z^{l-1} = \text{max-unpool}(Z, \text{idxs}, ∂Z^{l}) \hspace{1mm}\odot \hspace{1mm} ∂\text{act-func}(Z^{l-1})

The indivdual gradient values of Zl1∂Z^{l-1} are just the values of Zl∂Z^{l} itself, as in the forward pass, when we compute max(xXn:Krow-size,m:Kcol-size),m,n\text{max}(x \in X_{n:\mathcal{K_{\text{row-size}}}, m:\mathcal{K_{\text{col-size}}}}), \forall m, n we're essentially returning the maximum value of the given region of XX covered by the kernel, K\mathcal{K}, with no weights applied onto the input.

This can be seen as being equivalent to having weights =1= 1 for the max-pooled values or weights =0= 0 for the eliminated values, such that the original formulation for what would've been the gradient as

Zl1=Wlmax-unpool(Z,idxs,Zl)∂Z^{l-1} = W^l \cdot \text{max-unpool}(Z, \text{idxs}, ∂Z^l) becomes

Zl1=1max-unpool(Z,idxs,Zl)∂Z^{l-1} = 1 \cdot \text{max-unpool}(Z, \text{idxs}, ∂Z^l) for the non-zero gradients.

Backpropagation through Average-Pooling

Backprop through average-pooling is similar, yet simpler than backpropagating gradients through max-pooling layers.

Just as prior, to get gradients, zl∂z^l, assuming the current layer ll is the average pooling layer, we can compute as:

Zl=(Wl+1)TZl+1∂Z^l = (W^{l+1})^T∂Z^{l+1}

or if propagating back from a convolutional layer,

Zl=(Kl+1)F^Z(l+1)∂Z^{l} = \mathcal{(K^{l+1})}^{\text{F}} \hat{*}\hspace{1mm}∂Z^{(l + 1)}

Now to propagate the gradient, Zl∂Z^l, back to a prior layer, say a convolutional layer, we have to upsample, just as was done when propagating gradients back through the max pooling layer.

Though fortunately, this is simpler than max-unpooling, as we don't have to store the indices of the max-pooled values with respect to the original input in the cache.

Instead, given the kernel for the average-pooling layer, Kl\mathcal{K}^l, and the gradients, Zl∂Z^l, we can instead create an empty mask of same dimensions as Zl1Z^{l-1} (just as was done in \partial through max-pooling), let's call it Zmaskl1∂Z_{mask}^{l-1}. We can slide Kl\mathcal{K}^l over Zmaskl1∂Z_{mask}^{l-1}, and for each rr region in the mask, Zmaskrl1∂Z_{mask_{r}}^{l-1}, we extract the rthrth value in Zl∂Z^l at position i,ji, j, and evenly spread it out over the region.

If the Zmaskrl1∂Z_{mask_{r}}^{l-1} was dimensions 2×22 \times 2, or size 44, and the rthrth value in Zl∂Z^l was 88, we'd divide 88 by 44, yielding 22, and insert the 22 into all 44 slots of the current region.

If our stride is less than the a given dimension of the kernel, (whether it be height or width), at some point, we will a have overlapping regions to insert values of Zl∂Z^l. Then the case becomes that for the overlapping portions, we add the newly dispersed gradients, (Zlrsize)r(\frac{∂Z^l}{r_{size}})^r, to the previously dispersed gradients at the previous rr, (Zlrsize)r1(\frac{∂Z^l}{∂r_{size}})^{r-1}.

This is done for all possible regions, rr, in Zmaskl1∂Z_{mask}^{l-1}, then Zmaskl1Zl1∂Z_{mask}^{l-1} \rightarrow ∂Z^{l-1}

Zl1=Zmaskrl1+Zi,jl1rsize,(r,(i,j))∂Z^{l-1} = ∂Z_{mask_{r}}^{l-1} + \frac{∂Z^{l-1}_{i, j}}{r_{size}}, \forall\hspace{1mm} (r, (i, j)) (this is element wise sum, where Zi,jl1rsize is broadcasted over the region)\text{(this is element wise sum, where }\frac{∂Z^{l-1}_{i, j}}{r_{size}} \text{ is broadcasted over the region)}

A Helpful Resource \rightarrow LINK