Residuals

Juan Vera

November 2024

Abstract

Here's how we always have a of at least Lxl+2\frac{∂L}{∂x_{l+2}} for the lthlth layer with a Residual Connection at every other ll, and why we need the Identity Transformation to maintain this important feature of residual networks, for deep networks.

Feel free to use this as a resource to understand, "Identity Mappings in Deep Residual Networks".

Forward

x2=ReLU(F1(x1))x_2 = \text{ReLU}(F_1(x_1)) x3=ReLU(F2(x2)+x1)(1)x_3 = \text{ReLU}(F_2(x_2) + x_1) \tag{1} x4=ReLU(F3(x3))x_4 = \text{ReLU}(F_3(x_3)) x5=ReLU(F4(x4)+x3)(2)x_5 = \text{ReLU}(F_4(x_4) + x_3) \tag{2} L=CrsEnt(plog(x5))L = \text{CrsEnt}(p \cdot log(x_5))

Backward

Lx5=x5pLx4=(Lx5)(x5x4) \frac{∂L}{∂x_5} = x_5 - p \\[5mm] \frac{∂L}{∂x_4} = (\frac{∂L}{∂x_5})(\frac{∂x_5}{∂x_4}) \\[5mm] Lx3=(Lx5)(x5x4)(x4x3)+(Lx5)(x5x3)=(Lx5)(x5x4)(x4x3)+(Lx5)(F4(x4)x3+1)(∂2) \frac{∂L}{∂x_3} = (\frac{∂L}{∂x_5})(\frac{∂x_5}{∂x_4})(\frac{∂x_4}{∂x_3}) + (\frac{∂L}{∂x_5})(\frac{∂x_5}{∂x_3}) = (\frac{∂L}{∂x_5})(\frac{∂x_5}{∂x_4})(\frac{∂x_4}{∂x_3}) + (\frac{∂L}{∂x_5})(\frac{∂F_4(x_4)}{∂x_3} + 1) \tag{∂2} \\[5mm] Lx2=(Lx5)(x5x4)(x4x3)(x3x2)\frac{∂L}{∂x_2} = (\frac{∂L}{∂x_5})(\frac{∂x_5}{∂x_4})(\frac{∂x_4}{∂x_3})(\frac{∂x_3}{∂x_2}) Lx1=(Lx5)(x5x4)(x4x3)(x3x2)(x2x1)+(Lx5)(x5x4)(x4x3)(x3x1)=(Lx5)(x5x4)(x4x3)(x3x2)(x2x1)+(Lx3)(F2(x2)x1+1)(∂1)\frac{∂L}{∂x_1} = (\frac{∂L}{∂x_5})(\frac{∂x_5}{∂x_4})(\frac{∂x_4}{∂x_3})(\frac{∂x_3}{∂x_2})(\frac{∂x_2}{∂x_1}) + (\frac{∂L}{∂x_5})(\frac{∂x_5}{∂x_4})(\frac{∂x_4}{∂x_3})(\frac{∂x_3}{∂x_1}) \\[3mm] = (\frac{∂L}{∂x_5})(\frac{∂x_5}{∂x_4})(\frac{∂x_4}{∂x_3})(\frac{∂x_3}{∂x_2})(\frac{∂x_2}{∂x_1}) + (\frac{∂L}{∂x_3})(\frac{∂F_2(x_2)}{∂x_1} + 1) \tag{∂1}

Notice that:

(Lx5)(x5x3)=(Lx5)(F4(x4)x3+1)and(Lx5)(x5x4)(x4x3)(x3x1)=(Lx3)(F2(x2)x1+1)(\frac{∂L}{∂x_5})(\frac{∂x_5}{∂x_3}) = (\frac{∂L}{∂x_5})(\frac{∂F_4(x_4)}{∂x_3} + 1) \\[3mm] \text{and} \\[3mm] (\frac{∂L}{∂x_5})(\frac{∂x_5}{∂x_4})(\frac{∂x_4}{∂x_3})(\frac{∂x_3}{∂x_1}) = (\frac{∂L}{∂x_3})(\frac{∂F_2(x_2)}{∂x_1} + 1)

as when we take the w.r.t to x1∂x_1 or x3∂x_3 propagating to the residual connection, we end up taking the gradient of x1∂x_1 and x3∂x_3 w.r.t to itself, such that it is equal to 11, then the gradient becomes (Fl+1(xl+1)xl+1)(\frac{∂F_{l+1}(x_{l+1})}{∂x_l} + 1).

When we multiply with the other gradient that forms the chain rule, Lx5\frac{∂L}{∂x_5}, Lx3\frac{∂L}{∂x_3}, or more generally, Lxl+2\frac{∂L}{∂x_{l+2}}, we're able to propagate back a of at least Lxl+2≥ \frac{∂L}{∂x_{l+2}}

Simpler example:

yxl=xl(Fl+1(xl+1)+xl)=Fl+1(xl+1)xl+xlxl=Fl+1(xl+1)xl+1 \frac{∂y}{∂x_l} = \frac{∂}{∂x_l}(F_{l+1}(x_{l+1}) + x_l) = \frac{∂F_{l+1}(x_{l+1})}{∂x_l} + \frac{∂x_l}{∂x_l} = \frac{∂F_{l+1}(x_{l+1})}{∂x_l} + 1

This exposes the inner mechanics behind computing gradients for x3x_3 and x1x_1 -- where above we had yxl\frac{∂y}{∂x_l}, in context of the former example, at equation (1)(∂1), it's equivalent to x3x1\frac{∂x_3}{∂x_1}, where x3=y∂x_3 = ∂y. Correspondingly, for equation (2)(∂2), yxl=x5x3\frac{∂y}{∂x_l} = \frac{∂x_5}{∂x_3}.

Hence, for residual connections, Fl(xl)+xi1F_l(x_l) + x_{i - 1}, the gradient with respect to a xlx_l will always be Fl+1(xl+1)xl+1\frac{∂F_{l+1}(x_{l+1})}{∂x_l} + 1, that is if we have the II transformation for a residual connection, and then the overall gradient w.r.t to xlx_l will always be Lxl+2≥ \frac{∂L}{∂x_{l+2}}. (Worst case scenario, it being equivalent to Lxl+2\frac{∂L}{∂x_{l+2}} if F(xl+1)xl=0\frac{∂F(x_{l+1})}{∂x_l} = 0)

Thereby, for the given layers that include a residual II connection, in this case those which involve F4F_4 and F2F_2, we'll always have a Lxl+2∂ ≥ \frac{∂L}{∂x_{l+2}}, such that we end up mitigating the vanishing gradient problem for those layers, and perhaps earlier layers, though this is dependent on how large your residual block is.

You can view this as a gradient "highway".

The deeper your residual block is, the more layers the 's will backpropagate through without the alleviating residual connection, such that your set of gradients can still vanish, until the gradients meet a residual connection.

Now consider:

xl+1=λlxl+F(xl) x_{l+1} = \lambda_l x_l + F(x_l)

where λl\lambda_l is any scalar that changes the magnitude of the residual connection.

Note that a Residual Connection can be recursively defined for any layer LL as:

xL=xl+i=lLF(xi,Wi)x_{L} = x_l + \sum_{i = l}^L F(x_{i}, \mathcal{W_{i}})

Now, introducing λ\lambda and going with the recursive expression of the residual connections up the layer LL:

xL=(i=1L1λi)xl+F^(xi,Wi)x_L = (\prod_{i = 1}^{L - 1} \lambda_i) x_l + \hat{F}(x_i, W_i)

where F^\hat{F} absorbs λ\lambda's, backpropagation is defined as:

Lxl=LxL((i=lLλi)+xli=lL1F(xi,W))\frac{∂L}{∂x_l} = \frac{∂L}{∂x_L}((\prod_{i=l}^{L}\lambda_i) + \frac{∂}{∂x_l}\sum_{i = l}^{L - 1}F(x_i, W))

Note that the scalar, 11, has turnt into a scalar i=lLλi\prod_{i=l}^{L}\lambda_i, which can be arbitrarily big or small.

From now:

i=lLλi=λ\prod_{i = l}^L \lambda_i = \lambda

II = Identity

For a smaller λ\lambda, the backpropagated gradient, LxL\frac{∂L}{∂x_L}, will be scaled into a smaller value such that vanishing gradients can become an issue.

We lose the property of: LxL∂ ≥ \frac{∂L}{∂x_L}

For a larger λ\lambda, the backpropagated gradient can become exponentially larger the more layers we backpropagate through such that exploding gradients diminish the quality of the model.

If λ\lambda remains to be 11, as is in the II transformation, the gradient will remain unscaled as it is backpropagated through the network, such that II residual connections become extremely important.

This is why having residual connections be a transformation such as 1×11 \times 1 convolutions or some other transformation is damaging to backpropagation. You reduce the expressiveness of a deep network via Shattered Gradients.