Short Note on Matrix Differentials and Backpropagation

Mathematical notation is the convention that we all use to denote a concept in a concise mathematical formulation, yet sometimes there is more than one way to express the same equation. For example, we can use Leibniz’s notation $\frac{dy}{dx}$ to denote a derivate, but in Physics, we use $\dot{y}, \ddot{y}$ to simplify the derivatives. Similarly, to solve differential equations, we use the Laplace transformation $F(s) = \int f(t) e^{-st}dt$, but instead of using the definition, we can use the frequency domain representations and simply solve differential equations using basic algebra. In this post, I’ll cover a matrix differential notation and how to use differentials to derive backpropagation functions easily.

Differentials

Let’s first define the differential. Let a vector function $f(x)$ be differentiable at $c$ and the first-order Taylor approximation is

$$ f(c + u) = f(x) + f'(c)u + r(u) $$

where $r$ denotes the remainder. We denote $\mathsf{d}f(c;u) = u f’(c)$, the differential of $f$ at $c$ with increment $u$. This of course can also be denoted simply using partial derivatives.1

$$ \mathsf{d}f(c; u) = (\mathsf{D}f(c))u $$

where $\mathsf{D}_j f_i(c)$ denotes the partial derivative of $f_i$ with respect to the $j$-th coordinate at $c$. The matrix $\mathsf{D}f(c)$ is the Jacobian matrix and it’s the transpose of the gradient of $f$ at $c$.

$$ \nabla f(c) = (\mathsf{D}f(c))^T $$

Chain Rule

Let $h = g \circ f$, the differential of $h$ at $c$ is

\begin{align} \mathsf{d}h(c; u) & = (\mathsf{D}(h(c))u \\ & = (\mathsf{D}g(f(c)))(\mathsf{D}f(c))u = (\mathsf{D}g(f(c)))\mathsf{d}f(c;u) \\ & = \mathsf{d}g(f(c); \mathsf{d}f(c;u)) \end{align}

We can further simplify the notation by replacing $\mathsf{d}f(c;u)$ with $\mathsf{d}f$ when it unambiguously represents the differential concerning the input variable.

Matrix Function

Now let’s extend this to a matrix function. Let $F: S \rightarrow \mathbb{R}^{m\times p}$ be a matrix function defined on $S \in \mathbb{R}^{n \times q}$.

$$ \text{vec}F(C+U) = \text{vec} F(C) + F'(C) \text{vec}U + \text{vec}R(U) $$

We can denote the differential of $F$ at $C$ as

$$ \text{vec}\; \mathsf{d}F(C;U) = F'(C) \text{vec}U $$

Matrix Backpropagation

\todo{Introduce how we use the differential for backpropagation} Let $A : B = \text{Tr}(A^TB) = \text{vec}(A) \text{vec}(B)^T$, sum of all elements in $A \circ B$. If we let $F(X) = Y$ and $L$ be the final loss,

\begin{align} \mathsf{d} L \circ f & = \mathsf{D} L : \mathsf{d}f \\ & = \mathsf{D} L : \mathcal{L}(\mathsf{d}X) \\ & = \mathcal{L}^*(\mathsf{D} L) : \mathsf{d}X \end{align}

where we denote $\mathsf{d}Y = \mathcal{L}(\mathsf{d}X)$ and $\mathsf{D}L = \frac{\partial L}{\partial Y}$. So given gradients from the upper layer, the gradient with respect to $X$ can easily be computed by finding the function $\mathcal{L}^*$, the adjoint of $\mathcal{L}$.

Example 1: Linear Function

Let $Y = f(X) = AX + b$, then,

\begin{align} \mathsf{d} L \circ f & = \mathsf{D} L : \mathsf{d}(AX + b) \\ & = \mathsf{D} L : A\mathsf{d}X = \text{Tr}(\mathsf{D}L (A \mathsf{d}X)^T) \\ & = \text{Tr}(\mathsf{D}L \mathsf{d}X^T A^T) = \text{Tr}(A^T \mathsf{D}L \mathsf{d}X^T) \\ & = A^T \mathsf{D}L : \mathsf{d}X \end{align}

Thus $\mathcal{L}^*(Y) = A^TY$.

Example 2: Constrained Optimization

We would like to solve the following constrained optimization problem.

\begin{equation*} \begin{aligned} & \underset{x}{\text{minimize}} & & f(x) \\ & \text{subject to} & & Ax = b. \end{aligned} \end{equation*}

The Lagrangian and the primal and dual feasibility equations are

$$ \mathcal{L}(x, \nu) = f(x) + \nu^T(Ax - b) \\ Ax^* = b, \;\; \nabla f(x) + A^T \nu^* = 0 $$

If we take the first order approximation of the primal and dual feasibility equations,

\begin{align} Ax + \mathsf{d}(Ax) & = b + \mathsf{d}b\\ Ax + \mathsf{d}Ax + A\mathsf{d}x & = b + \mathsf{d}b\\ \nabla f(x) + A^T \nu + \mathsf{d}(\nabla f(x) + A^T \nu) & = 0 \\ \nabla f(x) + A^T \nu + \nabla^2 f(x) \mathsf{d}x + \mathsf{d}A^T \nu + A^T \mathsf{d}\nu & = 0 \end{align}

Or more concisely,

$$ \begin{bmatrix} \nabla^2 f(x) & A^T \\ A & 0 \end{bmatrix} \begin{bmatrix} \mathsf{d}x \\ \mathsf{d}\nu \end{bmatrix} = - \begin{bmatrix} f(x) + A^T \nu + \mathsf{d}A^T\nu \\ Ax - b + \mathsf{d}Ax - \mathsf{d}b \end{bmatrix} $$

This is the same as the infeasible start Newton method 2.

Conclusion

In this post, we covered the notation for matrix differentials and matrix backpropagation. Simple notation can ease the burden of derivation and also lead to fewer mistakes.

References

Leave a Comment