# Short Note on Matrix Differentials and Backpropagation

Mathematical notation is the convention that we all use to denote a concept in a concise mathematical formulation, yet sometimes there is more than one way to express the same equation. For example, we can use Leibniz’s notation $\frac{dy}{dx}$ to denote a derivate, but in Physics, we use $\dot{y}, \ddot{y}$ to simplify the derivatives. Similarly, to solve differential equations, we use the Laplace transformation $F(s) = \int f(t) e^{-st}dt$, but instead of using the definition, we can use the frequency domain representations and simply solve differential equations using basic algebra. In this post, I’ll cover a matrix differential notation and how to use differentials to derive backpropagation functions easily.

## Differentials

Let’s first define the differential. Let a vector function $f(x)$ be differentiable at $c$ and the first-order Taylor approximation is

$$f(c + u) = f(x) + f'(c)u + r(u)$$

where $r$ denotes the remainder. We denote $\mathsf{d}f(c;u) = u f’(c)$, the differential of $f$ at $c$ with increment $u$. This of course can also be denoted simply using partial derivatives.1

$$\mathsf{d}f(c; u) = (\mathsf{D}f(c))u$$

where $\mathsf{D}_j f_i(c)$ denotes the partial derivative of $f_i$ with respect to the $j$-th coordinate at $c$. The matrix $\mathsf{D}f(c)$ is the Jacobian matrix and it’s the transpose of the gradient of $f$ at $c$.

$$\nabla f(c) = (\mathsf{D}f(c))^T$$

### Chain Rule

Let $h = g \circ f$, the differential of $h$ at $c$ is

\begin{align} \mathsf{d}h(c; u) & = (\mathsf{D}(h(c))u \\ & = (\mathsf{D}g(f(c)))(\mathsf{D}f(c))u = (\mathsf{D}g(f(c)))\mathsf{d}f(c;u) \\ & = \mathsf{d}g(f(c); \mathsf{d}f(c;u)) \end{align}

We can further simplify the notation by replacing $\mathsf{d}f(c;u)$ with $\mathsf{d}f$ when it unambiguously represents the differential concerning the input variable.

### Matrix Function

Now let’s extend this to a matrix function. Let $F: S \rightarrow \mathbb{R}^{m\times p}$ be a matrix function defined on $S \in \mathbb{R}^{n \times q}$.

$$\text{vec}F(C+U) = \text{vec} F(C) + F'(C) \text{vec}U + \text{vec}R(U)$$

We can denote the differential of $F$ at $C$ as

$$\text{vec}\; \mathsf{d}F(C;U) = F'(C) \text{vec}U$$

## Matrix Backpropagation

\todo{Introduce how we use the differential for backpropagation} Let $A : B = \text{Tr}(A^TB) = \text{vec}(A) \text{vec}(B)^T$, sum of all elements in $A \circ B$. If we let $F(X) = Y$ and $L$ be the final loss,

\begin{align} \mathsf{d} L \circ f & = \mathsf{D} L : \mathsf{d}f \\ & = \mathsf{D} L : \mathcal{L}(\mathsf{d}X) \\ & = \mathcal{L}^*(\mathsf{D} L) : \mathsf{d}X \end{align}

where we denote $\mathsf{d}Y = \mathcal{L}(\mathsf{d}X)$ and $\mathsf{D}L = \frac{\partial L}{\partial Y}$. So given gradients from the upper layer, the gradient with respect to $X$ can easily be computed by finding the function $\mathcal{L}^*$, the adjoint of $\mathcal{L}$.

### Example 1: Linear Function

Let $Y = f(X) = AX + b$, then,

\begin{align} \mathsf{d} L \circ f & = \mathsf{D} L : \mathsf{d}(AX + b) \\ & = \mathsf{D} L : A\mathsf{d}X = \text{Tr}(\mathsf{D}L (A \mathsf{d}X)^T) \\ & = \text{Tr}(\mathsf{D}L \mathsf{d}X^T A^T) = \text{Tr}(A^T \mathsf{D}L \mathsf{d}X^T) \\ & = A^T \mathsf{D}L : \mathsf{d}X \end{align}

Thus $\mathcal{L}^*(Y) = A^TY$.

### Example 2: Constrained Optimization

We would like to solve the following constrained optimization problem.

\begin{equation*} \begin{aligned} & \underset{x}{\text{minimize}} & & f(x) \\ & \text{subject to} & & Ax = b. \end{aligned} \end{equation*}

The Lagrangian and the primal and dual feasibility equations are

$$\mathcal{L}(x, \nu) = f(x) + \nu^T(Ax - b) \\ Ax^* = b, \;\; \nabla f(x) + A^T \nu^* = 0$$

If we take the first order approximation of the primal and dual feasibility equations,

\begin{align} Ax + \mathsf{d}(Ax) & = b + \mathsf{d}b\\ Ax + \mathsf{d}Ax + A\mathsf{d}x & = b + \mathsf{d}b\\ \nabla f(x) + A^T \nu + \mathsf{d}(\nabla f(x) + A^T \nu) & = 0 \\ \nabla f(x) + A^T \nu + \nabla^2 f(x) \mathsf{d}x + \mathsf{d}A^T \nu + A^T \mathsf{d}\nu & = 0 \end{align}

Or more concisely,

$$\begin{bmatrix} \nabla^2 f(x) & A^T \\ A & 0 \end{bmatrix} \begin{bmatrix} \mathsf{d}x \\ \mathsf{d}\nu \end{bmatrix} = - \begin{bmatrix} f(x) + A^T \nu + \mathsf{d}A^T\nu \\ Ax - b + \mathsf{d}Ax - \mathsf{d}b \end{bmatrix}$$

This is the same as the infeasible start Newton method 2.

## Conclusion

In this post, we covered the notation for matrix differentials and matrix backpropagation. Simple notation can ease the burden of derivation and also lead to fewer mistakes.

Tags:

Categories:

Created:

Updated: