# Expectation Maximization and Variational Inference (Part 1)

Statistical inference involves finding the right model and parameters that represent the distribution of observations well. Let $\mathbf{x}$ be the observations and $\theta$ be the unknown parameters of a ML model. In maximum likelihood estimation, we try to find the $\theta_{ML}$ that maximizes the probability of the observations using the ML model with the parameters:

$$ \hat{\theta}_{ML} = \underset{\theta}{\arg\!\max} \; p(\mathbf{x}; \theta) $$Typically, the problem requires few assumptions to solve the above optimization efficiently. One trick is to introduce latent variables $\mathbf{z}$ that break down the problem into smaller subproblems. For instance, in the Gaussian Mixture Model, we can introduce the cluster membership assignment as random variables $z_i$ for each datum $x_i$, which greatly simplifies the model ($p(x_i | z_i=k) \sim \mathcal{N}(\mu_k, \sigma_k)$).

$$ p(\mathbf{x};\theta) = \int p(\mathbf{x}, \mathbf{z}; \theta) d\mathbf{z} $$However, the above integration is, in many cases, intractable and can be either approximated using stochastic sampling (Monte Carlo methods) or we can simply bypass the computation using few assumptions. The second method is called variational inference, coined after the calculus of variations, which we will go over in this post.

## Evidence Lower Bound (ELBO)

There are many great tutorials for variational inference, but I found the tutorial by Tzikas et al.^{1} to be the most helpful. It follows the steps of Bishop et al.^{2} and Neal et al.^{3} and starts the introduction by formulating the inference as the Expectation Maximization. Here, we will summarize the steps in Tzikas et al.^{1} and elaborate some steps missing in the paper. Let $q(z)$ be a probability distribution on $z$. Then,

where $F(q, \theta)$ known as the evidence lower bound or ELBO, or the negative of the variational free energy. $KL(\cdot || \cdot)$ is the Kullback-Leibler divergence. Since the KL-divergence is non-negative,

$$ \ln p(x; \theta) \ge F(q, \theta) $$The ELBO provides a lower bound for the marginal likelihood. Instead of maximizing the marginal likelihood directly, the Expectation Maximization (EM) and variational inference maximize the variational lower bound.

## Expectation Maximization

Let’s assume that we can find $p(z | x; \theta^{OLD})$ analytically (for the Gaussian Mixture Model, this is just a softmax). Then, we can simply substitute $q(z) = p(z | x; \theta^{OLD})$. The ELBO becomes

$$ \begin{align*} F(q, \theta) & = \int q(z) \ln \Big( \frac{p(x, z; \theta)}{q(z)} \Big) dz \\ & = \int q(z) \ln p(x, z; \theta) dz - \int q(z) \ln q(z) dz \\ & = \int p(z | x; \theta^{OLD}) \ln p(x, z; \theta) dz \\ & \quad - \int p(z | x; \theta^{OLD}) \ln p(z | x; \theta^{OLD}) dz \\ & = Q(\theta, \theta^{OLD}) + H(q) \end{align*} $$The second term $H(z|x)$ is the entropy of $z$ given $x$ and is a function of $\theta^{OLD}$. It is a constant with respect to $\theta$ and we do not take the term into account while maximizing the ELBO.

The EM algorithm can be succinctly summarized as ${\arg\!\max}_\theta Q(\theta, \theta^{OLD})$.

- E-step: compute $p(z | x; \theta^{OLD})$
- M-step: evaluate ${\arg\!\max}_\theta \int p(z | x; \theta^{OLD}) \ln p(x, z; \theta) dz$

For example, the EM for the Gaussian Mixture Model consists of an expectation step where you compute the soft assignment of each datum to K clusters, and a maximization step which computes the parameters of each cluster using the assignment. However, for complex models, we cannot use the EM algorithm.

## Variational Expectation Maximization

For a simple model, an analytical solution for $p(z | x; \theta^{OLD})$ exists and thus computing $q(z) = p(z | x; \theta^{OLD})$ is tractable. However, it is not possible in general as the model gets more complex. Instead, we approximate the posterior probability using a simpler model. For example, we assume that a set of latent variables is independent of the rest of the latent variables given $x$. Such independence reduces complexity and allows us to deduce the analytic form of the EM.

We can even enforce full independence among all latent variables given $x$, i.e., $z_i,\perp z_j$ for $i \neq j$. This assumption, known as the mean field approximation, allows us to compute the update rules for each latent variable in isolation and has been successful in many problems. We will go over variational inference using the mean field approximation, but the following technique can be used for models with more complex dependency.

Let $q(z) = \prod_i q(z_i)$. Then, the ELBO can be factorized into $z_j$ and the rest of the latent variables.

$$ \begin{align*} F(q, \theta) & = \int q(z) \ln \Big( \frac{p(x, z; \theta)}{q(z)} \Big) dz \\ & = \int \prod_i q(z_i) \ln p(x, z; \theta) dz - \sum_i \int q(z_i) \ln q(z_i) dz_i \\ & = \int q(z_j) \int \Big( \prod_{i \neq j} q(z_i) \ln p(x, z; \theta) \Big) \prod_{i \neq j} dz_i dz_j \\ & \quad - \int q(z_j) \ln q(z_j) dz_j - \sum_{i \neq j} \int q(z_i) \ln q(z_i) dz_i \\ & = \int q(z_j) \ln \Big( \frac{\exp(\langle \ln p(x, z; \theta)\rangle_{i \neq j})}{q(z_j)} \Big) dz_j \\ & \quad - \sum_{i \neq j} \int q(z_i) \ln q(z_i) dz_i \\ & = \int q(z_j) \ln \Big( \frac{\tilde{p}_{i\neq j}}{q(z_j)} \Big) dz_j + H(z_{i\neq j}) + c\\ & = - KL(q_j || \tilde{p}_{i\neq j}) + H(z_{i\neq j}) + c \end{align*} $$where $\langle \cdot \rangle_i$ indicates the expectation over the latent variable $z_i$. Since $\exp(\langle \ln p(x, z; \theta)\rangle_{i \neq j})$ is not a proper pdf, the constant $c$ is added to adjust it to become a proper pdf. Since the KL-divergence is non-negative, the ELBO is maximized when $KL(\cdot || \cdot) = 0$ which happens when $q(z_j) = \tilde{p}_{i\neq j} = \frac{1}{Z} \exp \langle \ln p(x, z; \theta)\rangle_{i \neq j}$.

Similarly, in the variation EM,

- E-step: evaluate $q^*(z_j) = \frac{1}{Z} \exp \langle \ln p(x, z; \theta)\rangle_{i \neq j}$ for all $j$,
- $q^{NEW} = \prod_i q_i^*$

- M-step: find $\theta = {\arg\!\max}_\theta F(q^{NEW}, \theta)$

In practice, $q^*$ is the optimal probability that maximizes $F(q, \theta)$. And $q^*$ has the form of known probability distribution functions. Thus, $\theta^{NEW}$ would simply be the parameters of the probability distribution function after factorizing the other probability distribution. However, if the function $q^*$ cannot be simplified into a known form, solving the KKT condition and setting the derivative of the ELBO would give you a solution.

## Conclusion

The variational EM gives us a way to bypass computing the partition function and allows us to infer the parameters of a complex model using a deterministic optimization step. In the next post, I will give a concrete example with a simple Gaussian Mixture Model.

# References

D. G. Tzikas, A. C. Likas, and N. P. Galatsanos, The Variational Approximation for Bayesian Inference, IEEE Signal Processing Magazine, Nov 2008 ↩ ↩

^{2}C. Bishop, Pattern Recognition and Machine Learning. Springer, 2006 ↩

R.M. Neal and G.E. Hinton, A view of the EM algorithm that justifies incremental, sparse and other variants, Learning in Graphical Models, 1998 ↩

## Leave a Comment