Variational Autoencoders – Marmara Lectures

Variational autoencoders are unsupervised generative models.

By “unsupervised”, we mean that they train with unlabeled data.
By generative, we mean that the variational autoencoders are capable of generating new data which are not in their training set.

The basic steps to construct a geneative model are:

Collect an unlabeled training set $x_1, x_2, \ldots, x_N$.
Create a pdf which assigns a high probability to the region of the space from where the elements of the training set are coming from. This will naturally lower the probability of other regions.
Sample from this pdf to generate new data. This new data will be similar to the data in the training set but will not be one of them.

Before going on explaining Variational autoencoders, we have to explain latent variable models.

Latent variable models

Latent variable models assume the existence of a lower dimensional space, called the “latent space”, which summarizes the data set. In other words, there exist a pdf $p(x|z)$, where $x$ is a point in the data set space and $z$ is a point in the latent space. The relation between $x$ and $z$ is generally constructed by using a neural network.

An example for latent models is the following pdf, which depicts the weights of some primary school teachers and their students.

In image processing, latent variables correspond to the image description. They are more efficient to work with rather than working directly with pixels.

We can make part of the observation space independent of other parts conditioned on latent variables.

Generative modelling with latent space

To generate new data (ie, images), we need to sample $p(x)$, where x is a data point. İf we have a latent space, this is a two-step process:

Sample $z \sim p(z)$.
$p(z)$ is assumed to be known to the full. It is generally taken to be standard normal, ie, $p(z)=\mathscr{N}(z,0,1)$. $p(z)$ is chosen to be a peaked distribution, as we want to push the points in the latent space close together. The reason for this will be explained later.
Sometimes $p(z)$ is also optimized/learned. This approach is called “learning the prior”. We will not discuss this approach.
Sample $x \sim p_{\theta}(x|z)$
True to the form of parametric estimation theory, the functional form of $p_{\theta}(x|z)$ is assumed to be known, but the parameters $\theta$ is not known and have to be estimated.
In nearly all situations, $p_{\theta}(x|z)$ is implemented as follows:
— The mapping from $z$ to $x$ is achieved by a neural network with weights $\theta$, $x=NN_{\theta}(z)$.
— The probability of $x$ given $z$ is computed by using a gaussian $p_{\theta}(x|z) = \mathscr{N}(x,\mu,I) = \mathscr{N}(NN_{\theta}(z),\mu,I)$. The variance of this gaussian can be chosen as $I$. We will discuss how to chose the mean $\mu$ later.
–Or, alternatively, the neural net may have sigmoid last layer which outputs probabilities and we flip a coin for each output to sample these propabilities.

Training is done via ML algorithm: We consider the data points $x_i$ independent of each other and we arrange the parameters $\theta$ such that the probability of the training set is maximized

\begin{eqnarray}
\underset{\theta}{\mathrm{max}} \,\, p_{\theta}(x_1)p_{\theta}(x_2) \ldots p_{\theta}(x_N) = \underset{\theta}{\mathrm{max}} \, \prod_{i=1}^N p_{\theta}(x_i)
\end{eqnarray}

Because we prefer to work with sums, we take the logarithm

\begin{eqnarray}
\underset{\theta}{\mathrm{max}} \, \log \left( \prod_{i=1}^N p_{\theta}(x_i) \right) = \underset{\theta}{\mathrm{max}} \sum_{i=1}^N \log p_{\theta}(x_i)
\end{eqnarray}

recalling Bayes’ law

\begin{eqnarray}
p_{\theta}(x_i) &=& \sum_z p_{\theta}(x_i | z) p(z)
\end{eqnarray}

we arrive at the main object for optimization

\begin{eqnarray}
\underset{\theta}{\mathrm{max}} \sum_{i=1}^N \log \sum_z p_{\theta}(x_i|z)p(z)
\end{eqnarray}

The objective above can be evaluated in two ways:

If $z$ can only take a small set of values, exact objective can be calculated
Otherwise, do sampling.

We will investigate alternative (2) below. While we are at it, we will also review the theory of importance sampling.

Importance Weighted Autoencoder (IWAE)

To evaluate the above formula, we sample $p(z)$ $K$ times for each $x_i$ to generate samples $z_i^{(k)} \sim p(z)$ and evaluate

\begin{eqnarray}
\underset{\theta}{\mathrm{max}} \sum_{i=1}^N \log \sum_z p_{\theta}(x_i|z)p(z) \sim \underset{\theta}{\mathrm{max}} \sum_{i=1}^N \log \frac{1}{K} \sum_{k=1}^K p_{\theta}(x_i|z_i^{(k)})\quad,\quad z_i^{(k)} \sim p(z)
\end{eqnarray}

The problem with this is that $p(z)$ may not be the best function to sample to evaluate $p(x|z)$, as the $z$s we sample may not be the ones that result in $x$. This results in slow convergence. So, we will rather sample from an another distribution $z_i^{(k)} \sim q(z)$, whose properties we dont know yet, which we hope will give a faster convergence.

\begin{eqnarray}
\underset{\theta}{\mathrm{max}} \sum_{i=1}^N \log \sum_z p_{\theta}(x_i|z)p(z) &=& \underset{\theta}{\mathrm{max}} \sum_{i=1}^N \log \sum_z \frac{q(z)}{q(z)} p_{\theta}(x_i|z)p(z) \\
&\sim& \underset{\theta}{\mathrm{max}} \sum_{i=1}^N \log \frac{1}{K} \sum_{k=1}^K \frac{p(z_i^{(k)})}{q(z_i^{(k)})}p_{\theta}(x_i|z_i^{(k)})\quad,\quad z_i^{(k)} \sim q(z)
\end{eqnarray}

How to choose the best $q(z)$ for a given $x_i$? The theory of importance sampling says that $q(z)$ works best when it is proportional to $p(z)p_{\theta}(x_i|z)$, or, $p_{\theta}(z|x_i)$,

\begin{eqnarray}
q_i(z) &\propto& p(z)p_{\theta}(x_i|z) \\
&=& p_{\theta}(z|x_i)p_{\theta}(x_i) \\
&\propto& p_{\theta}(z|x_i)
\end{eqnarray}

(When passing from line 2 to line 3 above, we can ignore $p_{\theta}(x_i)$, as it is not a function of $z$ and hence only acts as a scaling factor.)

Therefore we have to choose $q(z)$ close to the distance $p_{\theta}(z|x_i)$. But, this implies that for each $x_i$ we must have a different $q(z)$. Notationally, we indicate such $q(z)$ corresponding to $x_i$ with $q_i(z)$. Now the samples $z_i^{(k)}$ is drawn from $q_i(z)$. Note that if we consider $x_i$ as images, this means that each image will have a separate $q_i(z)$ trained for itself.

The closeness between $q_i(z)$ and $p_{\theta}(z|x_i)$ can be established by minimizing the KL distance $\mathrm{KL}(q_i(z)||p_{\theta}(z|x_i))$. One way of making this distance small is

Choose $q_i(z)$ a parametrized distribution which is easy to work with, say a gaussian.
Fit a different $q_i(z)$ to each $x_i$ by minimizing the distance $\mathrm{KL}(q_i(z)||p_{\theta}(z|x_i))$. If $q_i(z)$, is chosen to be a gaussian, we will fit the mean and the variance of that gaussian.

The above discussion foe IWAE suggests the following algorithm:

For each $x_i$, consider $q_i(z)$ as a gaussian and pick arbitrary initial $\mu_i$, $\sigma_i$. Also, pick arbitrary initial $\theta$ for $p_{\theta}(z|x_i)$ and $p_{\theta}(x_i|z). Consider $p(z)$ as a gaussian with mean zero and variance 1.
For each $x_i$, sample $q_i(z)$ $K$ times, to obtain $z_i^{(k)}$.
Use $x_i$ and $z_i^{(k)}$ to evaluate
\begin{eqnarray}
\sum_{i=1}^N \log \frac{1}{K} \sum_{k=1}^K \frac{p(z_i^{(k)})}{q(z_i^{(k)})}p_{\theta}(x_i|z_i^{(k)})\quad,\quad z_i^{(k)} \sim q_i(z)
\end{eqnarray}
Do a gradient ascent along $\theta$’s.
With the newly found $\theta$’s, do a gradient descent along $\mathrm{KL}(q_i(z)||p_{\theta}(z|x_i))$ to find new $\mu_i$, $\sigma_i$ for $q_i(z)$.
Goto 2 untill convergence occurs.

Amortized analysis for IWAE

As mentioned before, sampling from $p(z)$ is inefficient, as it will result in $z$ which may not be related to the $x_i$ at hand. We want our $z$ samples to be related to $x_i at hand$. Hence, we have concluded that it is much better to sample from the distribution close to $p_{\theta}(z|x_i)$ rather than $p(z)$, which is what the theory of importance sampling says.

Unfortunately, $p_{\theta}(z|x_i)$ is not known (it is the inverse of the $p_{\theta}(x_i |z)$ which is a gaussian+neural net). Hence, we will try to approximate it with a gaussian+neural net, $q_{\phi}(z|x)$. The mean and variance of this gaussian is calculated by a neural net.

\begin{eqnarray}
q_{\phi}(z|x) = \mathscr{N}( z; \mu_{\phi}(x), \sigma_{\phi}(x) )
\end{eqnarray}

where $\mu_{\phi}(x)$ and $\sigma_{\phi}(x)$ are calculated by a neural network with parameters $\phi$. The input of this neural network is $x$. The output of the neural net is $\mu_{\phi}(x)$ and $\sigma_{\phi}(x)$. Then we can sample a $z$ using $\mu_{\phi}(x)$ and $\sigma_{\phi}(x)$.

We want $q_{\phi}(z|x)$ to be as close as possible to $p_{\theta}(z|x)$. Taking $K=1$ above and $N=1$ we can show that

\begin{eqnarray}
\log p_{\theta}(x) = [ \log p_{\theta}(x|z) + \log p(z) – \log q_{\phi}(z|x) ] + \mathrm{KL}(q_{\phi}(z|x) || p_{\theta}(z|x) )
\end{eqnarray}

The bracketed quantity immediately to the right of the equality is known as ELBO (evidence lower bound) or VLB (variational lower bound). Hence we can rewrite the equation as

\begin{eqnarray}
\log p_{\theta}(x) = ELBO(x,\theta,\phi) + \mathrm{KL}(q_{\phi}(z|x) || p_{\theta}(z|x) )
\end{eqnarray}

Consider increasing ELBO by changing $\phi$. $\log p_{\theta}(x)$ is not a function of $\phi$, hence it will remain constant. This means KL distance between $q_{\phi}(z|x)$ and $p_{\theta}(z|x)$ will decrease and we will generate better samples.
Consider increasing ELBO by changing $\theta$. In this case either $p_{\theta}(x)$ will increase, or the KL distance will decrease, or both.

How to find the best mean and variance? We want to minimize the KL distance between $q_{x_i}(z)$ and $p(z|x)$:

\begin{eqnarray}
\underset{q_{x_i}(z)}{\mathrm{min}} \; \mathrm{KL} ( q_{x_i}(z) || p_{\theta}(z|x) ) &=& \underset{q_{x_i}(z)}{\mathrm{min}} \;\; \underset{z \sim q_{x_i}(z)}{\mathrm{E}} \mathrm{log} \left( \frac{q_{x_i}(z)}{p_{\theta}(z|x) } \right) \\
&=& \underset{q_{x_i}(z)}{\mathrm{min}} \;\; \underset{z \sim q_{x_i}(z)}{\mathrm{E}} \left[ \log q_{x_i}(z) – \log p(z) -\log p_{\theta}(x|z) \right]+\log p_{\theta}(x_i)
\end{eqnarray}

The second term in the above equation is defined as ELBO

\begin{eqnarray}
\mathrm{ELBO} = -E[\left[ \log q_{x_i}(z) – \log p(z) -\log p_{\theta}(z|x) \right]]
\end{eqnarray}

Rearranging

\begin{eqnarray}
\log p_{\theta}(x_i) = ELBO + \mathrm{KL} ( q_{x_i}(z) || p_{\theta}(z|x) )
\end{eqnarray}

Note that both $\log p_{\theta}(x_i)$ and $\mathrm{KL} ( q_{x_i}(z) || p_{\theta}(z|x) )$ are by definition positive. Hence maximizing ELBO over $q_{x_i}(z)$ will necessarily minimize the KL distance, and make $q_{x_i}(z)$ a better choice for sampling. But also note that the KL distance will never be zero, as we choose $q_{x_i}(z)$ to be gaussian and $\mathrm{KL} ( q_{x_i}(z) || p_{\theta}(z|x) )$ is not necessarily gaussian, so there will always be some distance between the two. This is not very important, as importance sampling will work for any $q(z)$.

All terms in ELBO are known and easy to calculate ( $\log q_{x_i}(z)$ and $\log p(z)$ are choosen to be gaussians. $\log p_{\theta}(z|x)$ is a neural network + a gaussian. SGD on $\mu(x)$ and $\sigma(x)$ can be run on ELBO because it is an expectation.

Amortized Analysis

In the formulation above, for each $x_i$ we have to optimize a separate $q_{x_i}(z)$. If $x_i$ is an image, this means that we have to find a separate $q_{x_i}(z)$ for each image separately. We can amortize this process by a Neural network. In more concrete terms, rather than optimizing

\begin{eqnarray}
\mathrm{KL} ( q_{x_i}(z) || p_{\theta}(z|x) )
\end{eqnarray}

we will try to optimize

\begin{eqnarray}
\mathrm{KL} ( q_{\phi}(z|x) || p_{\theta}(z|x) )
\end{eqnarray}

$q_{\phi}(z|x)$ is a neural network which wşll generate $\mu(x)$ and $\sigma(x)$ for a given image $x$. So we have

\begin{eqnarray}
p_{\theta}(x|z) = \mathscr{N}( x-NN_{\theta(z)}, I) \\
q_{\phi}(z|x) = \mathscr{N}( \mu_{\phi}(x), \sigma_{\phi}(x) )
\end{eqnarray}

Sampling from $q_{\phi}(z|x)$ will be done by $z=\mu(x)+\epsilon \sigma(x)$ where $\epsilon \sim \mathscr{N}(0,1)$. There will be no need to sample from $p_{\theta}(x|z)$. Therefore, we modify our formula for ELBO above as

\begin{eqnarray}
\mathrm{ELBO} = E[\left[ \log q_{\phi}(z|x) – \log p(z) -\log p_{\theta}(z|x) \right]]
\end{eqnarray}

General theory of optimization

===============================================

Here

$p(x|z)$ is essentially a deterministic pdf which corresponds to decoder neural network. $p(x|z)=\delta(x-d(z)).$
$p(z)$ determines how the input to the decoder is distributed. We arbitrarily set this to standard normal distribution, ie, $z \sim N(0,1)$ as we want to push the latent space representation of objects together. After this choice, all $z$’s will come from within a narrow sphere with a high probability.
$p(z|x)$ is the inverse function of the neural network. As the neural net is not guaranteed to be bijective, this relation is not deterministic in general. In most cases, it is not possible to get an analytic expression for it.

We will try to estimate $p_{\theta}(z|x)$ by an another neural network, $q_{\phi}(z|x)$. To introduce

\begin{eqnarray}
&=& \underset{z \sim p_{\theta}(z|x)}{E} \log \left( \frac{p_{\theta}(x|z)p(z)}{p(z|x)} \frac{q_{\phi}(z|x)} {q_{\phi}(z|x)} \right)
\end{eqnarray}

Armed with the knowledge of latent spaces, we can modify the method of sampling from a generative model as follows.

Sample a z from the latent space
Reconstruct the new data point x by $x=f(z)$

Given the training set $x_1, x_2, \ldots, x_N$ and a parametric family of pdf’s $p_{\theta}(x)$, we want to find the $\theta^*$ which maximizes the probability of this training set within the family:

$$\theta^*=\underset{\theta}{\mathrm{argmax}} \prod_{i=1}^{N}p(x_i)

= \underset{z \sim q_{\phi}(z|x)}{E} \log \eft( \frac{p_{\theta}(x|z)}{p(z)} \right)$$

========================

Minimizing $\mathrm{KL}(q_i(z)||p_{\theta}(z|x_i))$ is the same as maximizing $-\mathrm{KL}(q_i(z)||p_{\theta}(z|x_i))$. Hence our cost function is:

\begin{eqnarray}
\underset{\theta,q_i(z)}{\mathrm{max}} \left[ \sum_{i=1}^N \log \frac{1}{K} \sum_{k=1}^K \frac{p(z_i^{(k)})}{q_i(z_i^{(k)})}p_{\theta}(x_i|z_i^{(k)})\; – \; \mathrm{KL}(q_i(z)||p_{\theta}(z|x_i)) \right] \quad,\quad z_i^{(k)} \sim q_i(z)
\end{eqnarray}

If $q_i(z)$ is gaussian, it is maximized by a gradient ascent on its mean and variance. As $q_i(z)$ becomes closer to $p_{\theta}(z|x_i)$ sampling becomes more efficient. Note that this method will work for any $q_i(z)$, but it will require too many samples for convergence if $q_i(z)$ is not optimal.