Variational Autoencoders (VAEs) are a popular and effective tool to generate new data from training data. As their name implies, they are closely related to (ordinary) Autoencoders (AEs).

In this post, we will recap how AEs and VAEs work, why one of them can be used for data generation and we will examine an example VAE trained for the FashionMNIST dataset and how its generated samples behave.

Introduction (Recap AEs vs. VAEs)

An autoencoder is a neural network, that compresses usually high dimensional data into lower dimensional space, to then decompress it back into the original space, such that all or most relevant information of the original sample is reconstructed.

Obviously, a potential usage for AEs can be compression for the purpose of saving storage space, although the compression will often not be lossless, which can be undesired in many potential applications. More relevant uses include dimensionality reduction, as the latent space to which the input data is compressed is a lower dimensional representation of the higher dimensional input data. An important advantage that AEs have over more traditional dimensionality reduction methods, such as PCA, is that they are more viable for large amounts of input samples, as they can be trained in mini batches [1].

AEs consists of two neural networks. The encoder and the decoder. The encoder is a neural network that takes the high dimensional original samples as input and outputs a lower dimensional representation. The decoder, which is often a mirrored version of the encoder, will take the lower dimensional representation as input and output a sample of the same dimensionality as the input sample. The loss of this forward pass is calculated as the reconstruction loss (often mean-squared-error loss or log-loss) between the original sample and the reconstructed sample. The loss is then back-propagated through the decoder and encoder.

When it comes to reconstructing samples from known latent vectors, AEs can be very efficient, however, their latent space is not guaranteed to be continuous or bounded, so it is not possible to determine an area within the latent space from which samples could be drawn that result in meaningful samples after decompression.

VAEs solve this problem and guarantee a continuous and at least somewhat bounded latent space. They belong to the family of Bayesian methods and encode input samples into a probability density function, under usage of a prior, instead of a single sample. Sampling from that function can then be expected to yield meaningful results after decompression. VAEs define their latent space by defining a prior distribution, for example a Gaussian, which is then used to find an approximation of the posterior probability density in the latent space.

Example VAE — FashionMINIST

To see a simple application of VAEs in action, please refer to this repository.

What follows are explanations on how the VAE in this repository was built, and what decisions have been made along the way and why.

Networks

The encoder consists of two convolutional blocks with three convolutional layers each, each of which is followed by a LeakyReLU activation function to avoid issues with zero weights. After each block, batch normalization and max pooling are applied.

To return the required mean and standard deviation the network splits after the convolutional layers. The 2d output of the convolutional part of the network is flattened and passed through two separate fully connected layers. The fully connected layer handling the mean is followed by a LeakyReLU activation, similar to the convolutional blocks and the layer handling the standard deviation is followed by the exponential function. It appears that modeling the log-standard-deviation is more stable than modeling the standard deviation directly.

The number of input channels in the FashionMNIST dataset is one, as the images are grayscale images. The number of channels is immediately increased to 16 by the first convolutional layer and kept constant throughout the convolutional blocks. Likewise, the kernel size is set to 3 for each layer.

The decoder mirrors the encoder quite closely. The samples from latent space are first passed through a fully connected layer, before they are passed through two times three convolutional layers with 16 channels and kernel size three. To balance out the max pooling of the decoder, upsampling and batch normalization are applied to the data before each convolutional block. Each layer is followed by a LeakyReLU activation function.

The input shape for the encoder network is (1, 28, 28), excluding the batch dimensions. The output of the decoder are two tensors (mean and standard deviation) of length equal to the dimensionality of the latent space.

The input for the decoder is a tensor of the same dimensionality as the latent space. The output has the same shape as the input images.

The VAE

The prior of the latent space is a normal distribution with mean 0 and standard deviation 1. A normal distribution defines the latent space quite intuitively, although other distributions are also possible. Such as the Beta distributions if the latent space is desired to be strictly bounded.

The actual pixel intensity is sampled from a Continuous Bernoulli distribution, which is conveniently a single parameter distribution defined between 0 and 1. Also here, other distributions would be possible, such as the Beta or Gamma distribution, although the second parameter for both of these distributions would either need to be fixed or would need to be sampled from a second latent space, which is of course less convenient.

The chosen optimizer is Pyro's ClippedAdam wrapper around the associated pytorch optimizer. A learning rate of 5e-4 was chosen, together with a moderate weight regularization. Different trials have shown that a larger learning rate, unclipped gradients or unregularized weights would lead to less stability in the training process and could even lead to exploding gradients.

The loss to be minimized is the negative TraceELBO score. This criterion is provided by Pyro and allows to not only backpropagate the loss from model weights through the model, but also from distribution parameters.

The default batch size in the training script is 128. This is mainly done for computational consideration, as this size should be safe for smaller GPUs. Users are free to experiment with different batch sizes though.

Generative Model

Inside the model method the generative model is first registered with Pyro's parameter storage. The chosen name in the repository is "decoder" to reflect its role of sampling from the latent space and reconstructing a meaningful image from the drawn sample. Inside the plate context, the input parameters for the latent space prior are defined as a zero-tensor and a one-tensor, of the shape (n_samples, n_latent_features). This is the prior distribution of the latent space.

Z_loc and z_scale are then used to sample from the latent space. During the actual training and inference, the z is sampled from the distribution in the "guide" of the same name (see next section).

Next, the samples from latent_space z are decoded, using the decoder network, resulting in mean pixel outputs. These mean outputs are then used as inputs for a Continuous Bernoulli distribution to model the randomness of pixel densities around the expected pixel density.

Lastly, the decoder outputs are returned to be used during inference or image generation.

Inference Model ("Guide")

In the guide method, the inference model is registered with Pyro's parameter store, in this case under the name of "encoder" to reflect its role in the model.

Another Pyro plate context is created, of the same name as the associated plate in the model method.

Inside the plate z_loc and z_scale are inferred from passing the input x to the encoder. Z_loc and z_scale are then used as input for the latent_space distribution of the guide.

Pyro will internally know to sample from the latent space according to the specifications in the guide over those in the model method. While the model provides the prior estimate of the latent space, the guide therefore supplies the variational distribution, that is an approximation of the true posterior distribution.

Results

After training, the following images can be sampled from the latent space of the model:

None

Figure 1: Samples from VAE model.

As one can see from Figure 1, the shape of clothing items is very well reproduced by these samples. There is also an attempt by the model to reproduce the patterns that are present in the original data, but the resulting details seem "smudged" or "washed out". Of course, in the context of fashion design that might not even have to be a disadvantage, but it can be doubted that this model would generalize well to more complex datasets.

None

Figure 2: Sampling a show with increasing noise

In Figure 2, it can be observed how an increasing standard deviation (increasing from top to bottom and from left to right) influences the sampling process. For this figure the latent space has been sampled from repeatedly while keeping the mean of the used normal distribution constant and increasing its standard deviation. As one can see, this results in samples that are at first very similar and become increasingly more noisy until the original shoe becomes unrecognisable.

None
Figure 3: Morphing from T-Shirt to sneaker

Figure 3: Morphing from T-Shirt to sneaker

None

Figure 4: Samples from Conditional VAE

Figure 4 shows samples from an alternative architecture of VAEs, specifically one in which the prior is not the same across all classes of samples, but is instead conditioned on the sample itself, as well as its class. It is visible that the shape of items is similarly well reproduced as in the original model. However, the details and patterns seem more refined, while still not clearly visible. This can indicate that this alternative architecture will be more able to generalize to more diverse and complex datasets.

All of the above images have been sampled by encoding a set of 25 original images into their variational parameters, sampling from the latent space and decompressing the samples.

The generated images show that this simplistic architecture can indeed successfully generate interesting samples for the FashionMNIST dataset. Especially the samples of the conditional model (Figure 4), show that even a decent level of detail in generated images can be generated.

Figure 3 indicates that the latent space is also at least somewhat well populated, as it is possible to traverse while drawing only a few samples that are not clearly attributable to any class of images. However, it was not possible to randomly and unconditionally sample from the latent space, and generate meaningful images.

In conclusion, this model seems viable for an application in the simplistic case of the FashionMNIST dataset, but it seems that the posterior approximation already hits its limit in terms of flexibility and will not be sufficient for a more complex or diverse dataset.

There are different ways to improve the flexibility of the posterior estimation. As outlined by Kingma and Welling (2019) auxiliary latent variables, normalizing flows, and inverted autoregressive flows have all been shown to allow for more flexibility of the posterior approximation, while also retaining computational efficiency [2].

Further Reading and Next Steps

While the current results are already quite interesting, the field of VAEs provides plenty of more opportunities to generate better and far more complex data. This project was heavily inspired by the tutorial on Variational Autoencoders from the Pyro documentation. The briefly shown alternative architecture, which led to better results, was inspired by another Pyro Tutorial: Conditional Variational Auto-encoder. Beyond that the work of Diederik and Welling [2] provides a great introduction to VAEs as well as a comprehensive overview over many active areas of research and application in the area of VAEs.

Following this project, it would be interesting to apply some of the concepts from Diederik and Welling [2] to observe better generated samples for the FashionMNIST dataset. But once this dataset is sufficiently explored it may be worthwhile and interesting to move on to more complex datasets, like CIFAR-10 and beyond.

For a more general introduction to Pyro with an example implementation of a VAE for the MNIST Digits dataset, including exercises and solutions, please see here.

References

  1. Michelucci, Umberto. "An introduction to autoencoders." arXiv preprint arXiv:2201.03898 (2022).
  2. Kingma, Diederik P., and Max Welling. "An introduction to variational autoencoders." Foundations and Trends® in Machine Learning 12, no. 4 (2019): 307–392.