In my previous blog post, I talked about how Google DeepMind's AlphaGo works, and explained that convolutional networks play a crucial role in the system.
This time I am going to look at how computer programs are able to generate artistic images with high perceptual quality automatically. The system I am going to describe has been documented in two papers written by Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. This team also set up a company, called DeepArt, with an online demo demonstrating their system.
Source. The Neckarfront in Tubingen, Germany, combined with several well-known paintings. The DeepArt system matches the content representation of the photograph of Tübingen and the style representation of the painting.
Playing the game of Go, and creating art pieces that are interesting to the human eye are both tasks that are hard to teach to a computer AI. Yet, both AlphaGo and Deepart's system proved to be good at the task they were taught to do. Coincidentally, both systems rely on convolutional networks, which is one of the reasons I decided to write a new post on the topic.
Mixing the content of one picture with the style of another
What the founders of DeepArt built is a system that takes two images as input: an artistically pleasing picture such as a famous painting, and any photograph. The system creates an output image that combines the two inputs in an interesting way: the content of the original photograph, as well as the artistic style of the painting are preserved. In the words of DeepArt's team:
The images are synthesised by finding an image that simultaneously matches the content representation of the photograph and the style representation of the respective piece of art [...]. While the global arrangement of the original photograph is preserved, the colours and local structures that compose the global scenery are provided by the artwork. Effectively, this renders the photograph in the style of the artwork, such that the appearance of the synthesised image resembles the work of art, even though it shows the same content as the photograph.
I will now start to explain how DeepArt's team achieved this. I will have to take a brief detour and provide some explanations regarding convolutional networks.
Convolutional neural networks (or just convolutional networks) are a sub-type of artificial network that are especially well adapted to take images as input. The output of a convolutional network can be either a classification result (e.g. "This image contains a robot. It does not contain a banana."), or one or more images (e.g. the output image could be a de-noised or de-blurred version of the original input image).
The result of an image being de-blurred using Multi-layer perceptrons (MLP), another type of neural network. More info here.
Convolutional networks, like other artificial neural networks, contain a number of parameters which can be trained (i.e. their value tuned) in such a way that the network achieves a desired goal (such as object recognition: being able to tell if the input image contains robots or bananas) as best as possible.
Also like other artificial neural networks, convolutional networks operate layer-wise, in a feed-forward manner. In other words, an input image is sent sequentially through the layers of the network, and the resulting image is output by the last layer of network. Each layer has a set of parameters that are calculated (trained) during the training process of the network. These trainable parameters define the function of the filters applied at each layer. An image going through the filters of one layer will generate a number of filtered images which are called feature maps. These feature maps are images themselves. Each of them is a representation of some feature of the original input image (edges, corners, contours, object parts).
Usually, convolutional networks contain more feature maps in the deeper layers than in the first layers, but the feature maps become smaller and smaller with increasing depth in the network.
The feature maps therefore contain certain features of the input image. It is important to understand that there is no strict definition or description of the role of each filter. Similarly, it is difficult to say what feature of an input image a specific feature map captures.
Usually these features become more and more abstract after an increasing number of layers in the network. For example, when convolutional networks are trained for object recognition, the feature maps in the deeper layers (applied later in the process) of the network care more about the presence of specific objects in the input image than about the specific pixel values of the input image. In comparison, the feature maps in the first layer often detect relatively simple features such as edges.
Also, the size of the features that can be detected by the later layers tend to be larger than in the first layers. This is due to the filtering operation itself: the first layer can only detect features that are at most as large as the size of the filters (which tend to be just a few pixels large). But the second layer can already detect features that are almost twice as large. This size is known as the receptive field size. We will come back to this a bit later.
Over the years, the results achieved in object detection and localisation using convolutional networks have become better and better. The challenges have also become more and more difficult, with the ImageNet challenge asking to discriminate between 1000 different object classes. The winner of the ImageNet 2014 challenge was a deep convolutional network with 19 layers and relatively small filters. This network is known as the VGG network and forms the basis of the DeepArt system.
Source: ImageNet. Images annotated with object classes. The goal is to be able to annotate images automatically.
Texture synthesis with convolutional networks
In one paper, the DeepArt team members describe how convolutional networks trained for object recognition can be used for a different task: generating artificial textures. A photograph is provided as a source, and an artificial image resembling it is synthesized. Here are two examples of synthetically generated textures:
Source: Bethgelab. The images on the left were generated automatically, given the image on the right. The images look very similar, but the global arrangement of the image is lost.
Many more very impressive examples of synthetically generated textures and their corresponding source images can be found at Bethgelab. A texture image usually represents a pattern that can be repeated, like close-up pictures of sand, paper, clouds, wood, concrete, etc... The global arrangement of a texture image is not as important as in a photograph of e.g. a famous landmark.
The synthesis procedure takes as input an image containing only random noise and proceeds by iteratively updating this noise image until it sufficiently resembles the source image, according to some metric.
The goal of this metric is to capture "resemblance" in such a way that colors and local structures are preserved, but not the global arrangement of the image. DeepArt's team say it best:
Textures are per definition stationary, so a texture model needs to be agnostic to spatial information.
The trick the team comes up with is to take correlations between feature maps in the VGG network (trained for object detection), and use these correlations as a metric:
A summary statistic that discards the spatial information in the feature maps is given by the correlations between the responses of different features.
We want to remove spatial information, but preserve the texture. The problem is that a feature map is a (non-linearly) filtered version of the input image. So it still contains spatial information about the input image. What can we do to remove this spatial information? Calculating the correlations between feature maps of a layer removes spatial information (the correlation between two feature maps is a single value, so spatial information is necessarily lost). The authors take the correlations between all feature maps in a layer, which gives them a Gram matrix of size N x N, where N is the number of feature maps in that layer.
Hence, the iterative update procedure modifies the original noise image until its Gram matrix is close to the one generated by the source image. It turns out that this iterative update procedure can be conveniently done via backpropagation, which is the standard method for training neural networks, except that in this case, only the input image is updated (and not the parameters of the network). It also turns out that this procedure can generate textures that are strikingly similar to the source image.
Source. Top row: Original images. Middle row: Reconstructions using only the first layer. The reconstructions have little structure due to the small receptive field size in the first layer. Bottom row: Reconstructions using more (deeper) layers. Spacial information is preserved over larger regions due to larger receptive field sizes.
One question remains: Which layer or layers should be chosen to compute the Gram matrix? It turns out that using the first (shallowest) layers does not lead to very interesting results. The best results are achieved when using several layers, including deeper layers. This effect can be explained by two factors:
- The deeper layers extract more and more abstract (and therefore interesting) features about the input image, and
- the deeper layers are able to extract larger and larger features, thanks to ever increasing receptive field sizes.
I was extremely impressed by the quality of these textures, and their striking resemblance to the source image.
One of the authors kindly made source code available for the generation of textures, so you can even try for yourself.
Combining content and style with convolutional networks
In a follow-up paper, the same team adds an additional step for the previously described texture synthesis method. For texture synthesis, the goal is to approximate the Gram matrix produced by feature maps in the VGG network, with the result being an image with similar style but different content.
The goal is now to create an image which has similar style to one image, but similar content to a different image. Approximating image content is done similarly to the texture synthesis approach, except that the metric is now different: the goal is to approximate the values in the deep layers of the VGG network directly. The algorithm can be summarized as follows.
- Feed the artistic image through the VGG net and compute and save the Gram matrix G.
- Feed the photograph through the VGG net and save the feature maps F.
- Generate a white noise image. Through backpropagation, iteratively update this image until it has a feature map and a Gram matrix that are close to F and G, respectively.
Since it is usually impossible to find an image that perfectly matches both G and F, a trade-off is necessary: Is it more important to be close to G, or close to F? If more emphasis is put on the Gram matrix G, the result will be an image that more closely matches the style of the artistic image. If more emphasis is put on the feature maps F, the result will be an image that more closely matches the content of the photograph. The below example illustrates this trade-off.
Source. Left: More emphasis on style. Right: More emphasis on content
This nicely demonstrates that style and content are separable.
Why does it work?
Source. The same face, under different lighting conditions (yes, it's really the same face!). The content is the same in all four images, but the appearance/style is different. This shows the importance of being able to discriminate between content and appearance/style in object recognition.
As I said earlier, the exact function of each layer in a neural network is hard to define since the network is trained as a whole and depending on its parameters, different layers will have different functions. It appears that convolutional networks, trained to recognize objects, internally represent images in a way that allows for the separation of image content and image style: the Gram matrix represents the image style, and the feature map values represent image content. And they learn this separable representation automatically during the training procedure. This is the reason why it was possible to manipulate the style of an image without affecting its content too severely.
This is a surprising result, and one might wonder why this is the case. The authors provide a possible explanation:
All in all it is truly fascinating that a neural system, which is trained to perform one of the core computational tasks of biological vision, automatically learns image representations that allow the separation of image content from style. The explanation could be that when learning object recognition, the network has to become invariant to all image variation that preserves object identity. Representations that factorise the variation in the content of an image and the variation in its appearance would be extremely practical for this task. Thus, our ability to abstract content from style and therefore our ability to create and enjoy art might be primarily a preeminent signature of the powerful inference capabilities of our visual system.
Let me explain further: There are many variations (such as changes in lighting) that change the appearance of an image dramatically, without affecting its content (the same object is still being represented). Any system that is good at recognizing objects in images is therefore forced to be able to discriminate content from appearance. It does not matter too much if the system is a computer system (e.g. a convolutional network) or a biological system (e.g. the human brain). Quite an interesting insight!
Some fun: The system in reverse
Using DeepArt's online demo, I first reproduced the results in the paper: I used a photograph of Tübingen's Neckarfront, and combined it with "The Starry Night" by Vincent van Gogh. The result resembles the one presented by DeepArt's team. Then, I used the system in reverse: I used the photograph of the Neckarfront as the artistic image, and Van Gogh's painting as the photograph. The result is an image that preserves the content representation of the painting, using the colors and small structures found in the photograph of the Neckarfront. We see that the system works as expected in this situation, too.
Starting from the top-left, clockwise: A photograph of Tübingen's Neckarfront. Vincent Van Gogh's "The Starry Night". DeepArt's combination, using Van Gogh's painting as the piece of art. DeepArt's combination, using the photograph of Tübingen as the piece of art.
Matthias Bethge informed me that Leon Gatys became one of Adobe's new research fellows and that a collaboration with Adobe is under way to decompose style into different specific aspects of style. Looking forward to seeing more results!