Style Transfer: a Summary
Tuesday 21st August, 2018
Machine learning, Style transfer

Here's an answer to the question "What is Style Transfer?" that I posted recently in a discussion forum. It has the same structure as yesterday's post on the topic, but is shorter and so might be easier to understand.

In its most general sense, rendering one artefact in the style of another. Most research recently has probably been on style transfer in visual art: particularly on rendering photographs in the styles of various famous painters. Here are some examples:

[ Image: via "Artistic Style Transfer" by Firdaouss Doukkali in Towards Data Science, credited to @DmitryUlyanovML ]

[ Image: from "Convolutional neural networks for artistic style transfer" by Harish Narayanan in his blog ]

[ Image: via "Artistic Style Transfer with Deep Neural Networks" by Shafeen Tejani in the From Bits to Brains blog, from "Image Style Transfer Using Convolutional Neural Networks" by Leon A. Gatys, Alexander S. Ecker, Matthias Bethge ]

The concept of style transfer doesn’t need to be restricted to visual art. For example, there’s a paper on style transfer in cooking described in Lucy Black’s “Style Transfer Applied To Cooking - The Case Of The French Sukiyaki” (14 May 2017). The work she describes is by Masahiro Kazama, Minami Sugimoto, Chizuru Hosokawa, Keisuke Matsushima, Lav R. Varshney and Yoshiki Ishikawa, one of whom she says is a professional chef. It’s written up in an arXiv paper at [1705.03487] A neural network system for transformation of regional cuisine style .

In visual art, the concept doesn’t need to be restricted to photographs and paintings. Prutha Date, Ashwinkumar Ganesan, Tim Oates have written a paper on Neural Style Transfer to Design Clothes , also available in the arXiv. (This is one reason I’m taking an interest.)

But now let’s talk about photographs and paintings. As I understand it, the “modern era” in style transfer began with a paper by Leon A. Gatys, Alexander S. Ecker and Matthias Bethge, https://www.cv-foundation.org/op... . There had been work on the topic before, but it tended to use only low-level image features to define style, and could cope only with a restricted range of objects in images, e.g. faces. The advance of Gatys et. al. was to use so-called “convolutional neural networks” (a.k.a. “convnets”). These can be trained to recognise a vast range of common objects, making it possible to separate the “content” of an image (i.e. the things it depicts) from its style. They are also able to recognise higher-level aspects of style than could previous work.

The key points as I understand them are:

1) Style (in most of this work) refers to the technique of a single artist in a single painting. Though there is now research on inferring style from multiple paintings by the same artist, e.g. all Impressionist paintings by Monet.

2) Style means things like the thickness and smoothness of lines, how wiggly they are, the density and distribution of colour, and the surface texture of brush strokes in oil paintings.

3) Style can occur at a range of scales.

[ Image: incorporates Wikipedia's The Starry Night ]

In the above images, the circles represent what a vision scientist might call “receptive fields”. Each one “sees” the properties of the image portion within it. The first set of circles “see” how the boundaries of the church and its spire are painted. The second set “see” the swirls in van Gogh’s sky. Each one sees an individual swirl, and can perceive bulk properties such as its width and curvature. The third set of circles also “see” the swirls, but in much more detail. They see a swirl as a bundle of strokes, and can work out how far these are apart, how their shapes match, and so on.

4) Style is distributed across the entire image. The above should make that clear. Or think of other examples, such as a cartoonist’s cross-hatching. So estimating style entails working out correlations between different parts of the image.

5) Given two images, it’s possible to work out how close they are in style. In other words, style can be quantified. This is important, because if we write a program that wants to render (say) a portrait of Quora’s founder in the style of The Starry Night, the program has to know when its output actually looks van-Gogh-ish.

6) As already said, images have “content” as well as style. The content of an image is the objects it depicts.

7) The detection of content actually rests on a vast amount of work done to train learning programs (these “convnets”) to recognise objects. This work in turn has relied on (a) a huge database of images called ImageNet; (b) a huge number of volunteers to label the images therein with their descriptions; (c) a huge database of words and concepts called WordNet. The last of these gives the volunteers a consistent framework, so that if two volunteers label a picture of a crocodile (say), they’ll do so in the same way.

8) Unlike style, content is local. Objects occupy a fixed part of an image, and the recogniser need only be interested in that. Unlike style, most objects don’t contain long repeated patterns. (Indeed, if they do, I think we’d often regard that as texture, and the choice of how to depict the texture as style.)

9) Given two images, it’s possible to work out how close they are in content. In other words, content can be quantified. This is important, because if we write a program that wants to render (say) a portrait of Quora’s founder in the style of The Starry Night, the program has to know when its output actually looks like said founder.

10) OK, now we have everything we need. So suppose I have two images. One, Q, is the “content” image: the photo of Mr. Quora. One, S, is the “style” image: a photo of The Starry Night. Now I generate a third image, O, at random. This is going to become my output: Mr. Quora retouched in the style of The Starry Night.

By point 5, I can measure how close O and S are in style. By point 9, I can measure how close O and S are in content. And I can continue doing this no matter how I change O.

So I now start optimising. If O is far in style from S, I tweak it so as to bring it closer. If O is far in content from Q, I tweak it to bring it closer. And I keep doing so until the result is quite close to S in style, and quite close to Q in content.

The important point here is that my tweaker doesn’t itself need to know how to measure style and content. It just needs to feed O into the “how close am I to S in style meter” and the “how close am I to Q in content meter” mentioned two paragraphs ago, then keep tweaking O until both meters give a high reading. And, just to recap, by “meter”, I mean the “convnets” that Gatys et. al. used.

How does the optimisation know how to tweak an image? After all, if it just kept on doing so at random, perhaps millions of years would pass before it got an acceptable result. The answer is something called “gradient descent”. This is like a man standing on top of a hill, and taking a few strides along each path down it, in order to estimate which path will get him to the base most quickly.

So that’s a non-technical summary of “modern era” work on visual style transfer. I’ve written it up in more detail as a blog post, “Style Transfer - Chromophilia” . This post has references in it to the paper by Gatys et. al., and to various review and blog articles.

Style transfer research will continue, I’m sure, in many directions. I can foresee researchers wanting to make it more “semantic”. When you see this word, it’s a give-away that the author wants their programs to become less superficial, looking at “meanings” rather than surface features. And although current work already produces some stunning images, it by no means has a perfect conception of style.

For example, Analytical Cubism refers to the early cubism of Braque and others, where paintings were made from a variety of fragmentary images, each showing the same object from a different viewpoint. These paintings have a distinctive texture which many will recognise:

[ Image: via https://www.flickr.com/photos/renaud-camus/14572053322 , Le Jour ni l’Heure 3854 by George Braque ]

I suspect that if you were to ask the program Gatys et. al. describe to render an object in this style, it would faithfully emulate the texture. But I bet that it would not break the object into fragments and paint each from a different viewpoint. That’s one of the main characteristics of the style, but I don’t think the current generation of style-transfer programs could know this.