Colormind II

In last Monday's post, I wrote about Colormind, a program which extracts colour palettes from photos. And on Friday, I turned to pix2pix, a program which can be trained to transform images, producing effects such as these:
Screenshot of the pix2pix page, showing a sketch for a cat, and the picture generated from it.
Screenshot of the pix2pix page, showing my sketch for a handbag, and the picture generated from it. [Images: (1) in tweet 19 Feb 2017 by Christopher Hesse; (2) Chromophilia ]

As it happens, Colormind is two different programs. On Friday, I discussed one of these, the extractor. But there's also an ab initio generator. Colormind's author Jack Qiao describes it in his blog entry "Generating Color Palettes with Deep Learning". Here, he trained pix2pix to generate complete palettes from partial ones. He did this by giving it a database of pairs of images. In each pair, the "output" image was a complete palette from Adobe Color, and the "input" image was the same palette with some colours missing. So in effect, he was training pix2pix to "fill in" missing colours.

One could regard this as analogous to what I showed on Friday, where pix2pix was being trained to "fill in" handbags, shoes or cats from their sketches. (For the technically minded, the original authors of pix2pix note under "Color palette completion" in "Image-to-Image Translation with Conditional Adversarial Nets" that this "stretches the definition of what counts as 'image-to-image translation' in an exciting way"; it may not be the best choice of representation.)

I'm not clear from Jack Qiao's writeups how closely the ab initio generated palettes resemble those created by people. In describing the palette extractor, he says it submitted the palettes it generated to a gatekeeper, which rejects those that don't look like human-created ones. The ab initio generator doesn't have a gatekeeper: its knowledge comes from complete palettes from Adobe Colour. Do these have the same kind of high-level structure that human-created palettes do? I don't know.

To experiment with the ab initio generator, go to . You'll see a strip of five colours. Each box in it has either three or four controls under it. These are represented by icons for: a padlock; sliders; and a left arrow or a right arrow or both. Clicking on the sliders icon gives you controls for changing the colour. Clicking on the padlock locks in your choice. And clicking on the arrow(s) exchanges your colour with the one on its left or right. Clicking "Generate" will generate a new palette from the locked-in colours.

Designing Handbags with pix2pix

I just designed a handbag!
Screenshot of a handbag designed by Christopher Hesse's pix2pix page.

To be fair, there's very little about the bag that's mine, apart from its outline. I made the image by sketching a handbag in the "Input" box in the "edges2handbags" section of "Image-to-Image Demo Interactive Image Translation with pix2pix-tensorflow", by Christopher Hesse. Once I'd done so and pressed "Process", his software did the rest:
Screenshot of the pix2pix page, showing the generated handbag above, and my sketch for it.

As well as handbags, Christopher Hesse's page allows you to generate shoes from sketches, cats from sketches (with gruesome results if you get it wrong), and buildings from facade plans. It's all based on Hesse's re-implementation of pix2pix, a rather wonderful piece of machine-learning software, which can be trained to carry out a variety of general-purpose — and hard — image transformations.

To train pix2pix, it must be fed with a database of pairs of images. With the handbags, shoes, and cats, the "output" image of each pair was a photo of a handbag, shoe, or cat. The other image in the pair, the "input", was a black-and-white "sketch" thereof, automatically generated by software that detects the edges of objects. Once pix2pix has been trained, it can take new inputs and generate outputs from them.

You can try this for yourself, at various levels. To try Christopher Hesse's generators, go to his page. He recommends using it in Chrome. I tried it in Firefox, and found that the browser kept popping up messages saying "A script is slowing down the page: do you want to kill it or wait?". (Obviously, one should then wait, not kill.) Typically, this would happen three or four times during each run. But the runs do eventually end, and then you get a new handbag you can admire, or a new cat you can run away from screaming.

Training pix2pix on new sets of images would be fun. At the moment, I think this still requires knowledge of programming: that is, there aren't yet systems that will allow you to (for example) click on loads of handbag photos, automatically turn them into sketches, feed the sketch-photo pairs to a learning program, leave it to train on them, and then embed the result into a web page or app you can use to generate new pictures from sketches. No doubt someone will eventually build one, but in the meantime, the pages above plus "Pix2Pix" by Machine Learning for Artists contain enough information for a reasonably skilled programmer to get started.

And at an even deeper level, one can research into improved learning programs for fashion design, as in this recent paper: "DeSIGN: Design Inspiration from Generative Networks" by Othman Sbai, Mohamed Elhoseiny, Antoine Bordes, Yann LeCun, and Camille Couprie. That requires a deep knowledge of machine-learning-related things such as loss functions, as well as the visual language of clothing. But let's return to something simpler, the handbags. Here are some more of my runs:
Screenshots of more handbags generated by Hesse's pix2pix from my sketches.

It's notable how sensitive the output is to minute changes in input. See how the texture and colour of the right-hand face of bags 1 to 4 change when I add small details to the sketch. Or the way the colouring of bag 5 changes when I add a handle.

Why? Christopher Hesse says that he trained the handbag generator on a database of about 137,000 handbag pictures collected from Amazon. But bags vary hugely in surface detailing: one bag could be made from indigo ruched satin, while another with almost the same outline could be navy viscose/polyamide netted with black lace. A not-too-clever edge detector might output very similar sketches for both. So the mapping from sketch to bag is, as mathematicians like to say, "not well behaved": moving from one point to the next, you feel like a chamois leaping around a million-dimensional version of the Brenner Pass. One infinitesimal step in one direction, and you plummet down a precipice in some other direction that you can't define and never wanted to go.

In addition, the edge detection isn't perfect, so if you sketch a handbag using unbroken even lines, your drawing won't be using the same "notation" that the inputs do.

And, according to a remark by Jack Qiao on

Pix2pix is great for texture generation but bad at creating structure, like in the photo->map example straight lines and arcs tend to come out looking "liquified".

Here for comparison is a real bag: an evening bag that I bought from Unicorn to use as a purse. It has lots of structure.
Black velvet evening bag. It's the size of a large purse, rectangular, and decorated with sequins, small plastic beads, and leaves and spirals made from metal segments.

Primark and the Spectrum Suckers IV: Brown Needs Purple?

Here's an interesting sidelight on human-designed colour palettes. I tried running my photo from "Primark and the Spectrum Suckers" through Colormind. The photo is predominantly brown, and every single palette Colormind made from it contained some kind of purple, not too different from the one on the left below.
Colour palette from Colormind for the photo of Primark used in 'Primark and the Spectrum Suckers'.

Colormind, I explained in my post about it, extracts the main colours from a photo, produces random variations on them, and then sends these for scrutiny by a gatekeeper: a machine-learning program trained on palettes that Jack Qiao, Colormind's author, thought were good looking. I wonder whether Jack didn't use enough palettes to teach it that brown doesn't always have to go with purple. Google colour palette brown and you'll see that there are other choices.


While I was looking for photos of green-and-purple clothing, I came across a colour-scheme generator named Colormind. There are lots of generators on the web. What distinguishes Colormind is that it tries to make its schemes acceptable to humans.

This is difficult, says Colormind's author, Jack Qiao. In his blog post "Extracting Colors from Photos and Video", he writes that:

Human-designed color palettes typically have some high-level structure — a gradient running left to right, similar hues grouped together etc., and have some minimum amount of contrast between each color. Automatically created palettes [ones automatically created from an image] look more haphazard, with colors distributed according to how they were used in the original image.

There's a short discussion about this on the YCombinator Hacker News group at There, Jack proposes an experiment to demonstrate the difference between randomly generated palettes and ones designed by experts. Go to and click on one of the color rules. Adobe will generate a random palette based on that rule. Then compare it with a palette uploaded by users on or .

Given that there is this difference, how can one make a machine generate human-style palettes? Jack's answer is to use the results of machine learning. Here's his diagram for the process:
Diagram of how Colormind generates a palette from an image. It starts
with an extracted palette labelled 'MMCQ'. This is followed by four slightly
different palettes labelled 'Random Variations'. These lead to another four
palettes labelled 'Shuffle'. They are all fed into a 'Classifier'. The output
from the classifier is the same as the 'Shuffle' palettes, except that each is
annotated with a number. Finally, there is an 'Output' palette. In the diagram, this is the one
with the highest number. [ Image: from "Extracting Colors from Photos and Video" by Jack Qiao in his blog. ]

The first stage is colour quantisation. And now you know why I devoted a post to this last Friday. In the diagram above, that's represented by the first sub-image, the one labelled MMCQ. That's an abbreviation for the name of a particular colour-quantisation algorithm, the so-called Modified Median Color Quantization. The second stage is to produce a few random variations on the extracted palette, shown in the row below. The third stage is labelled 'Shuffle'. From Jack's diagram, this appears to mean that it shuffles the order of colours within each palette. The fourth stage feeds all the shuffled palettes to a "classifier", which rates them for acceptability. And the fifth stage rejects unacceptable palettes.

Where machine learning enters is the classifier. Jack trained this on palettes that he'd chosen as "good looking". As he says, "In the end [after some experiments] I built a self-contained classifier and trained it on a hand-picked list of examples. Good color palettes generally have good color contrast and an overarching theme, and bad ones look random and/or has bad inter-color contrast." Once trained, Jack's classifier acts as a gatekeeper, letting through only palettes that it thinks are good looking.

So to summarise, Colormind reduces a photo to a palette consisting a small number of colours. It then generates random variations on this, and then rejects those that, to a gatekeeper trained on appealing palettes designed by humans, look bad. I was curious to see how this would apply to my red silk top, which as I mentioned in "Visualising Clothing Colours as a 3D Cloud of Points II", is an intense red with little white. Here are three palettes Colormind generated from it:
Three palettes generated by Colormind for my red silk Chinese top. Each has an intense red, two pale brick-reddish-pinks, a very pale whitish red, and a dark maroony-aubergine.

Each has an intense red, two pale brick-reddish-pinks, a very pale whitish red, and a dark maroony-aubergine. For comparison, here's the TinEye palette. It has a very different distribution. which hasn't balanced the darks with a pale:
Colour palette from TinEye for my red Chinese silk top, with a copy of the photo reduced to that set of colours. The palette contains one intense red (Cinnabar), two much browner reds (Guardsman Red and Monarch), a blackish brown (Aubergine), and a pink (Sea Pink).

Here's one other example, from the photo of the blue, green, and plum shirts together. The first image is from Colormind, and the second from TinEye. I don't know why Colormind hasn't given me the colour labels this time.
Colour palette from Colormind for my sage-green, ice-blue, and plum velvet Moroccan shirts.
Colour palette from TinEye for my sage-green, ice-blue, and plum velvet Moroccan shirts.

To see how Colormind does on other images, try it yourself. Should you want to use my photos, I've made them available in this zip file.

Style Transfer: Does Deep Learning Understand Art Deeply?

How deeply does style transfer understand the styles it transfers? Not very, I suspect. Consider analytical cubism. This was the early cubism of Braque and others, who composed paintings from fragmentary images, each showing the same object from a different viewpoint. These paintings have a distinctive texture:
A selection of analytical cubist paintings [ Image: Google Image Search for "analytical cubism" ]

According to Google's usage-rights tool, most of the analytical cubist paintings are not public domain, so I'm restricted in what I can show. As the search results above are practically unusable, I hope I'm OK in claiming "fair use" for them. But one that is public domain is Juan Legua by Juan Gris (1911):
The painting 'Juan Legua' by Juan Gris [ Image: From, Metropolitan Museum of Art ] This demonstrates the multiplicity of viewpoints and how these get pieced together, while the smaller pictures give an impression of the overall texture thus created.

What has this to do with style transfer? Were I to ask a program of the kind described in my Style Transfer post to render a photo in the style of one of the above paintings, I am sure it would do so. But it would do so in the way someone would whose only experience with analytical cubism comes from one of those tiny pictures.

In other words, it would have a superficial model of the target style, namely that the distribution of colour and tone is fairly even, that the colours are not bright, that there are a lot of short dark lines, that these tend to be straight or slightly curved, that they tend also to be fairly evenly spaced, and so on. What it would not have is a more profound model of the analytical cubist's intentions, as stated, for example, in this quote by Peter Vergo:

What the Cubists had done was to create a new image of reality, influenced to some extent by the radical theories of the French philosopher Henri Bergson. Rejecting any conception of painting as a kind of 'window on the world', they broke decisively with the post-Renaissance convention of depicting objects as if seen from a single viewpoint, employing instead what Metzinger called 'mobile perspective' — moving round objects, simultaneously recording not only different images of the same object, but also the near and the far, the seen and the remembered. The more radical also analysed, probed, destroyed objects in order to reconstruct them, enhancing the emphasis given to the surface plane of the picture while at the same time progressively blurring the separation between the motif (figure, object, etc.) and its environment.

Why does this matter? I don't know whether it does matter to style transfer for clothing design. But it's good for programs to have as deep an understanding as possible of their task, and this seems to be a case where style transfer is currently lacking. I'll give another example, probably more relevant to fashion, in a later post. Meanwhile, I'd like to issue a challenge to the style-transfer researchers: build a program that does cubism properly!

From Peter Vergo's Introduction to Abstraction: Towards a New Art. Painting 1910-20, Tate Gallery, 1980

Using Style Transfer to Design Clothes

Here's a follow-up to my style-transfer post, itself inspired by this flowery but earth-toned kimono, this not-at-all earth-toned Yves Saint-Laurent dress, and the onset of this year's very-definitely earth-toned autumn. Put these together, and I'm sure you can see what I'm aiming at.

Briefly — I want a kimono with colours as vivid, bright and bold as that dress. I probably am not allowed to pay someone to make one with exactly that dress's designs, because it would violate Yves Saint-Laurent's copyright. I can't afford to pay an artist to invent a new design that's equally good but just different enough to avoid violation. But suppose I could run a computer program that either (a) inputs designs from all the "Homage to Pablo Picasso" dresses and invents one in the same spirit, or (b) inputs one design and mutates it to produce an equally good variation.

That seems to be what is described in the paper "Fashioning with Networks: Neural Style Transfer to Design Clothes" by Prutha Date, Ashwinkumar Ganesan and Tim Oates, 31 July 2017, posted on the arXiv. The method is similar to what I described in my style-transfer post, which is why I went into so much detail.

For our purposes, the differences seem to be that, first, the garment and its parts are the "content": that is, the object. The style is the colouring and texture. Transferring the style from one shape of garment to another automatically makes the colouring and texturing follow the second shape:

Image from the paper cited above, showing: a content image (an elaborately shaped top); a style image (picture of one other top, different in shape from the content image); and a generated style image (a top shaped like the content image, but coloured with designs from the style image)
[ Image: by Prutha Date, Ashwinkumar Ganesan and Tim Oates, from the paper cited above ]

Actually, I'm not quite sure whether I have that right, because it doesn't always seem to happen in style transfer from paintings to photos: see for example the Tübingen examples in my style-transfer post. In the one for Munch's The Scream some of Munch's sky has migrated into the frontage of a house. On the other hand, I did note that van Gogh's stars stay strictly within Tübingen's sky. I think the truth is that there are no rules in the style-transfer code that make style follow shape. However, style still tends to follow shape, because the optimisation process that I described in my post tries to preserve as much content as possible. But content is the objects in the image, and they will be lost if restyling erases or blurs too much of their boundaries. So there is an implicit bias in favour of retaining their shape, and therefore in favour of making the styling flow round it.

At any rate, the second difference between painting-to-photo style-transfer (in the research I've written about) and the garment-to-garment style transfer paper is that the latter can merge styles from several garments before transferring. This is demonstrated in the following image from the paper:
Image from the paper cited above, showing: a content image (an elaborately shaped top); four style images (pictures of four other tops, different in shape from each other and from the content image); and a generated style image (a top shaped like the content image, but coloured with designs from the four style images)
[ Image: by Prutha Date, Ashwinkumar Ganesan and Tim Oates, from the paper cited above ]

So is that all I need? Not quite. I want a garment, not a picture of a garment. Amazon is said to be developing factories that could automatically make clothes given their specifications: see "Amazon won a patent for an on-demand clothing manufacturing warehouse" by Jason Del Rey, recode, 18 April 2017. The style-transfer work would have to generate that specification, however — that is, some kind of sewing pattern — and as far as I know, it can't yet. But who knows what research is being done that hasn't yet been reported?

Style Transfer: a Summary

Here's an answer to the question "What is Style Transfer?" that I posted recently in a discussion forum. It has the same structure as yesterday's post on the topic, but is shorter and so might be easier to understand.

In its most general sense, rendering one artefact in the style of another. Most research recently has probably been on style transfer in visual art: particularly on rendering photographs in the styles of various famous painters. Here are some examples:

[ Image: via "Artistic Style Transfer" by Firdaouss Doukkali in Towards Data Science, credited to @DmitryUlyanovML ]

[ Image: from "Convolutional neural networks for artistic style transfer" by Harish Narayanan in his blog ]

[ Image: via "Artistic Style Transfer with Deep Neural Networks" by Shafeen Tejani in the From Bits to Brains blog, from "Image Style Transfer Using Convolutional Neural Networks" by Leon A. Gatys, Alexander S. Ecker, Matthias Bethge ]

The concept of style transfer doesn’t need to be restricted to visual art. For example, there’s a paper on style transfer in cooking described in Lucy Black’s “Style Transfer Applied To Cooking - The Case Of The French Sukiyaki” (14 May 2017). The work she describes is by Masahiro Kazama, Minami Sugimoto, Chizuru Hosokawa, Keisuke Matsushima, Lav R. Varshney and Yoshiki Ishikawa, one of whom she says is a professional chef. It’s written up in an arXiv paper at [1705.03487] A neural network system for transformation of regional cuisine style .

In visual art, the concept doesn’t need to be restricted to photographs and paintings. Prutha Date, Ashwinkumar Ganesan, Tim Oates have written a paper on Neural Style Transfer to Design Clothes , also available in the arXiv. (This is one reason I’m taking an interest.)

But now let’s talk about photographs and paintings. As I understand it, the “modern era” in style transfer began with a paper by Leon A. Gatys, Alexander S. Ecker and Matthias Bethge, . There had been work on the topic before, but it tended to use only low-level image features to define style, and could cope only with a restricted range of objects in images, e.g. faces. The advance of Gatys et. al. was to use so-called “convolutional neural networks” (a.k.a. “convnets”). These can be trained to recognise a vast range of common objects, making it possible to separate the “content” of an image (i.e. the things it depicts) from its style. They are also able to recognise higher-level aspects of style than could previous work.

The key points as I understand them are:

1) Style (in most of this work) refers to the technique of a single artist in a single painting. Though there is now research on inferring style from multiple paintings by the same artist, e.g. all Impressionist paintings by Monet.

2) Style means things like the thickness and smoothness of lines, how wiggly they are, the density and distribution of colour, and the surface texture of brush strokes in oil paintings.

3) Style can occur at a range of scales.

[ Image: incorporates Wikipedia's The Starry Night ]

In the above images, the circles represent what a vision scientist might call “receptive fields”. Each one “sees” the properties of the image portion within it. The first set of circles “see” how the boundaries of the church and its spire are painted. The second set “see” the swirls in van Gogh’s sky. Each one sees an individual swirl, and can perceive bulk properties such as its width and curvature. The third set of circles also “see” the swirls, but in much more detail. They see a swirl as a bundle of strokes, and can work out how far these are apart, how their shapes match, and so on.

4) Style is distributed across the entire image. The above should make that clear. Or think of other examples, such as a cartoonist’s cross-hatching. So estimating style entails working out correlations between different parts of the image.

5) Given two images, it’s possible to work out how close they are in style. In other words, style can be quantified. This is important, because if we write a program that wants to render (say) a portrait of Quora’s founder in the style of The Starry Night, the program has to know when its output actually looks van-Gogh-ish.

6) As already said, images have “content” as well as style. The content of an image is the objects it depicts.

7) The detection of content actually rests on a vast amount of work done to train learning programs (these “convnets”) to recognise objects. This work in turn has relied on (a) a huge database of images called ImageNet; (b) a huge number of volunteers to label the images therein with their descriptions; (c) a huge database of words and concepts called WordNet. The last of these gives the volunteers a consistent framework, so that if two volunteers label a picture of a crocodile (say), they’ll do so in the same way.

8) Unlike style, content is local. Objects occupy a fixed part of an image, and the recogniser need only be interested in that. Unlike style, most objects don’t contain long repeated patterns. (Indeed, if they do, I think we’d often regard that as texture, and the choice of how to depict the texture as style.)

9) Given two images, it’s possible to work out how close they are in content. In other words, content can be quantified. This is important, because if we write a program that wants to render (say) a portrait of Quora’s founder in the style of The Starry Night, the program has to know when its output actually looks like said founder.

10) OK, now we have everything we need. So suppose I have two images. One, Q, is the “content” image: the photo of Mr. Quora. One, S, is the “style” image: a photo of The Starry Night. Now I generate a third image, O, at random. This is going to become my output: Mr. Quora retouched in the style of The Starry Night.

By point 5, I can measure how close O and S are in style. By point 9, I can measure how close O and S are in content. And I can continue doing this no matter how I change O.

So I now start optimising. If O is far in style from S, I tweak it so as to bring it closer. If O is far in content from Q, I tweak it to bring it closer. And I keep doing so until the result is quite close to S in style, and quite close to Q in content.

The important point here is that my tweaker doesn’t itself need to know how to measure style and content. It just needs to feed O into the “how close am I to S in style meter” and the “how close am I to Q in content meter” mentioned two paragraphs ago, then keep tweaking O until both meters give a high reading. And, just to recap, by “meter”, I mean the “convnets” that Gatys et. al. used.

How does the optimisation know how to tweak an image? After all, if it just kept on doing so at random, perhaps millions of years would pass before it got an acceptable result. The answer is something called “gradient descent”. This is like a man standing on top of a hill, and taking a few strides along each path down it, in order to estimate which path will get him to the base most quickly.

So that’s a non-technical summary of “modern era” work on visual style transfer. I’ve written it up in more detail as a blog post, “Style Transfer - Chromophilia” . This post has references in it to the paper by Gatys et. al., and to various review and blog articles.

Style transfer research will continue, I’m sure, in many directions. I can foresee researchers wanting to make it more “semantic”. When you see this word, it’s a give-away that the author wants their programs to become less superficial, looking at “meanings” rather than surface features. And although current work already produces some stunning images, it by no means has a perfect conception of style.

For example, Analytical Cubism refers to the early cubism of Braque and others, where paintings were made from a variety of fragmentary images, each showing the same object from a different viewpoint. These paintings have a distinctive texture which many will recognise:

[ Image: via , Le Jour ni l’Heure 3854 by George Braque ]

I suspect that if you were to ask the program Gatys et. al. describe to render an object in this style, it would faithfully emulate the texture. But I bet that it would not break the object into fragments and paint each from a different viewpoint. That’s one of the main characteristics of the style, but I don’t think the current generation of style-transfer programs could know this.


Style Transfer

I've been blogging about my admiration for Yves Saint-Laurent's "Homage to Pablo Picasso" dress, and my desire for similar designs on my own clothing. I'm going to write more about this, but I want first to introduce a discipline called "style transfer": using computers to render one picture in the style of another.

Some examples

A now famous example of style transfer is redrawing photographs to resemble van Gogh's The Starry Night. There are loads of illustrations of this on the web. Here's one, showing a dog photo thus treated:
Photo of dog, followed by van Gogh's 'The Starry Night', followed by dog photo rendered in the same style [ Image: via "Artistic Style Transfer" by Firdaouss Doukkali in Towards Data Science, credited to @DmitryUlyanovML ]

Here are more, one of which also takes its style from The Starry Night:
Various artworks, each followed by photo of child rendered in the same style [ Image: from "Convolutional neural networks for artistic style transfer" by Harish Narayanan in his blog ]

And here are yet more, showing a photo of the Neckarfront in Tübingen rendered in the styles of The Starry Night, Turner's The Shipwreck of the Minotaur, and Munch's The Scream.
Various artworks, each followed by photo of child rendered in the same style [ Image: via "Artistic Style Transfer with Deep Neural Networks" by Shafeen Tejani in the From Bits to Brains blog, from "Image Style Transfer Using Convolutional Neural Networks" by Leon A. Gatys, Alexander S. Ecker, Matthias Bethge ]

Where did these techniques come from, and how do they work? The main principles seem to be these:

Defining style

1. "Style", as the term is used in this research, means things like the thickness and smoothness of lines, how wiggly they are, the density and distribution of colour, and the surface texture of brush strokes in oil paintings.

Style and scale

2. Style can happen at many scales. I'll demonstrate with the images below. They show a lattice of circles overlaying The Starry Night. The circles are what vision scientists would call "receptive fields": each is sensitive to the visual properties of what's inside it, and summarises these for use in higher-level analyses of the scene.

The first lot of circles are scaled so that they "see" the thickness of the lines painted round the church. The second lot of circles see the swirls in the sky: each circle is big enough to see a swirl in its entirety, so can concentrate on bulk properties such as its width and curvature. The third lot of circles also see swirls. But I've made them much smaller, so that they see the strokes making up a swirl, and how these strokes resemble one another.
Circular receptive fields overlaying 'The Starry Night', analysing the lines painted
round the church
Circular receptive fields overlaying 'The Starry Night', analysing neighbouring swirls
in the sky
Circular receptive fields overlaying 'The Starry Night', analysing the internal makeup of
the swirls [ Image: incorporates Wikipedia's The Starry Night ]

Style is distributed

3. Style is spatially distributed. What's important when characterising it is the correlations between different regions.

So what's important about the boundary lines around the church, for example, is that they are all similar. What's important about the swirls is that their constituent strokes are bundled, being roughly the same distance apart, the same width, and the same wiggliness. And in the second image above, we look not at the correlation within a swirl, but at the correlation between swirls, noting how one resembles that next to it.

The distance between two styles

4. There are programs that, given two images I and J, can calculate how close J's style is to I's. If I were The Starry Night, such a program would be able to work out that the restyled dog photo in my first picture is close to it in style, while the original dog photo is much further.

How do these programs work? They look for spatially distributed regularities, such as those I mentioned in point 3. In principle, a skilled programmer with a knowledge of drawing and painting techniques could write such a program. But it would take a long time, and lots of trial and error, to precisely specify the style used by any particular artist in any particular picture; and then the programmer would have to start all over again for the next picture. So instead, programs have been developed that, given examples of a style, can learn the features that distinguish it from other styles. That's why I've categorised this post as "machine learning" as well as "style transfer".

Style versus content

5. As well as style, pictures have "content". This is the objects they depict: such as the child in Harish Narayanan's examples above. When you change the style of a picture, you want something else to remain unchanged: that's the content.

Extracting content

6. Programs can be written to detect the objects in an image. This was once a vastly impossible task for computer scientists. When you think of all the possible objects in the world — on the Internet, we see mainly cats, but there are also catkins, catalogues, catamarans, catwalks and catteries, not to mention cattle, cathedrals and Catholics — how on earth could one hope to code a description of the features that distinguish them one from another and from non-objects?

As with style, the answer is machine learning — but to a much greater degree.

Before going further into machine learning, I want to mention a project called ImageNet: a large visual database designed for use in research on software recognition of object images. This was described in Harish Narayanan's blog post, "Convolutional neural networks for artistic style transfer". ImageNet contains over 14 million images, divided into groups covering about 21,000 concepts with around 1,000 images per concept. For example, there are 1,188 pictures of the African crocodile, Crocodylus niloticus: A page of pictures of crocodiles from ImageNet [ Image: screenshot of an ImageNet page ]

How does ImageNet know what the pictures are of? Because volunteers have labelled each image with details of what it contains. To ensure that the labelling is consistent, they've followed the conventions of a database of concepts and words called WordNet. This provides a framework within which each volunteer can label in a way consistent with all the other volunteers.

Now back to machine learning. With a database showing crocodiles in all possible orientations, surroundings and lighting conditions, and of all possible colourings and ages; with all the crocodiles consistently labelled; and with all other objects (such as alligators, logs, boats, and newts) also consistently labelled, it's possible to train a suitable machine-learning program to pick out crocodiles and distinguish them from other objects (such as alligators, logs, boats, and newts). The content-extracting programs used in this research have been pre-trained in this way.

The differences between looking for style and looking for content

7. As it happens, the programs used to learn style are very similar to those used to learn objects. One big difference, I suppose, is the huge pre-training on objects that I mentioned just above.

Content is local

8. Another difference is that whereas style information is distributed across potentially the entire image, object information is "local": that is, confined to a fixed region. A glimpse of this can be got from the diagram below:
The features recognised at various different layers of a convnet object classifier [ Image: via "Convolutional neural networks for artistic style transfer" by Harish Narayanan in his blog, from Deep Learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville ]

This disgram shows a "convnet" or "convolutional neural net", the learning system used in the style-transfer research I'm covering here. I've taken it from the section titled "How do convnets work (so well)?" in Narayanan's post.

The diagram shows that the learning system contains one layer of receptive fields similar to those I drew over The Starry Night. This layer recognises basic visual features such as edges. It passes its summaries up to a higher layer, which recognises higher-level features, such as corners and contours. And this passes its summaries up to a yet higher layer, which recognises even higher-level features such as simple parts of objects.

As an aside, it's interesting to look at the pictures below of one of my Moroccan shirts. Would the decoration count as style or content, and what features might be used in recognising it?
Plum velvet Moroccan shirt, showing embroidery down front

Plum velvet Moroccan shirt, showing embroidery down front [ Image: Chromophilia ]

Demonstrating content recognition

9. There's a wonderful demo program recommended by Narayanan, "Basic Convnet for MNIST". This demonstrates how a convnet recognises numerals. Go to the page, and click on "Basic Convnet" in the menu on the left. Then draw a numeral on the white canvas. To erase, click the cross next to "Clear". The second "Activation" layer shows (if I understand correctly), receptive fields which respond to parts of the numerals. You can see from the dark regions, which part of the numeral each field regards as salient.

It's probably worth saying that the features recognised by each layer are not arbitrary, but are (again if I understand correctly), those needed to best distinguish the objects on which the program was trained. So if the objects were the numerals, the upper left vertical in a "5" would probably be important, because it distinguishes it from a "3". Likewise, the crossbar on a "7" written in the European style would be significant, because it's a key feature distinguishing it from a "1".

The distance between two contents

10. Given two images I and J, we can calculate how close the objects in J are to those in I's. That is, how close J's content is to I's. If I were the original dog photo in my first picture, such a program would be able to work out that the restyled dog photo in my first picture is close to it in content, while The Starry Night is much further. This is the content counterpart of my point 3.

Generating images with the right balance of style and content

11. So suppose that we have an image I which is The Starry Night, and an image J which is the unretouched dog photo I showed in my introduction. And we generate a third image at random, K. From 3, we can see how close its style is to The Starry Night's, and from 10, how close its content is to "a dog as in my photo".

Now we change K a tiny bit. This will make it either more or less like I in style (The Starry Night), and either more or less than J in content (my dog). And we keep changing it until we achieve an optimum balance of style and content.

But there has to be a tradeoff, because we don't want the style to be too perfect at the expense of the content, and we don't want the content to be too perfect at the expense of the style. (This, of course, is a familiar situation in art. The medium and the tools impose restrictions which have to be worked around. For example, it's difficult to depict clouds and blond hair when working in pen and ink.)

At any rate, this is the final piece in the jigsaw. The style-transfer programs covered here contain an optimiser, which repeatedly tweaks K until it is optimally near I in style (but not at the expense of its content) and optimally near J in content (but not at the expense of its style). The technique used is called "gradient descent", and resembles a walker at the top of a hill taking a few strides along each path down, and estimating which one will get him to the bottom most quickly.


I've included some references below, for computer scientists who want to follow this work up — or for artists who want to persuade a computer-scientist collaborator to do so. The first reference is the paper that ushered in the "modern era" of style transfer, inspiring the methods I've described above. It and the next two references point back at the history of the topic; the third reference, "Supercharging Style Transfer", also describes work on inferring style from more than one artwork — for example, from a collection of Impressionist paintings. The next reference is the one I've referred to throughout this article, by Harish Narayanan; the one after it one is another, briefer but in the same vein. It shows some more nice examples. And the final one is a technical paper on extracting style, referred to by Narayanan: "This is not trivial at all, and I refer you to a paper that attempts to explain the idea."

For more pictures generated by style transfer, just point your favourite search engine's image search at the words "artistic style transfer".


"Image Style Transfer Using Convolutional Neural Networks" by Leon A. Gatys, Alexander S. Ecker and Matthias Bethge, Open Access version also published on IEEE Xplore
16 May 2016.

"Neural Style Transfer: A Review" by Yongcheng Jing, Yezhou Yang, Zunlei Feng, Jingwen Ye, Yizhou Yu and Mingli Song
17 June 2018.

"Supercharging Style Transfer" by Vincent Dumoulin, Jonathon Shlens and Manjunath Kudlur, Google AI Blog
26 October 2016.

"Convolutional neural networks for artistic style transfer" by Harish Narayanan in his blog
31 March 2017.

"Artistic Style Transfer with Deep Neural Networks" by Shafeen Tejani, From Bits to Brains blog
27 December 2016.

"Incorporating long-range consistency in CNN-based texture generation" by Guillaume Berger and Roland Memisevic
5 November 2016.