Generative adversarial networks have received much media attention due to the rise of deepfakes. These algorithms are finding unique applications in the arts and helping us make giant strides in understanding artificial intelligence.
“All art is but imitation of nature.” — Seneca
In October 2018, a painting titled Portrait of Edmond Belamy was expected to sell at auction for $7,000 to $10,000, but to the surprise of auction house Christie’s it ended up fetching the whopping price of $432,500.1 The gilt-framed painting is a portrait of a black-clad man with indistinct facial features. The corners of the portrait are unfinished, but the most unique part of the painting, and perhaps the reason for its high price, is the mathematical formula that can be seen in the bottom right corner, where the artist’s signature normally would be found. The painting was created not by a human but by an algorithm. Specifically, it was generated by a class of machine learning algorithms known as generative adversarial networks (GANs), developed by Ian Goodfellow, a renowned artificial intelligence (AI) researcher currently working at Apple.
GANs have received a lot of media attention recently due to the rise of deepfakes — videos created by superimposing celebrities’ and politicians’ faces on other people’s bodies, often those of impersonators. These deepfakes, which are powered by GANs, are eerily realistic and capable of convincing viewers that they feature real celebrities. Unsurprisingly, GANs have found applications in all kinds of visual content editing, from auto-generating anime characters to changing photos of fashion models to show different poses to increasing the resolution of blurry photographs. The video game design industry is on the verge of a revolution thanks to this technology, which is being used to create more-realistic computer graphics and virtual environments. Some consumer-facing applications, like FaceApp, also employ GANs, showing users how they would look if they aged a certain number of years. Even astronomers are using GANs to fill in parts of the sky with missing data and generate realistic realizations of deep-space for further research.
But GANs’ true potential lies in how the algorithms could advance the field of AI from narrow applications to more general ones. Ever since Alan Turing published his famous paper asking whether machines can think, there has been steady progress toward developing a machine that can.2 In the past few decades, AI research has increasingly adopted statistical modeling techniques like machine learning, in which systems learn by looking for patterns in data and making inferences with minimal human intervention. One such modeling technique, called a neural network, has driven much progress in recent years, leveraging growing computational power and access to massive datasets. GANs are the latest in the line of such models and take a uniquely creative approach using neural networks to train machines. So groundbreaking is this idea that Yann LeCun, one of the modern pioneers in artificial intelligence, has described GANs as the “coolest idea in machine learning in the last 20 years.”3
To understand the game-changing potential of GANs, we need to first look at the concepts of discriminative modeling and generative modeling. In machine learning, researchers have been trying to develop algorithms that can ingest large volumes of training data to learn and understand the world. But until recently, most of the noteworthy progress in the field revolved around the idea of discriminative modeling. This refers to tasks like identifying whether a photo contains a dog or whether a given painting was created by van Gogh. Here the algorithms learn from training data, with each observation labeled. Mathematically speaking, discriminative modeling tries to estimate the probability that an observation x belongs to a category y. Since the launch of the ImageNet database in the early 2010s, the ImageNet Visual Recognition Challenge and the development of the deep convolutional neural network (CNN), such image classification tasks have become easier, with many considering the challenge a solved problem.
Generative modeling, on the other hand, is not merely about identifying whether a photo shows a dog. It learns from a training dataset of images of dogs to figure out the rules about their appearance and generate or synthesize new canine images. Importantly, this model should be probabilistic and not deterministic. A deterministic model always produces the same result, given a set of starting conditions or initial parameters. The generative model should therefore include a random element so that the new, synthesized image is different every time. Assume there is some unknown probabilistic distribution that describes why certain images are likely to be found in the training dataset and other images are not. The generative model should closely resemble this distribution and sample from it to output a group of pixels that look like they could have been part of the original training dataset.
A GAN comprises neural networks that are based on the preceding two models but engaged in opposing objective functions: a generative network and a discriminator, or adversarial, network. The generative network is trained to take random noise as input and output a synthetic candidate. To create a painting, a GAN would take in numerous samples of paintings as input and generate an artificial one. To generate artificial faces, it would study a huge data set of real photos of people.
The adversarial network, on the other hand, is trained to discriminate between a synthetic candidate and a real one. That is, this discriminator is expected to “catch” or classify a generated painting or an artificial face as being fake. When trained in a cyclical fashion, the generative network becomes progressively better at what it does — generating synthetic candidates very close to the real ones. And the discriminator network gets better at its job of catching fakes and picking out the synthetic candidates.
Think of the generative network as a forger producing imitations of great artworks and the adversarial network as an appraiser evaluating the authenticity of those works. The two are engaged in a constant tug of war. The forger wants the fake to be misclassified as real; the appraiser stands in the forger’s way because he can spot the fakes. The forger makes a large number of attempts and learns from what the appraiser allows to go through. The appraiser, for his part, learns from all the tricks the forger plays and in doing so becomes better and better at distinguishing a fake from a real work of art. This process helps both networks understand the nuances of what makes a painting real. How do we know when the training is complete? When a human eye cannot tell whether the painting was created by an algorithm or by an actual artist.
Mathematically, the GAN system can be represented by the following function:D and G denote, respectively, the discriminative and generative models. D(x) represents the probability that x came from the real data rather than the generator’s distribution. G(z) is a function that generates output when a noise z is introduced. By that logic, it can be seen that D(G(z)) estimates the probability that a synthesized data instance is real. E stands for the expected value of the respective probability distributions.
The first term on the right-hand side of the formula represents the likelihood of the real sample passing through the discriminator; the second term is the likelihood of the synthetic sample not passing through. The aim of the discriminator is to maximize this function so that in the most ideal case all real samples will pass through and synthetic samples won’t. The generator’s job is exactly the opposite — to minimize the function. The two networks engage in this zero-sum game until the model reaches an equilibrium. In fact, the signature in the Edmond Belamy painting is a version of this formula.
Walking a Tightrope
For generative adversarial networks, the most crucial challenge lies in the training process. This is typically done in a cyclical manner so both networks have an opportunity to learn from each other’s progress. In one step, the generator learns from how the discriminator classified the previously generated samples. If some were more likely to get classified as real than others, the generator learns to produce more samples similar to them. The discriminator is frozen until the generator has learned as much as possible from the current state of its adversary. Once that has happened, the generator is frozen and the discriminator is allowed to learn what made some of the samples almost get classified as real in the previous iteration; this helps the discriminator spot these near-misses going forward. This cycle is repeated again and again, improving both networks.
It’s not ideal for one of the networks to advance too quickly. If either network gets too good before the other can catch up, the training will plateau and the overall result will be suboptimal.4 A useful analogy is that of two chess students playing each other to improve their respective games. They both have to learn and improve at roughly the same pace. If one student improves her game significantly more than the other, they will both end up with a suboptimal level of expertise: The better player will not be challenged enough, and the lesser player will keep losing without learning anything significant.
When trained well, GANs can be tools to generate information in any scenario where we have a certain understanding of what to expect and where we have a system to tell if the generated information meets our expectations.
Consider the simple but all too common goal of increasing the resolution of a photograph, or “upscaling.” Starting with a low-resolution image, a GAN’s generator will create thousands of random high-resolution images as candidates to be the upscaled version of the original. In other words, these are candidates for high-resolution images that could produce our original input image if their resolutions were reduced, or downsampled. The discriminator will then go through these high-resolution images and try to classify them based on the most likely and reasonable possibilities, given its training over many high-resolution images. Together the generator and the discriminator will generate an upscaled image from a low-resolution one that will be closest to a real high-resolution image, if it had existed.5 Essentially, the GAN tries to make the best guess based on its training, even though it may not initially have all the information. GANs can also be used to remove unwanted objects or undesirable elements from an image — for example, watermarks or lampposts and trash bins. This is done by deleting the unwanted elements and letting the GAN fill the space with the most expected information, as in the process of upscaling described above.6
What if we have no input image to start with, but only a verbal description? Let’s say we have the words “a blue bird sitting on a tree branch, facing left.” Theoretically, a GAN should be able to create an image from just the words describing the image. In the standard process, the generator will create thousands of images and the discriminator will look through all of them, allowing only those that match the description. After many iterations, the GAN will generate an image of a blue bird sitting on a branch facing left, and it will be an entirely new creation because it was generated from a model involving a random element.
The ability to generate near-realistic data comes in handy in other areas of machine learning research, such as reinforcement learning, which involves the optimization of a goal through trial and error. Trial-and-error experiments can be complicated to conduct in certain environments. Consider the case of teaching a self-driving car to navigate a rocky terrain with cliffs and pits. If an algorithm could simulate the environment using a GAN, the testing could be done in a virtual setting and the learning could be accelerated.
The State of the Art
Some of the more straightforward applications of GANs include upscaling, removing objects from images and converting audio to images. But the real fun begins when GANs are combined with other technologies, such as convolutional neural networks that specialize in image processing and object recognition tasks. CNNs consist of layers, or filters, that extract a certain feature from an image and produce differently filtered versions of the input image. These are capable of transforming images into representations that capture the high-level content (what objects are in the image, how they are arranged) without worrying about the exact pixel values. They can also produce a representation of the style — a texturized version of the input image that captures the color and localized structures.
In their landmark 2015 paper, computer scientists Leon Gatys, Alexander Ecker and Matthias Bethge made the breakthrough of separating content and style representations.7 They then demonstrated the idea of style transfer: By mixing an input photograph with famous artworks, they were able to synthesize new renderings of the photographs in those very artistic styles. The new rendering showed the same content as the photograph, but the style resembled the artwork. For example, combining a photograph of the Neckarfront (a tourist attraction in Tübingen, Germany) and van Gogh’s painting The Starry Night as the style reference image, the algorithm was able to create a new, artistic version of the photo, complete with post-Impressionistic flourishes resembling the painting.
Although the 2015 paper was groundbreaking, it relied on a single image as the reference for the style. Subsequent research has taken this idea further by training GANs to learn from a domain of images, such as the complete works of a specific painter or art from a certain time period. This is precisely what the Paris-based art collective Obvious did to create the painting Portrait of Edmond Belamy. It trained the GAN on a dataset of 15,000 portraits painted between the 14th and 20th centuries. The generator in the GAN was tasked with synthesizing new images based on this dataset, while the discriminator tried to catch the images that were not human-made.
As GANs have evolved, new, unforeseen uses for them have been discovered. One of these involves deepfakes and has caught widespread media attention. Deepfakes use GANs to superimpose content onto a source image or video and seamlessly alter the original content. They can be used for fun and harmless applications like impersonating celebrities and transferring professional dance moves onto the body of amateurs, but in the wrong hands the technology can be weaponized for harassment, social engineering, political misinformation campaigns and propaganda. The deepfake video of President Barack Obama superimposed on comedian Jordan Peele was a timely warning about the possible dangers of misinformation using the technology. We have reached a stage where anyone with access to a reasonable dataset and computing power could create such videos to mislead the public, disrupt financial markets or even cause national security incidents.
Research is already underway into catching these deepfakes through forensic techniques that model subtle mannerisms and facial expressions specific to an individual’s speaking patterns. One observation made in the early stages of deepfake generation was that the eyes of artificially created faces didn’t blink like natural eyes would. This was because the training data did not include images with the person’s eyes closed the algorithms had no way to learn about the concept of blinking. But once this flaw was noticed, the next generation of deepfakes accounted for blinking and easily bypassed detection techniques. Subsequently, deepfake pioneer Hao Li conducted a study that revealed certain “soft biometrics” — distinct movements of the face, head and upper body — that could help distinguish real videos of people from deepfakes.8 But Li thinks it will soon become impossible to detect fakes, with new forensic techniques and countermeasures battling each other, improving each side, not unlike the two networks competing in a GAN system.9
In his famous 1950 test, Alan Turing proposed that a machine can be said to exhibit intelligent behavior if a human evaluator is unable to distinguish its responses from that of a human in a text-based conversation. Since then, the field of AI has grown dramatically and given rise to a number of applications, most of which are restricted to narrow tasks in specific domains. Think of intelligent systems like Google Translate, Siri, Alexa and common facial recognition software. These demonstrate high levels of intelligence for specific functions and in some cases are superior to human capability. But these systems are not of much use when applied to tasks other than their specialty. In contrast, a hypothetical form of artificial general intelligence would be able to extend learning across different functions and could tackle more-complicated problems, react to unfamiliar environments and make decisions on its own.
The arrival of GANs has added much excitement to this growing field of research. It has allowed machine learning techniques to progress beyond merely being able to understand and label the data supplied to them. The techniques are now getting better at figuring out how the data was generated in the first place. To achieve true intelligence, machines should not only be able to figure out whether a photo contains a dog or a cat but also be able to understand what it means for the photo to be of a dog or a cat. The latest applications of GANs in visual content generation, especially in creating artworks, seem to suggest that we are heading in the right direction.
Tejesh Kinariwala is a Vice President, Portfolio Management, at WorldQuant and has a bachelor’s degree in electrical engineering from the Indian Institute of Technology, Delhi, and an MBA from the Indian Institute of Management, Ahmedabad.