Vision, both looking and seeing, turns out to be one of the hardest of all “easy” things.

Convolutional neural networks (ConvNets) are the driving force behind today’s deep-learning revolution in computer vision, and in other areas as well.

Object Recognition

One prerequisite to describing visual input is object recognition, that is, recognizing a particular group of pixels in an image as a particular object category, such as “woman”, “dog”, “balloon”, or “laptop computer.” Object recognition is typically so immediate and effortless for us humans that it didn’t seem as though it would be a particularly hard problem for computers, until AI researchers actually tried to get computers to do it.

David Hubel and Torsten Wiesel were awarded a Nobel Prize for their discoveries of hierarchical organization in the visual systems of cats and primates (including humans) and for their explanation of how the visual system transforms light striking the retina into information about what is in the scene.

The visual cortex is roughly organized as a hierarchical series of layers of neurons, like the stacked layers of a wedding cake, where the neurons in each layer communicate their activations to neurons in the succeeding layer. Hubel and Wiesel found evidence that neurons in different layers of this hierarchy act as “detectors” that respond to increasingly complex features appearing in the visual scene: neurons at initial layers become active in response to edges; their activation feeds into layers of neurons that respond to simple shapes made up of these edges; and so on, up through more complex shapes and finally entire objects and specific faces.

Hubel and Wiesel discovered that neurons in the lower layers of the visual cortex are physically arranged so that they form a rough grid, with each neuron in the grid responding to a corresponding small area of the visual field. Each important visual feature has its own separate neural map. The combination of these maps is a key part of what gives rise to our perception of a scene.

Like neurons in the visual cortex, the units in a ConvNet act as detectors for important visual features, each unit looking for its designated feature in a specific part of the visual field. And (very roughly) like the visual cortex, each layer in a ConvNet consists of several grids of these units, with each grid forming an activation map for a specific visual feature.

A ConvNet (like the brain) represents the visual scene as a collection of maps, reflecting the specific “interests” of a set of detectors. In ConvNets the network itself learns what its interests should be; these depend on the specific task it is trained for.

A key to the ConvNet’s success is that, inspired by the brain, these maps are hierarchical, the inputs to the units at layer 2 are the activation maps of layer 1, the inputs to the units at layer 3 are the activation maps of layer 2, and so on up the layers.

In the real world, ConvNet can have many layers, sometimes hundreds, each with different number of activation maps. Determining these and many other aspects of a ConvNet’s structure is part of the art of getting these complex networks to work for a given task.

Our hypothetical ConvNet consists of edge detectors at its first layer, but in real world ConvNets edge detectors aren’t built in. Instead, ConvNets learn from training examples what features should be detected at each layer, as well as how to set the weights in the classification module so as to produce a high confidence for the correct answer. And, just as in traditional neural networks, all the weights can be learned from data via the same back-propagation algorithm. What’s really interesting is that, even though ConvNets are not constrained by a programmer to learn to detect any particular feature, when trained on large sets of real world photographs, they indeed seem to learn a hierarchy of detectors similar to what Hubel and Wiesel found in the brain’s visual system.

ConvNets and ImageNet

LeCun, Hinton, and other neural network loyalists believed that improved, larger versions of ConvNets and other deep networks would conquer computer vision if only they could be trained with enough data. In 2012, the torch carried by ConvNet researchers suddenly lit the vision world afire, by winning a computer-vision competition on an image data set called ImageNet.

WordNet is a project led by the psychologist George Miller, to create a database of English words, arranged in a hierarchy moving from most specific to most general, with groupings among synonyms. For example, WordNet contains the following information about the term “cappuccino”: cappuccino –> coffee –> beverage –> food –> substance –> physical entity –> entity, –> means “is a kind of”. WordNet had been used extensively in research by phychologists and linguists as well as in AI natural-language processing systems.

Fei-Fei Li had a new idea: create an image database that is structured according to the nouns in WordNet, where each noun is linked to a large number of images containing examples of that noun. Thus the idea for ImageNet was born.

According to Amazon, its Mechanical Turk service is “a marketplace for work that requires human intelligence.” The service connects requesters, people who a need a task accomplished that is hard for computers, with workers, people who are willing to lend their human intelligence to a requester’s task, for a small fee (for example, labeling the objects in a photo, for ten cents per photo). The name Mechanical Turk comes from a famous eighteenth-century AI hoax: the original Mechanical Turk was a chess-playing “intelligent machine,” which secretly hid a human who controlled a puppet (the “Turk”, dressed like an Ottoman sultan) that made the moves. Evidently, it fooled many prominent people of the time, including Napoleon Bonaparte. Mechanical Turk is the embodiment of Marvin Minsky’s “Easy thins are hard” dictum: the human workers are hired to perform the “easy” tasks that are currently too hard for computers. Amazon’ service, while not meant to fool anyone, is, like the original Mechanical Turk, “Artificial Artificial Intelligence.”

Fei-Fei Li realized that if her group paid tens of thousands of workers on Mechanical Turk to sort out irrelevant images for each of the WordNet terms, the whole data set could be completed within a few years at a relatively low cost. In a mere two years, more than three million images were labeled with corresponding WordNet nouns to form the ImageNet data set. For the ImageNet project, Mechanical Turk was “a godsend.” The service continues to be widely used by AI researchers for creating data sets; nowadays, academic grant proposals in AI commonly include a line item for “Mechanical Turk workers.”

In 2010, the ImageNet project launched the first ImageNet Large Scale Visual Recognition Challenge, in order to spur progress toward more general object-recognition algorithms. The highest-scoring program in 2010 used a so-called support vector machine, the predominant object-recognition algorithm of the day, which employed sophisticated mathematics to learn how to assign a category to each input image. Notably, there were no neural networks among the top-scoring programs.

In the 2012 ImageNet competition, the winning entry achived an amazing 85 percent correct. Such a jump in accuracy was a shocking development. What’s more, the winning entry did not use support vector machines or any of the other dominant computer-vision methods of the day. Instead, it was a convolutional neural network. This particular ConvNet has come to be known as AlexNet, named after its main creator, Alex Krizhevsky, then a graduate student at the University of Toronto, supervised by the eminent neural network researcher Geoffrey Hinton. AlexNet had eight layers, with about sixty million weights whose values were learned via back-propagation from the million-plus training images. The Toronto group came up with some clever methods for making the network training work better, and it took a cluster of powerful computers about a week to train AlexNet.

At almost the same time, Geoffery Hinton’s group was also demonstrating that deep neural networks, trained on huge amounts of labeled data, were significantly better than the current state of the art in speech recognition.

The annual ImageNet competition began to see wider coverage in the media, and it quickly morphed from a friendly academic contest into a high-profile sparring match for tech companies commercializing computer vision. Winning at ImageNet would guarantee coveted respect from the vision community, along with free publicity, which might translate into product sales and higher stock prices. The pressure to produce programs that outperformed competitors was notably manifest in 2015 cheating incident involving the giant Chinese internet company Baidu. The cheating involved a subtle example of what people in machine learning called data snooping. A cardinal rule in machine learning is “Don’t train on the test data.” It seems obvious: If you include test data in any part of training your program, you won’t get a good measure of the program’s generalization abilities.

What was it that enabled ConvNets, which seemed to be at a dead end in the 1990s, to suddenly dominate the ImageNet competition, and subsequently most of computer vision in the last half a decade? It turns out that the recent success of deep learning is due less to new breakthroughs in AI than to the availability of huge amounts of data and very fast parallel computer hardware. These factors, along with improvements in training methods, allow hundred-plue-layer networks to be trained on millions of images in just a few days.

Have ConvNets surpassed humans at object recognition? How close they are to rivaling our own human object-recognition abilities? The truth is that object recognition is not yet close to being solved by artificial intelligence.

Beyond Object Recognition

If the goal of computer vision is to “get a machine to describe what it sees,” then machines will need to recognize not only objects but also their relationships to one another and how they interact with the world. If the “objects” in question are living beings, the machines will need to know something about their actions, goals, emotions, likely next steps, and all the other aspects that figure into telling the story of a visual scene. Moreover, if we really want the machines to describe what they see, they will need to use language. AI researchers are actively working on getting machines to do these things, but as usual these “easy” things are very hard.

Why are we still so far from this goal? It seems that visual intelligence isn’t easily separable from the rest of intelligence, especially general knowledge, abstraction, and language, abilities that, interestingly, involve parts of the brain that have many feedback connections to the visual cortex. Additionally, it could be that the knowledge needed for humanlike visual intelligence can’t be learned from millions of pictures downloaded from the web, but has to be experienced in some way in the real world.

Learning on One’s Own

It is inaccurate to say that today’s successful ConvNets learn “on their own”. In order for a ConvNet to learn to perform a task, a huge amount of human effort is required to collect, curate, and label the data, as well as to design the many aspects of the ConvNet’s architecture. While ConvNets use back-propagation to learn their “parameters” (that is, weights) from training examples, this learning is enabled by a collection of what are called “hyperparameters”, an umbrella term that refers to all the aspects of the network that need to be set up by humans to allow learning to even begin. Examples of hyperparameters include the number of layers in the network, the size of the units’ receptive fields at each layer, how large and change in each weight should be during learning (called the learning rate), and many other technical details of the training process. This part of setting up a ConvNet is called tuning the hyperparameters. There are many values to set as well as complex design decisions to be made, and these settings and designs interact with one another in complex ways to affect the ultimate performance of the network.

The most successful ConvNets learn via a supervised-learning procedure: they gradually change their weights as they process the examples in the training set again and again, over many epochs, learning to classify each input as one of a fixed set of possible output categories. In contrast, even the youngest children learn an open-ended set of categories and can recognize instances of most categories after seeing only a few examples. Moreover, children don’t learn passively: they ask questions, they demand information on the things they are curious about, they infer abstractions of and connections between concepts, and, above all, they actively explore the world.

Big Data and Long Tail

When we use services provided by tech companies such as Google, Amazon, and Facebook, we are directly providing these companies with examples, in the form of our images, videos, text, or speech, that they can utilize to better train their AI programs. And these improved programs attract more users (and thus more data), helping advertisers to target their ads more effectively. Moreover, the training examples we provide them can be used to train and offer services such as computer vision and natural-language processing to businesses for a fee.

Deep learning requires a profusion of training examples. The reliance on extensive collections of labelled training data is one more way in which deep learning differs from human learning.

The supervised-learning approach, using large data sets and armies of human annotators, works well for at least some of the visual abilities. But what about in the rest of life? Virtually everyone working in the AI field agrees that supervised learning is not a viable path to general-purpose AI.

The issue is compounded by the so-called long-tail problem: the vast range of possible unexpected situations an AI system could be faced with. The term long-tail comes from statistics. The long list of very unlikely (but possible) situations is called the “tail” of the distribution. The situations in the tail are sometimes called edge cases. Most real-world domains for AI exhibit this kind of long-tail phenomenon: events in the real world are usually predictable, but there remains a long tail of low-probability, unexpected occurrences. This is a problem if we rely solely on supervised learning to provide our AI system with its knowledge of the world; the situations in the tail don’t show up in the training data often enough, if at all, so the system is more likely to make errors when faced with such unexpected cases.

A commonly proposed solution is for AI systems to use supervised learning on small amounts of labeled data and learn everything else via unsupervised learning. The term unsupervised learning refers to a broad group of methods for learning categories or actions without labeled data. Examples include methods for clustering examples based on their similarity or learning a new category via analogy to known categories. Perceiving abstract similarity and analogies is something at which humans excel, but to date there are no very successful AI methods for this kind of unsupervised learning. For general AI, almost all learning will have to be unsupervised, but no one has yet come up with the kinds of algorithms needed to perform successful unsupervised learning.

How do humans handle long-tail cases? Humans have a fundamental competence lacking in all current AI systems: common sense. We humans use common sense, usually subconsciously, in every facet of life. Many people believe that until AI systems have common sense as humans do, we won’t be able to trust them to be fully autonomous in complex real-world situations.

Overfitting

The machine learns what it observes in the data rather than what you (the human) might observe. If there are statistical associations in the training data, even if irrelevant to the task at hand, the machine will happily learn those instead of what you wanted it to learn. If the machine is tested on new data with the same statistical associations, it will appear to have successfully learned to solve the task. In machine learning jargon, the network overfitted to its specific training set, and thus can’t do a good job of applying what it learned to new data that differ from those it was trained on.

Biased AI

The biases in AI training data reflect biases in our society, but the spread of real-world AI systems trained on biased data can magnify these biases and do real damage.

The biases can be mitigated in individual data sets by having humans make sure that the data are balanced in their representation of, say racial or gender groups. But this requires awareness and effort on the part of the humans curating the data. Moreover, it is often hard to tease out subtle biases and their effects.

The problem of bias in applications of AI has been getting a lot of attention recently, with many articles, workshops, and even academic research institutes devoted to this topic. Should the data sets being used to train AI accurately mirror our own biased society, as they often do now, or should they be tinkered with specifically to achieve social reform aims? And who should be allowed to specify the aims or do the tinkering?

Show Your Work

“Show their work” is something that deep neural networks cannot easily do. Generally, you can often trust that people know what they are doing if they can explain to you how they arrived at an answer or a decision. The fear is that if we don’t understand how AI systems work, we can’t really trust them or predict the circumstances under which they will make errors.

Humans can’t always explain their thought processes either, and you generally can’t look “under the hood” into other people’s brains (or into their “gut feelings”) to figure out how they came to any particular decision. But humans tend to trust that other humans have correctly mastered basic cognitive tasks such as object recognition and language comprehension. In part, you trust other people when you believe that their thinking is like your own. You assume, most often, that other humans you encounter have had sufficiently similar life experiences to your own, and thus you assume they are using the same basic background knowledge, beliefs, and values that you do in perceiving, describing, and making decisions about the world. In short, where other people are concerned, you have what phychologists call a theory of mind, a model of the other person’s knowledge and goals in particular situations. None of us have a similar “theory of mind” for AI systems such as deep networks, which makes it harder to trust them.

It shouldn’t come as a surprise then that one of the hottest new areas of AI is variously called “explainable AI,” “transparent AI,” or “interpretable machine learning.” These terms refer to research on getting AI systems, particularly deep networks, to explain their decisions in a way that humans can understand. Explainable AI is a field in progressing quickly, but a deep-learning system that can successfully explain itself in human terms is still elusive.

Fooling Deep Neural Networks

There is yet another dimension of the AI trustworthiness question: Researchers have discovered that it is surprisingly easy for humans to trick deep neural networks into making errors. That is, if you want to deliberately fool such a system, there turn out to be an alraming number of ways to do so.

You could take an ImageNet photo that AlexNet classified correctly with high confidence (for example, “School Bus”) and distort it by making very small, specific changes to its pixels so that the distorted image looked completely unchanged to humans but was now classified with high confidence by AlexNet as something completely different (for example, “Ostrich”). The distorted image is called an adversarial example.

If deep-learning systems, so successful at computer vision and other tasks, can easily be fooled by manipulations to which humans are not susceptible, how can we say that these networks “learn like humans” or “equal or surpass humans” in their abilities? It’s clear that something very different from human perception is going on here. And if these networks are going to be used for computer vision in the real world, we’d better be sure that they are safeguarded from hackers using these kinds of manipulations to fool them.

Computer vision isn’t the only domain in which networks can be fooled; researchers have also designed attacks that fool deep neural networks that deal with language, including speech recognition and text analysis.

Understanding and defending against such potential attacks are a major area of research right now, but while researchers have found solutions for specific kinds of attacks, there is still no general defense method. Like any domain of computer security, progress so far has a “whack-a-mole” quality, where one security hole is detected and defended, but others are discovered that require new defenses.

What, precisely, are these networks learning? In particular, what are they learning that allows them to be so easily fooled? Or perhaps more important, are we fooling ourselves when we think these networks have actually learned the concepts we are trying to teach them?

Clever Hans has become a metaphor for any individual (or program) that gives the appearance of understanding but is actually responding to unintentional cues given by a trainer. Does deep learning exhibit “true understanding,” or is it instead a computational Clever Hans responding to superficial cues in the data?

One the one hand, deep neural networks, trained via supervised learning, perform remarkably well (though still far from perfectly) on many problems in computer vision, as well as in other domains such as speech recognition and language translation. Because of their impressive abilities, these networks are rapidly being taken from research settings and employed in real-world applications such as web search, self-driving cars, face recognition, virtual assistants, and recommendation systems, and it’s getting hard to imagine life without these AI tools. On the other hand, it’s misleading to say that deep networks “learn on their own” or that their training is “similar to human learning.” Recognition of the success of these networks must be tempered with a realization that they can fail in unexpected ways because of overfitting to their training data, long-tail effects, and vulnerability to hacking. Moreover, the reasons for decisions made by deep neural networks are often hard to understand, which makes their failures hard to predict or fix. Researchers are actively working on making deep neural networks more reliable and transparent, but the question remains: Will the fact that these systems lack humanlike understanding inevitably render them fragile, unreliable, and vulnerable to attacks? And how should this factor into our decisions about applying AI systems in the real world?

Trustworthy and Ethical AI

The Great AI Trade-Off: should we embrace the abilities of AI systems, which can improve our lives and even help save lives, and allow these systems to be employed ever more externsively? Or should we be more cautious, given current AI’s unpredictable errors, susceptibility to bias, vulnerability to hacking, and lack of transparency in decision-making?

To what extent should AI research and development be regulated, and who should do the regulating?

The problems surrounding AI, trustworthiness, explainability, bias, vulnerability to attack, and morality of use, are social and political issues as much as they are technical ones. Thus, it is essential that the discussion around these issues include people with different perspectives and backgrounds.

The regulation of AI should be modeled on the regulation of other technologies, particularly those in biological and medical sciences, such as genetic engineering. In those fields, regulation, such as quality assurance and the analysis of risks and benefits of technologies, occurs via cooperation among government agencies, companies, nonprofit organizations, and universities. Moreover, there are now established fields of bioethics and medical ethics, which have considerable influence on decisions about the development and application of technologies. AI research and its applications very much need a well-thought-out regulatory and ethics infrastructure.

One stumbling block is that there is no general agreement in the field on what the priorities for developing regulation and ethics should be. Should the immediate focus be on algorithms that can explain their reasoning? On data privacy? On robustness of AI systems to malicious attacks? On bias in AI systems? On the potential “existential risk” from superintelligent AI?

Could machines themselves be able to have their own sense of morality, complete enough for us to allow them to make ethical decisions on their own, without humans having to oversee them?

People have been thinking about “machine morality” for as long as they’ve been thinking about AI. Probably the best-known discussion of machine morality comes from Isaac Asimov’s science fiction stories, in which he proposed the three “fundamental Rules of Robotics”:

A robot may not injure a human being, or, through inaction, allow a human being to come to harm.
A robot must obey the orders given to it by human beings except where such orders would conflict with the First Law.
A robot must protect its own existence, as long as such protection does not conflict with the First or Second Law.

These laws have become famous, but in truth, Asimov’s purpose was to show how such a set of rules would inevitably fail.

The value alignment problem in AI: the challenge for AI programmers to ensure that their systems’ values align with those of humans. But what are the values of humans? Does it even make sense to assume that there are universal values that society shares? Before we can put our values into machines, we have to figure out how to make our values clear and consistent. This seems to be harder than we might have thought.

Progress on giving computers moral intelligence cannot be separated from progress on other kinds of intelligence: the true challenge is to create machines that can actually understand the situations that they confront. Reasoning about morality requires one to recognize cause-and-effect relationships, to imagine different possible futures, to have a sense of the beliefs and goals of others, and to predict the likely outcomes of one’s actions in whatever situation one finds one self. In other words, a prerequisite to trustworthy moral reasoning is general common sense, which, as we’ve seen, is missing in even the best of today’s AI systems.