Is the bottleneck in deep learning in the field of computer vision?

The picture comes from visual China.

The picture comes from visual China.

Titanium Media Note: this article is from the public qubit (QbitAI), author: chestnut, dry Ming, Titanium Media is authorized to reprint.

Behind the boom, deep learning in the field of computer vision has come to a bottleneck.

Neither outsiders nor one of the founders of computer vision, Alan Yuille, a professor at Johns Hopkins University, was a disciple of Hawking.

'it's not a good momentum to do AI now without mentioning neural networks and it's hard to publish,'he said.

If people only follow the trend of neural networks and abandon all the old methods; if people only make the list and don't think about how to deal with the limitations of deep networks, it may be difficult to make better progress in this field.

Facing the three major bottlenecks of deep learning, Professor Yuille gave two ways to deal with it: to develop generalization capabilities by combining models and to test potential failures with combined data.

After the opinion was published, there was a lot of resonance. The topic of Reddit quickly reached more than 200, and AI scientists in academic circles also posted it on Twitter.

Reddit users commented that in the background of Professor Yuille, he is more aware than others about how the status quo in the field of computer vision is deep learning, and why bottlenecks occur.

Three bottlenecks of Deep Learning

Yuille points out that while depth learning is superior to other technologies, it is not universal. Over the years, its bottlenecks have emerged, with three main points:

Need a lot of data annotation

The premise that deep learning can be realized is a large amount of labeled data, which makes researchers in the field of computer vision tend to do research in the field of data resources, rather than in important areas.

Although there are some ways to reduce the dependence on data, such as transfer learning, less sample learning, unsupervised learning and weak supervised learning. But so far, their performance is not comparable to supervised learning.

Over-fitting benchmark data

The depth neural network performs well on the datum data set, but on the real world image outside the data set, the effect is not satisfactory. For example, the figure below is a case of failure.

A ImageNet training to identify sofa depth neural network, if sofa placement angle is a bit special, do not recognize. This is because some angles are rare in ImageNet datasets.

In practical applications, if there is a deviation in the depth network, it will have very serious consequences.

You know, the data set used to train autopilot systems, basically never sits in the middle of the road.

Oversensitive to image changes

Deep neural networks are sensitive to standard adversarial attacks that can cause undetectable changes to the image, but may change the neural network's perception of an object.

Moreover, neural networks are also too sensitive to changes in the scene. For example, in the picture below, a guitar and other objects are placed on the monkey picture, and the neural network recognizes the monkey as a human being, and the guitar recognizes it as a bird.

The reason behind this is that humans are more likely to carry guitars than monkeys, and birds are more likely to appear in the jungle than guitars.

This excessive sensitivity to the scene is due to the limitations of the data set.

For any target object, there is only a limited number of scenarios in the dataset. In practical applications, neural networks tend to favor these scenarios.

For data-driven methods such as deep neural networks, it is difficult to capture all kinds of scenes and interference factors.

For deep neural networks to deal with all the problems, it seems that an infinite data set is needed, which poses a huge challenge for training and testing data sets.

Why is the data set not big enough?

These three major problems are still killing deep learning, but they are all signs of vigilance.

The reason behind the bottleneck, says Yuille, is a concept called "combinatorial explosion":

It is said that the visual field, the image of the real world, is too large from the point of view of combinatorics. No matter how big a data set is, it is difficult to express the complexity of the reality.

So, what is the concept of combinatorial meaning?

Imagine creating a visual scene: you have a dictionary of objects, you have to choose a variety of objects from the dictionary, put them in different places.

It's easy to say, but everyone chooses objects and places them differently. The number of scenes that can be set up can grow exponentially.

Even with only one object, the scene can grow exponentially. Because it can be occluded in a strange way; the background of the object is also infinite.

Human words can naturally adapt to changes in the background; but deep neural networks are more sensitive to change and more error-prone:

Yes, it did.

Not all visual missions will have a combinatorial Explosion.

For example, medical imaging is well suited for deep network processing because there are few changes in the background: for example, the pancreas is usually close to the duodenum.

However, such applications are not common, and complex and changeable situations are more common in reality. If there is no big data in the exponential sense of the set, it is difficult to simulate the real situation.

Models trained/tested on a limited data set will have no practical significance: because the data set is not large enough, it does not represent the true data distribution.

Then, there are two new issues that need attention:

1. How to train in a limited data set to make AI perform well in the complex real world?

2. How to test the algorithms efficiently in a limited data set to ensure that they can withstand the test of a large amount of data in reality?

How to deal with the combined explosion?

Data sets don't grow up exponentially, so try breaking them elsewhere.

You can train a combined model to develop generalization skills. You can also use the combined data to test the model to find faults that are prone to occur.

In short, the combination is the key.

Training combination model

Combinatorial (Compositionality) is a complex expression whose meaning can be determined by the meaning of each component.

Here, an important assumption is that a structure is composed of many more basic substructures, hierarchical; there are some syntax rules behind it.

This means that AI can learn substructures and syntax from limited data and generalize them into a variety of scenarios.

Unlike deep networks, Compositional Models require a structured representation to make structures and substructures clearer.

The inference ability of the combined model can be extended beyond the data that AI has seen: reasoning, intervention, diagnosis, and answering different questions based on existing knowledge structures.

Reference Stuart German's sentence:

The world is compositional or God exists. The world is combinative, otherwise, God is there.

Although deep neural networks are also somewhat combined: advanced features are composed of low-level features; however, in the sense of this paper, deep neural networks are not combinable.

The advantages of combinatorial models have been demonstrated in many visual tasks, such as the 2017 Science model, which recognizes CAPTCHA verification codes.

There are also theoretical advantages, such as interpretable, and the ability to generate samples. In this way, researchers are more likely to find out where the error is, unlike a deep neural network that is a black box, and no one knows what is going on inside.

But learning the combination model is not easy. Because there is a need to learn all the components and grammar;

In addition, if we want to analyze by synthesizing (Synthesis), we need a generating model (Generative Models) to generate objects and scene structures.

It is said that image recognition, in addition to a few regular patterns such as faces, letters, etc., other objects are still difficult to cope with:

Basically, to solve the problem of combinatorial explosion, we have to learn the causality model (Causal Models), of 3D world and how these models generate images.

Studies on human babies show that they learn by building causal models that predict the structure of their living environment.

The understanding of causality can effectively extend the knowledge learned from limited data into new scenarios.

Test the model in the combined data

After the training, the test was over.

As mentioned earlier, the world is so complicated that we can only test algorithms on limited data.

To deal with combinatorial data, game theory is an important method: it focuses on the worst case (Worst Case) rather than the Average Case.

As discussed earlier, if the dataset does not cover the combinatorial complexity of the problem, the results discussed in the average case may not be meaningful.

Focusing on the worst-case scenario makes sense in many scenarios: algorithms such as self-driving cars, such as algorithms for cancer diagnosis. Because in these scenarios, algorithm failure can have serious consequences.

If the fault mode (Failure Modes), such as stereo vision hazard factor (Hazard Factors), can be captured in low dimensional space these failures can be studied with graphics and grid search.

But for most visual tasks, especially those involving combined data, there is usually no simple case where several risk factors can be identified and isolated for separate study.

Counterattack: a slight change in texture, which only affects AI identification and does not affect human

There is a strategy to extend the concept of Adversarial Attacks to include non-local structures, supporting complex operations that change images or scenes, such as occlusions, such as changing the physical properties of an object's surface. But don't make a major change in human cognition.

Applying such an approach to visual algorithms is still very challenging.

However, if the algorithm is written in a compositional way, a clear structure may be of great help to algorithm fault detection.

On Alan Yuille

Alan Yuille, currently working at Johns Hopkins University, is a Distinguished Professor of Cognitive Science and Computer Science.

In 1976, he received a bachelor's degree in mathematics from Cambridge University. He then studied with Hawking and received a doctorate in theoretical physics in 1981.

After graduation, he turned to computer vision. He has worked in MIT's artificial Intelligence Laboratory, Harvard computer Department and other academic institutions.

He joined UCLA in 2002 and later served as Director of the Center for Visual Recognition and Machine Learning. He is also a Visiting Professor in the Department of Psychology, Department of Computer Science, Department of Psychiatry and Biobehavioral.

In 2016, join Johns Hopkins University.

He has received the Best Paper Award from ICCV. In 2012, he was the Chairman of CVPR, the top computer vision conference, and one of the founders of the computer vision industry.

In addition, Alan Yuille also directly affected the development of Chinese AI. After his academician Dr. Zhu Xi was enrolled, he returned to China to establish AI company Yitu Technology, and is now one of the most well-known startup companies in China's CV field.

The article, based on a paper published by Yuille in May 2018, was updated in January by Chenxi Liu, his PhD student.

Paper portal:

Deep Nets: What have they ever done for Vision?

More exciting content, focus on Titanium Media WeChat (ID: taimeiti), or download Titanium Media App

Depth learning computer vision domain bottleneck

Read More Stories

© , New View Book