EN新闻详情

Data map: in December 4, 2017, the people experienced artificial intelligence robots at the fourth World Internet Conference, "the light of the Internet" held in Wuzhen, Zhejiang. Du Yang, a journalist from China China News Agency

At present, artificial intelligence (AI) performs well in the fields of image and speech recognition, but scientists think this is far from enough. For the development of AI, the understanding of dynamic behavior in video is the key development direction for the development of video, which is crucial for AI to understand the world in its software, and to the wide application of AI in medical, entertainment and education areas, according to the recent report on the Journal of MIT Technology Review.

Understanding the image and understanding the action

AI systems for interpreting videos, including systems in autonomous driving vehicles, often rely on identifying objects in static frames, rather than explaining behavior. Google recently released a tool that recognizes objects in video and is part of the cloud platform that contains AI tools for processing images, audio, and text.

But for AI, the ability to understand why a cat would ride a Roomba sweeping robot in the kitchen and play with a duck is the highlight of its ability.

Therefore, the next challenge for scientists may be that the church machine not only understands what the video contains, but also understands what happens in the lens. This may bring some practical benefits, such as new methods that bring powerful search, annotated and excavated video clips, and also allow robots or autopilot to better understand how the surrounding world works.

Video training computer

At present, scientists have used video data sets to train machines to better understand the behavior in the real world, and the Massachusetts Institute of Technology (MIT) and IBM are now working together.

Last September, IBM and MIT announced the establishment of the IBM MIT brain inspired multimedia machine understanding laboratory. The two sides will work together to develop AI with advanced audio-visual capability.

Not long ago, MIT and IBM published a huge collection of video clips, a video dataset called "time dataset", which included 3 seconds of a number of actions from fishing to break dance. Odd Oliva, chief research scientist at MIT, said that many things in the world are changing rapidly. If you want to understand why something happens, sports will give you a lot of information.

The reason why the length of the video is 3 seconds is because most of the time, people need 3 seconds to observe and understand an action intention, such as a wind blowing tree, or an object falling off the table, etc.

Even last year, Google released a set of video sets, YouTube-8M, made up of 8 million YouTube videos that have been tagged, and Facebook is developing annotation datasets called "Scene" "operations" and "object" sets.

Olga Rusakviski, assistant professor of Princeton University, specializes in computer vision. Previously, scientists believe that it is difficult to develop useful video datasets because they need more storage and computing power than still images. "I’m glad to be able to use these new datasets," he said. "The 3 second is great - it provides a time context and a lower requirement for storage and computing."

Other institutions are studying more creative ways. Twenty Billion Neurons, a start-up company in Toronto and Berlin, has created a customized dataset. Rolandme Nisevic, the co - founder of the company, said they also used a neural network that specializes in time visual information. "AI can tell you whether the video shows a football match or a party with other data sets, and the neural network trained with our custom dataset can tell you whether someone has just done it. Just enter the room. "

The future of the transfer of learning artificial intelligence

According to IBM, human beings can watch a short video and easily describe the content of the video, and even predict the occurrence of subsequent events, which can still be impossible for the machine. Therefore, what IBM and MIT need to do is to solve the technical problems of Machine Cognition and prediction, and develop a cognitive system based on this.

IBM’s Danny Gutfreund said that effective identification requires a machine to learn an action and to apply the knowledge that is acquired in a situation where the same action is being carried out. The progress of this field, namely, transfer learning is very important for the future of AI; moreover, this technology is useful in practice, "you can use it. To help improve care for the elderly and the disabled, such as telling the nursing staff whether or not the elderly have fallen, or whether they have taken the medicine and so on.

MIT and IBM also say that once a machine can understand video, the visual ability of a high-level computer cognitive system will be able to be used in a variety of industries, not only medical, but also education, entertainment and other fields, including maintenance and maintenance of complex machines. Reporter Liu Xia

next challenge: teaching machines to understand images and actions