Computer Vision System Combines Image Recognition and Generation | Massachusetts Institute of Technology News

Computers have two notable functions when it comes to images. Images can be both identified and newly generated. Historically, these functions have existed independently as separate acts, such as a chef who is good at cooking (Generation) and a connoisseur who is good at tasting food (Recognition).

But I can’t help but wonder what it takes to reconcile the harmonious union between these two signature abilities. Both chefs and chefs have a common understanding of the taste of food. Similarly, integrated vision systems require a deep understanding of the visual world.

Now, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have trained a system to infer missing parts of images. This task requires a deep understanding of the image content. Known as Mask Generation Encoder (MAGE), the system satisfies his two goals at once: to accurately identify an image and to create a new image that closely resembles reality by filling in the blanks. To do.

This dual-purpose system enables a myriad of applications, including identifying and classifying objects in images, learning quickly from minimal examples, creating images under specific conditions such as text and classes, and enhancing existing images. Potential applications become possible.

Unlike other techniques, MAGE does not work with raw pixels. Instead, we convert the image into what we call a “semantic token”, which is a compact yet abstracted version of the image section. Think of these tokens as mini jigsaw puzzle pieces, each representing a 16×16 patch of the original image. Just as words form sentences, these tokens preserve the information of the original image while creating an abstracted version of the image that can be used for complex processing tasks. Such a tokenization step can be trained within a self-supervised framework, so it can be pre-trained on large unlabeled image datasets.

Now, when MAGE uses “Masked Token Modeling”, the magic begins. We randomly hide some of these tokens to create an incomplete puzzle and train a neural network to fill in the gaps. In this way, it learns both to understand patterns in images (image recognition) and to generate new patterns (image generation).

“One of the highlights of MAGE is its variable masking strategy during pre-training, which allows it to be trained for either image generation or recognition tasks within the same system,” says MIT Electric. said Tianhong Li, PhD student in engineering and computer science. , an affiliate of CSAIL, and lead author of research papers. “His MAGE’s ability to operate in ‘token space’ rather than ‘pixel space’ enables sharp, detailed, high-quality image generation and semantically rich image representation. This allows for a high degree of integration It could pave the way for improved computer vision models.”

Apart from the ability to generate realistic images from scratch, MAGE also allows conditional image generation. The user specifies certain criteria for the image that they want MAGE to generate and the tool creates the appropriate image. It can also perform image editing tasks such as removing elements from an image while maintaining a realistic look.

Recognition tasks are also suitable for MAGE. The ability to pre-train on large unlabeled datasets allows us to classify images using only the learned representations. Moreover, it excels at few-shot learning and achieves impressive results on large image datasets like ImageNet with just a few labeled samples.

MAGE’s performance verification was impressive. On the one hand, it sets new records in generating new images, significantly outperforming its predecessor. On the other hand, MAGE came out on top on the recognition task, achieving 80.9 percent accuracy for linear probes and 71.9 percent 10-shot accuracy for ImageNet (this was 71.9 percent of the cases with only 10 labeled samples each). It means that you correctly identified the ) class).

Despite its strengths, the research team acknowledges that MAGE is a work in progress. Some information is inevitably lost in the process of converting an image into tokens. They are keen to explore ways to compress images without losing important details in future work. The team plans to test his MAGE on even larger datasets. Future investigations may include training his MAGE on larger unlabeled datasets, which may further improve its performance.

“It has been a long-standing dream to achieve image generation and image recognition in a single system. It’s a great study,” said Huisheng Wang, Senior Staff Software Engineer, Human and Interaction in Research and Machine Intelligence. A division of Google and was not involved in this work. “This innovative system has a wide range of applications and may inspire many future studies in the field of computer vision.”

Li co-authored the paper with Dina Katabi, professor of electrical engineering and computer science at the Massachusetts Institute of Technology, Tuan Pham and Nicole Pham, and CSAIL principal investigator. Her Huiwen Chang, a senior researcher at Google. Shlok Kumar Mishra, University of Maryland PhD student and Google Research intern.Senior Han Jang Research Scientist at Google. And Google’s staff research scientist Dilip Krishnan. Computational resources were provided by Google Cloud Platform and his MIT-IBM Watson Research Collaboration. The team’s research was presented at the 2023 Computer Vision and Pattern Recognition conference.

Source link