Introducing MultiModal-GPT: Vision and Language Models for Multi-Round Human Interactions

Screenshot 2023-05-19 at 2.02.24 PM — https://arxiv.org/pdf/2305.04790.pdf

Humans interact with the environment in many ways, including through sight and language. Each has special advantages in expressing and communicating specific ideas about the world and promoting a deeper knowledge of the world. A major goal of artificial intelligence research is to develop flexible assistants that can successfully execute multimodal visual and verbal commands that reflect human intentions. This assistant will be able to perform a wide range of activities in the real world. GPT-4 has proven to be very good at multimodal conversation with humans.

Despite the remarkable potential of GPT-4 has been demonstrated, its underlying mechanisms remain a mystery. Studies such as Mini-GPT4 and LLaVA have attempted to reproduce this performance by matching visual representations to his LLM’s input space and exploiting LLM’s original self-attention to process visual information. I was. However, due to the large amount of image tokens, including such models with comprehensive or spatio-temporal visual information can be computationally expensive. In addition, both models utilize vicuna, an open-source chatbot that has been improved by fine-tuning his LLaMA with user-generated dialogs via ChatGPT, making the adjustment step of the study’s language instruction omitted.

They want to improve OpenFlamingo for more human-like conversations with a large image and text instruction database. Researchers from the Shanghai Institute of AI, the University of Hong Kong, and Tianjin University have combined the open-source Flamingo framework, multimodal pre-trained models that employ cross-attention layers gated on image-text interactions, and visual information into effects. It uses a perceptual resampler that extracts Address these issues from your vision encoder. The model is pre-trained on a large dataset of image-text pairs, so it has strong visual comprehension capabilities for a few shots. However, you cannot participate in zero-shot, multi-turn image and text discussions.

🚀 Check out 100’s of AI Tools at the AI Tools Club

They hope to leverage OpenFlamingo’s fundamental strengths to bridge the performance gap between the model’s current capabilities and the expected outcomes of more accurate, human-like interactions in multimodal conversations. I am aiming. The company’s multimodal chatbot is known as his MultiModal-GPT. Employ a common verbal and visual instruction template during model training. To train MultiModal-GPT, first create an instruction template with language and graphics data. They found that training data was critical to his MultiModalGPT effectiveness.

Some datasets such as VQA v2.0, OKVQA, GQA, CLEVR, NLVR datasets have only one or two words for each response (e.g. yes/no), resulting in poor conversational performance for MultiModal-GPT To do. As a result, when these datasets are included in the training process, the model tends to respond with only one or two words. This brevity does not support usability. We also collect language data and create a common instruction template for jointly training MultiModal-GPTs to improve their ability to converse with humans. Verbal-only training and training with a combination of visual and verbal instructions improve model performance. They provide various demos to demonstrate the capabilities of MultiModal-GPT for continuous communication with people. He also publishes the codebase on his GitHub.

Please check paper and lipo.don’t forget to join 21,000+ ML SubReddit, Discord channeland email newsletterShare the latest AI research news, cool AI projects, and more. If you have any questions regarding the article above or missed something, feel free to email me. Asif@marktechpost.com

🚀 Check out 100’s of AI Tools at the AI Tools Club

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his Bachelor of Science in Data Science and Artificial Intelligence from the Indian Institute of Technology (IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is in image processing and he is passionate about building solutions around it. He loves connecting with people and collaborating on interesting projects.

➡️ Introducing Bright Data: The World’s #1 Web Data Platform

Source link