The exploration of general-purpose AI systems has facilitated the development of capable end-to-end trainable models, many of which aim to provide a simple natural language interface for users to interact with the models. The most common method for developing these systems is extensive unsupervised pre-training followed by supervised multitask training. Ultimately, we want to be able to run these systems to meet the endless long tail of difficult jobs. However, this strategy requires carefully selected datasets for each task. By dividing the difficult activity described in natural language into simpler phases that can be handled by specialized end-to-end pre-trained models and other programs, this research aims to handle a long tail of complex tasks. I am researching the usage of large language models for
Tell a computer vision program, “Please tag this image with the seven main characters from the TV show The Big Bang Theory.” The system must first understand the purpose of the instruction before performing the next step. Detect faces, retrieve a list of Big Bang Theory key characters from the knowledge base, classify faces using the list of characters, name and tag images with recognized character faces. Although several visual and language systems can perform each task, performing natural language tasks is outside the scope of end-to-end trained systems.
Researchers at the Allen Institute for AI have created VISPROG, a program that takes visual information (either a single picture or a collection of images) and natural language commands as input and creates a series of instructions, or a program that can be called a visual program. I am proposing. It then executes these instructions to produce the desired result. Each line of a visual program calls one of many modules currently supported by the system. Modules can use pre-built language models, OpenCV image processing subroutines, or arithmetic and logical operators. It can also be a pre-built computer vision model. Inputs created by executing previous lines of code are consumed by modules, producing intermediate outputs that can be used later.
In the preceding example, the face detector, GPT-3 as the knowledge retrieval system, and CLIP as the open vocabulary image classifier are all used by a visual program created by VISPROG to provide the required output. (See Figure 1). Both the generation and execution of programs for vision applications are powered by his VISPROG. Neural Modular Networks (NMNs) combine specialized differentiable neural modules to create question-specific end-to-end trainable networks for visual question answering (VQA) problems. These methods use REINFORCE’s weak response supervision to train a layout generator, or use a weak pre-built semantic parser to deterministically generate layouts for modules.
In contrast, VISPROG allows users to build complex programs without prior training using a powerful language model (GPT-3) and limited in-context examples. The VISPROG program calls trained state-of-the-art models, non-neural Python subroutines, and a higher level of abstraction than NMN, and is similarly more abstract than NMN. These advantages make VISPROG a fast, effective and versatile neurosymbolic system. Additionally, VISPROG is highly interpretable. First, VISPROG creates an easy-to-understand program that users can check for logical correctness. VISPROG then allows you to break down the forecast into manageable pieces so you can examine intermediate phase results to identify flaws and correct your logic if necessary.
A complete program, with the outputs of intermediate steps (text, bounding boxes, segmentation masks, generated images, etc.) connected to show the information flow, serves as a visual justification for the predictions. They use his VISPROG for his four different activities to show off its versatility. These tasks require general skills (such as image analysis), but also specialized thinking and visual manipulation skills. These tasks include:
- Answer visual questions about configuration.
- Zero shot NLVR for image pairing.
- Labeling of factual knowledge objects from NL instructions.
- Image manipulation with language guide.
They emphasize that neither the modules nor the language model have changed in any way. To adapt VISPROG to any task, here are some in-context examples using natural language commands and suitable programs. VISPROG is easy to use, has a significant improvement over the base VQA model with 2.7 points in the configuration VQA test, a good zero-shot accuracy of 62.4% in NLVR, and is satisfactory for knowledge tagging and image editing tasks. It gives qualitative and quantitative results.
please check out Paper, Github, and project pages. don’t forget to join 25,000+ ML SubReddit, Discord channeland email newsletterShare the latest AI research news, cool AI projects, and more. If you have any questions regarding the article above or missed something, feel free to email us. Asif@marktechpost.com
🚀 Check out 100’s of AI Tools at the AI Tools Club
Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his Bachelor of Science in Data Science and Artificial Intelligence from the Indian Institute of Technology (IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is in image processing and he is passionate about building solutions around it. He loves connecting with people and collaborating on interesting projects.
