Google AI proposes LANISTR: an attention-based machine learning framework that learns from language, images, and structured data

Screenshot 2024-05-26 at 1.37.16 AM — https://arxiv.org/abs/2305.16556

Google Cloud AI researchers introduced LANISTR to address the challenge of effectively and efficiently handling unstructured and structured data within their framework. Processing multimodal data consisting of language, image, and structured data is becoming increasingly important in machine learning. A key challenge is the issue of missing modalities in large-scale, unlabeled structured data such as tables and time series. Traditional methods often run into issues when one or more types of data are missing, resulting in suboptimal model performance.

Current methods for pre-training on multimodal data typically assume that all modalities are available during training and inference, which is often not feasible in real-world scenarios. These methods include various forms of early and late fusion techniques, where data from different modalities are combined at either the feature or decision level. However, these approaches are not suitable for situations where some modalities may be completely missing or incomplete.

Google's LANISTR (Language, Image, and Structured Data Transformer) is a novel pre-training framework that leverages unimodal and multimodal masking strategies to create a robust pre-training objective that can effectively handle missing modalities. The framework is based on an innovative similarity-based multimodal masking objective, which allows us to infer about missing modalities while learning from available data. The framework aims to improve the adaptability and generalizability of multimodal models, especially in scenarios where labeled data is limited.

The LANISTR framework employs unimodal masking, where parts of the data in each modality are masked during training. This forces the model to learn contextual relationships within the modality. For example, in text data, certain words may be masked and the model learns to predict these based on the surrounding words. In images, certain patches may be masked and the model learns to infer these from the visible parts.

Multimodal masking extends this concept by masking across modalities. For example, in a dataset containing text, images, and structured data, one or two modalities may be randomly and completely masked during training. A model is then trained to predict the masked modalities from the available modalities. This is where similarity-based objectives come into play: the model is guided by a similarity measure to ensure that the representations generated for the missing modalities are consistent with the available data. The effectiveness of LANISTR was evaluated on two real-world datasets: the Amazon Product Reviews dataset from the retail industry and the MIMIC-IV dataset from the healthcare industry.

LANISTR has demonstrated effectiveness in out-of-distribution scenarios, where the model encounters data distributions not seen during training. This robustness is crucial in real-world applications where data variability is a common challenge. LANISTR has achieved significant improvements in accuracy and generalization, even when labeled data is available.

In conclusion, LANISTR addresses a critical problem in the field of multimodal machine learning: the challenge of missing modalities in large-scale unlabeled datasets. By employing a novel combination of unimodal and multimodal masking strategies and a similarity-based multimodal masking objective, LANISTR enables robust and efficient pre-training. Evaluation experiments demonstrate that LANISTR can effectively learn from incomplete data and generalize well to new and unknown data distributions, making it a valuable tool for advancing multimodal learning.

Please check Papers and blogs. All credit for this research goes to the researchers of this project. Also, don't forget to follow us. twitter. participate Telegram Channel, Discord Channeland LinkedIn GroupsUp.

If you like our work, you will love our Newsletter..

Please join us 42,000+ ML subreddits

Pragati Jhunjhunwala is a Consulting Intern at MarktechPost. She is currently pursuing her B.Tech degree from Indian Institute of Technology (IIT) Kharagpur. She is a technology enthusiast with a keen interest in the range of applications of software and data science. She is constantly reading about developments in various areas of AI and ML.

🐝 Join the fastest growing AI research newsletter, read by researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft & more…

Source link