Google AI introduces efficient machine learning techniques to scale transformer-based large-scale language models (LLMs) to infinitely long inputs

Screenshot 2024-04-14 at 1.30.19 PM — https://arxiv.org/abs/2404.07143

Memory is important to intelligence because it helps us remember past experiences and apply them to current situations. However, due to the way the attention mechanism works, both traditional His Transformer models and Transformer-based large-scale language models (LLMs) have limitations regarding context-sensitive memory. The memory consumption and computation time of this attention mechanism are both quadratic in complexity.

Compressed memory systems represent a viable alternative that aims to be more efficient and scalable for managing very long sequences. Compressed memory systems maintain a constant number of parameters for information storage and retrieval, in contrast to classical attention mechanisms, which require memory to expand according to the duration of the input sequence. Reduce storage and compute costs.

The goal of this system's parameter adjustment process is to assimilate new information into memory while preserving retrieval. However, an efficient compressed memory method that is a compromise between simplicity and quality has not yet been adopted by existing LLMs.

To overcome these limitations, a team of researchers at Google proposed a unique solution that allows Transformer LLM to process inputs of arbitrary length with a constrained memory footprint and computational power. A key component of their approach is an attention mechanism known as Infini-attention. It combines long-term linear attention and masked local attention into a single Transformer block and incorporates compressed memory into the traditional attention process.

The main breakthrough of Infini-attention is its ability to effectively manage memory while processing long sequences. By using compressed memory, this model can store and recall data with a fixed set of parameters, eliminating the need to expand memory with the length of the input sequence. This keeps computing costs within reasonable limits and helps control memory consumption.

The team found this method effective for many tasks, including books summarizing tasks with input sequences of 500,000 tokens, retrieving passkey context blocks for sequences up to 1 million tokens long, and language modeling for long contexts. I shared that this has been shown to be the case. benchmark. His LLMs with sizes ranging from 1 billion to 8 billion parameters have been used to solve these tasks.

The ability to include a minimally bounded memory parameter, that is, to constrain and predict the memory requirements of the model, is one of the main advantages of this approach. The proposed approach also enables fast streaming inference for LLMs, allowing efficient analysis of sequential inputs in real-time or near-real-time situations.

The team summarizes their main contributions as follows:

The research team announced Infini-attention, a unique attention mechanism that combines local causal attention and long-term compressed memory. This method is convenient and effective because it effectively represents context dependencies over both short and long ranges.

The standard scaled dot product attention mechanism requires only slight modifications to accommodate infinite attention. This enables plug-and-play continuous pre-training and long-context adaptation, making it easy to integrate into current Transformer structures.

This approach allows Transformer-based LLMs to accommodate infinitely long contexts while maintaining constrained memory and computational resources. This approach ensures optimal utilization of resources by processing very long inputs in streaming mode, allowing LLM to perform well in real-world applications with large data.

In conclusion, this work represents a major advance for LLMs, allowing them to efficiently process very long inputs in terms of computation and memory usage.

Please check paper. All credit for this study goes to the researchers of this project.Don't forget to follow us twitter.Please join us telegram channel, Discord channeland linkedin groupsHmm.

If you like what we do, you'll love Newsletter..

Don't forget to join us 40,000+ ML subreddits

Want to get in front of an AI audience of 1.5 million people? work with us here

Tanya Malhotra is a final year student at University of Petroleum and Energy Research, Dehradun, pursuing a Bachelor's degree in Computer Science Engineering with specialization in Artificial Intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking, and a keen interest in learning new skills, leading groups, and managing work in an organized manner.

🐝 Join the fastest growing AI research newsletter from researchers at Google + NVIDIA + Meta + Stanford + MIT + Microsoft and more…

Source link