New ways to improve the functionality of large-scale language models | Massachusetts Institute of Technology News

Most languages use word position and sentence structure to extract meaning. For example, “The cat sat on the box” is not the same as “The box was on the cat.” In longer texts such as financial documents or novels, the syntax of these words can evolve.

Similarly, a person might be tracking variables in code or following instructions that include conditional actions. These are examples of state changes and sequential reasoning that state-of-the-art artificial intelligence systems are expected to excel at. However, existing state-of-the-art attention mechanisms within transformers, an architecture primarily used in large-scale language models (LLMs) to determine word importance, have theoretical and empirical limitations regarding such functionality.

The attention mechanism allows LLM to look back at previous parts of a query or document and decide which details and words are most important based on its training. However, this mechanism alone cannot understand word order. In order to “see” all input words, aka tokens, at the same time and process them in the order in which they are presented, researchers have developed techniques to encode location information. This is important for highly structured domains like languages. However, a common positional encoding method called rotational positional encoding (RoPE) only considers the relative distance between tokens in a sequence and does not depend on the input data. This means that words that are four positions apart, such as “cat” and “box” in the example above, all undergo the same fixed mathematical rotation specific to their relative distances.

Now, research led by MIT and the MIT-IBM Watson AI Lab has developed an encoding technology known as “PaTH Attendance” that makes location information adaptive and context-aware, rather than static like RoPE.

“Transformers allow accurate and scalable modeling of many domains, but AI There are limitations regarding state tracking, which is a type of phenomenon that is considered to be the basis for critical functionality required by the system.The key question, therefore, is how to maintain the scalability and efficiency of transformers while enabling state tracking. '' said the paper's lead author Yoon Kim, an associate professor in the School of Electrical Engineering and Computer Science (EECS), a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL), and a researcher at the MIT-IBM Watson AI Institute.

A new paper on this research was presented at the Neural Information Processing Systems Conference (NeurIPS) earlier this month. Kim's co-authors include lead author Songlin Yang. He is an EECS graduate student and former intern at the MIT-IBM Watson AI Lab summer program. Kaiyue Wen of Stanford University. Lilian Ren of Microsoft. Yikang Shen, Shawn Tan, Mayank Mishra, and Rameswar Panda of IBM Research and MIT-IBM Watson AI Lab;

path to understanding

Rather than assigning a fixed rotation to every word based on the relative distance between tokens, as in RoPE, PaTH attention is flexible and treats words between words as paths consisting of small data-dependent transformations. Each transformation is based on a mathematical operation called Householder reflection, which acts like a small mirror that adjusts depending on the content of each token it passes through. Each step in the sequence can affect how the model interprets the information later. Cumulative effects allow the system to model not only the distance between words, but also how meaning changes along the path between words. This approach allows Transformers to track how entities and relationships change over time, providing a sense of “location memory.” Think of this as walking down the street, experiencing your environment and how it affects you. The team also developed a hardware-efficient algorithm that more efficiently computes attention scores between all token pairs. This compresses the cumulative mathematical transformations from PaTH attention and splits them into smaller computations, making them compatible with faster processing on GPUs.

The MIT-IBM researchers then investigated PaTH Attend's performance on synthetic and real-world tasks, including inference, long-context benchmarks, and full LLM training to see whether the model's ability to track information improves over time. The team tested the ability to follow modern “write” commands despite many distracting steps and a multi-step recall test, a task difficult with standard positional encoding techniques like RoPE. The researchers also trained a medium-sized LLM and compared it to other methods. PaTH Attend improves confusion and outperforms other methods on untrained inference benchmarks. We also evaluated search, inference, and stability by inputting tens of thousands of tokens. PaTH Attendance has consistently proven to be able to recognize content.

“We found that our new approach can outperform existing attention mechanisms while maintaining efficiency, both in diagnostic tasks designed to test the limits of transformers and in real-world language modeling tasks,” says Kim. Furthermore, “We are looking forward to seeing whether this kind of data-dependent positional encoding, like PATH, improves the performance of transformers in structured domains such as biology.” [analyzing] proteins and DNA. ”

Think bigger and more efficiently

The researchers then investigated how the PaTH attention mechanism operates when it better mimics the human cognitive ability to ignore old or irrelevant information when making decisions. To do this, they combined PaTH Attend with another positional encoding scheme known as Forgetting Transformer (FoX), which allows the model to selectively “forget.” The resulting PaTH-FoX system adds ways to lighten information depending on the data, achieving superior results across inference, long context understanding, and language modeling benchmarks. In this way, PaTH Attendance extends the expressive power of transformer architectures.

Kim said such research is part of a broader effort to develop the “next big thing” in AI. He explains that a key driver of both deep learning and the generative AI revolution is the creation of “universal building blocks that can be applied across a wide range of domains, such as convolutional layers, RNNs, etc.” [recurrent neural network] Looking to the future, Kim points out that considerations such as accuracy, expressiveness, flexibility, and hardware scalability have been and will continue to be essential. In his words, “The core business of modern architectural research is trying to devise these new primitives that are scalable while still maintaining or increasing expressiveness.”

This research was supported in part by the MIT-IBM Watson AI Lab and Schmidt Sciences' AI2050 program.

Source link