Stanford University Researchers Introduce Gisting: A New Technique for Efficient Rapid Compression in Language Models

Machine Learning


Source: https://arxiv.org/pdf/2304.08467.pdf

Model specialization involves adapting a pretrained machine learning model to a specific task or domain. For language models (LM), model specialization is important to improve the performance of various tasks such as summarization, question answering, translation, and language generation. Two main processes for specializing a language model to a particular task are imperative fine-tuning (adapting a pre-trained model to a new task or set of tasks) and model distillation (pre-trained transfer knowledge from a “teacher” model). a smaller, specialized “student” model). Prompting provides a way to direct the model to a particular behavior, allows more efficient use of limited training data, and is essential for achieving state-of-the-art performance, hence the LM specialty. An important concept. Prompt compression is a technique being researched in the hope that it will save a lot of computing, memory, and storage, and not significantly reduce the overall performance or quality of the output.

This paper, published by researchers at Stanford University, proposes a new technique for prompt compression called gisting, which trains LMs to compress prompts into a set of smaller “gist” tokens. To reduce the cost of prompts, techniques such as fine-tuning and distillation can be used to train a model that behaves like the original model without prompts, but in which case the model must be retrained . This is far from ideal. However, the idea behind gisting is to use a meta-learning approach to predict gist tokens from prompts. This eliminates the need to retrain the model for each task and allows generalization to invisible instructions without additional training. This reduces computational costs and allows prompts to be compressed, cached, and reused for computational efficiency. It also allows the user to fit more content into his window in limited context.

The author has experimented with a simple way of realizing such a model. We used LM itself (leveraging existing knowledge) to predict gist tokens during instruction fine-tuning while altering Transformer’s attention mask. Given a (task, input) pair, add a gist token between the task and the input and set the attention mask by: Input tokens after the gist token cannot correspond to any of the prompt tokens before the gist token (they can join Gist tokens). If the input and output cannot correspond to the prompt, the model should compress the information from the prompt into her gist tokens in between.
To train a Gist model, I needed a dataset with different tasks, so I created a dataset called Alpaca+. It combines data from two existing instruction tuning datasets (Standford Alpaca and Self-Instruct) totaling over 130,000. example. We then performed three validation splits so that the model could be validated after training. This included displayed, hidden, and hand-crafted human prompts. In this way, generalizations to unseen instructions could be tested, and the human split posed an even stronger generalization challenge. We also used multiple LM architectures (i.e., LLaMA-7Bm, a decoder-only GPT-style model, and FLAN-T5-XXL) and used varying numbers of Gist tokens (1, 2, 5, or 10). and trained a Gist model. However, our results show that the model is generally insensitive to the number of Gist tokens, and in some cases even a higher number of tokens actually hurts performance. Therefore, they used a single gist model for the rest of the experiments.

🚀 Join the fastest ML Subreddit community

To assess the quality of rapid compression, they calibrated performance against a positive control. This is effectively a standard instruction tweak that provides a performance cap and a negative control where the model has no access to the instruction at all. As a result, random gist tokens were generated to provide a lower bound for performance. To compare the model’s output to the positive control and measure the win rate against it, we asked ChatGPT to choose which response was better and explained why. They also used a simple lexical overlap statistic called ROUGE-L (a metric that measures the similarity between generated text and human-written instructions in fine-tuning open-ended instructions). . A 50% win rate indicates that the model is of comparable quality to a model without rapid compression.

As a result, in the direction of Seen, the Gist model performed very close to the positive control model with win rates of 48.6% (LLaMA) and 50.8% (FLAN-T5). More importantly, the Gist model can be shown to generalize competitively against invisible prompts, with win rates of 49.7% (LLaMA) and 46.2% (FLAN-T5). I made it. Only in the hardest human splits did the win rates drop slightly (but still competitive) to 45.8% (LLaMA) and 42.5% (FLAN-T5). The slightly worse performance of FLAN-T5 and the specific failure case led to more hypotheses to test in future papers.

The researchers also explored the potential efficiency gains that could be achieved by gisting, which was the primary motivation for this study. The results are very encouraging, with Gist caching reducing FLOPs by 40% and reducing wall clock time by 4-7% compared to the unoptimized model. These improvements were found to be small for decoder-only language models, but the researchers found that the gist model enabled 26x compression of invisible prompts and provided considerable additional space for the input context window. I also showed that

Overall, these findings demonstrate the great potential of gisting to enhance both the effectiveness and efficiency of specialized language models. The authors also suggest some promising directions for follow-up work on Gisting. For example, the greatest gains in computation and efficiency with Gist come from compressing longer prompts, and “Gist pre-training” initially suggests compressing arbitrary spans of natural language before learning to compress prompts. It stipulates that the compression performance can be improved by learning


check out paper and githubdon’t forget to join Our 19k+ ML SubReddit, cacophony channeland email newsletterWe share the latest AI research news, cool AI projects, and more. If you have any questions about the article above or missed something, feel free to email me. Asif@marktechpost.com

🚀 Check out 100 AI Tools in the AI ​​Tools Club

Nathalie Crevoisier holds a Bachelor’s and Master’s Degree in Physics from Imperial College London. She studied Applied Data Her Science, Machine Learning, and Internet Analytics at the Polytechnic Federal Institute of Lausanne (EPFL) for her one year as part of her degree. While she was in school, she developed a strong interest in AI, and after graduating, she joined Meta (formerly Facebook) as a data scientist. During her four-year tenure at the company, Nathalie worked on various teams, including Ads, Integrity, and Workplace, applying cutting-edge data science and her ML tools to dozens of We’ve solved a complex problem that affects billions of users. Seeking independence and time to stay on top of her latest AI discoveries, she recently decided to transition to her freelance career.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *