Recursive Introspection (RISE): A machine learning approach to fine-tune LLM to improve sequential responses over multiple turns

Machine Learning


https://arxiv.org/abs/2407.18219

Large-scale language models (LLMs) have attracted significant attention as powerful tools for a variety of tasks, but their potential as general-purpose decision-making agents comes with its own challenges. To function effectively as an agent, LLMs must do more than just generate plausible text completions. They must exhibit interactive, goal-directed behavior to accomplish a specific task. This requires two key abilities: actively seeking information about the task and making decisions that can be improved by “thinking” and validating during reasoning. Current methodologies struggle to achieve these capabilities, especially for complex tasks that require logical reasoning. LLMs often have the necessary knowledge, but often fail to apply it effectively when asked to correct their mistakes in turn. This limitation highlights the need for a more robust approach that allows LLM agents to self-improve during testing.

Researchers have attempted a variety of approaches to enhance the reasoning and thinking capabilities of the underlying models for downstream applications. These methods primarily focus on developing prompting techniques for effective multi-turn interaction with external tools, reflection, verbalization of thoughts, continuous improvement of predictions through self-criticism and correction, or the use of other models for response critique. While some of these approaches show promise in improving responses, they often rely on detailed error tracing or external feedback to be successful.

Prompting techniques, while useful, have limitations. Research has shown that inherent self-correction guided solely by the LLM itself is often infeasible for off-the-shelf models, even when they have the necessary knowledge to address the prompts. Fine-tuning LLMs to gain self-improvement capabilities has also been explored using strategies such as training self-generated responses, learned verifiers, search algorithms, contrastive prompts for negative data, and iterative supervised or reinforcement learning.

However, these existing methods primarily focus on improving single-turn performance and do not introduce the ability to improve performance across turns of consecutive interactions.Some research directly fine-tunes LLMs for multi-turn interactions via reinforcement learning, but this approach addresses different challenges than those posed by the single-turn problem in multi-turn scenarios.

Announced by researchers from Carnegie Mellon University, University of California, Berkeley, and Marcellion RISE (Recursive Introspection)RISE is a unique approach to enhance LLMs' self-improvement capabilities. The method employs an iterative fine-tuning procedure that frames single-turn prompts as multi-turn Markov decision processes. By incorporating principles of online imitation learning and reinforcement learning, RISE develops multi-turn data collection and training strategies. This approach enables LLMs to recursively detect and correct mistakes in subsequent iterations, a capability previously considered difficult to achieve. Unlike traditional methods that focus on single-turn performance, RISE instills dynamic self-improvement in LLMs, aiming to revolutionize their problem-solving capabilities in complex scenarios.

RISE presents an innovative approach to fine-tune underlying models for self-improvement over multiple turns. The method starts by converting a single-turn problem into a multi-turn Markov decision process (MDP). This MDP construction converts prompts into initial states and model responses serve as actions. The next state is created by concatenating the current state, the model's actions, and a fixed introspection prompt. Rewards are based on the accuracy of the answer. RISE then employs a data collection and learning strategy within this MDP framework. The approach uses either distillation from a more competent model or self-distillation to generate improved responses. Finally, RISE applies reward-weighted supervised learning to train the model, allowing it to improve its predictions over successive trials.

RISE shows significant performance gains across multiple benchmarks. On GSM8K, RISE improves the 5-turn performance of the LLama2 base model by 15.1% and 17.7% after one and two iterations, respectively, without the use of an oracle. On MATH, gains of 3.4% and 4.6% are seen. These gains exceed gains achieved by other methods, such as self-improvement with prompts only and standard fine-tuning of oracle data. Notably, RISE outperforms sampling multiple responses in parallel, demonstrating the ability to truly correct mistakes on successive turns. The effectiveness of the method persists across a range of base models, with Mistral-7B + RISE outperforming Eurus-7B-SFT, a model specifically fine-tuned for mathematical reasoning. Additionally, a self-distilled version of RISE is promising, improving 5-turn performance even with fully self-generated data and supervision.

RISE introduces a unique approach to fine-tuning large language models to improve responses over multiple turns. By transforming a single-turn problem into a multi-turn Markov decision process, RISE employs iterative reinforcement learning on on-policy rollout data with expert or self-generated supervision. This method significantly enhances the self-improvement ability of the 7B model in inference tasks, outperforming previous approaches. Results show consistent performance gains across a range of base models and tasks, demonstrating true sequential error correction. Currently, computational constraints limit the number of training iterations, especially with self-generated supervision, but RISE shows a promising direction for improving the self-improvement capabilities of LLMs.


Please check paperAll credit for this research goes to the researchers of this project. Also, don't forget to follow us. twitter And our Telegram Channel and LinkedIn GroupsUp.

If you like our work, you will love our Newsletter..

Please join us 47,000+ ML Subreddits

Check out our upcoming AI webinars here

Asjad is an Intern Consultant at Marktechpost. He is pursuing a B.Tech in Mechanical Engineering from Indian Institute of Technology Kharagpur. Asjad is an avid advocate of Machine Learning and Deep Learning and is constantly exploring the application of Machine Learning in Healthcare.

🐝 Join the fastest growing AI research newsletter, read by researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft & more…





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *