Q*: A general purpose artificial intelligence AI approach to improve LLM performance in inference tasks

Machine Learning


https://arxiv.org/abs/2406.14283

Large-scale language models (LLMs) have demonstrated great capabilities in tackling a variety of reasoning tasks expressed in natural language, including mathematical word problems, code generation, and planning. However, as the complexity of reasoning tasks increases, even the most advanced LLMs begin to suffer from errors, hallucinations, and inconsistencies due to their autoregressive nature. This challenge is particularly pronounced in tasks that require multiple reasoning steps, where the “System 1” thinking of LLMs (which is fast and instinctive, but less accurate) is insufficient. More careful and logical “System 2” thinking becomes essential to accurately and consistently solve complex reasoning problems.

Several attempts have been made to overcome the challenges faced by LLMs in complex reasoning tasks. Supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) aim to align LLM outputs with human expectations. Direct preference optimization (DPO) and aligner methods have also been developed to improve alignment. In the area of ​​enhancing LLMs with planning capabilities, trees of thought (ToT), A* search, and Monte Carlo tree search (MCTS) have been applied. For mathematical reasoning and code generation, techniques such as prompt engineering, fine-tuning with task-specific corpora, and training reward models have been explored. However, these methods often require extensive expertise, significant computational resources, or task-specific modifications, limiting their generalization and efficiency.

Announced by researchers from Skywork AI and Nanyang Technological University Q*Q* is a robust framework designed to enhance the multi-step inference capabilities of LLMs through deliberative planning. The approach formalizes LLM inference as a Markov decision process (MDP), where states combine input prompts with previous inference steps, actions represent the next inference steps, and rewards measure the success of the task. Q* introduces a general method to estimate optimal Q-values ​​for state-action pairs, including offline reinforcement learning, and rolls out optimal sequence selection from and to completion with a more powerful LLM. By framing multi-step inference as a heuristic search problem, Q* uses a plug-and-play Q-value model as a heuristic function within the A* search framework to guide LLMs to efficiently select the most promising next steps.

The Q* framework employs a sophisticated architecture to enhance the multi-step inference capabilities of LLM. The framework formalizes the process as a heuristic search problem using the A* search algorithm. The framework associates each state with an f-value. The f-value is calculated as a weighted sum of the aggregated utility and the heuristic value. The aggregated utility is calculated using a process-based reward function, and the heuristic value is estimated using the optimal Q-value of the state. Q* introduces three methods for estimating the optimal Q-value: offline reinforcement learning, learning from rollout, and approximation using the more powerful LLM. These methods allow the framework to learn from the training data without task-specific modifications. The deliberative planning process follows the A* search algorithm. Two sets of states are maintained: unvisited and visited. The algorithm iteratively selects the state with the highest f-value from the unvisited set, expands it using the LLM policy, and updates both sets accordingly. This process continues until a final state (a complete trajectory) is reached, at which point the answer is extracted from the final state.

Q* showed significant performance gains across a range of inference tasks. On the GSM8K dataset, it enhanced Llama-2-7b to achieve 80.8% accuracy, outperforming ChatGPT-turbo. On the MATH dataset, Q* improved Llama-2-7b and DeepSeekMath-7b to reach 55.4% accuracy, outperforming models such as Gemini Ultra (4-shot). In code generation, Q* improved the accuracy of CodeQwen1.5-7b-Chat to 77.0% on the MBPP dataset. These results consistently demonstrate the effectiveness of Q* in improving LLM performance across mathematical inference and code generation tasks, outperforming traditional methods and some closed-source models.

Q* has emerged as an effective way to overcome the challenges of multi-step reasoning in LLMs by introducing a robust deliberation framework. This approach enhances LLMs' ability to solve complex problems that require detailed and logical thinking beyond simple autoregressive token generation. Unlike previous methods that rely on task-specific utility functions, Q* uses a generic Q-value model trained solely on ground truth data, making it easily adaptable to different reasoning tasks without modification. The framework uses a plug-and-play Q-value model as a heuristic function to effectively guide LLMs without the need for task-specific fine-tuning, maintaining performance across a range of tasks. Q*'s agility comes from its single-step consideration approach, in contrast to more computationally intensive methods such as MCTS. Extensive experiments in mathematical reasoning and code generation demonstrate the superior performance of Q*, highlighting its potential to significantly improve LLMs' complex problem-solving capabilities.


Please check paperAll credit for this research goes to the researchers of this project. Also, don't forget to follow us. twitter.

participate Telegram Channel and LinkedIn GroupsUp.

If you like our work, you will love our Newsletter..

Please join us 45,000+ ML subreddits


🚀 Create, edit, and enhance tabular data with Gretel Navigator, the first complex AI system now generally available. [Advertisement]

Asjad is an Intern Consultant at Marktechpost. He is pursuing a B.Tech in Mechanical Engineering from Indian Institute of Technology Kharagpur. Asjad is an avid advocate of Machine Learning and Deep Learning and is constantly exploring the application of Machine Learning in Healthcare.

[Announcing Gretel Navigator] Create, edit and enhance tabular data with the first combined AI system trusted by EY, Databricks, Google and Microsoft.





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *