Introducing HuggingGPT: A framework that leverages LLM to connect disparate AI models from the machine learning community (Hugging Face) to solve AI tasks

Machine Learning


Large-scale language models (LLMs) like ChatGPT have attracted a great deal of interest from researchers and industry because of their impressive results on a wide range of NLP tasks. Using reinforcement learning from human feedback (RLHF) and extensive pre-training on vast text corpora, LLM can produce better language understanding, generation, interaction, and reasoning capabilities. The great potential of LLM has sparked many new research areas, resulting in virtually limitless opportunities to develop state-of-the-art AI systems.

LLMs must collaborate with other models to maximize their potential and tackle challenging AI tasks. Therefore, choosing the right middleware to establish a communication channel between LLM and AI models is of utmost importance. To solve this problem, researchers recognize that each AI model can be expressed as a language by summarizing the model functions. As a result, researchers propose the idea that LLM uses language as a generic interface to link different AI models. Specifically, LLM can be viewed as the central nervous system for managing AI models such as planning, scheduling, and collaboration, as prompts contain model descriptions. As a result, LLMs can now use this tactic to call third-party models to complete AI-related activities. However, another problem arises when we want to incorporate different AI models into LLM. Performing many AI tasks requires collecting many high-quality model descriptions, which requires intensive rapid engineering. Many public ML communities have a wide range of good models for solving specific AI tasks, such as language, vision, and speech, and these models have clear and concise descriptions.

HuggingGPT, which can process inputs from multiple modalities and solve numerous complex AI problems, has been proposed by a research team to connect LLM (i.e. ChatGPT) and ML communities (i.e. Hugging Face). To communicate with ChatGPT, researchers combine model descriptions and prompts from the corresponding library for each AI model in Hugging Face. LLM (ie ChatGPT) then becomes the “brain” of the system answering user queries.

🚀 Join the fastest ML Subreddit community

Researchers and developers can collaborate on natural language processing models and datasets with the help of HuggingFace Hub. As a bonus, it has an easy user interface for finding and downloading ready-to-use models for various NLP applications.

Hug GPT Phase

HuggingGPT can be divided into four steps:

  • Task planning: Leverage ChatGPT to interpret user requests and break them down into discrete actionable tasks with on-screen guidance.
  • Model selection: Based on the model description, ChatGPT selects an expert model stored in Hugging Face to complete the given task.
  • Execute Task: Calls and executes each selected model and reports the results to ChatGPT.
  • After integrating all the model’s predictions with ChatGPT, the final step is to generate an answer for the user.

To scrutinize –

HuggingGPT starts with a huge language model that decomposes the user’s request into individual steps. A large language model must establish task relationships and order while handling complex requests. HuggingGPT uses quick design that combines specification-based imperatives and demonstration-based analysis to guide large language models into efficient task planning. The following paragraphs introduce these details.

After parsing the list of functions, HuggingGPT should choose the appropriate model for each task in the task list. Researchers do this by retrieving expert model descriptions from the Hugging Face Hub and using the in-context task model assignment mechanism to dynamically select which model to apply to a particular task. This method is more adaptable and open (we describe the expert model, which can be used by anyone over time).

The next step after the model is given a task is to perform a process called model inference. HuggingGPT leverages hybrid inference endpoints to ensure speedup and computational stability for these models. The model takes the task arguments as input, performs the necessary computations, and then returns the inference results to the larger language model. Resource-agnostic models can be parallelized for even more efficient inference. This allows you to start many tasks simultaneously with all their dependencies satisfied.

Once all tasks have been performed, HuggingGPT moves to the response generation step. HuggingGPT combines the results of the previous three steps (task planning, model selection, and task execution) into one cohesive report. This report details the planned tasks, the models chosen for those tasks, and the inferences drawn from those models.


  • We provide an inter-model collaboration protocol that complements the advantages of large language models and expert models. A new approach to creating general AI models is made possible by separating a large language model, which acts as the brain for planning and decision-making, from a smaller model, which acts as the performer of each given task. .
  • By connecting the Hugging Face hub to over 400 task-specific models centered around ChatGPT, researchers were able to create HuggingGPT to tackle a broad class of AI problems. HuggingGPT users have access to a reliable multimodal chat service thanks to the model’s open collaboration.
  • Numerous trials on various challenging AI tasks in language, vision, speech, and cross-modality show that HuggingGPT can grasp and solve complex tasks across multiple modalities and domains.


  • By design, HuggingGPT can use external models, allowing it to perform a variety of complex AI tasks and integrate multimodal perceptual skills.
  • Additionally, HuggingGPT can continue to absorb knowledge from domain-specific experts thanks to this pipeline, enabling extensible and scalable AI capabilities.
  • HuggingGPT embeds hundreds of Hugging Face models in ChatGPT, spanning 24 tasks such as text classification, object detection, semantic segmentation, image generation, question answering, text-to-speech, and text-to-video. Experimental results show that HuggingGPT can handle complex AI tasks and multimodal data.


  • HuggingGPT always has limitations. Efficiency is of great concern to us as it is a potential barrier to success.
  • Large language model inference is the main efficiency bottleneck. HuggingGPT has to engage the huge language model multiple times per user request round. This happens during task planning, model selection, and response generation. These interactions significantly increase response times and degrade end-user quality of service. The second is the maximum length limit imposed on the context.
  • HuggingGPT has a maximum context length limitation due to LLM’s maximum allowed token count. To address this, research has focused solely on the task planning phase and context tracking of dialog windows.
  • The primary concern is overall system reliability. During inference, large language models can deviate from the instructions, and the output format can surprise developers. Rebellion of very large language models during inference is an example.
  • There is also the issue of the need to make the expert model of the hug face inference endpoint more tractable. The Hugging Face expert model may have failed during the job execution stage due to network latency or service conditions.

The source code is in a directory called “JARVIS”.

The conclusion is

Improving AI requires solving difficult problems across different disciplines and modalities. Many AI models exist, but they have to be more powerful to handle complex AI tasks. LLMs can be controllers for managing existing AI models to perform complex AI tasks. Language is a popular interface, as LLM has shown excellent linguistic processing, generation, interaction, and reasoning capabilities. In line with this idea, researchers present HuggingGPT. This framework uses LLMs (such as ChatGPT) to link various AI models from other communities of machine learners (such as Hugging Face) to complete AI-related tasks. More specifically, we use ChatGPT to organize tasks after receiving user requests, select models based on functional descriptions in Hugging Face, and use the selected AI models to complete each subtask. Run and compile the response from the results of the run. HuggingGPT leverages ChatGPT’s superior linguistic capabilities and Hugging Face’s rich AI models to pave the way for state-of-the-art AI, enabling language, vision, speech, and more.

check out paper and github. All credit for this research goes to the researchers of this project.Also, don’t forget to participate Our 17k+ ML SubReddit, cacophony channeland email newsletterWe share the latest AI research news, cool AI projects, and more.

Dhanshree Shenwai is a Computer Science Engineer with a keen interest in AI applications and strong experience in FinTech companies covering the domains of Finance, Cards & Payments and Banking. She is passionate about exploring new technologies and advancements in today’s evolving world to make life easier for everyone.

🔥 Must read – What is AI hallucinations? The problem with AI chatbots How to find hallucinatory artificial intelligence?

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *