Peking University Researchers Deploy FastServe: A Distributed Inference Service System for Large Language Model LLMs

Screenshot 2023-05-17 at 11.29.31 AM — https://arxiv.org/abs/2305.05920

Improvements in large-scale language models (LLMs) have created opportunities in a variety of areas and spawned a new wave of interactive AI applications. Most notably, ChatGPT will allow people to communicate privately with AI agents to solve problems ranging from software engineering to language translation. ChatGPT is his one of the fastest growing programs of all time thanks to its great features. Many companies have followed the trend of releasing products similar to his LLM and ChatGPT, including Microsoft’s New Bing, Google’s Bard, Meta’s LLaMa, Stanford’s Alpaca, Databricks’ Dolly, and UC Berkeley’s Vicuna.

LLM inference differs from other deep neural network (DNN) model inference, such as ResNet, due to special properties. Conversational AI applications built on LLM must provide inference in order to function. The interactive design of these apps requires a fast job completion time (JCT) of LLM inference to provide an engaging user experience. For example, a consumer submits data to her ChatGPT and expects an immediate response. However, the infrastructure that provides inference is heavily taxed by the number and complexity of LLMs. Companies set up expensive clusters with accelerators such as GPUs and TPUs to process LLM inference operations.

DNN inference jobs are often deterministic and highly predictable, and inference job run times are largely determined by the model and hardware. For example, using the same ResNet model on a given GPU, different input photos have slightly different run times. In contrast, the LLM inferred position has its own autoregressive pattern. LLM inference work goes through multiple rounds. Each iteration produces one output token, which is added to the input to create subsequent tokens on the next iteration. The output length is initially unknown, but affects both execution time and input length. Most deterministic model inference tasks, such as those performed by ResNet, are served by existing inference service systems such as Clockwork and Shepherd.

🚀 Check out 100’s of AI Tools at the AI Tools Club

They base scheduling decisions on accurate execution-time profiling, but have no effect on LLM inference for variable execution times. The most advanced method for LLM inference is Orca. It proposes iteration-level scheduling, allowing each iteration to add new jobs to the current processing batch or remove completed jobs from the current processing batch. However, inference jobs are processed using first-come, first-served (FCFS). A scheduled task runs continuously until it completes. Due to the limited GPU memory capacity and his low JCT requirements for inference jobs, he cannot use an arbitrary number of receive functions to increase the processing batch. Head of line blocks in execution-to-completion processing are well known.

This problem is especially acute for LLM inference operations, because LLMs are large and their absolute execution takes a long time. Large LLM inference jobs, especially those with long output lengths, take a long time to complete and interfere with subsequent short jobs. A Peking University researcher has developed a distributed reasoning service solution for his LLM called FastServe. To enable preemption at the level of each output token, FastServe uses iteration-level scheduling and an autoregressive pattern of LLM inference. FastServe can choose to continue the scheduled task after generating an output token, or preempt the task in another job in the queue. This allows FastServe to reduce his JCT and leading blocking through preemptive scheduling.

A proprietary skip/join multi-level feedback queue (MLFQ) scheduler serves as the foundation of FastServe. MLFQ is a well-known method for minimizing average JCT in an uninformed environment. Each work starts in the highest priority queue and is demoted to the next priority queue if it does not finish within a certain amount of time. LLM inference does not rely on semi-information. That is, the length of the output is not known in advance, but the length of the input is known. This is the main difference between LLM inference and the traditional situation. The input length determines the execution time to create the first output token. Due to the autoregressive pattern of LLM inference, it can take much longer than the execution time of the next token.

If the input is long and the output is short, the execution time of the first output token takes up most of the work. They take advantage of this property to add skip-join to traditional MLFQ. Each arriving task joins the appropriate queue by comparing the execution time of the first output token to the line’s demotion threshold, instead of always entering the highest priority queue. Queues with higher priority than the combined queue are bypassed to minimize downgrades. Preemptive scheduling with MLFQ adds additional memory overhead to keep started but not completed jobs in an interim state. LLM maintains a key-value cache for each Transformer layer to store intermediate states. As long as the batch size is not exceeded, the FCFS cache should store intermediate state for scheduled jobs. However, MLFQ may have started additional jobs, but they will be dispatched to a lower priority queue. All started but not completed jobs in the MLFQ must be maintained in intermediate state by the cache. Given the size of LLMs and the limited memory space of GPUs, it is possible to overflow the cache. When the cache fills up, the scheduler can simply delay starting new jobs, but this again causes head-of-line blocking.

Instead, we developed a productive GPU memory management system that aggressively uploads the state of processes in low-priority queues when scheduled, and offloads the state when the cache is nearly full. bottom. Pipelining and asynchronous memory operations are employed for efficiency. FastServe uses parallelization techniques such as tensor and pipeline parallelism to provide distributed inference services across many GPUs for huge models that cannot fit on a single GPU. To reduce pipeline bubbles, the scheduler runs many batches of jobs concurrently. A distributed key-value cache is organized by a key-value cache manager that also distributes management of memory swapping between GPU and host memory. They implemented a FastServe system prototype based on NVIDIA FasterTransformer. The results revealed that FastServe improved average and tail JCT by up to 5.1 and 6.4, respectively, compared to the state-of-the-art solution, Orca.

Please check paper.don’t forget to join 21,000+ ML SubReddit, Discord channeland email newsletterShare the latest AI research news, cool AI projects, and more. If you have any questions regarding the article above or missed something, feel free to email me. Asif@marktechpost.com

🚀 Check out 100’s of AI Tools at the AI Tools Club

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his Bachelor of Science in Data Science and Artificial Intelligence from the Indian Institute of Technology (IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is in image processing and he is passionate about building solutions around it. He loves connecting with people and collaborating on interesting projects.

Source link