Hetccl demonstrates scaling multivendor GPU clusters for large language models

Machine Learning


Researchers are grappling with a critical bottleneck in developing increasingly large language models: inefficient use of diverse GPU hardware. Heehoon Kim, Jaehwan Lee, and Taejeoung Kim from Seoul National University and Samsung Research Institute, respectively, collaborated with Jongwon Park, Jinpyo Kim, Pyongwon Suh, and others to develop HetCCL, a new collective communication library designed to overcome the limitations of current deep learning frameworks. HetCCL uniquely enables fast RDMA-based communication between GPUs from different vendors without requiring any driver changes, providing a path to cost-effective, high-performance training across heterogeneous GPU clusters, and unlocking the potential for practical large-scale language model development using off-the-shelf hardware.

This breakthrough addresses a significant inefficiency in deep learning frameworks. Deep learning frameworks currently do not support collective communication between heterogeneous GPU clusters, leading to increased cost and decreased performance.

The research team achieved cross-vendor communication by leveraging optimized vendor libraries, specifically NVIDIA NCCL and AMD RCCL, through two innovative mechanisms integrated within HetCCL. Evaluations conducted on a multivendor GPU cluster demonstrate that HetCCL not only matches the performance of NCCL and RCCL in a homogeneous environment, but also scales uniquely in a heterogeneous setup.
This study reveals a practical solution for high-performance training leveraging both NVIDIA and AMD GPUs without requiring any changes to existing deep learning applications. The rapid emergence of trillion-scale deep learning models requires advanced computational power, often enabled by heterogeneous cluster systems with various hardware accelerators.

Currently, GPU-based platforms, especially those leveraging NVIDIA or AMD GPUs, dominate deep learning, but parallel training between GPUs from different vendors remains a major challenge due to incompatibility of communication backends. HetCCL overcomes this limitation by enabling seamless communication, thereby unlocking the potential of heterogeneous GPU clusters for large-scale model training.
In this study, we establish a method for direct point-to-point communication between GPUs from different vendors using RDMA. This is achieved through a carefully designed implementation of heterogeneous GPU collective communication operations, abstraction of platform-specific APIs, and consolidation of vendor-optimized operations into a unified framework.

Experiments show that HetCCL can significantly accelerate the training of large language models on multivendor GPU clusters, outperforming homogeneous setups, and avoiding performance degradation associated with dispersion effects and reduced model accuracy. By replacing existing communication backends with HetCCL, researchers can take advantage of both NVIDIA and AMD GPUs within existing parallel training code written in frameworks such as PyTorch without any code changes.

This work is the first demonstration of transparently leveraging all multivendor GPUs in a heterogeneous cluster, supporting NVIDIA and AMD GPUs, which dominate the accelerator market with approximately 88% and 12% market share, respectively. The team’s contributions will pave the way for building a scalable, cost-effective AI infrastructure essential to advancing the next generation of distributed machine learning systems.

Heterogeneous GPU collective communication over direct RDMA interconnect enables scalable multi-GPU training

Scientists developed HetCCL, a collective communication library designed to integrate vendor-specific backends and facilitate RDMA-based communication across GPUs without changing drivers. This work addresses inefficiencies caused by the lack of cross-vendor collective communication support in current deep learning frameworks, particularly in growing GPU clusters.

HetCCL allows you to leverage both NVIDIA and AMD GPUs simultaneously for practical, high-performance training without changing your existing deep learning applications. Researchers have designed a way to leverage RDMA for direct point-to-point communication between GPUs from different vendors. The experiments used standard InfiniBand and RoCE networks, bypassing the CPU and accessing GPU memory directly through the network interface card.

This approach avoids the bandwidth limitations of host memory staging, a common bottleneck in inter-node communication, as shown in Figures 1a and 1b. This research pioneered vendor-optimized operations, specifically the integration of NVIDIA’s NCCL and AMD’s RCCL into a unified framework.

The team implemented heterogeneous GPU collective communication operations and abstracted platform-specific APIs to create a seamless interface. This includes registering device memory using vendor-specific APIs. cuda/hipMallocallows RDMA-supported NICs to directly access GPU memory space using the IB Verbs API.

As shown in Figure 2b, by replacing the original communication backend with HetCCL, existing parallel training code written in frameworks such as PyTorch can take advantage of GPUs from both vendors. Evaluation on a multivendor GPU cluster demonstrates that HetCCL achieves comparable performance to NCCL and RCCL in a homogeneous setup, while scaling uniquely in a heterogeneous environment, achieving approximately 88% and 12% utilization of NVIDIA and AMD GPUs, respectively.

Heterogeneous GPU performance via RDMA and unified communications libraries enables scalable multi-GPU training

Scientists have developed HetCCL, a new collective communication library that integrates vendor-specific backends and facilitates RDMA-based communication across GPUs without changing drivers. This study addresses the inefficiencies and increased costs associated with scaling GPU clusters using hardware from multiple vendors.

Experiments show that HetCCL achieves comparable performance to NVIDIA NCCL and AMD RCCL on a homogeneous GPU setup. Importantly, HetCCL is uniquely scalable in heterogeneous environments, enabling practical, high-performance training on both NVIDIA and AMD GPUs without requiring changes to existing deep learning applications.

The team measured a key innovation in HetCCL: direct point-to-point communication via RDMA between GPUs from different vendors. This feature bypasses the CPU, significantly reduces memory copy overhead, and takes advantage of the higher bandwidth of the interconnect network. The results show that HetCCL supports NVIDIA and AMD GPUs, which together dominate the accelerator market, accounting for approximately 88% and 12% market share, respectively.

By replacing native communication backends such as NCCL and RCCL with HetCCL, the researchers enabled existing parallel training code to seamlessly utilize GPUs from both vendors. This breakthrough provides a unified framework for heterogeneous GPU clusters, abstracting platform-specific APIs and integrating vendor-optimized operations.

Evaluation of multivendor GPU clusters showed significant performance improvements for training large language models. Tests have proven that HetCCL avoids dispersion effects, maintains model accuracy, and achieves faster training speeds than homogeneous setups. Measurements confirm that HetCCL is the first cross-vendor CCL that enables training and inference of deep learning models on heterogeneous clusters without changing the source code at any level. This effort establishes a critical foundation for building a scalable and cost-effective AI infrastructure.

Heterogeneous GPU cluster training with unified communications and RDMA integration enables scalable deep learning

Scientists developed HetCCL, a collective communication library designed to improve the efficiency of deep learning training across different GPU clusters. Current deep learning frameworks often struggle with communication between GPUs from different vendors, leading to performance bottlenecks and increased costs.

HetCCL addresses this issue by integrating vendor-specific backends and enabling RDMA-based communication without changing existing GPU drivers. The library introduces two key mechanisms to facilitate cross-vendor communication while leveraging optimized vendor libraries, specifically NVIDIA NCCL and RCCL.

Evaluations conducted on a multi-vendor GPU cluster demonstrate that HetCCL achieves comparable performance to NCCL and RCCL in a homogeneous environment. Importantly, HetCCL scales effectively in heterogeneous environments, enabling high-performance training with GPUs from multiple vendors without requiring changes to existing deep learning applications.

The relative error of the final loss values ​​for all comparisons is less than 7x 10-3, which remains within the numerical tolerance range. This research significantly expands the possibilities for machine learning practitioners by allowing more flexible use of available accelerators, facilitating larger batch sizes and higher training throughput.

HetCCL removes a significant barrier to the use of heterogeneous training infrastructures, which are becoming increasingly common. The authors acknowledge that the system’s ability to reach its full potential is limited by its dependence on the underlying RDMA network infrastructure. Future research may consider further optimization of specific network topologies and investigate the application of HetCCL to a broader range of deep learning workloads.



Source link