Greening the AI/ML Data Center

Machine Learning


David Kuo

Vice President, Product Marketing and Business Development

Point 2 Technology

May 31, 2024

blog


Next-gen AEC accelerates data center performance while reducing energy consumption and operational costs.

Over the next few years, AI/ML data centers will need to overcome three simultaneous challenges: increasing performance to keep up with growing demand, containing costs as scale and complexity grow, and continuing to improve energy efficiency. Solving these challenges will require technological advances in nearly every part of the data center, including AI accelerators, switches, pooled memory systems, smart NICs, and the cables that interconnect servers.

Data centers have quickly become an essential part of everyday life and the backbone of the digital revolution. They power the next generation of AI/ML applications that are revolutionizing the economy and enabling modern conveniences like streaming real-time video to your smartphone. From cloud service providers' growing need for large-scale language model (LLM) training and inference processing to consumers' never-ending appetite for social media, video streaming, video conferencing, online gaming, and other digital services, the demand for a full range of data center services continues to grow.

To meet growing demand, hyperscalers and enterprises also need their data centers to be as energy efficient as possible to keep energy costs down and reduce their carbon footprint. According to the U.S. Department of Energy, data centers account for approximately 2% of the total electricity usage in the United States and consume 10 to 50 times more energy per floor than a typical commercial office building. Data centers and data transmission networks already account for 1% of energy-related GHG emissions, a figure that is projected to increase to 8% by 2030.

IT equipment and cooling systems consume up to 90% of a data center's energy, so naturally the focus is on reducing power consumption of the network devices in the data center racks, including servers, switches, accelerators, storage systems, etc. However, there is a related, yet lesser known, device that is becoming increasingly important for reducing data center energy consumption while meeting performance requirements and minimizing operational costs and carbon footprint: Active Electrical Cabling (AEC).

Today's data centers rely primarily on 400 Gigabit (400G) Ethernet network devices, which are not fast enough to handle future AI/ML workloads. As a result, data center operators are increasing network speeds from 400G to 800G, and soon to 1.6T Ethernet.

Passive copper cables (direct attach cables or DACs) commonly used for 400G cannot support the future requirements of intra-rack interconnect. As speeds increase to support 800G transmission, signal loss in the copper cabling of the DAC becomes noticeable, reducing the cable lengths that can be supported. Cable lengths can be extended by using heavier copper gauges, but then the overall DAC becomes too thick. It is clear that copper has reached its physical limits; it is too thick and bulky to support the cable lengths required for intra-rack interconnect use cases.

AEC offers a viable alternative to address the interconnect needs of next-generation, high-performance data centers. AEC incorporates a silicon system-on-chip (SoC) within the cable assembly to recover high-speed signal losses due to copper wires, improving performance and reliability of data transmission. In addition to providing greater intelligence than passive DACs, AEC extends cable distances up to 7 meters for intra-rack and adjacent rack connections. AEC also allows for the use of finer copper wire gauges to reduce cable bulk and weight. Reducing cable bulk also reduces cable congestion within racks, improving airflow and reducing energy required for cooling. Given these benefits, AEC is critical to the future of reliable, cost-effective data center operations.

While an AEC with an embedded SoC can increase cable reach and reduce cable volume, the additional electronics in the cable consume more power. Employing the right SoC for the lowest power AEC can result in savings in several areas. First and most obviously, a low-power AEC reduces the power consumption of the cable devices themselves, reducing the cost of power consumption based on the price of power per kilowatt-hour.

For example, modern AI/ML GPU rack designs can have 60 active cables connecting GPU accelerators to the switch fabric and memory subsystems. As these rack designs move to 800G speeds, the active cables become a significant consumer of system power. A typical 800G AEC incorporating PAM4 DSP consumes around 20W, while a dedicated mixed-signal interconnect SoC consumes only 11W. This 45% (9W) reduction significantly lowers the power requirements of an AI/ML GPU rack.

An added benefit of a lower-power AEC is the reduced energy demands on infrastructure, including energy from the data center's cooling systems, power distribution units, lighting, and other equipment required to keep IT equipment running. Power saved from IT equipment such as the AEC also saves energy on the required infrastructure. These savings can be calculated using the Power Usage Effectiveness (PUE) metric used by data center operators. PUE is the facility's total energy use divided by the energy use of the IT equipment.

According to the Uptime Institute, the average PUE for a data center in 2023 was 1.58. This means that for every 1W used by IT equipment, an additional 0.58W of infrastructure energy is required. And for every watt saved in IT equipment power, the same 0.58W of infrastructure energy is saved. Consider the 800G AEC example above. In addition to the 9W of power saved by using a dedicated SoC, an additional 5.2W (9W x 0.58) of infrastructure energy is saved.

In AI/ML data centers, active electrical cables connecting GPU accelerators, network switches, and memory systems in the same and adjacent racks are becoming a critical part of the power equation. AECs based on dedicated mixed-signal SoCs are not only thinner, lighter, and have longer reach than DACs, but also more energy efficient and cost effective than AECs built with typical PAM4 DSPs.

As AEC becomes a critical infrastructure, selecting the lowest power SoCs will enable hyperscalers and enterprises to scale installed performance while improving energy efficiency and reliability. Reducing operational costs will improve the sustainability of AI/ML data centers and provide a greener footprint for years to come.

For more information, visit point2tech.com.

David Kuo is a veteran semiconductor product marketing and business development executive with over 20 years of experience in the network, consumer and mobile markets. He has a proven track record of delivering innovative product marketing and management strategies that drive growth, revenue and customer experience. He is a technical expert with experience in mixed-signal SoCs, connectivity ICs, AI/ML processors/accelerators, software and tools.

More articles by David



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *