NCSA builds Delta supercomputer with AI enhancements

The University of Illinois’ National Center for Supercomputing Applications, which just went live with the Delta system in April 2022, is now receiving $10 million from the National Science Foundation to augment the machine with an AI partition (called DeltaAI). properly provided for It is based on Nvidia’s “Hopper” H100 GPU accelerator.

There are thousands of fairly modestly sized academic HPC centers around the world for fairly large systems, which together account for perhaps about two-thirds of the world’s HPC capacity. (I’ve never seen data like this, so this is a wild guess). . The supercomputing top 500 list contains a number of machines that are not running his HPC as a day (or night) job, enough to get the details of all the machines in these academic research centers. This is a wild guess, as it is not of reasonable length.

With that in mind, we’ve been thinking lately about what $10 million can buy you in terms of capacity.

The original Delta machine, also $10 million, shown here, uses a mix of Hewlett Packard Enterprise’s Apollo 2500 CPU and Apollo 6500 CPU-GPU nodes, all powered by Cray’s “Rosetta” Slingshot. connected by an Ethernet interconnect. And now it’s managed by HPE. 124 Apollo 2500 nodes with a pair of 64-core AMD “Milan” Epyc 7763 CPUs, 100 Apollo 6500 nodes with one of the same Milan CPUs and four 40 GB Nvidia “Ampere” A100 accelerators, and another had 100 Apollo 6500 nodes. Four Nvidia Ampere A40 accelerators. Suitable for rendering, graphics and AI inference. All of these machines have 256 GB of memory, which is a small amount for AI work, but the 2 GB per core ratio is reasonable, albeit a bit light. (3 GB per core is better, 4 GB per core is recommended.) The Delta system had eight 40 GB Nvidia A100 SXM4 GPUs cross-linked with a pair of Milan CPUs and NVSwitch. There was also a testbed partition based on the Apollo 6500 enclosure. It is fabric-ready and has access to 2 TB of main memory. This was clearly aimed at AI workloads.

Running all the computations on these CPUs and GPUs in the Delta system yields a total of 6 petaflops across the vector engine where the HPC workload is tuned, and 131.1 petaflops of FP16 operations across the CPU vector engine and GPU matrix math engine. It can be obtained. If you do not enable sparsity for AI workloads.

NCSA hasn’t joined the Top500 ranking of supercomputers since Cray built its hybrid CPU-GPU “Blue Waters” system in 2012. His Blue Waters machine peaked at 13.1 petaflops in FP64 double precision, cost a whopping $188 million, and had 49,000 of his Opteron processors. 3,000 of his Nvidia GPUs. Delta performed less than half of Bluewaters on the FP64 task, but he was ten times stronger in the FP16 calculations. (To reduce his FP16 support on older machines, just fill the vector with a quarter of his.) And obviously it consumes a lot less power and takes up a lot less space.

Little is known about the $10 million DeltaAI upgrade, but it appears that a number of Apollo 6500s will be equipped with new “hopper” H100 SXM5 GPU accelerators and connected to the Slingshot network. The award was announced here by NSF and also there by NCSA, but there aren’t many details. So I pulled out the Excel spreadsheet again.

DeltaAI Award says: “DeltaAI’s compute element offers over 300 next-generation Nvidia graphics processors capable of over 600 petaflops of half-precision floating point computing, and advanced features for application communication and access to innovative flash memory. are distributed across a wide variety of network interconnects -based storage subsystems.”

Doing the math, if you have 38 servers with 8 H100 SXM5 GPU accelerators, FP16 on H100 Tensor Cores with sparsity turned on peaks at 304 GPUs and 601.6 petaflops. So this is the only possible configuration that hits the data points of the DeltaAI configuration. Budgeting works well too, assuming NCSA got about 30% off on GPUs and networking he only spent 10% of his $10 million. Incidentally, the DeltaAI partition adds another 10.6 petaflops of FP64 vector performance to the CPUs and GPUs of these 38 nodes. So the total performance of the Delta+DeltaAI machines is 16.6 petaflops peak at FP64 for AI work precision and 732.8 petaflops peak for FP16 precision.

Source link