Sad, the number of H100s in the lab is 0! PhDs in the same lab have to fight for GPUs – PassionateGeekz

【New Wisdom Introduction】A top 5 machine learning PhD in the US posted a post to reveal that there are zero H100s in his lab! This has also sparked a global discussion among netizens in the ML community. Obviously, compared to big GPU users like Princeton and Harvard who can easily spend 300 or 400 H100s, it is more common to see “poor people” with a shortage of GPUs. PhDs in the same lab often even have to compete for GPUs.

A PhD in machine learning from one of the top 5 universities in the US, but his lab doesn’t even have an H100?

Recently, this netizen posted this post on reddit, which immediately sparked a community discussion:

Everyone has discovered that GPU powerhouses like Princeton and Harvard have H100s that cost at least $300-400, but most ML PhDs don’t even have access to an H100…

The GPU “rich-poor” gap between different schools and institutions has actually reached such a huge level?

Most labs are worse than Stanford

Two months ago, AI godmother Fei-Fei Li said in an interview that Stanford’s natural language processing (NLP) group only had 64 A100 GPUs.

Faced with the lack of AI resources in academia, Fei-Fei Li is deeply distressed.

The netizen who posted this also said that computing resources were a major bottleneck when he was pursuing his doctorate (at a top five school in the United States).

If there are more high-performance GPUs, the calculation time will be significantly shortened and the research progress will be much faster.

So, how much H100 was there in his lab? The answer is – 0.

He asked netizens: How many GPUs do you have in your lab? Can you get additional computing power sponsorship from Amazon or Nvidia?

Young researchers have revealed the GPU situation in their schools or companies, and the facts revealed have surprised everyone.

1 2080Ti + 1 3090, that’s all

A netizen who seems to be from Asia said that although his research direction is computer vision (CV) rather than machine learning, he could only use a 2080 Ti graphics card at the beginning of 2019.

In 2021, there will be an opportunity to use a server equipped with a V100 and an RTX 8000 graphics card.

In 2022, access was obtained to a 3090 graphics card.

In 2023, he was able to use a set of servers in another lab, which included 12 2080 Ti, 5 3090 and 8 A100 graphics cards. In the same year, he also received a computing grant to use the A100 graphics card for three months.

In 2024, the school purchased a server equipped with 8 H100 graphics cards and allowed a one-month trial.

In addition, GPUs will be available for rent on an hourly basis from a local academic provider between 2021 and 2023.

With the exception of the 2080 Ti and 3090 graphics cards, most of these resources are shared.

Question: Does the “a” here literally mean “one”?

Netizens said, yes, it is so hard…

Some people came forward to say that they were in a very bad situation: they had no graphics card and no credits. Because their university could not provide any help, they could only ask their internship company to help them get some.

A PhD graduate who will graduate at the end of 2022 also revealed that the laboratory’s dedicated servers are equipped with nearly 30 GPUs, with each server equipped with 4 graphics cards. (Due to different purchase times, performance varies.)

However, competition for GPUs still occurs from time to time in the same laboratory.

In this regard, some netizens concluded that 0 GPU is very common.

The reason is very simple: we don’t need to drive a Ferrari to learn how to drive. In addition, the foundation of machine learning is linear algebra, statistics, and programming, followed by hardware process optimization.

The severe shortage of GPUs is also common in laboratories of Chinese universities.

One blogger even posted that a certain university’s courses actually required students to bring their own computing equipment.

A group of five students must have at least two 3090/4090s or one 40G A100 to complete the LLM training tasks required by the course.

So the question is, why can’t universities purchase more GPUs themselves?

Zhihu user “Net Addiction Uncle” said that it is not cost-effective for colleges and universities to directly purchase GPUs. As the scale of LLM training parameters increases, multiple machines and multiple GPUs are needed, as well as a network that connects the cards in series.

There are not only learning costs, but also maintenance costs, which is a huge investment for colleges and universities. Therefore, the more common way is to rent a server.

Sun Heng, a doctoral student at the Department of Computer Science at Tsinghua University, raised the same question: the card can be bought, but the question is, where to put it?

Of course, some people are moving forward with heavy burdens, while others are enjoying a peaceful life.

For example, the schools below are much more “rich” in comparison.

“H100, we only need a few hundred dollars.”

Some netizens revealed that the Princeton Language and Intelligence Institute (PLI) and the Harvard Kempner Institute have the largest computing clusters, equipped with 300 and 400 H100 GPUs respectively.

This information was also confirmed by a Princeton researcher:

At Princeton, there are three types of clusters available.

– Small group clusters vary, but for 10 people, 32 GPUs is a reasonable allocation

– Department clusters have more resources, but it also depends on the specific department

– University cluster Della has (128×2) + (48×4) A100s and (96×8) H100s

In short, Princeton and Harvard can be said to be big users of graphics cards.

In addition, some netizens revealed that UT Austin owns 600 H100s.

The PhD student at the University of Montreal said that his lab has about 500 GPUs, mainly A100 40GB and 80GB.

A netizen from RWTH Aachen University in Germany said that the school provided a computing cluster with 52 GPU nodes, each equipped with 4 H100 GPUs.

These resources are of course shared by all departments and can also be used by some other institutions.

However, even students are allocated a certain amount of cluster usage time each month. If you need more computing time, you can apply for dedicated computing projects of different sizes.

“I love this system, and being able to use it is an opportunity to change the course of research for me.”

The author of the question expressed great envy of such abundant computing power.

Another European netizen also said that his laboratory has about 16 A100 GPUs dedicated to the laboratory, and can access more GPUs through several different additional clusters.

The exact size of these clusters is difficult to estimate because they have so many users, but each cluster provides approximately 120,000 GPU hours of computing time per year.

However, the GPU memory requirement of more than 80GB is a bottleneck. Currently, there are about 5 H100s available.

Similarly, the laboratory where this netizen works is also quite wealthy:

“We have eight H100s and eight L40Ss in our lab, which are provided free of charge to five doctoral students and three postdocs.”

Finally, here are some excerpts from the comments of some “Versailles” netizens.

For example, this netizen who works for a cloud computing provider said that this post is very interesting because he didn’t know that H100 is so rare.

Or, if you can’t get a graphics card from your company, just buy one yourself. 😂

Why is the H100 so important?

Recently, Nvidia’s market value exceeded US$3.3 trillion, ranking first in the world.

The most dazzling star behind this is undoubtedly its H100 GPU.

Unlike ordinary chips, the 80 billion transistors in the H100 are arranged in cores that are tuned to process data at high speeds rather than generate graphics.

Founded in 1993, Nvidia bet that the ability to work in parallel would one day make its chips valuable beyond gaming, and they were right.

The H100 is four times faster than the previous generation A100 at training LLMs and 30 times faster at responding to user prompts. This performance advantage is critical for companies eager to train LLMs to perform new tasks.

That is why the wave of generative AI around the world is being converted into actual revenue for Nvidia. The demand for H100 is so great that many customers have to wait six months to receive it.

Igor, IaaS Technical Product Manager at Nebius AI, explores the differences between the most popular chips: H100, L4, L40, A100, V100, and identifies the workloads where each GPU model performs best.

Before talking about the differences between the chips, it is important to highlight some relevant properties of Transformer neural networks and numerical precision.

The role of numerical precision

Nvidia’s H100, L4, and L40 would not have been as successful without hardware support for FP8 precision, which is especially important for Transformer models.

But what makes support for FP8 so important? Let’s take a closer look.

FP stands for “floating point”, which refers to the precision of numbers that the model stores in RAM and uses in its operations.

Most importantly, these numbers determine the quality of the model output.

Here are some key number formats –

FP64, or double-precision floating point format, is a format where each number occupies 64 bits of memory.

While this format is not used in machine learning, it has its place in the scientific community.

FP32 and FP16: FP32 has long been the de facto standard for all deep learning computations.

However, data scientists later discovered that converting model parameters to FP16 format reduced memory consumption and accelerated computation, seemingly without compromising quality.

As a result, FP16 has become the new gold standard.

TF32 is another crucial format.

Before entering computations on Tensor Cores that process FP32 values, these values can be automatically converted to TF32 format at the driver level without requiring code changes.

Obviously, TF32 is slightly different but provides faster computations. That is, it can be encoded in a way that the model interprets FP32 on tensor cores.

INT8: This is an integer format and does not involve floating point numbers.

After training, model parameters can be converted to other types that take up less memory, such as INT8. This technique is called post-training quantization and can reduce memory requirements and speed up inference. It works wonders for many model architectures, but Transformer is an exception.

Transformers cannot be converted after training to reduce hardware requirements for inference. Innovations such as quantization-aware training do provide a workaround during training, but retraining existing models can be costly and challenging.

FP8: This format solves the above problems, especially for Transformer models.

You can take a pre-trained Transformer model, convert its parameters to FP8 format, and then switch from A100 to H100.

We can even do it without conversion and still gain performance, just because the H100 is faster.

With FP8, only about a quarter of the graphics cards are needed to infer the same model with the same performance and load.

It is also nice to use FP8 for mixed precision training – the process will complete faster, require less RAM, and no conversion will be required later in the inference phase since the model’s parameters may already be FP8 parameters.

Key GPU specifications and performance benchmarks for ML, HPC, and graphics

Let’s discuss the evolution of GPU specifications and their prominent features.

Pay special attention to the first two rows in the above graph: the amount of RAM and its bandwidth.

ML models must fit tightly on a GPU accessible to the runtime environment. Otherwise, we would need multiple GPUs for training. During inference, it is often possible to fit everything on a single chip.

Note the difference between SXM and PCIe interfaces. We at NVIDIA simply differentiate between them based on the servers we already have ourselves or our cloud providers.

If the setup consists of a standard server with PCI slots and you don’t want to spend money on a dedicated machine (SXM) where the GPU is directly connected to the motherboard, the H100 PCIe is our best choice.

Sure, its specifications might be weaker than the SXM version, but it’s fully compatible with standard compact servers.

However, if we want to build a top-of-the-line cluster from scratch and can afford it, the H100 SXM5 is clearly a better choice.

The performance indicators of various GPUs in training and reasoning can be based on the following figure:

The chart is from Tim Dettmers’ famous article “Which GPUs should I use for deep learning: My experience and suggestions for using GPUs in deep learning”

The H100 SXM indicator is used as the 100% benchmark, and all other indicators are normalized relative to it.

The chart shows that 8-bit inference on the H100 GPU is 37% faster than 16-bit inference on the same GPU model. This is due to the hardware support for FP8 precision calculations.

By “hardware support” I mean the entire low-level pipeline that moves data from RAM to the tensor core for computation, where various caches come into play.

And in A100, 8-bit inference on this GPU is not faster because FP8 is not supported at the hardware level. The cache from RAM just processes numbers at the same speed as FP16 format.

A more detailed chart is as follows:

You’ve no doubt noticed that some RTX graphics cards also do pretty well in AI tasks. Usually, they have less memory than datacenter-specific cards and don’t support clustering, but they’re obviously a lot cheaper.

So, if you are planning to use local infrastructure for internal experiments, you can also consider this type of RTX graphics card.

However, the GeForce driver EULA directly prohibits the use of such cards in data centers, so no cloud provider has the right to use them in their services.

Now, let’s compare GPUs in graphics and video processing related tasks. Here are the key specifications relevant to such use cases:

Again, we need to look at RAM size and bandwidth. Also, note the unique performance metrics for the RT cores, as well as the decoder and encoder counts, which are the dedicated chips responsible for compressing and decompressing the video feed.

The Graphics Mode line indicates whether the GPU can be switched to graphics-oriented mode (WDDM).

The H100 lacks this feature at all; the A100 has it, but it’s limited, so it’s not necessarily practical.

In stark contrast, the L4 and L40 are equipped with this mode, so they are positioned as versatile cards suitable for a variety of tasks, including graphics and training.

Nvidia even markets them as graphics-oriented cards first in some materials. However, they are also well suited for machine learning and neural network training and inference, at least without any hard technical barriers.

For users, these numbers mean that the H100 variant, as well as the A100, are not suitable for graphics-centric tasks.

The V100 could potentially serve as a GPU for virtual workstations handling graphics workloads.

The L40 is the undisputed champion for the most resource-intensive 4K gaming experience, while the L4 supports 1080p gaming. Both cards can also render video at their respective resolutions.

Summarize

We can get the following table, which shows the characteristics of different graphics cards according to their design purposes.

There are two main use case categories in the table: tasks that focus purely on computation (“computation”) and tasks that include visualization (“graphics”).

We already know that the A100 and H100 are not at all suitable for graphics, while the L4 and L40 are tailor-made for it.

At first glance, you might think that the inference capabilities of the A100 or L40 are equally good. However, there are some nuances to consider.

In the “HPC” column, it is shown whether multiple hosts can be combined into a single cluster.

In inference, clustering is rarely needed – but it depends on the size of the model. The key is to make sure the model fits in the memory of all GPUs on the host.

If the model exceeds this boundary, or the host cannot accommodate enough GPUs for its combined RAM, then a GPU cluster is required.

The scalability of L40 and L4 is limited by the capabilities of a single host, while H100 and A100 do not have this limitation.

Which GPU should we choose for ML workloads? The recommendations are as follows:

L4: An affordable general-purpose GPU for a wide range of use cases. It is an entry-level model and a gateway to the world of GPU-accelerated computing.

L40: Optimized for generative AI inference and visual computing workloads.

A100: Provides excellent price/performance for single-node training of traditional CNN networks.

H100: The best choice for BigNLP, LLMs, and Transformer. It is also very suitable for distributed training scenarios and inference.

Graphics scenarios can be divided into three groups: streaming, virtual desktops, and render farms. If there is no video input to the model, then it is not a graphics scenario. This is inference, and such tasks are best described as AI video.

The card can handle encrypted video feeds, and the A100 is equipped with hardware video decoders for such tasks. These decoders convert the feed into a digital format, enhance it using a neural network, and then pass it back.

During this entire process, no visual content appears on the display, so while the H100 and A100 can adeptly train models related to videos or images, they don’t actually produce any videos.

That’s another story.

References:

https://www.reddit.com/r/MachineLearning/comments/1dlsogx/d_academic_ml_labs_how_many_gpus/
https://medium.com/nebius/nvidia-h100-and-other-gpus-which-are-relevant-for-your-ml-workload-15af0b26b919

This article comes from WeChat public account:New Intelligence (ID: AI_era)

Advertising Statement: The external jump links contained in the article (including but not limited to hyperlinks, QR codes, passwords, etc.) are used to convey more information and save selection time. The results are for reference only. All articles in Passionategeekz include this statement.

Source link