Machine learning infrastructure, AI requirements, and examples

Applications of AI


IT exists as a discipline because of companies seeking to leverage data to gain a competitive advantage. Today, organizations are awash with data, but the technologies that process and analyze the data often struggle to keep up with the large amounts of real-time data. The challenge is not only the sheer volume of data, but also the wide variety of data types.

For example, the explosion of unstructured data is proving particularly challenging for information systems traditionally based on structured databases. This has led to the development of new algorithms based on machine learning (ML) and deep learning. This has created a need for organizations to purchase or build systems and infrastructure for their ML, deep learning, and AI workloads.

Interest in ML and deep learning has been growing for several years, but new technologies such as ChatGPT and Microsoft Copilot are driving interest in enterprise AI applications. IDC predicts that by 2025, 40% of Global 2000 companies' IT budgets will be spent on AI-related initiatives, as AI powers innovation.

Enterprises are undoubtedly building many of their AI and ML-based applications on the cloud, using high-level ML and deep learning services such as Amazon Comprehend and Azure OpenAI Service. However, the large amounts of data required to train and feed AI algorithms, the prohibitive costs of moving and storing data in the cloud, and the need for real-time (or near real-time) results make many enterprise AI systems , deployed on a private, dedicated system.

Many such systems reside in corporate data centers. But AI systems also exist at the edge. This is because such systems need to be close to the systems that generate the data that the organization needs to analyze.

To prepare for an AI-enhanced future, IT must grapple with many architecture and deployment choices. Chief among them is the design and specification of AI-accelerated hardware clusters. Due to their density, scalability, and flexibility, one promising option is hyperconverged infrastructure (HCI) systems. Although many elements of AI-optimized hardware are highly specialized, the overall design closely resembles more general hyperconverged hardware. In fact, there are HCI reference architectures created for use with ML and AI.

AI requirements and core hardware elements

Machine learning and deep learning algorithms feed data. Data selection, collection, and preprocessing (filtering, classification, feature extraction, etc.) are the main factors contributing to model accuracy and predictive value. Therefore, data aggregation (combining data from multiple sources) and storage are key elements of AI applications that influence hardware design.

The resources required for data storage and AI computation typically don't scale at the same time. Therefore, most system designs separate the two and design local storage within the AI ​​compute nodes to be large and fast enough to feed the algorithms.

Machine learning and deep learning algorithms require a huge number of matrix multiplication and accumulation floating-point operations. This algorithm can perform matrix calculations in parallel, making ML and deep learning similar to graphics calculations such as pixel shading and ray tracing, which are greatly accelerated by GPUs.

However, unlike CGI graphics and images, ML and deep learning calculations often do not require double-precision (64-bit) or even single-precision (32-bit) precision. This reduces the number of floating point bits used in calculations, further improving performance. Early deep learning research used off-the-shelf GPU accelerator cards for the past decade. Companies like Nvidia now have separate product lines of data center GPUs customized for scientific and AI workloads.

Most recently, Nvidia announced a new line of GPUs specifically designed to improve the performance of AI generated on desktops and laptops. The company also introduced a series of purpose-built AI supercomputers.

System requirements and components

The system components most critical to AI performance are:

  • CPU. Responsible for operating the VM or container subsystem, dispatching code to the GPU, and handling I/O. Current products use the popular 5th generation Xeon Scalable Platinum or Gold processors, but systems with 4th generation (Rome) AMD Epyc CPUs are also becoming more popular. Current generation CPUs have added features that significantly accelerate ML and deep learning inference operations, making them suitable for production AI workloads that leverage models previously trained using GPUs. .
  • GPU. Handle ML or deep learning training and inference. This is a feature that automatically classifies data based on learning. Nvidia offers dedicated high-speed servers through its EGX product line. The company's Grace CPU is also designed with his AI in mind, optimizing communication between the CPU and GPU.
  • memory. Because AI operations run from GPU memory, system memory is not usually a bottleneck, and servers typically have at least 512 GB of DRAM. GPUs use built-in high-bandwidth memory modules. Nvidia calls these modules streaming multiprocessors (SMs). According to Nvidia, “The Nvidia A100 GPU includes up to 2039 GB/s of bandwidth with 108 SMs, 40 MB of L2 cache, and 80 GB of HBM2 memory.”
  • Communication network. AI systems are often clustered to scale performance, so systems tend to have multiple 10 GbE or 40 GbE ports.
  • Storage IOPS. Moving data between storage and compute subsystems also becomes a performance bottleneck for AI workloads. Therefore, most systems use local NVMe drives instead of SATA SSDs.
Visual showing the internal logical and physical design of a typical AI server.

GPUs are the workhorse of most AI workloads, and Nvidia is driving deep learning performance significantly through features such as Tensor Cores and multi-instance GPUs (the ability to run multiple processes in parallel and NVLink GPU interconnect). Improved.

Enterprises can use any HCI or high-density system for AI by choosing the appropriate configuration and system components. However, many vendors offer products targeting ML and deep learning workloads. Below is a representative overview of key ML and deep learning system parameters from major vendors.

Visual showing key AI and ML system parameters.

Editor's note: This article about Machine Learning Infrastructure and AI Requirements was originally written by Kurt Marko in 2020 and then updated and expanded by Brien Posey in 2024. This article has been updated to add timely information about ML, AI, and system requirements. New vendor information is provided by the authors based on the parameters of leading ML and deep learning systems from leading companies.

Kurt Marko, a longtime contributor to TechTarget, passed away in January 2022. He is an experienced IT analyst and consultant who brings his broad and deep knowledge of enterprise IT architecture to the role. You can explore all the articles he has written for his TechTarget on his website. Contributor page.

Brien Posey is a 15-time Microsoft MVP with 20 years of IT experience. He has served as a chief network engineer for the US Department of Defense and as a network administrator for some of the nation's largest insurance companies.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *