Announce a new cluster creation experience for Amazon Sagemaker HyperPod

Today, Amazon Sagemaker HyperPod is announcing a new one-click, validated cluster creation experience to accelerate setup and prevent common misconceptions, allowing you to launch distributed training and inference clusters with Slurm or Amazon Elastic Kubernetes Services (Amazon EK) order, Amazon Virtural Cloud (Amazon ecscertse Streamersove, Amazon elcune ecks). Default.

Sagemaker HyperPod allows you to efficiently scale tasks such as generating AI training, fine-tuning, or inference to clusters containing hundreds or thousands of AI accelerators. The system continuously checks for hardware issues, resolves them automatically, ensuring that the workload recovers without manual intervention.

Previously, customers had to configure other AWS resources as prerequisites for creating VPCs, Amazon Simple Storage Service (Amazon S3) Buckets, AWS ID and Access Management (IAM) roles, and Sagemaker HyperPod clusters. This multi-step process created a manual touchpoint that could lead to misconfiguration.

The new Cluster Creation Experience allows you to create a Sagemaker HyperPod cluster that contains the required prerequisite AWS resources once and automatically apply normative default values. In this post, we explore a new cluster creation experience for Amazon Sagemaker HyperPod.

Solution overview

Sagemaker HyperPod offers two new deployment options to the AWS Management Console, creating clusters organized by SluRM and Amazon EKS: Quick Setup and Custom Setup. Both options are displayed in the Amazon Sagemaker AI console.

When you create a cluster, Sagemaker HyperPod creates an AWS CloudFormation stack to deploy the cluster and support resources in the specified configuration.

AWS CloudFormation allows you to use your infrastructure as code (IAC) to declaratively represent the desired state of a cloud architecture. This allows even complex configurations with multiple managed services to be consistently deployed in multiple environments, such as Seigemer Hyperpod clusters and prerequisite resources.

The next section provides detailed information about quick setup and custom setup options and provides screenshots of key configurations.

Quick Setup

With Quick Setup, Sagemaker HyperPod uses wise defaults such as grouping, networking, orchestration, lifecycle configuration, permissions, and storage. You also need to view editable configurations after the cluster is created, and you will need to recreate the corresponding AWS resources. If you want to edit such a configuration, use a custom setup. QuickSetup provides automatic instance recovery for unhealthy or non-responsive instances.

For networking, Quick Setup creates a new VPC with subnets spreading into the availability zone in your AWS region. Within each availability zone, a public /24 subnet is created for Internet access via the NAT gateway, a private /24 subnet is created, making communications easier for EKS control planes, and an A/16 private subnet is created to target accelerated instance group capacity. It also consists of the new security groups with the rules required to allow elastic fabric adapters (EFAs) and Amazon FSX for glossy network traffic.

Using the A/16 private subnet as the default for Sagemaker HyperPod instances supports over 65,000 private IPs. This is important to accommodate large clusters of accelerated instances that consume multiple IP addresses for each host.

For Amazon EKS orchestration, enable the available operators, such as EFA, Neuron, and Nvidia device plugins, and create a quick setup using the latest supported Kubernetes version. Health Monitoring Agent (HMA); Kubeflow Training Operator. Sagemaker Hyperpod Inference Operator.

Quick Setup also creates a new S3 bucket that stores the default lifecycle scripts, such as setup and configuration, a new IAM role with the required permissions for a Sagemaker HyperPod cluster, and a new FSX for Luster file systems for high performance data storage and search.

Custom setup

Custom setups give you the flexibility to choose how your Sagemaker HyperPod clusters are configured at a more detailed level in the same dimension.

A custom setup for Amazon EKS Orchestration recommends automatic node recovery in order to restart or replace the failed node if a problem is detected, but you can selectively disable this feature if you need more control over the recovery process to implement manual intervention for troubleshooting or testing. When continuous provisioning mode is enabled, Sagemaker HyperPod allows for simultaneous initiation of multiple operations, parallel scaling, scaling, AMI updates within a single instance group, and clustering even if all requested instances are not immediately available. This option provides more flexibility and faster operation by making multiple changes at the same time, reducing overall deployment and update times.

Custom setup provides the option to create a new VPC with a custom CIDR range and target specific availability zones for subnet creation based on accelerated computational capacity locations. You can also browse existing VPCs and security groups for Sagemaker HyperPod cluster deployments. This is useful when using an existing EKS cluster for orchestration, or attaching an existing FSX to a Luster file system.

For Amazon EK orchestration, you can create a new EKS cluster with the option to select the supported Kubernetes version, along with two or more private subnets that Amazon EKS uses to provision two elastic network interfaces (ENIS). If you are using an existing EKS cluster, you can select by name using a custom setup.

It also provides detailed control over which optional operators are installed in your EKS cluster, using the default Helm chart based on the specific requirements of your workload. Some of these components are required and must be installed for the Sagemaker HyperPod cluster to work properly.

With custom setups, you are required for advanced configuration needs, such as installing custom machine learning (ML) frameworks or specific versions of dependencies, deploying your own software or tools, configuring specific network optimizations, and more, using custom lifecycle scripts from existing S3 buckets. You can also assign existing IAM roles to a Sagemaker HyperPod cluster to meet specific authorization requirements. For storage, you have the flexibility to consolidate existing FSX for the Luster file system, provision new file systems with multiple throughput and storage capacity options, or skip file system provisioning if you don't need it yet.

Add an instance group

Both the Quick Setup and Custom Setup options allow you to add new instance groups to your Sagemaker HyperPod cluster in the SageMaker AI console.

You can choose standard or restricted instance groups (RIGs) that provide a generic computing environment without any additional security restrictions to provide a specialized environment within Sagemaker HyperPod that provides isolated space for training customized Amazon Nova models.

For large planned training jobs, you can get predictable access to computing resources accelerated with one-time workloads and tests on-demand capacity, or flexible training plans, within your timeline and budget. Flexible training plans allow you to schedule the capacity of the latest P6-B200 instance types and P6E-GB200 Ultra Cellbers with Nvidia Blackwell Tensor Core GPUs. If you need to provision an instance group for long-term use, you can contact AWS to reserve capacity for longer periods of time.

Amazon Eks Orchestration allows you to enable deep stress and connectivity health checks for each instance group you add. These deep health checks are performed in addition to basic orchestrator-independent health checks that also apply to SluRM organized clusters. A stress check tests hardware components under stress to identify potential issues with GPU, memory, and other hardware components. Check connections Test the network connections between nodes to maintain proper communication for distributed training.

Advanced configuration allows you to select the number of threads that run on each CPU core in your Amazon Elastic Compute Cloud (Amazon EC2) instance. Selecting one thread per core disables multithreading. Each core runs a single thread. This allows applications that benefit from dedicated core resources, such as specific high-performance computing workloads, to deliver more predictable performance. Selecting two threads per core enables multithreading. Each physical core runs two threads simultaneously, potentially increasing the throughput of a multi-threaded application at the expense of some degree of thread performance.

Download CloudFormation template parameters

For further customization and reuse, you can download a copy of the CloudFormation template from the Sagemaker AI console using the Preconfigured of your choice. This template allows you to automatically build and test changes before advertising them to your production stack using continuous delivery tools such as AWS CodePipeline. CodePipeline allows you to create parameter overrides and enter custom values in the template configuration file when creating or updating stacks in different development, testing, and production environments.

Conclusion

Sagemaker HyperPod now offers an enhanced one-click deployment experience for setting up a dedicated, resilient infrastructure for training and deploying large-scale ML models. The Quick Setup option allows you to take advantage of normative defaults. Custom setup options provide the flexibility to coordinate your distributed training environment and meet professional requirements. Using IAC through AWS CloudFormation gives you a declarative representation of a Sagemaker HyperPod cluster environment that can be version controlled, further customized and integrated into a continuous delivery pipeline.

Get started today by visiting the Sagemaker AI console and creating a new Sagemaker HyperPod cluster.

About the author

Giuseppe Angelo Porceli He is a leading machine learning specialist solution architect at Amazon Web Services. With years of software engineering and ML background, he works with customers of all sizes to understand business needs and design AI and ML solutions for AI and ML solutions that make the most of the AWS Cloud and Amazon Machine Learning stack. He is working on projects in a variety of domains, including MLOP, Computer Vision, NLP, and includes a wide set of AWS services. During his free time, Giuseppe enjoys playing football.

Cindy Zao I am a Seattle-based software development engineer. She focuses on building large-scale ML infrastructure using AWS Sagemaker HyperPods, helping customers set up safe and reliable clusters for foundation model training. Outside of work, she enjoys traveling and spending time with her cat.

Nathan Arnold I am AWS Senior AI/ML Specialist Solutions Architect based in Austin, Texas. He supports AWS customers (from small startups to large companies) and efficiently drives and deploys basic models on AWS. When he's not working with clients, he enjoys hiking, trail running and playing with his dogs.