Use Amazon Sagemaker HyperPod task governance to schedule topology-aware workloads

Today we are pleased to announce new features in Amazon Sagemaker HyperPod Task Governance to help optimize training efficiency and network latency for AI workloads. Sagemaker HyperPod Task Governance streamlines resource allocation and facilitates efficient computing resource utilization between teams and projects in Amazon Elastic Kubernetes Service (Amazon EKS) clusters. Administrators can manage accelerated calculation allocations, enforce task priority policies, and improve resource utilization. This helps organizations focus on accelerating generative AI innovation and reducing time to market rather than coordinating resource allocation and regeneration tasks. For more information, see Amazon Sagemaker HyperPod Task Governance Best Practices.

Generated AI workloads typically require extensive network communication across Amazon Elastic Compute Cloud (Amazon EC2) instances. Network bandwidth affects both workload runtime and processing latency. The network latency of these communications depends on the physical placement of the instances within the hierarchical infrastructure of the data center. Data centers can be organized into nested organizational units such as network nodes and node sets, with multiple instances per network node and multiple network nodes per node set. For example, instances within the same organizational unit experience faster processing times compared to the presence of different units. This means that fewer network hops between instances will result in lower communication.

By taking into account the physical and logical placement of resources, EC2 network topology information can be used during work submissions to optimize the placement of the Sagemaker HyperPod cluster generation AI workloads. The topology of an EC2 instance is explained by a set of nodes with one node in each layer of the network. For more information about how an Amazon EC2 instance topology works, see How to deploy an EC2 topology. Network topology labels offer the following important benefits:

Minimize network hops and reduce latency by routing traffic to nearby instances
Improved training efficiency by optimizing workload placement across network resources

Topology-aware scheduling in Sagemaker HyperPod Task Governance allows you to use topology network labels to schedule jobs over optimized network communications, improving task efficiency and resource utilization for AI workloads.

This post introduces Sagemaker HyperPod task governance and topology-conscious scheduling by submitting jobs that represent hierarchical network information. Provides details on how to use Sagemaker HyperPod task governance to optimize work efficiency.

Solution overview

Data scientists interact with Sagemaker HyperPod clusters. Data scientists are responsible for training, tweaking, and deploying models on accelerated computational instances. It is important to ensure that data scientists have the required capacity and privileges when interacting with clusters of GPUs.

To implement topology scheduling, first check the topology information for all nodes in the cluster, run a script that tells you which instances are on the same network node, and finally schedule a topologyware training task on the cluster. This workflow provides greater visibility and control over the placement of training instances.

In this post, you will view node topology information and submit tasks that are topology aware to the cluster. For reference, a network node describes the network node set of an instance. In each set of network nodes, three layers form a hierarchical view of each instance's topology. The closest instances to each other share the same Layer 3 network node. If there are no general network nodes in the lower layer (Layer 3), check if there is any commonality in Layer 2.

Prerequisites

To begin topology awareness scheduling, the following prerequisites are required:

EKS cluster
Sagemaker hyperpod cluster with instances enabled for topology information
Sagemaker HyperPod Task Governance Add-on Installation (version 1.2.2 or later)
Kubectl has been installed
(Optional) Sagemaker HyperPod CLI is installed

Get node topology information

Run the following command to display the node labels in the cluster: This command provides network topology information for each instance.

kubectl get nodes -L topology.k8s.aws/network-node-layer-1
kubectl get nodes -L topology.k8s.aws/network-node-layer-2
kubectl get nodes -L topology.k8s.aws/network-node-layer-3

Instances with the same network node layer 3 are as close as possible, according to the EC2 topology hierarchy. You will see a list of node labels that look like this:topology.k8s.aws/network-node-layer-3: nn-33333exampleRun the following script to view the nodes in the cluster on the same Layer 1, 2, and 3 network nodes:

git clone https://github.com/aws-samples/awsome-distributed-training.git
cd awsome-distributed-training/1.architectures/7.sagemaker-hyperpod-eks/task-governance 
chmod +x visualize_topology.sh
bash visualize_topology.sh

The output of this script visualizes the node topology of the cluster by printing a flow chart that can be used in a flow diagram editor, such as Mermaid.js.org. The following diagram shows an example cluster terpologies for a 7-instance cluster.

Submit the task

Sagemaker HyperPod Task Governance provides two ways to submit tasks using topology awareness. This section discusses these two options and a third alternative to task governance.

Modify the Kubernetes manifest file

First, you can modify an existing Kubernetes manifest file to include one of two annotation options:

kueue.x-k8s.io/podset-required-topology – Use this option if all pods must be scheduled on nodes in the same network node layer to start a job
kueue.x-k8s.io/podset-preferred-topology -Ideally, if you want all pods to be scheduled on nodes in the same network node layer, use this option, but it's flexible

The following code is kueue.x-k8s.io/podset-required-topology Setting to schedule pods that share the same Layer 3 network node:

apiVersion: batch/v1
kind: Job
metadata:
  name: test-tas-job
  namespace: hyperpod-ns-team-a
  labels:
    kueue.x-k8s.io/queue-name: hyperpod-ns-team-a-localqueue
    kueue.x-k8s.io/priority-class: inference-priority
spec:
  parallelism: 10
  completions: 10
  suspend: true
  template:
    metadata:
      labels:
        kueue.x-k8s.io/queue-name: hyperpod-ns-team-a-localqueue
      annotations:
        kueue.x-k8s.io/podset-required-topology: "topology.k8s.aws/network-node-layer-3"
    spec:
      containers:
        - name: dummy-job
          image: public.ecr.aws/docker/library/alpine:latest
          command: ["sleep", "3600s"]
          resources:
            requests:
              cpu: "1"
      restartPolicy: Never

To see which node the pod is running, use the following command to view the node ID for each pod:kubectl get pods -n hyperpod-ns-team-a -o wide

Use the Sagemaker HyperPod CLI

The second way to submit a job is to use the Sagemaker HyperPod CLI. To use topology-enabled scheduling, make sure to install the latest version (pending version). To use topologyware scheduling with the Sagemaker HyperPod CLI, you can include either --preferred-topology Parameters or --required-topology Your parameters create job Instructions.

The following code is an example command for starting a topology-aware Mnist training job using the Sagemaker HyperPod CLI, replacing xxxxxxxxx with your AWS account ID.

hyp create hyp-pytorch-job \
--job-name test-pytorch-job-cli \
--image XXXXXXXXXXXX.dkr.ecr.us-west-2.amazonaws.com/ptjob:mnist \
--pull-policy "Always" \
--tasks-per-node 1 \
--max-retry 1 \
--preferred-topology topology.k8s.aws/network-node-layer-3

cleaning

If you deployed new resources according to this post, please refer to the Cleanup section of the Sagemaker HyperPod EKS Workshop to ensure that you do not incur unnecessary charges.

Conclusion

During large-scale language modeling (LLM) training, pod-to-pod communication distributes models to multiple instances, requiring frequent data exchange between these instances. In this post, we discussed how Sagemaker HyperPod task governance can help you schedule workloads to enable job efficiency by optimizing throughput and latency. We also explained how to schedule jobs using Sagemaker HyperPod topology network information, optimizing the delay in network communication for AI tasks.

We recommend trying this solution and sharing your feedback in the comments section.

About the author

Nisha Nadkarni He is a senior Genai Specialist Solutions Architect at AWS and guides businesses through best practices when deploying large-scale distributed training and inference for AWS. Before her current role, she spent several years at AWS, focusing on helping emerging Genai startups develop models from ideas to production.

Siamak Nariman I'm a senior product manager at AWS. He focuses on AI/ML technology, ML model management, and ML governance, improving overall organizational efficiency and productivity. He has extensive experience in automating processes and deploying a wide range of technologies.

Zican Li He is a senior software engineer at Amazon Web Services (AWS) and leads the software development of task governance for SageMaker HyperPod. In his role, he focuses on empowering customers with highly AI capabilities while promoting an environment that maximizes efficiency and productivity for their engineering teams.

Anoop Saha I am an SR GTM specialist at Amazon Web Services (AWS) focusing on generating AI model training and inference. He partners with Top Frontier Model Builders, Strategic Customers and AWS Services teams to enable distributed training and reasoning at scale in AWS and Lead Joint GTM movements. Before AWS, Anoop played several leadership roles in startups and large enterprises, focusing primarily on silicon and systems architecture for AI infrastructure.