Multi-account support for Amazon Sagemaker HyperPod task governance

GPUs are valuable resources. Both have shorter supply and are much more costly than traditional CPUs. It can also be very adapted to many different use cases. Build or embrace generated AI to run simulations using GPUs, run inferences (both internal or external use), build agent workloads, and run data scientist experiments. Workloads range from short-lived single GPU experiments carried out by scientists to long, multinode continuous pretraining runs. Many organizations need to share a centralized, high-performance GPU computing infrastructure across different teams, business units, or accounts within the organization. This infrastructure allows you to maximize the use of expensive accelerated computing resources such as GPUs, rather than using siloed infrastructure that may be well-utilized. The organization also uses multiple AWS accounts for users. Large companies may want to separate different business units, teams, or environments (production, staging, development) into different AWS accounts. This provides more detailed control and separation between these different parts of the organization. It also makes it easier to track and assign cloud costs to the right team or business units, providing better financial monitoring.

Specific reasons and setup may vary depending on the size, structure and requirements of the company. However, in general, multi-account strategies increase the flexibility, security, and manageability of large-scale cloud deployments. This post explains how enterprises with multiple accounts can access a shared Amazon Sagemaker HyperPod cluster and run heterogeneous workloads. Use Sagemaker HyperPod task governance to enable this feature.

Solution overview

Sagemaker HyperPod Task Governance provides the ability to streamline resource allocation and to set up policies to maximize computational utilization within the cluster. Task governance allows you to create different teams with their own namespaces, calculation quotas, and borrowing restrictions. Multi-account settings allow role-based access controls to restrict which teams' computing quotas are available to access.

This post explains the settings required to set up multi-account access for a Sagemaker HyperPod cluster organized by the Amazon Elastic Kubernetes Service (Amazon EKS), and how to use Sagemaker HyperPod task governance to assign accelerated calculations to multiple teams with different accounts.

The following diagram illustrates the solution architecture.

In this architecture, one organization divides resources into several accounts. Account A Host Host Sagemaker HyperPod cluster. Account B is the place where data scientists live. Account C is where data is prepared and stored for training use. The following sections show how to set up multi-account access so that data scientists in Account B can train models on Sagemaker HyperPod and EKS clusters in Account A. It analyzes two setups using preprocessing data stored in Account C.

Cross-assess count access for data scientists

When you create a computational assignment with Sagemaker HyperPod task governance, the EKS cluster creates a unique Kubernetes namespace for each team. For this walkthrough, create AWS ID and Access Management (IAM) roles for each team. Cluster Access Rolesit is only scoped access to namespaces generated by team task governance within a shared EKS cluster. Role-based access control is a way to ensure that data science members in Team A cannot submit tasks on behalf of Team B.

To access the EKS cluster in Account A as a user of Account B, you must assume the Cluster Access role for Account A. The Cluster Access role only has the permissions required by a data scientist to access an EKS cluster. For an example of the IAM role of data scientists using Sagemaker HyperPod, see Scientist IAM Users.

Next, you need to assume the Cluster Access role from the Account B role. The role of cluster access in account A must have a trust policy for the data scientist role in account B. The Data Scientist role is the role used to use the Account Access role.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "sts:AssumeRole",
      "Resource": "arn:aws:iam::XXXXXXXXXXAAA:role/ClusterAccessRole"
    }
  ]
}

The following code is an example of a trust policy for a cluster access role, so the data scientist role can assume it.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::XXXXXXXXXXBBB:role/DataScientistRole"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

The final step is to create an access entry for the team's cluster access role in the EKS cluster. This access entry also requires an access policy such as ekseditpolicy scoped in the team's namespace. This ensures that Team A users in Account B cannot launch tasks outside the assigned namespace. You can also optionally configure custom role-based access controls. For more information, see Setting up role-based access control for Kubernetes.

For users with Account B, you can repeat the same setup for each team. You must create a unique cluster access role for each team to match the team's access role to its associated namespace. To summarise, we use two different IAM roles:

The role of a data scientist – Account B role is used to assume the cluster access role for Account A. This role must be able to assume the role of cluster access.
Cluster Access Roles – The role of the account used to provide access to the eks cluster. For example, see IAM roles for Sagemaker HyperPod.

Access to prepared data Cross-access

This section shows you how to configure an EKS POD ID and S3 access point so that pods performing training tasks in the EKS cluster in Account A can access the data stored in the account. POD identity allows you to map IAM roles to service accounts in a namespace. If the POD uses a service account with this association, Amazon EKS sets environment variables in the pod's container.

The S3 access point is called a network endpoint that simplifies data access for shared data sets in an S3 bucket. They act as a way to grant fine grain access control to specific users or applications that access a shared dataset within an S3 bucket without the need for users or applications to have full access to the entire bucket. Permissions to the access point are granted through the S3 access point policy. Each S3 access point consists of access policies specific to the use case or application. The HyperPod cluster in this blog post can be used by multiple teams, allowing each team to have its own S3 access points and access point policy.

Before following these steps, make sure you have installed the EKS POD ID add-on in your EKS cluster.

In Account A, create an IAM role with S3 permissions ( s3:ListBucket and s3:GetObject Access Point Resources (with trust) with pod identity. This will become your data access role. Below is an example of a trust policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowEksAuthToAssumeRoleForPodIdentity",
      "Effect": "Allow",
      "Principal": {
        "Service": "pods.eks.amazonaws.com"
      },
      "Action": [
        "sts:AssumeRole",
        "sts:TagSession"
      ]
    }
  ]
}

In Account C, follow the steps here to create an S3 access point.
Next, configure the S3 access point to grant access to the role created in step 1. This is an example of an access point policy that describes an access point in Account C.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam:::role/"
      },
      "Action": [
        "s3:ListBucket",
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::accesspoint/",
        "arn:aws:s3:::accesspoint//object/*"
      ]
    }
  ]
}

Make sure your S3 bucket policy has been updated so that your account can access it. This is an example of an S3 bucket policy.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": "*",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::",
        "arn:aws:s3:::/*"
      ],
      "Condition": {
        "StringEquals": {
          "s3:DataAccessPointAccount": ""
        }
      }
    }
  ]
}

In Account A, use the AWS CLI to create a pod identity association for your EKS cluster.

aws eks create-pod-identity-association 
--cluster-name  
--role-arn arn:aws:iam:::role/ 
--namespace hyperpod-ns-eng 
--service-account my-service-account

A pod requires a service account name referenced in the pod specification to access the cross-cross in an S3 bucket.

You can test cross-account data access by running Amazon S3 commands by spinning up the test POD and running it on the POD.

kubectl exec -it aws-test -n hyperpod-ns-team-a -- aws s3 ls s3://

This example illustrates the creation of a single data access role for a single team. For multiple teams, you can use namespace-specific ServiceAcCount with its own data access role to help prevent duplicate resource access between teams. You can also configure Amazon S3 access for Amazon FSX for Account A's Luster File System, as explained in Amazon FSX used by your account, as explained in Amazon FSX, and as explained in Amazon FSX. The FSX on Luster and Amazon S3 must be in the same AWS area, and the FSX on Luster File System must be in the same availability zone as the Sagemaker HyperPod cluster.

Conclusion

In this post, we provided guidance on how to set up cross-ascent access to data scientists accessing a centralized Sagemaker HyperPod cluster organized by Amazon EKS. Additionally, we explained how to provide Amazon S3 data access from one account to an EKS cluster with another account. Sagemaker HyperPod Task Governance allows you to restrict access and calculate assignments to a particular team. This architecture can be used at scale by organizations who want to share large computing clusters between accounts within an organization. To get started with Sagemaker HyperPod task governance, see Amazon EKS support in the Amazon Sagemaker HyperPod Workshop and Sagemaker HyperPod task governance documentation.

About the author

Nisha Nadkarni is a senior Genai specialist solution architect at AWS and guides businesses through best practices when deploying large-scale distributed training and inference for AWS. Before her current role, she spent several years at AWS, focusing on helping emerging Genai startups develop models from ideas to production.

Anoop Saha is an SR GTM specialist at Amazon Web Services (AWS) focusing on training and inference of generated AI models. He partners with Top Frontier Model Builders, Strategic Customers and AWS Services teams to enable distributed training and reasoning at scale in AWS and Lead Joint GTM movements. Before AWS, Anoop played several leadership roles in startups and large enterprises, focusing primarily on silicon and systems architecture for AI infrastructure.

Kareem Syed-Mohammed is AWS Product Manager. He focuses on calculating optimization and cost governance. Prior to this, he led embedded analytics and developer experience at Amazon Quicksight. In addition to Quicksight, he works as a product manager for AWS Marketplace and Amazon Retail. Kareem began his career as a developer of call center technology, local experts and advertising at Expedia, and management consultant at McKinsey.