Accelerate the development of basic models with one-click observability on Amazon Sagemaker HyperPod

Machine Learning


Amazon Sagemaker HyperPod now offers a comprehensive, unavailable dashboard that provides insight into Foundation Model (FM) development tasks and cluster resources. This Unified Observability solution automatically exposes critical metrics to Amazon managed services for Prometheus, visualizing them in an Amazon managed graphana dashboard optimized for FM development with a deep coverage of hardware health, resource utilization and task-level performance.

One-click installation of the Amazon Elastic Kubernetes Service (Amazon EKS) add-on Sagemaker HyperPod Observability allows you to use Nvidia DCGM, instance-level Kubernetes node exporter, Elastic Fabric Adapter (EFA), SEFADES HEDADES HETPEDES, KUBERNETER operators. This unified view allows you to track the performance of model development tasks and cluster resources by aggregating resource metrics at the task level. This solution also abstracts the management of cluster-wide collector agents and scraper, providing automatic scalability of collectors across nodes as clusters grow. The dashboard features intuitive navigation of the metrics and overall visualizations, helping users diagnose problems and execute actions faster. They are also fully customizable and support additional PROMQL metric imports and custom graphana layouts.

These features save teams valuable time and resources during FM development, accelerate time to market, and reduce the cost of generative AI innovation. Instead of configuring, collecting and analyzing cluster telemetry systems, data scientists and machine learning (ML) engineers can now quickly identify hardware performance issues with confusion in training, tuning, and inference, undervalued GPU resources. Pre-built and actionable insights into the observability of Sagemaker HyperPods can be used in several common scenarios when working with FM workloads such as:

  • Data scientists can gain insight into GPU memory and flops and monitor resource utilization of training and inference tasks submitted at a per-GPU level
  • AI researchers can troubleshoot optimal time zone (TTFT) for inference workloads by correlating deployment metrics with corresponding resource bottlenecks
  • Cluster administrators can configure customizable alerts to send notifications to multiple destinations, including Amazon Simple Notification Service (Amazon SNS), PagerDuty, and Slack, when hardware drops outside the recommended health threshold
  • Cluster administrators can quickly identify inefficient resource queuing patterns across teams or namespaces to reconfigure allocation and prioritization policies

This post explains the installation and use of the unified dashboard of the Sagemaker HyperPod's ready-to-use observability feature. Covers one-click installations from the Amazon Sagemaker AI console, navigates integrated dashboards and metrics, and covers advanced topics such as setting up custom alerts. If you have a running Sagemaker HyperPod EKS cluster, this post will help you understand how to quickly visualize key health and performance telemetry data and derive actionable insights.

Prerequisites

To get started with observability for Sagemaker HyperPods, you must first make AWS Iam Identity Center available to use Amazon Managed Grafana. If IAM Identity Center is not already enabled for your account, see Starting IAM Identity Center. Additionally, create at least one user in IAM Identity Center.

Observability of Sagemaker HyperPods is available in Sagemaker HyperPod clusters with Amazon EKS orchestrators. If you don't already have a Sagemaker HyperPod cluster with Amazon Eks Orchestrator, please refer to the Amazon Sagemaker HyperPod Quickstart workshop to create instructions.

Enables observability for Sagemaker HyperPod

To enable observability for your Sagemaker HyperPod, follow these steps:

  1. Select from the Sage Maker AI Console Cluster Management In the navigation pane.
  2. Open the cluster details page from the Sagemaker HyperPod cluster list.
  3. In Dashboard tab, Observability of Hyperpods Section, Selection Quick install.

Sagemaker AI creates a new Prometheus Workspace, a new Grafana Workspace, and installs the Sagemaker HyperPod Observability add-on on your EKS cluster. The installation usually takes less than a few minutes.

Screenshot before installation

Once the installation process is complete, you can view details and metrics of available add-ons.

  1. choose Manage users Assign users to the Grafana workspace.
  2. choose Open the dashboard in Grafana To open the Grafana dashboard.

Screenshot after installation

  1. When prompted, sign in in the IAM ID Center with the user you configured as a prerequisite.

Grafana Sign-in Screen

After you sign in, Grafana will display the Sagemaker HyperPod Observability Dashboard.

Sagemaker HyperPod Observability Dashboard

You can choose from multiple dashboards cluster, task, inference, trainingand File System.

cluster The dashboard displays cluster-level metrics such as: Total node and Total GPUand cluster node-level metrics, etc. Using GPU and File system space available. By default, the dashboard displays metrics for the entire cluster, but you can apply a filter to view metrics only for a specific host name or a specific GPU ID.

Cluster Dashboard

task Dashboards are useful when looking at resource allocation and utilization metrics at the task level (PyTorchJob, ReplicaSetand so on). For example, you can compare GPU usage across multiple tasks running in a cluster and identify which tasks to improve.

You can also select an aggregation level from multiple options (Namespace, Task name, Task Pod), and apply the filter (Namespace, Task Type, Task name, Pod, GPU ID). These aggregation and filtering features allow you to view metrics at the right granularity and drill down to the specific problem you are investigating.

Task Dashboard

inference The dashboard displays metrics specific to the inference application, such as Incoming Request, delayand Time until first byte (TTFB). inference The dashboard is especially useful when you need to infer using a Sagemaker HyperPod cluster and monitor model requests and performance traffic.

Inference Dashboard

Advanced Installation

Quick install Options create a new workspace for Prometheus and Grafana and select the default metric. If you want to reuse an existing workspace, select additional metrics, or enable podlogging for Amazon CloudWatch logs, Custom installation option. For more information, see Amazon Sagemaker HyperPod.

Set an alert

Amazon Managed Grafana includes access to an updated alert system that centralizes alert information in a single searchable view (in the navigation pane, select an alert to create an alert). Alerts are useful when you receive timely notifications, such as when GPU usage unexpectedly drops, when shared file system disk usage exceeds 90%, or when multiple instances are no longer available at the same time. Amazon Managed Grafana's HyperPod Observability Dashboard has pre-configured alerts for most of these important metrics. You can create additional alert rules based on metrics or queries and configure multiple notification channels, such as email and Slack messages. For instructions on configuring alerts using Slack messages, see Setting up Slack Alerts on your Amazon Managed Grafana Github page.

The number of alerts is limited to 100 per Grafana workspace. If you want a more scalable solution, check out Amazon Managed Service for Prometheus alert options.

High-level overview

The following diagram illustrates the architecture of observability for the new hyperpod.

Architecture diagram

cleaning

If you want to uninstall the observability feature (for example, to reconfigure) of the Sagemaker HyperPod, clean up the resources in the following order:

  1. Remove the Sagemaker HyperPod Observability add-on using either the Sagemaker AI Console or the Amazon EKS console.
  2. Delete the Grafana workspace in the Amazon Managed Grafana Console.
  3. Delete the Prometheus workspace in the Amazon Managed Service for Prometheus Console.

Conclusion

This post provided an overview and instructions for using Sagemaker HyperPod Observability, the newly released observability feature of Sagemaker HyperPod. This feature reduces the heavy lifting associated with setting up cluster observability and provides intensive visibility into cluster health and performance metrics.

For more information about the observability of Sagemaker HyperPods, see Amazon Sagemaker HyperPods. Leave feedback about this post in the comments section.


About the author

Symox Tomonori Shimomura He is a leading solution architect for the Amazon Sagemaker AI team, providing detailed technical consultations to Sagemaker AI customers and suggests product teams to improve their products. Before joining Amazon, he worked on designing and developing embedded software for video game consoles, and is currently leveraging the detailed skills of cloud-side technology. During his free time, he enjoys playing video games, reading books, and writing software.

Matt Nightingale The Solution Architect Manager for the AWS WSO Frameworks team is a team focused on generating AI training and inference. Matt specializes in distributed training architectures focusing on hardware performance and reliability. Matt holds a bachelor's degree from the University of Virginia and is based in Boston, Massachusetts.

Eric Sale He is a senior genai specialist at AWS and focuses on training and reasoning for basic models. He partners with Top Foundation Model Builders and the AWS Services team to enable distributed training and reasoning at scale with strategic customers in the AWS and Lead Joint GTM movements. Before joining AWS, Eric led the product team to build an enterprise AI/ML solution. He holds a Masters degree in Business Analytics from UCLA Anderson.

Piyush Kadam He is a senior product manager for the Amazon Sagemaker AI team and specializes in LLMOPS products that enable both Startups and Enterprise customers to quickly experiment and manage their underlying models efficiently. With a Masters in Computer Science from the University of California, Irvine and specializing in distributed systems and artificial intelligence, Piyush brings deep technical expertise to his role in shaping the future of cloud AI products.

Aman Shambag He is a specialist solution architect for the ML Frameworks team at Amazon Web Services (AWS), helping customers and partners deploy ML training and inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in Computer Science, Mathematics and Entrepreneurship.

Bhaskar Pratap I am a senior software engineer for the Amazon Sagemaker AI team. He is passionate about designing and building elegant systems that bring machine learning to people's fingertips. Additionally, he has extensive experience in building scalable cloud storage services.

Gopi Sekar I am the engineering leader for the Amazon Sagemaker AI team. He is dedicated to developing products that simplify machine learning adaptation to help customers and address real-world customer challenges.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *