As AI models become increasingly sophisticated and specialized, the ability to quickly train and customize models means the difference between industry leadership and lagging behind. That's why there's a reason to scale and advance AI model development using Amazon Sagemaker AI's fully managed infrastructure, tools, and workflows. Since launching in 2017, Sagemaker AI has changed the way organizations approach AI model development by maximizing performance and reducing complexity. Since then, we have continued to innovate relentlessly, adding over 420 new features since launch to provide the best tools to quickly and efficiently build, train and deploy AI models. Today we are pleased to announce new innovations built on the rich capabilities of Sagemaker AI to accelerate the construction and training of AI models.
Amazon Sagemaker HyperPod: The perfect infrastructure for AI models development
AWS launched the Amazon Sagemaker HyperPod in 2023 to reduce the complexity of building AI models and maximize performance and efficiency. Sagemaker HyperPod allows you to quickly scale generative AI model development across thousands of AI accelerators, reducing basic model (FM) training and fine-tuning development costs by up to 40%. Many of today's top models are trained on Sagemaker HyperPods, including Face, Luma AI, Prplexity AI, Salesforce, Thomson Reuters, Writer, and Amazon's Hug models. By training Amazon Nova FMS with SageMaker HyperPods, Amazon saved months of work and increased computing resource utilization to over 90%.
To further streamline workflows and speed up model development and deployment, the new Command Line Interface (CLI) and Software Development Kit (SDK) simplify infrastructure management, unify recruitment across training and inference, and provide a single, consistent interface that supports recipe-based and custom workflows with integrated monitoring and control. Today we have added two features to the Sagemaker HyperPod, which helps reduce training costs and accelerate AI model development.
Use Sege Maker Hyperpod observability to reduce the time to troubleshoot performance issues from days to minutes
To deliver new AI innovations to the market as quickly as possible, organizations need visibility across AI model development tasks, optimize training efficiency, and calculate resources to detect and resolve disruptions or performance bottlenecks as quickly as possible. For example, to investigate whether training or fine-tuned job failures are the result of hardware problems, data scientists and machine learning (ML) engineers want to quickly filter the monitoring data for the specific GPU that ran the job and establish a correlation with job failures, rather than manually browsing hardware resources across the cluster.
The new observability feature in Sagemaker HyperPod transforms how you monitor and optimize your model development workload. Monitoring data is automatically published to Amazon Managed Services in the Prometheus Workspace via a unified, pre-configured dashboard in Amazon Managed Grafana, allowing you to view generated AI task performance metrics, resource utilization, and cluster health in a single view. Teams can now quickly find bottlenecks, prevent costly delays, and optimize computational resources. Define automatic alerts, specify use case-specific task metrics and events, and publish them to your integrated dashboard with just a few clicks.
By reducing troubleshooting times from days to minutes, this feature helps accelerate the path to production and maximize the return on AI investments.
![]()
Datologyai builds a tool that automatically selects the best data to train deep learning models.
“We look forward to using Amazon Sagemaker HyperPod's one-click observability solution. Senior staff need insight into how they use GPU resources. Pre-built Grafana Dashboards provide you exactly what you need. We appreciate the power of Prometheus Query Language. We like the fact that you can write your own queries and analyze custom metrics without worrying about infrastructure issues.”
– Josh Wills, a member of the technical staff at Datologyai
–
![]()
Articul8 helps businesses build sophisticated enterprise-generated AI applications.
“Sagemaker HyperPod Observability allows you to deploy metric collections and visualization systems with a single click, and otherwise save manual setup dates, enhancing cluster observability workflows and insights. Data scientists can quickly monitor task performance metrics such as Latency and identify hardware issues without manual configuration. It promotes our mission to provide customers with accessible, reliable, AI-powered innovation.”
-Renato Nascimento, Head of Technology at Articul8
–
Deploying Amazon Sagemaker Jumpstart model on SageMaker HyperPod for fast, scalable reasoning
After developing generation AI models with Sagemaker HyperPods, many customers import these models into Amazon Bedrock, a fully managed service for building and scaling generation AI applications. However, some customers want to use Sagemaker HyperPod Compute Resources to speed up ratings and move models into production faster.
Now you can deploy Open-Weights models from Amazon Sagemaker Jumpstart to your Sagemaker HyperPod within minutes without setting up a manual infrastructure. Data scientists can perform Sagemaker Jumpstart model inference by simplifying and accelerating the evaluation of the model with just one click. This simple one-time provisioning reduces manual infrastructure setup and provides a reliable, scalable inference environment with minimal effort. Large model downloads can be reduced from hours to minutes, accelerate model deployment and speed up time to market.
–

H.AI exists to push the boundaries of super intelligence with agent AI.
“With Amazon Sagemaker HyperPod, we used the same high-performance computing to build and deploy the foundational models behind the Agent AI platform. This seamless transition from training to inference streamlined workflows, reduced production time and provided consistent performance in a live environment.
– Laurent Sifre, co-founder & CTO of H.ai
–
Seamlessly access Sage Maker AI's powerful computing resources from local development environments
Today, many customers are available for model development, from a wide set of fully managed integrated development environments (IDEs) available with SageMaker AI, to JupyterLab, Code-oss-based code editors, Rstudio and more. While these IDEs allow for a safe and efficient setup, some developers prefer to use local IDEs on their personal computers due to their debugging capabilities and extensive customization options. However, customers using local IDEs such as Visual Studio code have not been able to easily perform model development tasks with Sagemaker AI up until now.
New remote connections to Sagemaker AI allow developers and data scientists to connect quickly and seamlessly to Sagemaker AI from local VS code, allowing them to access custom tools and maintain familiar workflows that help them work most efficiently. Developers can build and train AI models using local IDEs, and Sage Maker AI manages remote execution, allowing them to work in their preferred environments, benefiting from the performance, scalability and security of Sage Maker AI. Now you can choose your preferred IDE, whether it's a fully managed cloud IDE or VS code, to accelerate AI model development using powerful infrastructure and seamless scalability of Saymaker AI.
–

Cyberark is a leader in identity security and offers a comprehensive approach centered around privileged control to protect against sophisticated cyber threats.
“Remote connections to SageMaker AI allow data scientists to choose the IDE that makes them the most productive. Our team can take advantage of a customized local setup while accessing the infrastructure and security controls of security manufacturer AI.
–Nir Feldman, Senior Vice President of Engineering at Cyberark
–
Build your generated AI models and applications faster with fully managed MLFLOW 3.0
As customers across the industry accelerate their generation AI development, the ability to track experiments, observe behavior, and evaluate the performance of models and AI applications is needed. Customers such as Cisco, Sonrai, Xometry use MLFLOW managed by SageMaker AI to efficiently manage large-scale ML model experiments. Introducing fully managed MLFLOW 3.0 in Sagemaker AI can accelerate generative AI development by tracking experiments, monitoring training progress, and using a single tool to gain deeper insights into the behavior of models and AI applications.
Conclusion
In this post, we shared some of the new innovations in Sagemaker AI to accelerate the way we build and train AI models.
For more information about these new features, Sagemaker AI, and how businesses use this service, see the following resources:
About the author
Ankur Mehrotra He joined Amazon in 2008 and is currently the general manager of Amazon Sagemaker AI. Before Amazon Sagemaker AI, he worked on building Amazon.com's advertising system and automated pricing technology.
