Need smarter insights in your inbox? Sign up for our weekly newsletter to get only the things that matter to enterprise AI, data and security leaders. Subscribe now
AWS is looking to expand its market position with updates to SageMaker, machine learning, AI model training and inference platforms, expanding its new observability capabilities, connected coding environments, and GPU cluster performance management.
However, AWS continues to face competition from Google and Microsoft, and also offers many features that will help accelerate AI training and reasoning.
In 2024, Sagemaker, which integrated data sources and converted to an integrated hub for accessing machine learning tools, adds the ability to provide more control to AWS customers with the slower model performance and the computational load allocated to model development.
Other new features include connecting a local integrated development environment (IDE) to Sagemaker, allowing locally written AI projects to be deployed to the platform.
Ankur Mehrotra, general manager of Sagemaker, told VentureBeat that many of these new updates came from the customers themselves.
“One of the challenges we saw customers face while developing the Gen AI model is that it's really hard to find out what's going on in layers of the stack when something goes wrong or when something isn't expected,” says Mehrotra.
The observability of the Sagemaker HyperPod allows engineers to explore different layers of the stack, including computing and networking layers. If something goes wrong or your model gets slower, Sagemaker will warn them and publish metrics on your dashboard.
Mehrotra pointed to the real problems his own team faced. In the training of the new model, the training code began to stress the GPU, causing temperature fluctuations. He said without the latest tools, it would have taken developers weeks to identify the source of the problem and then fix it.
Connected IDEs
Sagemaker already offers two ways for AI developers to train and run models. I accessed a fully managed IDE like Jupyter Lab and Code Editor to seamlessly execute model training code through Sagemaker. AWS was also able to run code on a machine, as I understand that other engineers prefer to use a local IDE, including all the extensions they installed.
However, Mehrotra pointed out that it proved to be a key challenge if developers want to expand, as this means that locally coded models only run locally.
AWS has added a new secure remote execution to allow customers to continue working on their desired IDE (local or managed) and connect to Sagemaker.
“So this ability has now provided them with the best world of both worlds, where they can develop locally in a local IDE if they wish, but when it comes to performing actual tasks, they can benefit from the scalability of Sagemers,” he said.
Improves calculation flexibility
AWS launched the Sagemaker HyperPod in December 2023 as a way to help customers manage clusters of servers for their training models. Like providers like CoreWeave, HyperPod allows Sagemaker customers to direct unused computing power to their preferred locations. HyperPod knows when to schedule GPU usage based on demand patterns, allowing organizations to effectively balance resources and costs.
However, AWS said many customers wanted the same service for inference. Many inference tasks occur during the day when people use models and applications, but training is usually scheduled during off-peak hours.
Mehrotra pointed out that even global reasoning allows developers to prioritize inference tasks that HyperPod should focus on.
Laurent Sifre, co-founder and CTO of AI agent company H AI, said in an AWS blog post that the company used Sagemaker HyperPod when building its agent platform.
“This seamless transition from training to inference streamlined workflows, reduced production time and provided consistent performance in a live environment,” says Sifre.
Competition with AWS
Amazon may not offer the most flashy basic models of cloud provider rivals, Google, Microsoft and more. Still, AWS focuses on providing an infrastructure backbone for businesses to build AI models, applications, or agents.
In addition to Sagemaker, AWS also offers Bedrock, a platform specifically designed for building applications and agents.
Sagemaker has been around for years and initially served as a way to connect different machine learning tools to data lakes. As the generative AI boom began, AI engineers began using Sagemaker to help train language models. However, Microsoft is strongly pushing the fabric ecosystem, with 70% of Fortune 500 companies adopting it and becoming a leader in the acceleration space of data and AI. Google has quietly forged into adopting enterprise AI through Vertex AI.
Of course, AWS has the advantage of being the most widely used cloud provider. Updates that make many AI infrastructure platforms easier to use are always an advantage.
Source link
