The Weather Company Powers MLOps with Amazon SageMaker, AWS CloudFormation, and Amazon CloudWatch

This blog post was co-written with Qaish Kanchwala of The Weather Company.

As industries begin to adopt processes that rely on machine learning (ML) technology, it is important to establish machine learning operations (MLOps) that can scale to support the growth and use of this technology. MLOps practitioners have many options to establish an MLOps platform. One of them is a cloud-based, integrated platform that can scale with your data science teams. AWS offers a full-stack service to establish an MLOps platform in the cloud that you can customize to your needs while still getting all the benefits of running ML in the cloud.

In this post, we discuss how The Weather Company (TWCo) enhanced their MLOps platform using services such as Amazon SageMaker, AWS CloudFormation, and Amazon CloudWatch. TWCo data scientists and ML engineers leveraged automation, detailed experiment tracking, and an integrated training and deployment pipeline to effectively scale MLOps. TWCo reduced infrastructure management time by 90% and model deployment time by 20%.

The Need for MLOps at TWCo

TWCo aims to help consumers and businesses make more informed and confident weather-based decisions. For decades, the organization has used ML in its weather forecasting process to turn billions of weather data points into actionable predictions and insights, but it is always striving to innovate and adopt cutting-edge technology in other ways too. TWCo's data science team aimed to create a predictive, privacy-conscious ML model that would show how weather conditions affect certain health symptoms and create user segments to improve the user experience.

TWCo wanted to scale its ML operations with greater transparency and less complexity to make ML workflows more manageable as its data science team expanded. There were notable challenges in running ML workflows in the cloud. TWCo's existing cloud environment lacked transparency into ML jobs, monitoring, and feature store, making it difficult for users to collaborate. Managers lacked the visibility they needed to continuously monitor ML workflows. To address these pain points, TWCo worked with AWS Machine Learning Solutions Lab (MLSL) to migrate these ML workflows to Amazon SageMaker and the AWS Cloud. The MLSL team worked with TWCo to design an MLOps platform that would meet the needs of the data science team, taking into account current and future growth.

Examples of business objectives TWCo has established for this collaboration include:

Achieve faster time to market and faster ML development cycles
Accelerating TWCo's migration of ML workloads to SageMaker
Improving end-user experience with managed services
Reduce the time engineers spend maintaining and upkeep their ML infrastructure

The following functional goals were established to measure the impact for MLOps platform users:

Increase the efficiency of your data science team's model training tasks
Reduce the number of steps required to introduce a new model
Reduce run time for end-to-end model pipelines

Solution overview

This solution uses the following AWS services:

AWS CloudFormation – An Infrastructure as Code (IaC) service for provisioning most templates and assets.
AWS CloudTrail – Monitor and log account activity across your AWS infrastructure.
Amazon CloudWatch – Collects and visualizes real-time logs that serve as the basis for automation.
AWS CodeBuild – A fully managed continuous integration service that compiles source code, runs tests, and produces deployment-ready software used to deploy training and inference code.
AWS CodeCommit – A managed source control repository for storing your MLOps infrastructure code and IaC code.
AWS CodePipeline – A fully managed continuous delivery service that automates your release pipelines.
Amazon SageMaker – A fully managed ML platform for completing the ML workflow from data exploration, training, and model deployment.
AWS Service Catalog – Centralize management of cloud resources, such as IaC templates, used for MLOps projects.
Amazon Simple Storage Service (Amazon S3) – Cloud object storage for storing training and testing data.

The following diagram shows the solution architecture:

The architecture consists of two main pipelines:

Training Pipeline – The training pipeline is designed to work with features and labels stored as CSV format files in Amazon S3. It includes multiple components such as preprocessing, training, and evaluation. After training the model, the related artifacts are registered in the Amazon SageMaker Model Registry through the Register Model component. The data quality check part of the pipeline creates baseline statistics for the monitoring task of the inference pipeline.
Inference Pipeline – The inference pipeline handles on-demand batch inference and monitoring tasks. The pipeline incorporates a SageMaker on-demand data quality monitor step to detect drift compared to the input data. The monitoring results are stored in Amazon S3 and exposed as CloudWatch metrics., It can be used to set alarms, which are then used to call training at a later time, send automated emails, or take any other desired action.

The proposed MLOps architecture includes flexibility to support different use cases and collaboration between different team personas such as data scientists, ML engineers, etc. This architecture reduces friction between cross-functional teams moving models to production.

ML model experimentation is one of the subcomponents of MLOps architecture. It improves data scientist productivity and model development process. Examples of model experimentation on MLOps related SageMaker services require features such as Amazon SageMaker Pipelines, Amazon SageMaker Feature Store, and SageMaker Model Registry using SageMaker SDK and AWS Boto3 library.

Configuring a pipeline creates the resources needed throughout the pipeline's lifecycle. In addition, each pipeline may generate its own resources.

The pipeline configuration resources are:

Training Pipeline:
- SageMaker Pipelines
- SageMaker Model Registry Model Groups
- CloudWatch Namespace
Inference Pipeline:

The pipeline execution resources are:

When your pipeline expires or is no longer needed, you should delete these resources.

SageMaker project template

In this section, we walk through manual provisioning of a pipeline using a sample notebook, and automatic provisioning of a SageMaker pipeline using a Service Catalog product and a SageMaker project.

By using Amazon SageMaker Project and its powerful template-based approach, organizations can establish a standardized, scalable infrastructure for ML development, allowing teams to focus on building and iterating on ML models, reducing time spent on complex setup and management.

The following diagram shows the required components of a SageMaker project template: Use Service Catalog to register your SageMaker project CloudFormation template in your organization's Service Catalog portfolio.

To kick off your ML workflow, a project template serves as the foundation for defining your continuous integration and delivery (CI/CD) pipeline. It starts by retrieving your ML seed code from your CodeCommit repository. Then, the BuildProject component takes over and orchestrates the provisioning of your SageMaker training and inference pipeline. This automation ensures that your ML pipeline runs seamlessly and efficiently, reducing manual intervention and speeding up the deployment process.

Dependencies

The solution has the following dependencies:

Amazon SageMaker SDK – Amazon SageMaker Python SDK is an open source library for training and deploying ML models on SageMaker. In this proof of concept, the pipeline was set up using this SDK.
Boto3 SDK – AWS SDK for Python (Boto3) provides a Python API for AWS infrastructure services. Use the SDK for Python to create roles and provision SageMaker SDK resources.
SageMaker Project – The SageMaker project provides standardized infrastructure and templates for MLOps to rapidly iterate across multiple ML use cases.
Service Catalog – Service catalog simplifies and accelerates the process of provisioning resources at scale. It provides a self-service portal, a standardized service catalog, versioning and lifecycle management, and access control.

Conclusion

In this post, we showed how TWCo uses SageMaker, CloudWatch, CodePipeline, and CodeBuild for its MLOps platform. With these services, TWCo expanded the capabilities of its data science team while also improving how data scientists manage their ML workflows. These ML models ultimately helped TWCo create predictive, privacy-conscious experiences that improve user experience and explain how weather conditions affect consumers' daily plans and business operations. We also saw an architectural design that helped modularize and maintain responsibilities between different users. Typically, data scientists are only concerned with the scientific aspects of the ML workflow, while DevOps and ML engineers focus on the production environment. TWCo reduced infrastructure management time by 90% and model deployment time by 20%.

This is just one of the many ways AWS enables developers to deliver great solutions – get started with Amazon SageMaker today!

About the Author

Kaishu Kanchwala I am an ML Engineering Manager and ML Architect at The Weather Company. I work on all steps of the machine learning lifecycle, designing systems that enable AI use cases. In my spare time, I like to cook new dishes and watch movies.

Shesal Kamaraj He is a Senior Solutions Architect in the High Tech division at Amazon Web Services. He works with enterprise customers to help them accelerate and optimize their workload migration to the AWS cloud. He is passionate about cloud management and governance, helping customers set up a landing zone for long-term success. In his spare time, he enjoys woodworking, listening to music and trying new recipes.

Anila Joshi With over 10 years of experience building AI solutions, Anila is an Applied Science Manager at the AWS Generative AI Innovation Center, where she pioneers innovative AI applications that push the boundaries of what's possible and helps customers strategically chart a course for the future of AI.

Kamran Raj Kamran is a Machine Learning Engineer in the Amazon Generative AI Innovation Center. Passionate about creating use-case driven solutions, Kamran helps customers leverage the full potential of AWS AI/ML services to address real-world business challenges. With 10 years of experience as a software developer, he has honed his expertise in various domains including embedded systems, cybersecurity solutions, and industrial control systems. Kamran holds a PhD in Electrical Engineering from Queen's University.

Shuja Sohrawardy Shuja is a Senior Manager of the Generative AI Innovation Center at AWS. For over 20 years, Shuja has used his technology and financial services acumen to transform financial services companies to meet the challenges of a highly competitive and regulated industry. In his last 4 years at AWS, Shuja has used his deep knowledge of machine learning, resiliency, and cloud adoption strategies to pave the way for numerous customer successes. Shuja holds a BA in Computer Science and Economics from New York University and an MSc in Executive Technology Management from Columbia University.

Francisco Calderon is a Data Scientist at the Generative AI Innovation Center (GAIIC). As a member of GAIIC, he works with AWS customers to explore possibilities using Generative AI technology. In his spare time, he enjoys playing music and guitar, playing soccer with his daughters, and spending time with his family.