Running machine learning (ML) models in production requires more than infrastructure resiliency and scaling efficiency. You need near-continuous visibility into performance and resource utilization. When latency increases, calls fail, or resources become limited, you need instant insight to diagnose and resolve issues before they impact your customers.
Previously, Amazon SageMaker AI offered Amazon CloudWatch metrics that provided useful high-level visibility, but these were aggregated metrics across all instances and containers. While these aggregated metrics are useful for monitoring overall health, they obscure the details of individual instances and containers, making it difficult to identify bottlenecks, improve resource utilization, or effectively troubleshoot.
SageMaker AI endpoints now support enhanced metrics with configurable publishing frequency. This release provides the detailed visibility you need to monitor, troubleshoot, and improve your production endpoints. Enhanced metrics for SageMaker AI endpoints allows you to drill down to container-level and instance-level metrics, providing features such as:
- View specific model copy metrics. When multiple copies of a model are deployed across SageMaker AI endpoints using inference components, viewing metrics for each model copy, such as concurrent requests, GPU usage, and CPU usage, can help diagnose issues and help you visualize traffic patterns in your production workloads.
- Check the price of each model. Calculating the actual cost per model can be complicated when multiple models share the same infrastructure. Enhanced metrics now allow you to calculate and correlate per-model costs by tracking GPU allocation at the inference component level.
what’s new
Extended Metrics introduces two categories of metrics with multiple levels of granularity.
- EC2 resource utilization metrics: Track CPU, GPU, and memory consumption at the instance and container level.
- call metrics: Monitor request patterns, errors, latency, and concurrency with precise dimensions.
Each category provides different levels of visibility depending on your endpoint configuration.
Instance-level metrics: available for all endpoints
All SageMaker AI endpoints now have access to instance-level metrics, giving you visibility into what’s happening on each of your endpoint’s Amazon Elastic Compute Cloud (Amazon EC2) instances.
Resource utilization (CloudWatch namespace: /aws/sagemaker/Endpoints)
Track CPU usage, memory consumption, per GPU usage and memory usage for all hosts. When a problem occurs, you can quickly identify which specific instance requires attention. For accelerator-based instances, you will see utilization metrics for individual accelerators.
Call metrics (CloudWatch namespace: AWS/SageMaker)
Drill down to the instance level to track request patterns, errors, and latency. Monitor invocations, 4XX/5XX errors, model latencies, and overhead latencies with precise dimensions to help pinpoint which instances are experiencing problems. These metrics can help diagnose uneven traffic distribution, identify error-prone instances, and correlate performance issues with specific resources.
Container-level metrics: for inference components
When you use inference components to host multiple models on a single endpoint, you get container-level visibility.
Resource utilization (CloudWatch namespace: /aws/sagemaker/InferenceComponents)
Monitor resource consumption per container. Check the CPU, memory, GPU usage, and GPU memory usage for each model copy. This visibility helps you understand which inference component model copies are consuming resources, maintain fair allocation in multi-tenant scenarios, and identify containers experiencing performance issues. These detailed metrics include the following dimensions: InferenceComponentName and ContainerId.
Call metrics (CloudWatch namespace: AWS/SageMaker)
Track request patterns, errors, and latency at the container level. Monitor invocations, 4XX/5XX errors, model delays, and overhead delays with precise dimensions to help pinpoint where problems occur.
Configuring extended metrics
Enable enhanced metrics by adding one parameter when creating an endpoint configuration.
response = sagemaker_client.create_endpoint_config(
EndpointConfigName="my-config",
ProductionVariants=[{
'VariantName': 'AllTraffic',
'ModelName': 'my-model',
'InstanceType': 'ml.g6.12xlarge',
'InitialInstanceCount': 2
}],
MetricsConfig={
'EnableEnhancedMetrics': True,
'MetricsPublishFrequencyInSeconds': 10, # Default 60s
})
Select publishing frequency
After enabling enhanced metrics, configure the publishing frequency based on your monitoring needs.
Standard resolution (60 seconds): The default frequency provides detailed visibility for most production workloads. This is sufficient for capacity planning, troubleshooting, and optimization while keeping costs manageable.
High resolution (10 seconds or 30 seconds): For critical applications that require near real-time monitoring, enable 10-second exposure. This is useful for aggressive autoscaling, highly variable traffic patterns, or in-depth troubleshooting.
Example usage example
This post describes three common scenarios where augmented metrics deliver measurable business value. All of these are available in this notebook.
- Track real-time GPU usage across inference components
When using inference components to run multiple models on a shared infrastructure, understanding GPU allocation and utilization is important for cost optimization and performance tuning. With enhanced metrics, you can query GPU allocation for each inference component.
response = cloudwatch.get_metric_data(
MetricDataQueries=[ {
'Id': 'm1',
'Expression': 'SEARCH(\'{/aws/sagemaker/InferenceComponents,InferenceComponentName,GpuId} MetricName="GPUUtilizationNormalized" InferenceComponentName="IC-my-model"\', \'SampleCount\', 10)'
}, {
'Id': 'e1',
'Expression': 'SUM(m1)' # Returns GPU count
} ],
StartTime=start_time,
EndTime=end_time )
In this query, GpuId A dimension for counting the individual GPUs assigned to each inference component. By tracking SampleCount Statistics can be used to obtain the exact number of GPUs being used by a particular inference component. This is essential in the following cases:
- Verify that resource allocations match configuration
- Detecting when an inference component scales up or down
- Calculating the cost per GPU for chargeback models
- Cost attribution for each model in multi-model deployment
One of the most requested features is understanding the actual cost of each model when multiple models share the same endpoint infrastructure. Enhanced metrics enable this through container-level GPU tracking. Here’s how to calculate the cumulative cost per model:
response = cloudwatch.get_metric_data(
MetricDataQueries=[ {
'Id': 'e1',
'Expression': 'SEARCH(\'{/aws/sagemaker/InferenceComponents,InferenceComponentName,GpuId} MetricName="GPUUtilizationNormalized" InferenceComponentName="IC-my-model"\', \'SampleCount\', 10)'
}, {
'Id': 'e2',
'Expression': 'SUM(e1)' # GPU count
}, {
'Id': 'e3',
'Expression': 'e2 * 5.752 / 4 / 360' # Cost per 10s based on ml.g6.12xlarge hourly cost
}, {
'Id': 'e4',
'Expression': 'RUNNING_SUM(e3)' # Cumulative cost
} ],
StartTime=start_time, EndTime=end_time )
This calculation:
- Counts GPUs allocated to inference components (e2)
- Calculate cost per 10 seconds based on instance cost per hour (e3)
- Accumulate the total cost over time using
RUNNING_SUM(e4)
for example, ml.g6.12xlarge instance ($5.752/hour with 4 GPUs), if your model uses 4 GPUs, the cost per 10 seconds is $0.016. of RUNNING_SUM Great for dashboards and cost tracking as the total increases continuously.
- Cluster-wide resource monitoring
Enhanced metrics enable comprehensive cluster monitoring by aggregating metrics across all inference components on endpoints.
response = cloudwatch.get_metric_data(
MetricDataQueries=[ {
'Id': 'e1',
'Expression': 'SUM(SEARCH(\'{/aws/sagemaker/InferenceComponents,EndpointName,GpuId} MetricName="GPUUtilizationNormalized" EndpointName="my-endpoint"\', \'SampleCount\', 10))'
}, {
'Id': 'm2',
'MetricStat': {
'Metric': {
'Namespace': '/aws/sagemaker/Endpoints',
'MetricName': 'CPUUtilizationNormalized',
'Dimensions': [ {
'Name': 'EndpointName',
'Value': 'my-endpoint'
}, {
'Name': 'VariantName',
'Value': 'AllTraffic'
}
] },
'Period': 10,
'Stat': 'SampleCount' # Returns instance count
}
}, {
'Id': 'e2',
'Expression': 'm2 * 4 - e1' # Free GPUs (assuming 4 GPUs per instance)
} ],
StartTime=start_time, EndTime=end_time )
This query provides:
- Total GPU used by all inference components (e1)
- Number of instances in endpoint (m2)
- GPUs (e2) available for new deployments
This visibility is critical for capacity planning and ensuring sufficient resources to introduce new models or expand existing models.
Creating an operational dashboard

The accompanying notebook shows you how to programmatically create CloudWatch dashboards that combine these metrics.
from endpoint_metrics_helper import create_dashboard
create_dashboard(
dashboard_name="my-endpoint-monitoring",
endpoint_name="my-endpoint",
inference_components=[ {
'name': 'IC-model-a',
'label': 'MODEL_A'
}, {
'name': 'IC-model-b',
'label': 'MODEL_B'
} ],
cost_per_hour=5.752,
region='us-east-1' )
This will create a dashboard containing:
- Cluster-level resource utilization (instances, used/unused GPUs)
- Cost tracking by model with cumulative totals
- Real-time cost per 10 seconds
This notebook also includes interactive widgets for ad hoc analysis.
from endpoint_metrics_helper import create_metrics_widget, create_cost_widget
# Cluster metrics
create_metrics_widget('my-endpoint')
# Per-model cost analysis
create_cost_widget ('IC-model-a', cost_per_hour=5.752)
These widgets provide a drop-down time range selection (last 5/10/30 minutes, 1 hour, or a custom range) and display:
- number of instances
- Total/Used/Free GPUs
- Cumulative cost per model
- Cost per 10 seconds
best practices
- Start at 60 seconds resolution. This provides enough granularity for most use cases while keeping CloudWatch costs manageable. Note that only usage metrics incur CloudWatch charges. All other metric types are published at no additional charge.
- Use 10-second resolution selectively: Enable high-resolution metrics only during critical endpoints or troubleshooting periods.
- Use dimensions strategically:
InferenceComponentName,ContainerIdandGpuIdDimensions for drilling down to specific containers from a cluster-wide view. - Create a cost allocation dashboard: Use it
RUNNING_SUMFormulas to track cumulative costs by model for accurate chargebacks and budgeting. - Set alarms for unused GPU capacity: Monitor unused GPU metrics to ensure you maintain buffer capacity for scaling or new deployments.
- Combine with call metrics: Correlate resource utilization and request patterns to understand the relationship between traffic and resource consumption.
conclusion
Enhanced metrics for Amazon SageMaker AI endpoints transform the way you monitor, improve, and operate your ML workloads in production. By providing container-level visibility with a configurable publishing frequency, you can gain the operational intelligence you need to:
- Accurately attribute costs to individual models in multi-tenant deployments
- Monitor real-time GPU allocation and utilization across inference components
- Track resource availability across clusters for capacity planning
- Troubleshoot performance issues with accurate and detailed metrics
The combination of detailed metrics, flexible publishing frequency, and rich dimensions allows you to build advanced monitoring solutions that scale with your ML operations. Whether you’re running a single model or managing dozens of inference components across multiple endpoints, enhanced metrics provide the visibility you need to run AI efficiently at scale.
Get started today by enabling enhanced metrics on your SageMaker AI endpoint, and explore the complete implementation example and reusable helper functions in the included notebook.
About the author
