Reinforcement learning scheduler reduces Kubernetes CPU usage by 20%

Machine Learning


Researchers are tackling the challenge of optimizing resource allocation in container orchestration, particularly Kubernetes, a leading platform for demanding computing workloads. Hanlin Zhou, Hua Yong Chan, Shun Yao Zhang from Sains University Malaysia’s School of Computer Science and Xiamen Software Technology Institute, along with Meie Lin and Jingfei Ni, present a new approach using reinforcement learning to intelligently schedule pods, the basic unit of deployment. Their work shows significant improvements over the default Kubernetes scheduler, using the SDQN-n model to reduce average CPU usage per cluster node by up to 20%, paving the way for more efficient and sustainable data centers. This is an important step as the energy consumption of cloud computing continues to increase.

The researchers chose average CPU utilization per node as a key performance metric, recognizing its direct impact on CPU provisioning decisions for both cloud and on-premises infrastructure, as well as its impact on the performance and power consumption of co-located services. This study establishes a new benchmark for pod scheduling and demonstrates the potential of reinforcement learning for optimizing resource allocation in dynamic containerized environments. Additionally, the SDQN-n strategy’s innovative approach to pod consolidation significantly reduces overall CPU utilization and enables potential shutdown of idle machines, promoting sustainable data center operations.

The team’s contributions include introducing the SDQN framework, which seamlessly integrates the DQN reinforcement learning paradigm with Kubernetes’ scheduling pipeline, and developing SDQN-n, which leverages reinforcement learning for intelligent pod integration. The researchers highlight the adaptability of the architecture, noting that the reinforcement learning components can be easily tailored to the requirements of diverse future scenarios. This flexibility ensures the long-term viability and applicability of the proposed scheduler in the evolving cloud computing environment. This research paves the way to significantly improve resource management within Kubernetes clusters, providing a path to reduce operational costs, improve performance, and minimize environmental impact.
This study highlights the potential of reinforcement learning-driven scheduling to revolutionize container orchestration by demonstrating superior resource savings compared to default and alternative AI-driven approaches. The findings have direct implications for organizations deploying compute-intensive workloads in cloud or on-premises environments, providing viable solutions to optimize resource utilization and build more sustainable infrastructure. The team’s efforts promise to advance the field of cloud computing by enabling more efficient and greener data centers.

Developing SDQN and SDQN-n Kubernetes schedulers

To facilitate this, this study utilized six key input parameters: CPU usage, memory usage, pod usage, health status, node uptime (hours), and number of running pods. Each parameter was calculated using specific formulas detailed in the study. CPU and memory usage was determined as the ratio of real-time consumption to total capacity, and pod utilization reflected the current workload pressure on each node as a percentage of the maximum possible pods. Node health was assessed as a binary metric with a “ready” status of 1 and 0 otherwise, and uptime was measured in hours from the node’s start time. This comprehensive set of inputs enabled the model to accurately assess the state of nodes and make informed scheduling decisions.

The SDQN algorithm uses a neural network to approximate a Q-function and estimate the optimal action for each state-action pair. The team defined a reward function designed to keep CPU and memory utilization within optimal limits. This function maintains CPU and memory usage within an optimal range, giving +10 points for usage between 40 and 70%, penalizing -2 points for every 1% above the threshold above 70%, and -10 points for values ​​below 40%. Pod distribution and node uptime also contributed to the reward score, promoting workload distribution and stable node operation. The SDQN model itself consists of a 6-dimensional input layer, a single fully connected hidden layer that maps to 32 dimensions using ReLU activations, and a final fully connected output layer that estimates the Q-value.

SDQN-n brings further innovation. SDQN-n builds SDQNs by forcing the placement of pods across a limited number of nodes (specifically two nodes), enhancing resource conservation and promoting energy efficiency. This constraint is reflected in the modified reward function, where placements outside of the top two candidate nodes are penalized by -50 points to encourage consolidation. The training process for both models utilized the Adam optimizer with a learning rate of 0.001 and involved forward propagation to compute Q(s,a), followed by backpropagation using the target reward to update the network weights.

SDQN scheduler improves Kubernetes CPU resources

This consolidation strategy concentrates pods onto fewer nodes, maximizing resource utilization and minimizing waste. Data shows that this approach not only reduces CPU load but also enables the retirement of idle machines, paving the way for more energy-efficient data centers. Scientists note that the reinforcement learning components of the SDQN and SDQN-n architectures are easily adjustable and can adapt to different future scenarios and workload requirements. Our measurements confirm that the average CPU utilization per node, a key performance metric, is consistently lower with the new scheduler, which directly impacts CPU provisioning decisions for both cloud and on-premises servers.

This breakthrough has the potential to significantly save resources, reduce power consumption, and improve the scalability of containerized applications. In testing, SDQN-n’s pod consolidation strategy has proven to be particularly effective, reducing CPU usage by more than 20% by strategically placing compute-intensive pods. This effort establishes the foundation for greener, more sustainable data center operations and improved resource management in cloud computing environments.

SDQN-n significantly reduces Kubernetes CPU usage and improves clusters

SDQN’s effectiveness comes from reinforcement learning’s ability to adapt to the real-time conditions of each node and strategically place pods to minimize overall CPU usage. The authors acknowledge that further work is needed to broaden the model’s applicability across different workload types and cluster configurations. Future research will also focus on improving hyperparameters to increase resource conservation and scheduling robustness, as well as investigating SDQN-n integration strategies as a blueprint for energy-efficient data centers.



Source link