Curriculum-based deep reinforcement learning enables stable electric vehicle routing using time windows

Machine Learning


Researchers are tackling the notoriously difficult time-bound electric vehicle routing problem, a key challenge in sustainable logistics. Mertcan Daysalilar (University of Miami), Fuat Uygroglu (Cyprus International University), Gabriel Nicolosi (Missouri Technological University), and others. We present a novel curriculum-based deep reinforcement learning framework that increases both the speed and reliability of the solution. Existing deep reinforcement learning models often flinch when faced with the complex constraints of this problem, but this new approach uses an incremental learning system to ensure stable training and impressive generalization with progressively increasing difficulty, even for problems involving up to 100 customers, even though they are trained on much smaller instances. This breakthrough represents a significant step toward practical, efficient, and reliable electric vehicle routing in real-world applications.

The research team designed a structured three-step curriculum that gradually increased problem complexity. This allows agents to first master distance and fleet optimization, then battery management, and finally complete EVRPTW scenarios. This stepwise approach avoids the sparse reward signals that typically plague end-to-end DRL models, promotes stable learning, and prevents policy collapse.

To ensure consistent learning across each phase, the team implemented a modified proximity policy optimization algorithm, carefully tuning hyperparameters, employing value and benefit clipping, and utilizing adaptive learning rate scheduling. The core of the model lies in a heterogeneous attention encoder enhanced with both a global-local attention mechanism and a per-feature linear modulation. This specialized architecture is designed to clearly capture the unique characteristics of locations, customers, and, importantly, charging stations, allowing agents to make informed routing decisions that take energy constraints into account. Initially trained on a small instance containing only N=10 customers, the model showed significant generalization and successfully handled unidentified instances ranging from N=5 to N=100.
Experiments reveal that this curriculum-based approach achieves high feasibility rates and competitive solution quality in out-of-distribution instances, significantly outperforming standard DRL baselines that often fail under dense constraints. The team’s efforts effectively bridge the gap between the speed of neural networks and the operational reliability required in real-world logistics. By decomposing the problem into manageable phases, the CB-DRL framework allows agents to learn robust policies that can avoid the complexity of EVRPTW, providing a promising solution for sustainable and efficient delivery operations. This innovation has the potential to significantly improve electric vehicle fleet planning and execution in dynamic real-time environments.

This research establishes a clear pathway for applying deep reinforcement learning to complex combinatorial optimization problems, especially those with severe constraints. Although the researchers trained the model only on small instances of N=10 customers, it showed robust generalization to unseen instances ranging from N=5 to N=100, and showed significant improvement compared to standard baseline methods for medium-sized problems. Experimental results confirm that our curriculum-based approach achieves high feasibility rates and competitive solution quality in out-of-distribution instances, effectively addressing the challenges of poor reward and unstable training that often hinder standard DRL applications. This framework provides a viable path to achieving both speed and operational reliability for electric vehicle routes, paving the way for more sustainable and efficient logistics solutions.

Achieve progressive skills with EVRPTW learning’s 3-step curriculum

Scientists developed a curriculum-based deep reinforcement learning (CB-DRL) framework to address the instability in solving the electric vehicle route problem with time windows (EVRPTW). This study pioneered a structured three-stage curriculum with stepwise increases in problem complexity to enhance training stability and generalization ability. Initially, the agent learns distance and fleet optimization in Phase A, followed by battery management in Phase B, and reaches full EVRPTW in Phase C. This step-by-step approach addresses the challenge of dense constraints often encountered in complex routing problems.

To ensure stable learning across each stage, the researchers employed a modified proximity policy optimization algorithm and carefully tuned the hyperparameters of each stage. Value and benefit clipping was implemented in parallel with adaptive learning rate scheduling to further refine the learning process. The policy network itself is built on a heterogeneous attention encoder, powered by both global/local attention and per-feature linear modulation, which is an important architectural innovation. This specialized design explicitly captures the distinct characteristics of depots, customers, and charging stations, allowing the model to distinguish their roles within the routing problem.

The team designed a heterogeneous graph attention encoder that effectively represents EVRPTW as a graph while recognizing the differences in functionality of each node type. Unlike standard attention models, this encoder utilizes separate projection parameters WQcust, WQstation, and WQdepot, allowing the model to learn explicit relational dynamics between nodes. For example, the distance between a customer and a charging station is weighted differently than the distance between two stations, reflecting the importance of feasibility. The resulting embeddings are processed by a global-local attention edge encoder, which fuses local neighborhood information and global routing context to aggregate features across different spatial scales.

The experiments used instances with N=10 customers for training and demonstrated robust generalization to unseen instances ranging from N=5 to N=100. The model significantly outperformed standard baseline techniques on medium-sized problems and achieved high feasibility and competitive solution quality in out-of-distribution instances where traditional DRL approaches failed. This curriculum-aligned approach effectively bridges the gap between computational speed and operational reliability and demonstrates the power of structured learning in complex optimization tasks.

Curriculum learning stabilizes electric vehicle route planning

Scientists have developed a curriculum-based deep reinforcement learning (CB-DRL) framework to address the instability problem in solving the electric vehicle routing problem with time windows (EVRPTW). The research team tackled the challenge of optimizing electric vehicle routes while taking into account customer time constraints, battery usage, and fleet size, a notoriously complex problem in sustainable logistics. Experiments revealed a structured three-phase curriculum that progressively increases problem complexity, starting with range and fleet optimization (Phase A), followed by battery management (Phase B), and culminating with a complete EVRPTW (Phase C). The team measured significant improvements in training stability by employing a modified proximity policy optimization algorithm with phase-specific hyperparameters, value and benefit clipping, and adaptive learning rate scheduling.

Results show that this approach effectively bridges the gap between neural network speed and operational reliability required for real-world applications. The policy network is built on a heterogeneous attention encoder enhanced with global/local attention and per-feature linear modulation to explicitly capture the unique characteristics of the depot, customer, and charging station. The model trained only on small instances with N=10 customers showed robust generalization to unseen instances ranging from N=5 to N=100. The data shows significant performance improvements on medium-sized problems compared to the standard baseline model.

Specifically, our curriculum-based approach achieves high feasibility and competitive solution quality in out-of-distribution instances where traditional DRL baselines consistently fail. Measurements confirm that the CB-DRL framework successfully navigates the sparse reward signals inherent in EVRPTW and avoids the instabilities caused by frequent constraint violations (such as battery drain and missed deadlines) that plague standard end-to-end reinforcement learning models. This breakthrough provides a way to decouple learning routing topology from ensuring feasibility under complex constraints, allowing agents to first learn feasible routes before optimizing delivery timing. Tests demonstrate that the three-step curriculum allows the neural policy to achieve near-optimal performance and zero-shot generalization on benchmark instances. The objective function, defined as minimizing the total travel distance and fleet size using a weighting factor λ, was successfully optimized, demonstrating the framework’s ability to balance cost and efficiency. This work establishes the foundation for a more robust and scalable solution to EVRPTW, paving the way for more sustainable and efficient logistics operations.

👉 More information
🗞 A curriculum-based deep reinforcement learning framework for electric vehicle route problems
🧠ArXiv: https://arxiv.org/abs/2601.15038



Source link