Detailed analysis of DS-STAR
We then performed an ablation study to verify the effectiveness of the individual components of DS-STAR and specifically analyzed the impact of the number of refinement rounds by measuring the iterations required to generate a sufficient design.
data file analyzer: This agent is essential for achieving high performance. Without descriptions to generate (variant 1), DS-STAR’s accuracy for the difficult tasks in the DABStep benchmark drops sharply to 26.98%, highlighting the importance of rich data context for effective planning and implementation.
router: The router agent’s ability to determine whether a new step is required or to correct a wrong step is very important. When this was removed (variant 2), DS-STAR simply added new steps sequentially, resulting in decreased performance on both easy and difficult tasks. This showed that correcting mistakes in the plan is more effective than continuing to add potentially flawed steps.
Generalizability across LLMs: We also tested the adaptability of DS-STAR using GPT-5 as a base model. This yielded promising results on the DABStep benchmark, demonstrating the framework’s versatility. Interestingly, DS-STAR with GPT-5 performed better on easy tasks, while the Gemini-2.5-Pro version performed better on difficult tasks.
