Oversearch in search-augmented large-scale language models

Search-enhanced large-scale language models (LLMs) excel at knowledge-intensive tasks by integrating external search. However, even if searching too much does not improve response quality, search tools are often called unnecessarily, and incorporating extraneous context can reduce computational efficiency or create illusions. In this work, we perform a systematic evaluation of oversearch across multiple dimensions, including query type, model category, search criteria, and multi-turn conversations. Our findings show that: (i) search generally improves the accuracy of answers for questions that can be answered, but harms abstention for questions that cannot be answered; (ii) oversearch becomes more pronounced in complex reasoning models and deep exploration systems, is exacerbated by noisy search, and cross-turn complexity in multi-turn conversations; (iii) The composition of the recovered evidence is important because the presence of negative evidence facilitates abstention. To quantify oversearch, we introduce Tokens Per Correctness (TPC), an evaluation metric that captures the trade-off between performance and cost of search enhancement LLM. Finally, we investigate mitigation approaches at both the query and retrieval levels and release the OverSearchQA benchmark to foster continued research on efficient search augmentation LLMs.