On the impossibility of separating intelligence and judgment: Computational difficulties in filtering for AI tuning

Machine Learning


As large-scale language models (LLMs) are increasingly introduced, there is concern that they can be misused to generate harmful content. Our research studies coordination challenges, focusing on filters that prevent the generation of unsafe information. Two natural points of intervention are filtering input prompts before they reach the model, and filtering output after they are generated. Our main results demonstrate the computational challenges in filtering both prompts and output. First, we show that there are LLMs without efficient prompt filters. Adversarial prompts that induce harmful behavior are easy to construct and computationally indistinguishable from benign prompts by efficient filters. The second main result identifies natural settings for which output filtering is computationally intractable. All of our separation results are based on cryptographic strength assumptions. In addition to these core findings, we formalize and study a relaxed relaxation approach and demonstrate further computational barriers. We conclude that security cannot be achieved by designing filters outside the internal structure (architecture and weights) of the LLM. In particular, black-box access to LLM is not sufficient. Based on our technical results, we argue that the intelligence of a tailored AI system is inseparable from its judgment.



Source link