Ensuring that artificial intelligence systems align with human values and operate safely is a fundamental challenge in the field, and researchers are currently exploring ways to build truly trustworthy AI. Alessio Benaboli from Trinity College Dublin, along with Alessandro Facchini and Marco Zaffaron from the Institute for Artificial Intelligence, are investigating this issue through the lens of “assistance” and “shutdown” scenarios, a common framework for assessing the safety of AI. Their work shows that creating AI that can safely assist humans or reliably shut down on demand requires systems that can go beyond simply programming desired outcomes to actively reason under conditions of uncertainty and accommodate the complexities of imperfect and even seemingly irrational human preferences. This research significantly advances our understanding of the conditions necessary for safe AI, moving beyond simplistic models to incorporate the nuances of real-world human behavior.
AI safety, decision theory, and uncertainty
The study examines a wide range of research on safety, decision theory, and uncertainty handling in artificial intelligence, covering areas such as value alignment, reward learning, and ensuring that AI systems remain under human control. This research leverages concepts from decision theory, game theory, and Bayesian methods to explore how to build AI that aligns with human values, avoids unintended consequences, and operates safely. Key themes include modeling imperfect preferences, imprecise probabilities, and challenges in learning from human feedback. Basic research on utility theory, imperfect preferences, and stochastic choice provides the basis for understanding how decision makers function when preferences are not completely defined.
Researchers are applying these concepts to the context of AI safety, particularly reward learning from human feedback and reinforcement learning, to create systems that accurately reflect human intent. Game theory, particularly the study of multi-agent systems, is informing approaches to the “off-switch” problem, ensuring that the AI can be safely shut down when needed. This work also delves into how to deal with uncertainty by utilizing imprecise probabilities and a set of qualifications to represent situations where the probabilities are unknown or subjective. Bayesian optimization techniques are employed to learn human preferences and adjust the AI system accordingly. Combining these approaches, scientists are exploring ways to create robust AI systems that can reason under uncertainty, accommodate imperfect preferences, and make ethical decisions, ultimately leading to more reliable and useful artificial intelligence.
Learning human preferences using a Gaussian process
Researchers have pioneered a new framework for aligning artificial intelligence with human values, focusing on assistance and shutdown issues and employing advanced computational techniques to model human preferences. The team developed a system that addresses challenges arising from imperfect, non-Archimedean preferences, requiring an AI that can reason under uncertainty. This research focuses on learning utility functions, recognizing that these are often unknown and must be inferred from human choices. This system utilizes a Gaussian process (GP) as a prior function for an unknown utility function and allows the calculation of a posterior distribution considering the observed preferences.
By approximating the posterior distribution using techniques such as Laplace approximation and Kullback-Leibler divergence minimization, the method efficiently handles the complexity of the preferred environment. Experiments with finite choice sets demonstrated the system's ability to approximate hidden utility functions from limited preference data, even when complete specification is not possible. Further research investigated scenarios that involve multiple potentially conflicting utility functions and where the system needs to learn them simultaneously. The likelihood function includes conditions that ensure that the chosen option does not dominate in both utilities, allowing the system to estimate the posterior bounds of each utility. This approach shows the potential for AI to learn and adapt to complex human preferences, even in the presence of uncertainty and conflicting desires.
AI tuning requires explicit uncertainty modeling
Scientists have made great strides in reconciling artificial intelligence and human values to address challenges in support and shutdown scenarios. This research shows that robust AI systems require the ability to reason under uncertainty, accommodate imperfect and non-Archimedean preferences, and go beyond traditional deterministic models. Experiments have shown that when humans are perfectly rational, AI assistants consistently follow their decisions. However, the team demonstrated that in order for the AI to remain under surveillance when modeling bounded rationality, uncertainty about the human utility function must be explicitly modeled.
This highlights a significant limitation of current preference-based adjustment techniques, which assume perfect preferences. Forcing humans to choose between incomparable options leads to seemingly irrational behavior from an AI perspective, as humans may legitimately decide that some options are incomparable. The researchers decried existing results showing the difficulty of designing truly useful AI agents that can be shut down reliably, and reformulated the problem as an instance of an AI-assisted game. Introducing the concept of “mutual preference independence,” in which the human preference for shutdown is independent of the task, we discovered that achieving both shutdown likelihood and usefulness requires a non-Archimedean preference for compliance with shutdown commands, particularly through the use of lexicographical utilities.
Uncertainty and preferences in adjusting AI
This study addresses the critical challenge of aligning artificial intelligence with human values and ensuring safety, building a framework through the related issues of aiding and stopping. The research team demonstrated that effectively solving these problems requires AI systems that can reason under uncertainty and handle preferences that are not always easily quantifiable. Specifically, this study proves that systems must account for uncertainty when learning human preferences and reject approaches that rely solely on deterministic predictions. The researchers developed a signaling game that incorporates a posteriori uncertainty derived from preference learning, investigated different selection strategies for intelligent agents, and evaluated them through numerical experiments.
These strategies, including “natural”, “enterprising” and “collaborative” approaches, were evaluated based on how well the agent could suggest actions that are in line with human utility, even with imperfect information. This result supports the use of probabilistic methods in artificial intelligence, as modeling uncertainty is essential for building reliable and secure systems. The authors acknowledge that their analysis relies on certain assumptions about the statistical distributions governing preferences and noise, and that further research is needed to explore the robustness of these results. Future work should focus on extending these models to more complex scenarios and investigating how these principles can be applied to real-world applications, ultimately leading to more reliable and useful artificial intelligence systems.
