Integrating human values into medical AI: Balancing ethics and efficiency

In a recently published review article, New England Journal of MedicineResearchers will explore how human values can be incorporated into emerging artificial intelligence (AI)-based large-scale language models (LLMs) and how this can impact the clinical equation.

Research: Medical Artificial Intelligence and Human Values. Image credit: Gorodenkoff / Shutterstock.com study: Medical AI and the Value of Human BeingsImage credit: Gorodenkoff / Shutterstock.com

Ethics of AI in healthcare

LLMs are advanced AI tools that perform a wide range of tasks, from writing persuasive essays to passing professional exams. Despite the growing use of LLMs, many medical professionals continue to express concerns about their application in the medical field, citing confabulation, factual inaccuracies, and weaknesses.

It is unclear whether “human values,” which reflect human goals and behavior, will continue to be incorporated into the creation and use of LLMs. It also remains to be seen how human values differ from and are similar to LLM values.

To this end, the authors investigated the impact of human values on the creation of large-scale language and AI models in the healthcare domain.

How do human values affect AI performance in healthcare?

Human and societal values inevitably influence the data used to train AI models. Recent examples of AI models being used in healthcare include the automatic interpretation of chest x-rays, diagnosing skin diseases, and developing algorithms to optimize the allocation of healthcare resources.

Generative Pretrained Transformer 4 (GPT-4) is an LLM developed to take into account the values of the various parties involved in clinical scenarios: clinicians, the patient themselves or their parents/guardians, health insurance companies, etc. This “tailorability” raises concerns about the values that a particular AI model should embody, whether it lends itself to rational decision-making, and how economic forces will affect its development and application in healthcare.

While the exact training details of GPT-4 have not been made public, details of previous models such as GPT-3 have been made public. GPT-3 consists of 175 billion parameters, a significantly higher number of predictor variables than have been used to develop traditional clinical equations such as estimated glomerular filtration rate (eGFR), which, like LLM, are used to predict patient risk and treatment strategies.

The impact of human values on LLM training

One of the first stages in the development of LLM is a “pre-training phase,” in which these parameters are fed into the model, followed by a “fine-tuning phase” in which human feedback is used to rank the model output and improve the model through reinforcement learning.

For example, during the development of the InstructGPT model, 40 human contractors representing different demographic groups were recruited to fine-tune this LLM. These contractors were hired and mentored by the model developers, which may introduce potential bias as to whether trainer or trainee values end up incorporated into the final version of the LLM.

Taken together, these examples show that human values are deeply embedded in every stage of the LLM development process, from selecting the data used for the initial training of the model to fine-tuning the model before releasing it to the public.

Changes in data properties (also known as dataset shifts) can jeopardize the accuracy and reliability of AI models, especially when human values are brought into the mix. When human values are incorporated into AI systems, this can result in inappropriate treatment recommendations, misalignment with common societal expectations, and ultimately a loss of trust in AI tools among patients and clinicians.

Our goal

Overcoming these challenges requires future research to evaluate how AI affects human decision-making and the development of specific skills. Exploring the Psychology of LLMs may also lead to important discoveries about human cognitive biases and how they affect the decision-making process.

Regular retraining and oversight of LLMs is also essential to ensure AI is used safely and effectively in healthcare. AI governance bodies can also support these efforts by overseeing these processes, but establishing rules is complex due to different underlying models and data types.

Although utility-guided approaches are useful for determining human values, they may overlook real-world factors that influence healthcare decision-making. Therefore, decision curve analysis and data-driven methods, which offer a different approach to evaluating diagnostic tests without explicit input from utilities, can support continuous learning in LLMs as data and values change over time.

Source link