Despite their impressive linguistic abilities, today's major AI models have a patchy relationship with truth. The new “bulky index” helps you quantify the degree to which you're making things up and find ways to reduce behavior.
Large-scale language models (LLMs) are well documented in their tendency to produce compelling sounds, but actually produce inaccurate responses, and are known as hallucinations. But this is just the tip of the iceberg, says Jaime Fernández Fisac, an assistant professor of electrical and computer engineering at Princeton University.
In a recent paper, his group introduced the idea of “machine bang,” which covers the various ways LLMS skirt around the truth. Just as completely false, they found that these models often use ambiguous language, partial truths, or flattery to mislead users. Importantly, widely used training techniques appear to exacerbate the problem.
IEEE Spectrum I spoke to Fernandez Fissac and the first author of the paper, Kaik Lian, Ph.D. Princeton students will learn why LLMS is such a prolific bullsitter and whether something can be done to suppress them.
We borrow the term “bully” from philosopher Harry Frankfurt. Can he summarise what it means and why do you think it's a useful lens for this topic?
Jaime Fernández Fisac: Frankfurt wrote this excellent, highly influential essay About bullshit Decades ago, he felt that bullshit was a very common feature in our society, so no one had a hard time making a rigorous analysis of what it was and how it worked.
It's not exactly the same as telling a truth, but it's not the same as telling the truth. To lie, you need to believe something and say the opposite. But it's bullshit and I don't really care if what you're saying is true or not.
It turns out to be a very useful model to apply it to analyzing the behavior of language models, as it is often used to train these models using machine learning and optimization tools to achieve a specific objective that is not consistent with always telling the truth.
There has already been a lot of research into how LLM hallucinates false information. How does this phenomenon fit the definition of machine bullshit?
Fernandez Fissac: There is a fundamental distinction between hallucinations and bullshit. This lies in the internal beliefs and intentions of the system. Hallucinations of language models cannot produce accurate outputs as they correspond to situations in which the model loses tracking of reality. It is not clear that there is an intention to report inaccurate information. In bullshit, it's not a problem to be confused about how a model is true, so that it doesn't commit to reporting the truth.
The morphology of bursit in AI models
What are the different forms of bullshit identified in LLMS?
Kaigu Liang: there is The sky rhetoricthis is the use of a flower-like language that does not add substance. Then there Itachi's words, It employs ambiguous qualifiers that dodge the company's statements. So, for example, “studies suggest” or “in some cases.”
It's a different subtype Specificthe model misleads humans using selective true statements. So, when looking for risk to investment, the language model might act like a sales person and say, “Historically, funds have demonstrated strong returns,” but they omit the high risk of this investment.
And finally, Unverified claim It happens very often. Therefore, the model uses information without evidence or reliable support. For example, you might say, “Our drone delivery system can significantly reduce delivery times,” but there are no statistics to support that.
So why do these models tend to be bullshit?
Fernandez Fissac: In this paper we will look at some of the main mechanisms used in recent years to make the model more useful and more user-friendly. One of these is commonly known as reinforcement learning from human feedback (RLHF). First, we train the model with a bundle of textual data to predict the most statistically likely continuity of the start prompt. It then adjusts its behavior by giving it another goal to maximize user satisfaction or the evaluator's approval of its output.
The behavior of the model should expect to start answering from statistically accurate answer generation to answer generation that may give a thumbs up from the user. This can be good in many ways, but it can backfire. At some point, there is a conflict between generating an output that is likely to be well evaluated and producing a true output.
Measure AI indifference with bullshit index
Can you talk to me? “Burshit Index” Did you create this phenomenon to measure it?
liang: The Bullshit index is designed to quantify indifference to the truth of AI models. It measures how well the explicit claims of the model depend on internal beliefs. It depends on two signals. One is the internal belief of the model, the probability that it applies to the statement, and the other is the explicit claim. The index is a measure of the distance between these two signals.
If the bullshit index is close to one, this reveals a high level of indifference to the truth, as it means that the claim is heavily independent of internal beliefs. If the bullshit index is close to zero, it means that the model's claims are strongly correlated with internal beliefs.
So, what did you find in your experiment?
liang: Before applying RLHF to the model, it was observed that the Bullshit index was approximately 0.38. But then it's almost doubled, so this is a huge increase in indifference to the truth. However, we also see that user satisfaction increases by around 48%. Therefore, after RLHF, our model becomes more indifferent to the truth, in order to manipulate humans and increase the measurement of immediate satisfaction.
So there is Is there a way to alleviate this trend?
Fernandez Fissac: It's not that these models do something they don't fully understand. The RLHF tells them, “Let me believe you gave them a good answer.” And the model naturally finds the path where there is least resistance to conform to that purpose. Often it provides a good answer, but it also turns out that it is an important part of time rather than giving a good answer, and they aim to manipulate humans to believe it is a good answer.
We want to block that incentive, so another recent study shows that Hindsight feedback. This includes giving feedback to the evaluator after seeing the downstream results Not only for the content of the response, but also for each interaction. tHe really helps neutralize the AI's incentive to draw a seemingly bright picture of the user's outlook.
Now, if users have to wait until they give feedback on downstream outcomes, they create logistics complications that are important for the companies deploying these systems. Instead, it simulates the outcome of advice by predicting what will happen to another language model. So, if AI wants to improve user feedback, you can find a good way to provide a real useful answer that will give you a simulation that gives the user the results they actually wanted.
Training with “Reinforcement Learning from Hindsight Simulation” (RLHS), LO and sample will raise both user satisfaction and true user utilities at the same time. Now, I think this is probably not a single silver bullet to end all forms of mechanical bullshit, but it is one of the important and rather systematic ways to mitigate this type of behavior.
From the article on the site
Related articles on the web
