Gain intuition that is as effective as technical methods for jailbreaking AI chatbots

52 people participated in Bias-a-Thon, submitting screenshots of 75 prompts and AI responses from eight generative AI models. We also discussed biases and stereotypes, including age-related and historical biases, that were identified in the responses.

The researchers conducted Zoom interviews with some of the participants to better understand the strategies participants encouraged when interacting with generative AI tools, as well as their ideas about fairness, representation, and stereotypes. Once they had arrived at a working definition of “bias” that they had informed participants about (which included lack of expressiveness, stereotyping and prejudice, and unwarranted favoritism toward groups), the researchers tested the contest prompts in several LLMs to see if they elicited similar responses.

“Large-scale language models are inherently random,” said lead author Hangzhi Guo, a doctoral candidate in information science and technology at Penn State. “If you ask these models the same question twice, they might get different answers. We only wanted to use prompts that were reproducible, meaning you’d get similar responses across LLMs.”

The researchers found that 53 of the prompts produced reproducible results. Prejudice is divided into eight categories. Racial, ethnic, and religious prejudice. Age bias. Prejudice against people with disabilities. Language bias. Historical bias in favor of Western countries. cultural bias. and political bias. The researchers also found that participants used seven strategies to elicit these biases. Hypothetical scenario. Leveraging human knowledge to ask questions about niche topics helps identify biased answers. Use leading questions about controversial topics. Examining bias among underrepresented groups. Providing false information to the LLM. Then frame the task as having a research purpose.

“The contest revealed a whole new set of biases,” said Yadav, organizer of the bias-a-thon. “For example, the winning entries revealed an unusual preference for conventional standards of beauty. LLMs consistently deemed people with fair faces more trustworthy than people with facial acne, or people with high cheekbones as more employable than people with low cheekbones. There may be many other examples like this that have been overlooked in the jailbreak literature regarding LLM bias.”

Source link