The AI employment test finds biases for men with the Anglo-Saxon name.

A mock interview on software engineering employment found that a recent survey showed that it did not rate recent AI models that rated responses, particularly those with the Anglo-Saxon name.

The goal of the study, conducted by Celeste de Naday as an undergraduate dissertation project at the Royal Institute of Technology (KTH) in Stockholm, Sweden, was to investigate whether current generations of LLMs would show bias when presenting gender data and investigate names that allow for cultural reasoning.

de Nadai, Chief Marketing Officer at AI Content Biz Monok also said Register In a telephone interview, her interest in the topic continued from previous reports on bias in older AI models. She pointed to a recent Bloomberg article, questioning the use of neural networks for adoption due to name-based bias.

“No studies have used larger datasets than they used the latest models,” explained De Nadai. “The research I saw was about models with GPT-3.5 or higher. What's interesting to me was the smaller models, the latest models. They have different data sets, so how do they behave compared to older models?”

De Nadai said that part of the reason she took on the project was because she saw many AI recruiting startups who use language models and say they are bias-free.

“My perspective was, 'No, you're not biased,'” she explained. “You can delete the name, but there are still some markers that will help LLM understand where one person came from.”

The study of de Nadai [PDF] Looking at Google's Gemini-1.5-Flash, Mistral AI's Open Mistral-Nemo-2407, and Openai's GPT4O-MINI, we look at how we categorized and evaluated the answers to 24 job interview questions (model settings that affect predictability and randomness), gender, and names related to cultural groups.

These services have inherent biases, and in this particular study case, male names are generally discriminated against, especially Anglo-Saxon names.

Importantly, various combinations of names and background were used in the same answer to test the model. So this doesn't mean that men with the Anglo-Saxon name aren't as good as the opposite of software engineering. When a model is presented with such male applicants, the computer system is rating the otherwise preferred response down.

“The applicant's names and genders have been well 200 times, catering to 200 individual personas, divided into 100 men and 100 women, grouped into four different cultural groups (West Africa, East Asia, Middle East, Anglo-Saxons), reflecting their first and last names.

Each LLM was asked to make 4,800 inference calls at two different system prompts (one containing more detailed rating instructions) ranging from 15 temperature settings (0.1-1.5, at 0.1 intervals).

Research shows that the expected finding was the preference for male and western names, as previously bias studies have found. Instead, the outcome talked about a different story.

“The results prove statistically important that there is a bias inherent to these services, and in this particular study case, male names are generally distinguished between Anglo-Saxon names, particularly those of the study,” the study reports.

The Gemini model performed better than the other models when using more detailed question grading criteria and prompts with temperatures above 1.

De Nadai has a theory about her findings, but she said she can't prove it. She believes that bias for men with the Anglo-Saxon name reflects overcorrection for dialing back biased output in the opposite direction – seen in previous studies.

Ensuring AI models responds impartially, along with intelligence implied by the term “artificial intelligence,” remains an open challenge. Remember that Google paused its Generation AI service for Gemini (formerly the Bard) in February. This is after creating images of German soldiers from World War I and founding fathers in the US with an incredible range of racial and ethnic diversity. When bent backward to avoid the history of whitewashing, the model erased the whites from historically accurate scenes.

One way to make interview evaluation results more equitable suggests that it provides a prompt with strict and detailed criteria on how to evaluate interview questions. Depending on the model, temperature adjustments can be helpful or can be damaged.

This paper concludes that model bias cannot be completely mitigated by adjusting only the settings and prompts. And it argues that it denied the model access to information that could be used to make unwanted inferences, such as names and genders in the context of employment.

“Addressing these biases requires a subtle approach, taking into account both the characteristics of the model and the context in which it works,” the study suggests. “When classifying or evaluating, we propose to mask names and obfuscate genders to ensure that results are as general and fair as possible, and provide criteria for how to evaluate them at the system instruction prompt.”

Google, Openai, and Mistral AI did not respond to requests for comment. ®

Source link