AI suffers from irony in non-American English

AI News


Emily Mort/Uncrash

In 2018, my Australian colleague asked me, “Hey, how are you?” My response – “I'm on the bus” – I met a smirk. I recently moved to Australia. Despite studying English for over 20 years, it took me a while to become familiar with the diverse languages of Australia.

It turns out that large language models with artificial intelligence (AI) such as ChatGpt are experiencing similar problems.

In a new study published in the Computational Linguistics Association 2025 findings, my colleagues and I present new tools to assess the ability of various major language models to detect emotions and irony in three different English languages: Australian English, Indian English, and British English.

The results show that there is still a long way to go before the promised benefits of AI are enjoyed by everyone, regardless of the type or diversity of the language they speak.

Limited English

Large-scale language models are often reported to achieve the best performance on several standardized sets of tasks known as benchmarks.

The majority of benchmark tests are written in standard American English. This means that large language models are actively sold by commercial providers, but are only tested and trained in this type of English.

This has great consequences.

For example, a recent study found that my colleague and I are more likely to classify texts as hatred when written in African American English. They also often “default” to standard American English, even if their input is in other English types, such as Irish English or Indian English.

Based on this study, we constructed Bestie.

What is Bestie?

Bestie is the first benchmark for the classification of three types of emotions and irony: Australian English, Indian English and British English.

For our purposes, “sentiment” is a characteristic of emotions: positive (Australian “Not bad!”) or negative (“I hate movies”). Irony is defined as a form of sarcasm in words intended to express light empt or ridiculous laughter (“I love being ignored”).

Two types of data were collected to build Besstie. Reviews of Google Maps and Reddit post locations. We carefully curated topics and adopted predictors of language breeds – AI models specialize in detecting linguistic diversity in texts. We selected texts that were predicted to have a probability of greater than 95% of the diversity of a particular language.

Two steps (location filtering and prediction of language diversity) ensured that the data represented national diversity, such as Australian English.

Next, we evaluated nine powerful, free-to-use large-scale language models, including Roberta, Mont, Mistral, Gemma, and Qwen.

Inflated bills

Overall, we found that the large language models we tested were Australian and British English (the native English type) working better than non-native Indian English.

We also found that large-scale language models are better at detecting emotions than irony.

Irony is particularly challenging, not only as a linguistic phenomenon, but also as a challenge for AI. For example, the model found that only 62% of the cynicism could be detected in Australian English. This number was about 57% for Indian and British English.

These performances are lower than those claimed by high-tech companies developing large language models. Glue, for example, is a leaderboard that tracks how well AI models work in sentiment classifications in American English texts.

The highest value is 97.5% for the model Turing ULR V6, Roberta (from the model suite). Both are more expensive in American English than in Australian, Indian and British English observations.

The national context is important

As more and more people around the world use large-scale language models, researchers and practitioners are awakened to the fact that these tools need to be evaluated for a particular national context.

For example, earlier this year, with Google, the University of Western Australia launched a project to improve the effectiveness of a large-scale language model of Aboriginal English.

Our benchmarks can help us evaluate techniques for future large-scale language modeling on our ability to detect emotions and irony. He is also currently working on a large-scale language model project for the hospital's emergency department, where his English proficiency supports a wide range of patients.

conversation

The study, led by Dipankar Srirag, was funded by Google's Research Scholars Grant awarded to Aditya Joshi and Diptesh Kanojia in 2024.

/Commentary of the conversation. This material of the Organization of Origin/Author is a point-in-time nature and may be edited for clarity, style and length. Mirage.news does not take any institutional position or aspect, and all views, positions and conclusions expressed here are the authors alone.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *