If AI is going to rule the world, why can’t it solve spelling bees?

My mission as set forth by my AI overlords was simple. The New York Times Spelling contest.

I spent the better part of a Saturday night trying to shape the letters G, Y, A, L, P, O, and N into as many words as I could. But after three hours, 141 points, and 37 words (including “nonapology,” “lagoon,” and “analogy”), I hit a wall. I was just a few more words away from catapulting myself into the spelling bee's “genius” tier, a title given to those who have decoded 70% of all possible words using a given set of letters, the point at which the puzzle is considered effectively solved. (Anyone who solves 100% is given the title of “Queen Bee,” but even I have my limits.)

My human brain was clearly struggling, but this task seemed like child's play to an AI, so I fired up ChatGPT, told it I wanted to win the spelling bee, gave it the letters, and explained the rules. “Give me a list of all the words you can make with them,” I typed. “You can repeat letters, but each word must contain an 'N'.”

After pressing Enter, ChatGPT responded with the following:

Below is a list of words that are formed with the letters “GYALPON” and always contain the letter “N”.
GlnPaYo
YPNL
Pignoil
Onapilgu
Lugnapio
Henri
Naigoppur
Japan
AOGPLNY
Naoguri
There are many possible combinations, so this is just a small list.

Not a very good start.

“The word must be in the dictionary,” I typed back. “Of course,” came the reply. “We’ll provide you with a list of valid English words based on the letters you give us.”“Angry, gallon, lang.”

Obviously this didn't work, so I tried Microsoft's Copilot (Yang, playing, planning, annoying), Google Gemini (Gapon, Gon, Jean), and Claude of Anthropique (Mango, in progress, Lawn 17. Ray). Meta AI told me that it made sure to only include words that it recognizes in its dictionary in the list. Narip and Nagi, Meanwhile, Perplexity, a chatbot with ambitions to beat Google search, Gal Hundreds of times before suddenly freezing.

I was stumped at solving the spelling bee. — Perplexity, a chatbot with ambitions to beat Google search, malfunctioned when asked to create words from a string of letters. (Screenshot: Pranav Dixit/Engadget)

AI can now create images, videos, and audio as fast as you can type in a description of what you want. It can write poems, essays, and term papers. It can also be a poor imitation of your girlfriend, therapist, or personal assistant. And many believe AI will take over human jobs through automation and change the world in ways we can't even imagine. So why is AI so bad at solving simple word puzzles?

The answer lies in how large-scale language models, the underlying technology behind the modern AI boom, work. Computer programming is traditionally logical and rule-based: you input commands for a computer to carry out according to a set of instructions, and it will provide a valid output. But machine learning, of which generative AI is a subset, is different.

“This is purely statistical,” Noah Giansiracusa, a professor of mathematics and data science at Bentley University, told me. “It's really about extracting patterns from the data and generating new data that roughly fits those patterns.”

OpenAI didn't respond officially, but a company spokesperson told me that this kind of “feedback” helped the model improve its understanding and ability to tackle problems. “Word structure, anagrams, etc. are not common use cases for Perplexity, so our models are not optimized for them,” company spokesperson Sarah Platonic told me. “As someone who plays Wordle/Connections/Mini Crossword every day, I'm excited to see what we come up with!” Microsoft and Meta declined to comment. Google and Anthropic had not responded at press time.

At the heart of large-scale language models are “Transformers,” a technological breakthrough made possible by Google researchers in 2017. When you enter a prompt, a large-scale language model breaks down words or parts of words into mathematical units called “tokens.” The Transformer analyzes each token in the context of the large datasets the model is trained on, so it can see how tokens relate to each other. Once the Transformer understands these relationships, it can respond to the prompt by inferring the next likely token in a sequence. Financial Times If you're interested, here's a great animated explanation detailing all of this.

I thought I had given the chatbot precise instructions for generating words for the spelling bee, but all they did was convert my words into tokens and use a transformer to spit out plausible responses. “It's not like computer programming or typing commands into a DOS prompt,” Giansiracusa says. “Your words are converted into numbers that are then processed statistically.” Purely logic-based queries seemed like the worst use of an AI's skills; it was like trying to turn a screw with a resource-intensive hammer.

The success of an AI model also depends on the data used to train it. That's why AI companies are now eager to sign deals with news publishers: the more recent the training data, the better it responds. For example, generative AI is bad at suggesting chess moves, but it's at least slightly better at solving word puzzles. Giansiracusa points out that there are tons of chess games available on the internet that are almost certainly included in the training data for existing AI models. “I suspect there just aren't enough annotated spelling bee games online that an AI could train on,” he said.

“If your chatbot seems more confused by wordplay than a cat solving a Rubik's cube, it's because it hasn't been specifically trained to perform complex wordplay,” says Sandy Besen, an artificial intelligence researcher at IBM AI company Nudesic. “Wordplay has specific rules and constraints that the model will have a hard time following unless specifically instructed to do so during training, fine-tuning, and prompting.”

“If your chatbot seems more puzzled by wordplay than a cat solving a Rubik's Cube, it's because it hasn't been specifically trained to perform complex wordplay.”

Despite all this, the world's leading AI companies continue to promote the technology as a panacea and make exaggerated claims about its capabilities. In April, both OpenAI and Meta boasted that their new AI models could “infer” and “plan.” In an interview, OpenAI Chief Operating Officer Brad Lightcap said: Financial Times Next-Gen GPT, the AI model that powers ChatGPT, will show progress in solving “hard problems” like inference. Joel Pinault, Meta's vice president of AI research, told the magazine that the company is “working hard to figure out how to make these models not just talk, but actually reason, plan, and have memory.”

I have tried many times to solve the spelling bee using GPT-4o and Llama 3, but failed miserably. Garon, Lang, Angry When the chatbot suggests a word that is not in the dictionary, Galvanopy Instead, when I mistakenly typed “sure” as “sur” in response to Meta AI's suggestion that I come up with more words, the chatbot told me that “sur” is indeed another word that can be formed with the letters G, Y, A, L, P, O, and N.

Clearly, we are still far from achieving artificial general intelligence, a vague concept that allows machines to perform most tasks as well as or better than humans. Some experts, like Yann LeCun, chief AI scientist at Meta, have been outspoken about the limitations of large language models, arguing that they will never reach human-level intelligence because they don't use logic as much. At an event in London last year, LeCun said that the current generation of AI models “don't understand how the world works. They can't make plans. They can't really reason.” “There are no fully autonomous self-driving cars that you can learn to drive with about 20 hours of practice, that a 17-year-old can do.”

But Giansiracusa strikes a more cautious tone: “We don't really know how humans reason. We don't know what intelligence actually is. We don't know if my brain is just a giant statistical computer, a kind of more efficient version of large-scale language models.”

Perhaps the key to living with generative AI without succumbing to hype or anxiety is to simply understand its inherent limitations. “These tools are not really designed for many of the uses that people are using them for,” said Chirag Shah, a professor of AI and machine learning at the University of Washington, who co-authored a high-profile research paper in 2022 that criticized the use of large language models in search engines. Shah thinks tech companies could be more transparent about what their AI can and can’t do before forcing it on us. But that ship may have already sailed. In recent months, the world’s largest tech companies — Microsoft, Meta, Samsung, Apple and Google — have all declared that they will tightly weave AI into their products, services and operating systems.

“Bots fail because they weren't designed for it,” Shah told me of my wordplay conundrum. It remains to be seen whether bots also fail with all the other challenges tech companies are imposing on them.

Has an AI chatbot ever let you down? email address: And let me know!

Update, June 13, 2024 at 4:19 PM ET: This story has been updated to include a statement from Perplexity.

Source link