It was first published strange loop canon, April 23, 2024
On-gold lift and reduced reliability. Or why can't LLMs play Conway's game of life?
Over the past few years, whenever a problem has arisen that LLM cannot perform, it has successfully solved the problem. However, even though he passed the exam with flying colors, he was still unable to answer the seemingly easy questions, and the reason is unknown.
So for the past few weeks, I've been obsessed with figuring out failure modes in LLMs. This started with me exploring what I found. It's definitely a little suspicious, but I think it's interesting. AI failures tell us more about what AI can do than successes.
The starting point was much larger: the many jobs that LLMs would end up doing needed to be evaluated task by task. But then I started asking myself how I could find out the limits of that reasoning ability so that I could trust its learning ability.
As I've written many times, LLMs are difficult, and it's difficult to separate their reasoning abilities from what they're training for. So I wanted to find a way to test my ability to repeatedly reason and answer questions.
Click here to continue reading this article.
