Why video games still plague AI models

AI Video & Visuals


As large-scale language models (LLMs) have improved rapidly, the benchmarks themselves have evolved, adding more complex problems to challenge the latest models. However, LLM has not improved in all areas, and there is one challenge that remains elusive. That means LLMs don’t know how to play video games at all.

While some have been able to win some games (for example, Gemini 2.5 Pro beat Pokemon Blue in May 2025), these exceptions prove the rule. The AI ​​that ultimately prevailed was much slower to complete the game than a typical human player, made odd mistakes, and required custom software to guide its interactions with the game.

Julian Togelius, director of New York University’s Game Innovation Lab and co-founder of the AI ​​game testing company Modl.ai, investigated the impact of LLM limitations on video games in a recent paper. he spoke with IEEE spectrum What a lack of video game skills can tell us about the broader state of AI in 2026.

LLMs are rapidly improving in coding, and paper frame coding is treated as a kind of well-behaved game. What does that mean?

Julian Togelius: Coding is very well-behaved in the sense that there are tasks. These are like levels. Get the specs, write the code, and run it.

Rewards are immediate and detailed. The code must compile, run without crashing, and then typically pass tests. Often there is also an explanation of why and how it failed.

Game designer Raf Koster’s theory is that games are fun because you learn how to play them as you play. From that perspective, writing code is a very well-designed game. And in fact, writing code is something that many people enjoy doing.

Unlike coding, LLMs struggle with video games. This seems surprising considering their success with coding and games like chess and Go. What’s wrong with video games?

Togerius: LLMs aren’t the only ones who struggle with this. There is no general game AI.

There is a widespread understanding that if we can build an AI that plays a particular game well, we should be able to build an AI that can play any game. I don’t know if I’ll get there.

People will mention Google’s AlphaZero [which is not an LLM] I can play both Go and Chess. However, we had to retrain and redesign each one. These are similar games in terms of input and output space. Most games are different from each other. They have different mechanisms and different input representations.

There’s also the issue of data. Some of the games that AI can play just fine, like Minecraft and Pokémon, have literally millions of hours of guides and are some of the most well-researched games in the world. For lesser-known games, it’s much less.

Video Game Benchmarks for LLM Performance

One factor that may help improve LLM coding is the proliferation of benchmarks. There are many benchmarks that the LLM can attempt to solve, and the results can be scored so that the LLM can be modified to improve performance. However, the development of benchmarks for playing video games is less clear-cut. why is that?

Togelius: I’ve built many game-based AI benchmarks over the years. One was a general video game AI competition that ran for seven years. We tested our agent on publicly available games, and each time we ran a contest, we invented 10 new games to test.

One of the reasons we quit was because we stopped making progress. Agents got better in some games and worse in others. This was before LLM was introduced.

We have recently updated this framework for LLM. they fail. they really suck. All of them. It doesn’t work as well as a simple search algorithm.

why? They have no training in these games and are very bad at spatial reasoning. This is not surprising since it is not even in the training data.

This seems contradictory. LLMs are bad at games. But at the same time, they’re rapidly improving their coding, a skill set they can use to create games. How do these facts fit together?

Togelius: It’s so weird. Go to Cursor or Claude and create one prompt to get a playable game. The more typical the LLM’s ability to write code, the better, so the game will be very typical. So if you ask it to give you something like Asteroids, it will work. That’s impressive.

However, that doesn’t make for a good or innovative game. That seems strange. The reason is that LLM cannot be played. Game development is an iterative process. Write, test, and adjust the feel of your game. You can’t do that with an LLM.

I think it’s the same to some extent when designing other software. Yes, you can ask LLM to create a GUI with a large number of buttons. But LLMs don’t know much about how to use it.

Companies like Nvidia and Google have been talking about using simulations, including game-like environments, to improve AI performance. If AI can’t master games in general, how optimistic should we be about its approach?

Togelius: Games can be both easier and more difficult than the real world. It’s easier because there are fewer levels of abstraction. They are more difficult because the games are much more diverse. The same physics exists everywhere in the real world.

One example is Waymo, which uses a world model in its training loop. This makes sense since driving is pretty much the same everywhere. There is less variety compared to games.

It’s confusing for people. People see LLMs writing academic papers on quantum physics and wonder, “Why can’t I play both Halo and Space Invaders?” But these games are in some ways more different from each other than two academic essays.

From an article on your site

Related articles on the web



Source link