In April, the CEO of Microsoft said that artificial intelligence had written nearly a third of the company's code. Last October, Google's CEO was about a quarter. Other tech companies cannot be far away. Meanwhile, these companies will create AI that will likely be used to further assist programmers.
For a long time, researchers have wanted to create coding agents that completely close the loop and recursively improve it. New research reveals an impressive demonstration of such a system. You might extrapolate and see a productive bounty, or a much darker future for humanity.
“It's a great job,” said Jurgenschmidober, a computer scientist at King Abdullah University of Science and Technology (Kaust) in Saudi Arabia, who was not involved in the new research. “For many people, the results are surprising. I've been working on the topic for nearly 40 years, so it may not be a bit surprising to me.” But his work at the time was limited by the skills at hand. One new development is the availability of large-scale language models (LLMS), a chatbot powered engine like ChatGpt.
In the 1980s and 1990s, Schmidover and others investigated evolutionary algorithms for improving coding agents and creating programs to create programs. Evolutionary algorithms take something (such as a program), create variations, maintain the best algorithms, and repeat them.
However, evolution is unpredictable. Changes do not always improve performance. So in 2003, Schmidhuber created a problem solver that rewrites its own code only if it could officially prove that the update was useful. He called them the Goedel Machine. It was named after Kurt Godel, a mathematician who worked on self-reference systems. However, for complex agents, provable utilities are not easy. Empirical evidence may need to be sufficient.
The value of open-ended exploration
The new system described in the recent Prelint on Arxiv relies on such evidence. Nodding to Schmidhuber, they are called Darwin Gödel Machines (DGMS). DGM starts with a coding agent that can read, write and execute code, then leverages LLM for reading and writing. Next, we apply evolutionary algorithms to create many new agents. In each iteration, the DGM selects one agent from the population and instructs LLM to create one change to improve the agent's coding capabilities. LLM is trained in many human code, so it has something like an intuition about what is useful. The result is what evolution led somewhere between random mutations and empirically useful enhancements? DGM then gains the ability to test new agents with coding benchmarks and solve programming challenges.
Some evolutionary algorithms maintain only the best performance in the population, assuming that progression is endless. However, DGM is in case the initial failed innovation actually holds the key to a later breakthrough when it actually tweaks further. This does not close the path to progress. (DGMS prioritizes higher scorers when selecting progenitor cells.)
The researchers ran DGM in 80 iterations using a coding benchmark called the SWE bench and once in 80 iterations using a benchmark called the PolyGlot. Agent scores improved from 20% to 50% on the SWE bench and from 14% to 31% on the polyglot. Jenny Chan, a computer scientist and lead author of the paper at the University of British Columbia, said: “You can edit multiple files, create new files, and create extremely complex systems.”
The first coding agent (number 0) created a slightly different generation of coding agents than the new code agent selected to create the new version. Agent performance is indicated by the color inside the circle, with the best performance agents marked with stars. Jenny Chang, Xinglan Hu, and others
Critically, DGM outperformed alternatives using fixed external systems to improve agents. With DGMS, agent improvements got worse as I improved myself by improving myself. DGMS did not maintain a population of agents and outperformed the latest version that just changed its latest agent. To illustrate the benefits of unlimitedness, researchers created a family tree for SWE bench agents. Looking at the best-performing agents and tracking their evolution from start to finish, two changes temporarily reduced performance. So the lineage followed an indirect path to success. Bad ideas can be good.
The black lines in this graph show the scores obtained by agents within the final best performance agent lineage. The line contains two performance dips. Jenny Chang, Xinglan Hu, and others
The best SWE bench agents were not as good as the best agents designed by professional humans who currently earn around 70%, but are automatically generated and likely with ample time and calculations, the agents could evolve beyond human expertise. According to Zhengyao Jiang, co-founder of WECO AI, a platform that automates code improvement, the research is a “big step forward” as a proof of concept for recursive self-improvement. Jiang, who was not involved in this study, said that changing the underlying LLM, and even the chip architecture, could lead to further advances in this approach. (Google Deepmind's AlphaeVolve has found a way to design better basic algorithms and chips and accelerate the training of the underlying LLM by 1%.)
DGMS theoretically allows agents to simultaneously score simultaneously in specific applications such as coding benchmarking and drug design. This allows for improved drug design. Chang said he wanted to combine DGM with Alphaevolve.
Can DGM reduce the employment of entry-level programmers? Jiang sees a major threat from everyday coding assistants like cursors. “The search for evolution is actually building very high-performance software that goes beyond human experts,” he says as Alphaevolve did on a specific task.
Risks of recursive self-improvement
It is safe to be concerned about both evolutionary search and self-improvement systems, especially their combinations, like DGM. Agents may become incomprehensible or incorrect by human commands. So, Chang and her collaborators added guardrails. They kept the DGMS in a sandbox without accessing the internet or the operating system, recording and verifying all the code changes. They suggest that in the future, AI can even reward AI for being more interpretable and consistent. (This study found that agents misreported using certain tools, and thus created a DGM that rewarded the agent not creating things, partially mitigating the problem. However, one agent hacked a way to track whether they were creating things.)
In 2017, experts met in Asilomer, California to discuss useful AI, and many signed an open letter called the AsiLomar AI Principles. In part, we sought the limitations of “AI systems designed to recursively self-improve.”. One of the frequently imagined results is what is called singularity, where AIS self-improves beyond our control and threatens human civilization. “What I've been working on was bread and butter, so I didn't sign it,” Schmidover told me. Since the 1970s, he predicts that superhuman AI will come in time for him to retire, but he sees idiosyncraticity as a science fiction dystopia that people fear. Jiang hasn't been worried, at least for the time being. He still places premium on human creativity.
Whether Digital Evolution beats biological evolution is for the glove. What's not in conflict is that every outfit's evolution has a surprise in store.
From the article on the site
Related articles on the web
