The service, known as the “first AI software engineer,” appears to be rather poor in its work, based on recent reviews.
The automatic coder is called “Devin” and was introduced in March 2024. The creator of the bot, called CognitionAI, made allegations such as “Devin can build and deploy from End to End to End” and “can autonomously find and fix bugs in the codebase.” The tool became generally available in December 2024 starting at $500 a month.
“Devin is an autonomous AI software engineer who writes, runs, writes, writes, writes, and creates test code to help software engineers tackle personal tasks and team projects,” the Cognition document declares. “You can check the PR, support code migration, respond to on-call issues, build web applications, perform personal assistant tasks like ordering lunches with Doordash, and leave them trapped in the codebase.”
The service uses Slack as the main interface for commands sent to a computing environment, a Docker container that hosts terminals, browsers, code editors, and planners. The AI Agent supports API integration with external services. This allows you to send email messages on your behalf, for example through SendGrid.
Devin is a “composite AI system” and means it relies on multiple underlying AI models. This is a set that includes Openai's GPT-4O and is expected to evolve over time.
In theory, you could ask NBDEV, the JUPYTER notebook development platform, to take on tasks like migrating code, and hope that it works. But it might be asking for too much.
Devin's early assessment found a problem. Cognition AI has posted a promotional video that appears to show AI Coder autonomously completing projects with an upwork on the Freelancer-For Hire platform. Software developer Carl Brown analyzed its VID and exposed it on his internet YouTube channel.
The software agent was also called from another YouTube code pandit to be allegedly included a critical security issue.
Three data scientists currently at As.ai, an AI research and development lab founded by Jeremy Howard and Eric Ries, tested Devin and found that only three of the 20 tasks completed successfully.
An analysis conducted earlier this month by Hamel Hussein, Isaac Frass and Jono Whitaker made Devin a good start, successfully pulling data from the concept database into Google Sheets. AI agents were also able to create planetary trackers to check claims regarding the historical location of Jupiter and Saturn.
However, when three researchers continued testing, they ran into problems.
“It took Devin days, not hours, not days, not hours, as Devin got caught up in a technical dead end or created an overly complex, unusable solution,” the researchers explained in the report. “What was even more concerning was Devin's tendency to move forward on tasks that are not actually possible.”
As an example, they cited how Devin was asked to deploy multiple applications on the infrastructure deployment platform rail, where he spent more than a day trying out an approach that was not supported and not working, and hallucinating features that do not exist.
Of the 20 tasks presented to Devin, the AI software engineers have completed only three of them well. This is the third challenge to investigate the above two and how to build a Discord Bot in Python. The other three tasks produced decisive results, and 14 projects failed altogether.
Researchers said Devin provided a sophisticated user experience.
“But that's a problem. It rarely worked,” they wrote.
“We were concerned that we couldn't predict which tasks would be successful, and even tasks similar to early victory would fail in complex and time-consuming ways.
Cognitive AI did not respond to requests for comment. ®
