AI agents promise to do the work for you while you sleep. The reality is more complicated

Summer Yue may work on safety and coordination on Meta’s super intelligence team, but even she admits she’s not immune to overconfidence when it comes to autonomous AI agents.

In a post on X Monday, Yue described how an OpenClaw autonomous AI agent built to run locally on a Mac mini computer deleted his entire inbox, ignoring instructions to pause and ask for confirmation first.

“I had to run towards the Mac Mini like I was defusing a bomb,” she said. It was a “rookie mistake,” she added. She explained that while the workflow worked in a test inbox that she used to safely try out the agent for several weeks, in the real inbox the agent lost its original instructions.

Yue’s experience stands in stark contrast to viral posts such as: The Lobster Revolution: How 24/7 AI agents changed everythingPeter Diamandis argues that always-on AI is far less frictional.

“Please tell me what it feels like to use this,” Diamandis wrote. “When you wake up in the morning, your agent — my agent is named Skippy, and he’s hilarious, sarcastic, and incredibly talented — has done eight hours of work for you while you slept. It’s read your 1,000 pages of markdown. It’s organized your files. It’s drafted three project plans. It’s booked your travels. It’s researched the questions you had at 11 p.m. that you forgot about.”

“I felt withdrawn when my Mac mini went offline for six hours,” he added. “It’s like my best friend is gone.”

These dueling accounts of the power of AI agents capture the tensions at the heart of today’s push toward “always-on” AI. Tools like OpenClaw and Claude Code make it technically possible to run agents for long periods of time, increasing excitement about the idea of AI that works while you sleep. In practice, however, early users say autonomy remains fragile, unpredictable, and requires significant effort to manage. Rather than replacing human work, today’s agents often require continuous monitoring, guardrails, and intervention, especially when risks increase beyond low-risk experiments.

AI agents work best when the task is simple and low risk.

Shyamal Anadkat, who previously worked as an applied AI engineer at OpenAI, said that most successful agents today still require frequent check-ins by humans or are limited to tightly-bounded, well-defined tasks, but stressed that this will change as measurement and evaluation techniques improve.

“A system that has 95% accuracy on individual steps can be confusing in a 20-step autonomous workflow,” says Anadkat. “Long-term planning is still weak.” As a result, agents may work well with short task chains, but tend to fall apart when asked to manage complex multi-day projects, he explained. Memory is another major limitation. “For many agents, memory is non-existent or fragile. We need a system that can maintain a consistent model of working context, priorities, and constraints.”

But that doesn’t mean the promise of AI agents is all fantasy, says Yoav Shoham, former chief scientist at Google, professor emeritus at Stanford University, and co-founder of AI21 Labs. But that means there’s a risk of people getting ahead of themselves. He explained that current AI agents are most effective when tasks are low risk, loosely defined, and have a low chance of failure.

“Developers like toys. We have this toy that can do amazing things,” he said. luck. “As long as what they’re doing is fairly simple, the risk is fairly low, and there’s an tolerance for error, that’s fine.” For example, let’s say you want an agent to read 10,000 websites and do something interesting with the results, providing useful information overnight.

But for mission-critical enterprise workflows, the hurdles are much higher. Businesses need systems that are verifiable, reproducible, and cost-effective. This requirement quickly nullifies the promise of fully autonomous, always-on agents. Higher degrees of automation are already possible in highly structured areas such as coding and mathematics. But in most real-world business processes, the effort required to make agents trustworthy often outweighs the benefits, Shoham says.

Brett Greenstein, chief AI officer at consulting firm West Monroe, noted that tools like OpenClaw feel like a tipping point similar to what happened with generative AI when ChatGPT launched in 2022, making the idea of AI agents more accessible for the first time. Still, it’s not a 24/7 “magic solution”.

“It can keep you working long hours and working on things, but it’s like a toddler that needs supervision,” he said. Some tasks are best done while you’re asleep, like scanning your LinkedIn messages or following the news. “I don’t know if I can answer customer feedback in my sleep,” he said.

I feel the ability to delegate to an AI agent is powerful.

Still, Greenstein emphasized that there is little doubt that the ability to delegate real-world tasks to AI agents will be very appealing to users. He pointed to his own experience where he tasked an AI agent with the mundane task of picking up his clothes from the dry cleaners, and then watched the AI agent silently complete the task from beginning to end.

Agents independently contacted the cleaners, planned and arranged pickups through email communications, monitored doorbell cameras to confirm pickups, and notified Greenstein when the work was completed. In this episode, we learned how agents can work across multiple systems and adapt when things don’t go as planned. However, it also highlighted why such tools still require strict guardrails and oversight before being deployed, especially in enterprise environments.

“OpenClaw is set up in a way that most people shouldn’t feel safe,” Greenstein says. “I don’t feel it’s mature enough to be trusted as a part of our lives.” For AI to be accepted into everyday life and business operations, it will need to earn trust over time, just as trust is established socially, he added.

Still, the demand is already clear. Greenstein pointed to meetups and early industry gatherings dedicated to OpenClaw, calling it an unusually rapid emergence for such a young tool. “This shows that people want AI that actually helps, systems that not only answer questions but also initiate actions,” he said.

Aaron Levy, CEO of cloud-based content management and collaboration company Box, calls what’s happening now with AI agents a “small sign” of what might happen in the future.

“Some end up not showing up, and some become the norm,” he said, referring to two years ago when AI company Cognition introduced an early agent called Devin that integrated with Slack for task delegation, bug fixing, data analysis, and code reviews. At the time, it was still considered futuristic, but now “no one is confused that this is standard practice,” he says. “You can put Claude Code in Slack and get to work. What seemed like a completely crazy idea is now basically the norm for modern engineering teams.”

However, Levy emphasized that while AI agents are very good at automating certain individual tasks, they are still bad at handling the broader, context-heavy tasks that make up most jobs. While AI agents may be able to fully automate some tasks, they will struggle with others, such as coordinating relationships and attending meetings.

“When you hear AI institutes say they will automate all knowledge work within 24 months, that is usually a very narrow definition of work,” he said. “The definition of what an agent can do is not the same as the definition of the job for which they are hired in the economic world.”

Trust factor is important when things go wrong

Avinash Bhootkuri, a staff data scientist at a top Fortune 500 retailer, said most enterprise AI agents “absolutely need a babysitter” and can currently only work in enterprise environments with strictly limited autonomy and extensive guardrails. “The stakes are huge,” he explained.

For example, we talked about building an agent system for enterprise cybersecurity where AI agents proactively investigate rather than simply triggering alerts and waiting for human review. Rather than bombarding analysts with thousands of alerts, agents gather evidence in real-time by querying threat intelligence databases, analyzing behavioral patterns, and filtering out false positives before determining whether a situation merits escalation.

The system relies on strictly limited autonomy and extensive guardrails to reduce human workload without eliminating oversight.

In cybersecurity, he explained, when an agent makes a mistake, the consequences are immediate and severe. “AI will either block legitimate customers (causing huge revenue losses) or it will allow sophisticated attackers to enter the network,” he said. “It’s absolutely critical if things don’t go well.”

Brianna Whitehead, who runs an AI operations consultancy that builds AI-powered systems for executives and founders, says the industry is in a “trust adjustment phase.”

AI agents can do more than most people allow them to do, but not as much as the hype suggests.

“The real skill is not in building the agent, but in designing the handoff,” she explained. “Most people either overtrust their agents and end up cleaning up the mess, or they micromanage all the deliverables, wondering why it feels like the AI is doing more work instead of less.” The idea, she says, is to design clear handoff points, where some things are fully delegated, others get a quick review, and other tasks are done only by humans.

So far, she says, agents are “really good” in what she calls the middle tier of knowledge work. “Organizing meeting notes into action items, drafting follow-up emails in someone else’s voice, summarizing research, and organizing competing priorities into a clear plan are tasks that would take up two to three hours of a smart person’s day.”

But anything that requires reading a room, navigating ambiguous situations, or making relationship-dependent decisions is not ready for prime time for AI agents. “I had a client who wanted to completely automate communication with investors,” she said. “The AI could draft beautifully, but it couldn’t sense when funders were starting to lose interest, so we needed a different approach. The agent drafted the email, but a human had to decide whether to send it or not.”

For now, it can be difficult to sleep when working with AI agents

For now, working with AI agents may have more to do with staying half-awake than sleeping on the job. Tools like OpenClaw can run for hours at a time, but for many early users, that autonomy comes with a new kind of vigilance: checking logs, reviewing output, and intervening before problems occur.

That dynamic was captured in a recent viral post titled: token anxietyIn it, investor Nikunj Kothari said his friend left the party early not because he was tired but because he wanted to get back to his agency. “No one questions it anymore,” Kothari wrote. “Half the room is thinking the same thing. The other half is probably checking on the agent’s progress. At the party.”

The dream of AI working while you sleep may become a reality. But for now, many people are still waking up.

Source link