Ex-AI Agent Startup Register Founded by Google Deepminder

Machine Learning


When Ang Li, co-founder of agent software Biz Simular, began working at Google DeepMind in 2017, software engineers at Search Giant were skeptical of the utility of machine learning, or the artificial intelligence (AI) that has come to be called.

As Li explained Register In interviews, production teams between 2017 and 2019 will often say, “machine learning doesn't work in production.”

“We have a lot of papers, so that's kind of funny,” he said.

At one point, the Google Ads team asked the Deepmind crew to apply the Alphago system to Google's ad system (the one that conquered the game) to improve Google's ad revenue.

“I think some people tried it, but they actually reduced their income,” Lee said. “The real world systems are so complicated, so that's the interesting part.”

Machine learning methods are based on statistics, Li says, and they assume static datasets.

“But in the real world, this assumption doesn't hold,” he explained. “In the real world, for example, YouTube has videos uploaded every day. In ads, search queries come in every day. And the distribution of this data continues to change. That's why machine learning doesn't actually work in production.”

This was before Openai released ChatGpt on November 30th, 2022. Almost three years later, we enter the generation AI hype cycle and after billions of capital expenditures, machine learning is not going that well. But investors are plagued.

As mentioned last month, AI agents (AI models using tools in loops) only performs about 30% of the complete office tasks.

However, the success rate depends on the benchmark you are using and the measurements. The OSWorld benchmark, which assesses how well agent software can handle real computer tasks, was established in April 2024. The benchmark task consists of directives such as “Update the bookkeeping sheet with recent transactions from the provided folder and detail the costs over the past few days.”

At the time, GPT-4 (Vision), the top-performing AI agent, managed an overall success rate of 12.24.

As of about a week ago, the top performer was the GUI test time scaling agent, or GTA1, which achieved a 45.2% task success rate on the OSWorld benchmark when paired with Openai's O3 model. GTA1 reflects research from researchers at Salesforce AI, the Australian National University and the University of Hong Kong.

This is a significant improvement from the cutting edge last year, but even the best agents have more than half failed on office automation tasks. Human workers can manage a task completion score of 72.36%. ”

In 2023, when Li co-founded Simular with Jiachen Yang, he said he told people that the company was building agents. However, people didn't understand and tried to convince him to call him an assistant. Now everyone is building an agent.

“The definition of an agent is a system that can interact with the environment and continue to improve itself,” he said.

Basically, for now, we need to carry our computers every day, but in the future we don't need to do that

SIMULAR's S2 Agent Framework is currently ranked 4th in OSWorld and 6th in the AndroidWorld benchmark, reflecting the company's vision for autonomous computing.

“Essentially for now, we need to carry our computers every day, but we don't need to do that in the future,” Li said. “It means that computers will become human… I'll book tickets for you, book tables, go shopping.”

The agent also has knowledge of the habits and preferences of users stored locally on computers, Li said. “This is the vision we're driving.”

A recent symptom of that vision is Simular Pro, a $500-a-month computer usage agent for MacOS (Apple Silicon) designed to automate desktop tasks. It's not the price for casual use. Rather, Li expects adoption in industries such as insurance and healthcare.

“Normally, this happens in industries that call the API-deficient industry. That means there is no API. [for programmatic access to data]Li explained.

“There's no API for insurance, healthcare, finance, developers and businesses to automate workflows. They're pretty painful. They have to hire people from all over the world to sit on their computers. If they can automate this, they say it's a huge productivity boost for them.

To attract organizations' interest in this type of office task automation, it may be necessary to get things right at least as often as human employees. However, Li claims that the industry has lost its way.

“We believe everyone else is doing the wrong thing,” Li said. “That's not really wrong. They don't seem to be going in the right direction. Everyone says agents are based on LLM. We think this type of technology is just part of the framework for reinforcement learning.”

Try different paths to try different paths to run known solutions without taking into account other options.

Other companies are focused too much on the exploitation portion and don't spend enough time on the exploitation portion, he said. SIMULAR's S2 agent framework starts with using LLM for exploration, but once you find a solution, you can convert the actions into symbolic code, just like JavaScript, and run the tasks predictively and programmatically until the code gets corrupted and LLM needs to be rewritten.

Li sees Simular as a technical infrastructure company, not as a manufacturer of agent products. The goal he describes is to develop a nervous systemic continuous reinforcement learning framework for architectural agents.

He said continuous learning is one of the most challenging problems for AI researchers. The problem is that if you continue to train your neural net with new data, you will “devastatingly forget what you learned 10 days ago.” And there's the cost issue. Ultimately, it becomes uncontrollable to continue retraining by adding knowledge to the static model.

Li believes that ongoing learning will be needed for the industry to reach what AGI or artificial general information calls, namely, AI models handle most tasks and humans. ®



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *