Apple researchers develop local AI agent to interact with apps

Applications of AI


Ferret-UI Lite matches or exceeds benchmark performance of models up to 24 times larger, despite having only 3 billion parameters. Here are the details:

A little background on ferrets

In December 2023, a team of nine researchers published a study called “FERRET: Refer and Ground Anything Anywhere at Any Granularity.” In it, they introduced a multimodal large-scale language model (MLLM) that can understand natural language references to specific parts of images.

Image: Apple

Since then, Apple has published a series of follow-up papers extending the Ferret family of models, including Ferretv2, Ferret-UI, and Ferret-UI 2.

in particular, Ferret-UI The variants were trained to extend the original capabilities of FERRET and overcome what researchers defined as the shortcomings of general-purpose domain MLLMs.

From the original Ferret-UI paper:

Recent advances in multimodal large-scale language models (MLLMs) are noteworthy, but these general-purpose domain MLLMs often lack the ability to understand and effectively manipulate user interface (UI) screens. In this paper, we introduce Ferret-UI, a new MLLM with referencing, grounding, and inference features tailored to enhance the understanding of mobile UI screens. Considering that UI screens typically exhibit a more elongated aspect ratio and contain smaller objects of interest (icons, text, etc.) than natural images, we incorporate “arbitrary resolution” on top of Ferret to magnify details and take advantage of enhanced visual capabilities.

Image: Apple
The original Ferret-UI research included an interesting application of the technology that allows users to interact with models and better understand how to interact with the interface, as seen on the right.

A few days ago, Apple further extended the Ferret-UI family of models with a study called Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents.

Ferret-UI is built on a 13B parameter model and primarily focuses on mobile UI understanding and fixed-resolution screenshots. Meanwhile, Ferret-UI 2 has expanded the system to support multiple platforms and high-resolution recognition.

In contrast, Ferret-UI Lite is a much more lightweight model, designed to run on-device while remaining competitive with very large GUI agents.

Ferret-UI Lite

According to the researchers in the new paper, “Most existing techniques for GUI agents […] Focus on large-scale underlying models. ” That’s because “the powerful reasoning and planning capabilities of large-scale server-side models enable these agent systems to excel in a variety of GUI navigation tasks.”

They point out that while much progress has been made in both multi-agent and end-to-end GUI systems, which take different approaches to streamlining many of the tasks that involve agent interaction with a GUI (“low-level GUI grounding, screen understanding, multi-step planning, self-reflection”), they are essentially systems that are too large and computationally intensive to run properly on a device.

So they set out to develop Ferret-UI Lite, a 3 billion parameter variant of Ferret-UI. It is “built with several key components, based on insights into training small-scale language models.”

Ferret-UI Lite leverages:

  • Real and synthetic training data from multiple GUI domains.
  • On-the-fly (or inference time) cropping and zoom-in techniques to better understand specific segments of the GUI.
  • Supervised fine-tuning and reinforcement learning techniques.

The result is a model that performs nearly as well or better than competing GUI agent models with up to 24 times the number of parameters.

Image: Apple

The overall architecture (extensively detailed in the study) is interesting, but the real-time cropping and zoom-in techniques are particularly noteworthy.

The model makes an initial prediction, crops around it, and then re-predicts in that cropped area. This helps compensate for the limited capacity of such small models to process large numbers of image tokens.

Image: Apple

Another notable contribution of this paper is how Ferret-UI Lite essentially generates its own training data. The researchers built a multi-agent system that directly interacts with a live GUI platform to generate synthetic training samples at scale.

There is a curriculum task generator that suggests goals of increasing difficulty, a planning agent breaks the goals into steps, a grounding agent executes the goals on the screen, and a critic model evaluates the results.

Image: Apple

Using this pipeline, the training system captures the ambiguity of real-world interactions, such as errors, unexpected conditions, and recovery strategies. This is much more difficult to do while relying on clean, human-annotated data.

Interestingly, while Ferret-UI and Ferret-UI 2 used iPhone screenshots and other Apple interfaces for evaluation, Ferret-UI Lite was trained and evaluated in Android, web, and desktop GUI environments using benchmarks such as AndroidWorld and OSWorld.

The researchers did not explicitly say why they chose this route for Ferret-UI Lite, but it likely reflects where large-scale, reproducible GUI agent testbeds are currently available.

Regardless, the researchers found that while Ferret-UI Lite performed well on short-duration, low-level tasks, it performed less strongly on more complex, multi-step interactions. This is almost an expected trade-off given the constraints of small, on-device models.

Ferret-UI Lite, on the other hand, provides a local and even private (as the data does not need to be sent to the cloud and processed on a remote server) agent that autonomously interacts with the app’s interface based on the user’s requests. This is very good for everyone.

Click this link for more information on the study, including benchmark breakdowns and results.

Accessories sale on Amazon

Add 9to5Mac as a preferred source on Google
Add 9to5Mac as a preferred source on Google

FTC: We use automated affiliate links that generate income. more.



Source link