
Apple's research paper describes how the company has been developing Ferret-UI, a generative AI system specifically designed to understand app screens.
The paper is somewhat vague about its potential applications, perhaps intentionally so, but the most appealing possibility would be powering a more advanced Siri.
Challenges to surpassing ChatGPT
Large-scale language models (LLMs) power systems like ChatGPT. These training materials are textual and primarily obtained from websites.
MLLM (multimodal large-scale language models) aims to extend the capabilities of AI systems to also understand non-textual information such as images, video, and audio.
MLLM is currently not very good at understanding the output of mobile apps. There are several reasons for this. It starts with the mundane reason that smartphone screen aspect ratios are different from the aspect ratios used in most training images.
Specifically, many of the images that need to be recognized, such as icons and buttons, are very small.
Additionally, they need to be able to interact with the app rather than understanding the information all at once, as they would when interpreting a static image.
Apple's Ferret-UI
These are problems that Apple researchers believe they have solved with an MLLM system they call Ferret-UI (UI stands for user interface).
Considering that UI screens typically exhibit a vertical aspect ratio and contain smaller objects of interest (icons, text, etc.) than natural images, magnify details and take advantage of enhanced visual capabilities. For this reason, we have built in “any resolution” on top of Ferret. […]
We carefully collect training samples from a wide range of basic UI tasks, such as icon recognition, text searching, and widget listings. These samples are formatted with areas annotated and instructions to follow to facilitate accurate reference and grounding. To enhance the model's inference capabilities, we further compile datasets for advanced tasks such as detailed explanations, recognition/dialogue conversations, and functional inference.
They say the results are better than GPT-4V and other existing UI-focused MLLMs.
From UI development to advanced Siri
This paper describes what they achieved, rather than how it is used. This is a typical phenomenon found in many research papers, and there may be several reasons for this.
First, the researchers themselves know How their work will ultimately be used. They focus on solving technical problems rather than potential applications. It may take time for product people to discover potential ways to leverage it.
Second, you may be instructed not to disclose or be intentionally vague about the intended use, especially where Apple is involved.
But there are three possible ways this ability could be used…
One is that it can be a useful tool for evaluating UI effectiveness. Developers can create a draft version of their app and let Ferret-UI decide whether it's easy to understand or use. This can be faster and cheaper than human usability testing.
Second, it may include accessibility applications. For example, rather than a simple screen reader that reads everything on her iPhone screen to a blind person, it summarizes what appears on the screen and lists the available options. A user can tell her iOS what she wants to do and have the system do it.
Apple provides an example of this, which displays a screen containing a podcast show in Ferret-UI. The system output looks like this: “This screen is a podcast application that allows users to browse and play new and featured podcasts. There are also options to play, download, and search for specific podcasts.”
Third, and most interesting, it can be used to power a very advanced form of Siri. Users can tell Siri to: “Check out tomorrow's flight for him from JFK to Boston and reserve a seat for me on the flight I arrive.'' Arrive by 10 a.m., total fare less than $200. ” Siri then interacts with the airline's app to perform the task.
thank you, A.K.. 9to5Mac composite image from Solen Feyissa on Unsplash and Apple.
FTC: We use automated affiliate links that generate income. more.