Apple researchers develop local AI agent that interacts with apps – 9to5Mac

Despite having only 3 billion parameters, Ferret-UI Lite matches or exceeds the benchmark performance of models up to 24 times larger. Here are the details.

A little background on the Ferret

In December 2023, a team of 9 researchers published a study titled “FERRET: Refer and Ground Anywhere Anywhere at any Granularity”. In it, they introduced a multimodal large language model (MLLM) that was able to understand natural language references to specific parts of an image:

Image: Apple

Since then, Apple has released a series of follow-ups expanding the Ferret family, including Ferretv2, Ferret-UI, and Ferret-UI 2.

specifically Ferret-UI variants extended the original capabilities of FERRET and were trained to overcome what the researchers defined as the lack of general domain MLLMs.

From the original Ferret-UI paper:

Recent advances in multimodal large language models (MLLMs) are remarkable, but these domain-general MLLMs often lag in their ability to understand and interact effectively with user interface (UI) screens. In this paper, we present Ferret-UI, a novel MLLM tailored for better understanding of mobile user interface screens, equipped with referencing, grounding, and reasoning functions. Since UI screens tend to have a more elongated aspect ratio and contain smaller objects of interest (eg icons, text) than native images, we’ve incorporated “any resolution” on Ferret to increase detail and take advantage of the enhanced visual features.

Image: Apple
The original Ferret-UI study included an interesting application of the technology where the user could talk to the model to better understand how to interact with the interface, as seen on the right.

A few days ago, Apple expanded the Ferret-UI model family with a study called Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents.

Ferret-UI was built on a 13B parameter model that focused primarily on understanding mobile UI and fixed resolution screenshots. Meanwhile, Ferret-UI 2 expanded the system to support more platforms and higher resolution perception.

In contrast, Ferret-UI Lite is a much lighter model, designed to run on devices while remaining competitive with significantly larger GUI agents.

Ferret-UI Lite

According to the researchers of the new paper, “most existing GUI agent methods (…) focus on large foundation models. This is because “strong server-side large model reasoning and planning capabilities enable these agent systems to achieve impressive capabilities in various GUI navigation tasks.”

They note that while much progress has been made in both multi-agent and complex GUI systems, which use different approaches to streamline the many tasks that involve agent interaction with GUIs (“low-level GUI grounding, screen understanding, multi-step planning, and self-reflection”), they are fundamentally too large and computationally intensive to perform well on a device.

So they decided to develop Ferret-UI Lite, a 3 billion parametric variant of Ferret-UI that “is built with several key components driven by insights from training small” language models.

Ferret-UI Lite uses:

  • Real and synthetic training data from multiple GUI domains;
  • Run-time (or inference-time) cropping and zooming techniques to better understand specific segments of the GUI;
  • Supervised tuning and reinforcement learning techniques.

The result is a model that closely matches or even outperforms competing GUI agent models with up to 24 times the number of parameters.

Image: Apple

While the entire architecture (which is thoroughly detailed in the study) is interesting, the real-time cropping and zooming techniques are particularly noteworthy.

The model makes an initial prediction, crops around it, and then re-predicts in that cropped region. This helps such a small model to compensate for its limited capacity to handle a large number of image tokens.

Image: Apple

Another notable contribution of the article is how Ferret-UI Lite essentially generates its own training data. The researchers built a multi-agent system that interacts directly with live GUI platforms to generate large-scale synthetic training examples.

There is a curriculum task generator that suggests objectives of increasing difficulty, a planning agent breaks them down into steps, a grounding agent executes them on screen, and a critical model evaluates the results.

Image: Apple

With this channel, the training system captures the fuzziness of real-world interaction (such as errors, unexpected states, and recovery strategies), something that would be much more challenging if it relied on clean, human-annotated data.

Interestingly, while Ferret-UI and Ferret-UI 2 used iPhone screenshots and other Apple interfaces for their evaluations, Ferret-UI Lite was trained and evaluated in Android, web, and desktop GUI environments using benchmarks like AndroidWorld and OSWorld.

The researchers don’t explicitly note why they chose this path for Ferret-UI Lite, but it likely reflects where reproducible, large-scale GUI-agent test environments are available today.

Be that as it may, the researchers found that while Ferret-UI Lite performed well on low-level, short-horizon tasks, it did not perform as strongly on more complicated multi-step interactions, a trade-off that would be largely expected given the limitations of the small model on the device.

On the other hand, Ferret-UI Lite offers a local and therefore private (since no data needs to go to the cloud and be processed on remote servers) agent that autonomously interacts with application interfaces based on user requests, which seems pretty cool.

To learn more about the study, including comparative analyzes and results, click this link.

Accessories offer on Amazon

Add 9to5Mac as a preferred resource on Google
Add 9to5Mac as a preferred resource on Google

FTC: We use automatic affiliate links with income. More.

Leave a Comment