Apple’s latest Ferret AI model is a step towards letting Siri see and control iPhone apps

Apple is still working on ways to help Siri see apps on the display, as a new document explains how it works on a version of Ferret that will run natively on the iPhone.

Apple’s work to speed up Siri with other AI systems usable on a smartphone is gradually accelerating. While immediate attempts to make a new, more contextual Siri happen aren’t quite ready for prime time, Apple is still looking ahead to what other updates it can make to its assistant and Apple Intelligence.

The way forward seems to be to focus on its strength: local query processing.

Ferrets shop

In 2023, Apple and researchers from Cornell University pioneered an open-source multimodal LLM called “Ferret”. It was the creation of software that could use areas of images for queries such as identifying what is in a drawn area of ​​a photograph.

Half a year later, in April 2024, the work expanded to a new version of Ferret-UI that could understand UI elements. That is, AI that can read a screenshot of a phone’s display, determine and read important elements that are on the screen, and potentially interact with the user interface of an open application.

In the February 2026 document for “Ferret-UI Lite”, there is a natural progression to create a version of Ferret that attempts to fix the problem of previous versions. Specifically, that it relied on processing using large language models (LLMs), which were quite large and not really designed for on-device processing at all.

Using these cloud-based LLMs made sense because the planning and reasoning capabilities were substantial and ensured great results. However, it still requires data to be sent to these cloud servers when privacy and security advocates may prefer to have the data processed locally.

While the team made progress in both producing GUIs and multi-agent systems for this task, especially when trying to reduce the work required for agents to interact with the user interface, there was still too much work to do locally on a smartphone.

This led to the creation of a new Lite version of Ferret-UI.

Thin and fast

The result is Ferret-UI Lite, an end-to-end GUI agent that works on multiple platforms, including mobile, web and desktop systems. That said, it’s something that will work on a smartphone like the iPhone without too much trouble.

To achieve this, Ferret-UI Lite is built with 3 billion parameters using GUI data from real and synthetic sources. It also increased performance in inference time using chain of thought reasoning and the use of visual aids along with reinforcement learning.

Crop the screen image based on predictions to minimize the amount of data that needs to be parsed, as discussed in the Ferret-UI Lite document

As an example of the ways Ferret-UI Lite works in ways that help locally processed queries, a zoom mechanism is included to help parse the UI image. The model makes an initial prediction, and based on that prediction, the image is cropped around the expected location.

With fewer images to work with, he can focus more on what information is presented in that cropped area, allowing him to refine the prediction much more.

To the researchers, this apparently mimics human behavior when looking at something in detail.

Promising research

While the resulting Ferret-UI Lite isn’t groundbreaking, the results it achieved are still pretty good considering it was pitted against server-level LLM agents. In some cases, the team claims it can outperform larger models.

In the ScreenSpot-Pro GUI grounding test, the model achieves an accuracy of 53.3%. This is more than 15% better than UI-TARS-1.5, LLM with 7 billion parameters.

However, not everything is great. In the GUI navigation task, Ferret had more limited performance than the larger models, but was still on par with the UI-TARS-1.5 model.

Finally, the paper concludes that the experiment “validates the effectiveness of these strategies for small-scale agents,” while also pointing out limitations. Reducing the number of GUI agents is both a promise and a challenge, and the team hopes the research will inform future research attempts.

Leave a Comment