Google published a research paper on how to extract user intent from user interactions that can then be used for autonomous agents. The method they discovered uses on-device small models that do not need to send data back to Google, which means that a user’s privacy is protected.
The researchers discovered they were able to solve the problem by splitting it into two tasks. Their solution worked so well it was able to beat the base performance of multi-modal large language models (MLLMs) in massive data centers.
Smaller Models On Browsers And Devices
The focus of the research is on identifying the user intent through the series of actions that a user takes on their mobile device or browser while also keeping that information on the device so that no information is sent back to Google. That means the processing must happen on the device.
They accomplished this in two stages.
- The first stage the model on the device summarizes what the user was doing.
- The sequence of summaries are then sent to a second model that identifies the user intent.
The researchers explained:
“…our two-stage approach demonstrates superior performance compared to both smaller models and a state-of-the-art large MLLM, independent of dataset and model type.
Our approach also naturally handles scenarios with noisy data that traditional supervised fine-tuning methods struggle with.”
Intent Extraction From UI Interactions
Intent extraction from screenshots and text descriptions of user interactions was a technique that was proposed in 2025 using Multimodal Large Language Models (MLLMs). The researchers say they followed this approach to their problem but using an improved prompt.
The researchers explained that extracting intent is not a trivial problem to solve and that there are multiple errors that can happen along the steps. The researchers use the word trajectory to describe a user journey within a mobile or web application, represented as a sequence of interactions.
The user journey (trajectory) is turned into a formula where each interaction step consists of two parts:
- An Observation
This is the visual state of the screen (screenshot) of where the user is at that step. - An Action
The specific action that the user performed on that screen (like clicking a button, typing text, or clicking a link).
They described three qualities of a good extracted intent:
- “faithful: only describes things that actually occur in the trajectory;
- comprehensive: provides all of the information about the user intent required to re-enact the trajectory;
- and relevant: does not contain extraneous information beyond what is needed for comprehensiveness.”
Challenging To Evaluate Extracted Intents
The researchers explain that grading extracted intent is difficult because user intents contain complex details (like dates or transaction data) and the user intents are inherently subjective, containing ambiguities, which is a hard problem to solve. The reason trajectories are subjective is…
Source link
Disclaimer
We strive to uphold the highest ethical standards in all of our reporting and coverage. We blogs.grocliq.com want to be transparent with our readers about any potential conflicts of interest that may arise in our work. It’s possible that some of the investors we feature may have connections to other businesses, including competitors or companies we write about. However, we want to assure our readers that this will not have any impact on the integrity or impartiality of our reporting. We are committed to delivering accurate, unbiased news and information to our audience, and we will continue to uphold our ethics and principles in all of our work. Thank you for your trust and support.
Website Upgradation is going on for any glitch kindly connect at [email protected]