Apple has introduced ReALM, an AI system designed to understand on-screen tasks, conversational context, and background processes, bringing voice assistants like Siri one step closer to being context-aware.
As more tech companies make strides in using AI systems, Apple’s ReALM just might offer a glimpse into the future of more intuitive and context-aware voice assistants, the tech giant disclosed in a recent research paper
The research paper, titled “ReALM: Reference Resolution As Language Modeling,” was authored by Joel Ruben Antony Moniz, et al. from Apple.
The system’s development is rooted in the idea that understanding context, including ambiguous references in human speech, is crucial for a conversational assistant. This involves integrating both conversational and on-screen contexts to enhance the user experience.
What is ReALM? And how does it work?
ReALM means Reference Resolution As Language Modelling. Reference resolution in discourse according to Tutorials Point, is the task of determining what entities are referred to by which linguistic expression. With the term “reference” being used to denote a linguistic expression used to refer to an entity or individual.
“While LLMs have been shown to be extremely powerful for a variety of tasks, their use in reference resolution, particularly for non-conversational entities, remains underutilized,” the paper explains.
“Human speech,” the paper stated, “typically contains ambiguous refer- ences such as “they” or “that”, whose meaning is obvious (to other humans) given the context. Being able to understand context, including references like these, is essential for a conversational assistant that aims to allow a user to naturally communicate their requirements to an agent, or to have a conversation with it.”
Thus, enabling the user to issue queries about what they see on their screen is a crucial step in ensuring a true hands-free experience in voice assistants, Apple said.
In a bid to resolve this problem, ReALM uses a novel method of converting on-screen information into text, which allows the system to bypass traditional image recognition processes. This results in a more efficient on-device AI system. The model also considers both the content visible on the user’s screen and the ongoing tasks to provide accurate responses.
Apple’s ReALM models outperformed GPT-4, despite having fewer parameters, the tech company claimed. A practical application of ReALM would be when a user is browsing a website and wants to call a business. By instructing Siri with a command like “call the business,” Siri can identify the phone number on the screen and initiate the call.
Apple’s ReALM: The Research
The research, conducted by a team of experts, compared ReALM against two baseline models, MARRS and ChatGPT (both GPT-3.5 and GPT-4 variants), and found that ReALM’s performance was notably superior.
The study focused on a critical task known as reference resolution, which involves extracting the entity or entities pertinent to a user’s query from three distinct types of relevant entities:
- On-screen Entities: These are entities that are currently displayed on a user’s screen.
- Conversational Entities: These are entities relevant to the conversation, either coming from a previous turn by the user or from the virtual assistant.
- Background Entities: These are relevant entities that come from background processes, not necessarily visible on the user’s screen or directly part of the interaction with the virtual agent.
The task was formulated as a multiple-choice challenge for the Large Language Model (LLM), where the objective was to select the relevant entity from a list displayed on the user’s screen.
The research utilised datasets that were either synthetically created or developed with the help of annotators. Each data point contained a user query, a list of entities, and the ground-truth entity or entities relevant to the query. The evaluation allowed for any permutation of the correct entities, ensuring a comprehensive assessment of the model’s performance.
– MARRS: A non-LLM-based baseline approach, specifically designed for reference resolution.
– ChatGPT: Utilised both GPT-3.5 and GPT-4 variants with in-context learning. GPT-4 was provided with a screenshot for on-screen reference resolution, significantly enhancing its performance.
– ReALM: The proposed model used a FLAN-T5 model for fine-tuning, with parsed input and entity conversion to a sentence-wise format for training.
The research results were presented in a comparative table showcasing the accuracy of different models across various datasets. ReALM demonstrated superior performance, outperforming both MARRS and GPT-4 in several key metrics. Notably, ReALM was able to understand more domain-specific questions due to its fine-tuning on user requests.