Earlier this year, Apple brought together members of the academic research community for an exclusive two-day workshop on the current state of the art in natural language understanding, and discussions and presentations on key challenges and potential solutions.
Balancing Privacy and Performance
A critical theme was the difficulty of advancing natural language understanding while preserving user privacy, given foundation models’ propensity to memorize data. Princeton assistant professor Danqi Chen highlighted research showing privacy risks with popular hybrid approaches using a parametric model plus k-nearest neighbors retrieval. Her work demonstrates reduced privacy leakage by fine-tuning the parametric model alone.
The workshop also addressed the daunting task of integrating foundation models into production systems. With their massive data needs, foundation models currently require substantial computing resources, which poses challenges for deployment on smaller devices like phones. Several model compression techniques like weight pruning and lower-precision encoding, covered in papers from UW professor Luke Zettlemoyer and Cornell Tech PhD candidate Rajiv Movva, were explored.
Foundation models are also prone to 'hallucinations', generating false or extraneous details. As Pascale Fung's research underscores, such hallucinations can lead to the propagation of incorrect information, which can be detrimental. Grounding agents in verified external knowledge from auxillary models and systems, rather than relying solely on foundation models, may mitigate these issues.
Leveraging multimodal context to refine conversational understanding was another area of interest. Workshop presenters explored multimodal techniques using visual, prosodic, and contextual signals to reduce ambiguity and improve conversational systems. Hadas Kotek from Apple demonstrated how context can help interpret phrases like "today" or "the lowest one".
Mari Ostendorf from the University of Washington outlined other contextual elements crucial for natural dialogue, including external knowledge, dialogue history, and prosodic signals.
Presenters also covered multimodal learning to obtain joint representations of text and images. This is vital for systems that use text to search for images or describe images in text. Yinfei Yang, an Apple researcher, presented his work on learning sparse text and image representation in grounded tokens, a crucial step towards achieving this.
Synthetic Data Generation
Presenters discussed using foundation models to generate high-quality synthetic datasets, reducing reliance on human-labeled data. Despite the promising privacy advantages of synthetic data, measuring and ensuring the quality and diversity of output data remain challenging, especially for multi-turn interactions. The workshop highlighted potential solutions, such as using foundation models for review and analysis, and incorporating human annotation.
While Apple might not be as vocal or visibly active as tech giants like Microsoft and Meta in publicizing their AI ambitions, they have been diligently exploring behind the scenes. They are contributing valuable research and earnestly exploring ways to integrate powerful but responsible AI into consumer products. Their approach is not only about leveraging the transformative power of AI but doing so in a manner that respects user privacy and safety.
Privacy preservation, knowledge grounding, multimodal understanding, and controlled synthetic data generation appear crucial to leveraging large language models to develop safe, useful real-world applications. With thoughtful research advancing these fronts, maybe one day we will get a Siri that could perceive meaning as effectively as humans.
For those interested in delving deeper into these themes, Apple has curated videos of select workshop talks and related academic papers on natural language understanding on their website.