Full System Architecture Modeling for Wearable Egocentric Contextual AI

December 18, 2025

Reading time: 5 minute

...

📝 Original Info

Title: Full System Architecture Modeling for Wearable Egocentric Contextual AI
ArXiv ID: 2512.16045
Date: 2025-12-18
Authors: Vincent T. Lee, Tanfer Alan, Sung Kim, Ecenur Ustun, Amr Suleiman, Ajit Krisshna, Tim Balbekov, Armin Alaghi, Richard Newcombe

📝 Abstract

The next generation of human-oriented computing will require always-on, spatially-aware wearable devices to capture egocentric vision and functional primitives (e.g., Where am I? What am I looking at?, etc.). These devices will sense an egocentric view of the world around us to observe all human-relevant signals across space and time to construct and maintain a user's personal context. This personal context, combined with advanced generative AI, will unlock a powerful new generation of contextual AI personal assistants and applications. However, designing a wearable system to support contextual AI is a daunting task because of the system's complexity and stringent power constraints due to weight and battery restrictions. To understand how to guide design for such systems, this work provides the first complete system architecture view of one such wearable contextual AI system (Aria2), along with the lessons we have learned through the system modeling and design space exploration process. We show that an end-to-end full system model view of such systems is vitally important, as no single component or category overwhelmingly dominates system power. This means long-range design decisions and power optimizations need to be made in the full system context to avoid running into limits caused by other system bottlenecks (i.e., Amdahl's law as applied to power) or as bottlenecks change. Finally, we reflect on lessons and insights for the road ahead, which will be important toward eventually enabling all-day, wearable, contextual AI systems.

💡 Deep Analysis

📄 Full Content

Full System Architecture Modeling for Wearable Egocentric Contextual AI Vincent T. Lee†, Tanfer Alan†, Sung Kim†, Ecenur Ustun†, Amr Suleiman†, Ajit Krisshna‡, Tim Balbekov‡, Armin Alaghi†, Richard Newcombe† Meta Reality Labs Research† Meta Reality Labs Silicon‡ {vtlee, tanfer, sungmk, ecenurustun, amrsuleiman, ajitkrisshna, timbalb, alaghi, newcombe}@meta.com Abstract—The next generation of human-oriented computing will require always-on, spatially-aware wearable devices to cap- ture egocentric vision and functional primitives (e.g., Where am I? What am I looking at?, etc.). These devices will sense an egocentric view of the world around us to observe all human- relevant signals across space and time to construct and maintain a user’s personal context. This personal context, combined with advanced generative AI, will unlock a powerful new generation of contextual AI personal assistants and applications. However, designing a wearable system to support contextual AI is a daunting task because of the system’s complexity and stringent power constraints due to weight and battery restrictions. To understand how to guide design for such systems, this work provides the first complete system architecture view of one such wearable contextual AI system (Aria2), along with the lessons we have learned through the system modeling and design space exploration process. We show that an end-to-end full system model view of such systems is vitally important, as no single component or category overwhelmingly dominates system power. This means long-range design decisions and power optimizations need to be made in the full system context to avoid running into limits caused by other system bottlenecks (i.e., Amdahl’s law as applied to power) or as bottlenecks change. Finally, we reflect on lessons and insights for the road ahead, which will be important toward eventually enabling all-day, wearable, contextual AI systems. I. INTRODUCTION Wearable egocentric-perception devices, such as smart glasses, promise to enable the next generation of human- oriented computing by observing the world from humans’ point of view and combining these observations with AI to unlock powerful new personalized contextual artificial intel- ligence (AI) assistants and capabilities [13]. Similar to the Xerox Alto introduced over 50 years ago, these systems are positioned to revolutionize how we interact with computing devices [32]. However, unlike previous generations of human- oriented computing, e.g., personal computers and smartphones, wearable glasses will also have access to egocentric signals such as eye gaze, head pose, and hand positions, which, along with other personal signals, form the user’s personal context over time. This personal context [2] contains significantly richer, more structured, and longer-term personal information which, when combined with generative AI, will unlock more powerful personalized contextual AI applications. In recent years, generative AI has made significant strides in creating sophisticated models capable of understanding and generating human-like text and images. However, these AI systems are trained on vast datasets that do not include the nuanced personal data that an egocentric wearable device can capture. As a result, this restricts their ability to provide personalized and context-aware assistance. This gap can be bridged by integrating personal context and egocentric signals, i.e., how the user experiences the world, into AI systems. For instance, AI could observe personal food intake or daily exercise activity, and propose restaurants or routines to meet personal dietary or health goals. This personally tailored AI service to augment and complement human capabilities is what we refer to as contextual AI. Wearable, always-on devices, such as smart glasses [1], [9], are uniquely positioned to fill this gap by continuously capturing egocentric signals like eye gaze, head pose, and hand positions, along with other personal context signals. The egocentric signals required to construct personal context are computed by a set of egocentric primitives (Where am I? What do I see?, etc.). Each primitive has an implementation that takes information sensed from the world around us, computes over it, and generates an egocentric signal. For instance, an egocentric primitive implementation such as hand tracking uses outward-facing cameras to sense visual data, compute over it, and generate hand position signals. From these egocentric signals and workloads, we can derive the device architecture requirements for always-on personal contextual AI. The required input sensing modalities for each primitive define the type, number, and placement of sensors for a wearable device (e.g., RGB camera, inertial measurement units, etc.). The specific algorithm implementations define the architectural resources required to support each workload, as well as the design optimization trade-offs between compute, memory, communication, and oth

📄 Read Full PDF on ArXiv