Title: Full System Architecture Modeling for Wearable Egocentric Contextual AI
ArXiv ID: 2512.16045
Date: 2025-12-18
Authors: Vincent T. Lee, Tanfer Alan, Sung Kim, Ecenur Ustun, Amr Suleiman, Ajit Krisshna, Tim Balbekov, Armin Alaghi, Richard Newcombe
📝 Abstract
The next generation of human-oriented computing will require always-on, spatially-aware wearable devices to capture egocentric vision and functional primitives (e.g., Where am I? What am I looking at?, etc.). These devices will sense an egocentric view of the world around us to observe all human-relevant signals across space and time to construct and maintain a user's personal context. This personal context, combined with advanced generative AI, will unlock a powerful new generation of contextual AI personal assistants and applications. However, designing a wearable system to support contextual AI is a daunting task because of the system's complexity and stringent power constraints due to weight and battery restrictions. To understand how to guide design for such systems, this work provides the first complete system architecture view of one such wearable contextual AI system (Aria2), along with the lessons we have learned through the system modeling and design space exploration process. We show that an end-to-end full system model view of such systems is vitally important, as no single component or category overwhelmingly dominates system power. This means long-range design decisions and power optimizations need to be made in the full system context to avoid running into limits caused by other system bottlenecks (i.e., Amdahl's law as applied to power) or as bottlenecks change. Finally, we reflect on lessons and insights for the road ahead, which will be important toward eventually enabling all-day, wearable, contextual AI systems.
💡 Deep Analysis
📄 Full Content
Full System Architecture Modeling for Wearable
Egocentric Contextual AI
Vincent T. Lee†, Tanfer Alan†, Sung Kim†, Ecenur Ustun†, Amr Suleiman†, Ajit Krisshna‡,
Tim Balbekov‡, Armin Alaghi†, Richard Newcombe†
Meta Reality Labs Research†
Meta Reality Labs Silicon‡
{vtlee, tanfer, sungmk, ecenurustun, amrsuleiman, ajitkrisshna, timbalb, alaghi, newcombe}@meta.com
Abstract—The next generation of human-oriented computing
will require always-on, spatially-aware wearable devices to cap-
ture egocentric vision and functional primitives (e.g., Where am
I? What am I looking at?, etc.). These devices will sense an
egocentric view of the world around us to observe all human-
relevant signals across space and time to construct and maintain
a user’s personal context. This personal context, combined with
advanced generative AI, will unlock a powerful new generation
of contextual AI personal assistants and applications. However,
designing a wearable system to support contextual AI is a
daunting task because of the system’s complexity and stringent
power constraints due to weight and battery restrictions. To
understand how to guide design for such systems, this work
provides the first complete system architecture view of one such
wearable contextual AI system (Aria2), along with the lessons
we have learned through the system modeling and design space
exploration process. We show that an end-to-end full system model
view of such systems is vitally important, as no single component
or category overwhelmingly dominates system power. This means
long-range design decisions and power optimizations need to be
made in the full system context to avoid running into limits
caused by other system bottlenecks (i.e., Amdahl’s law as applied
to power) or as bottlenecks change. Finally, we reflect on lessons
and insights for the road ahead, which will be important toward
eventually enabling all-day, wearable, contextual AI systems.
I. INTRODUCTION
Wearable egocentric-perception devices, such as smart
glasses, promise to enable the next generation of human-
oriented computing by observing the world from humans’
point of view and combining these observations with AI to
unlock powerful new personalized contextual artificial intel-
ligence (AI) assistants and capabilities [13]. Similar to the
Xerox Alto introduced over 50 years ago, these systems are
positioned to revolutionize how we interact with computing
devices [32]. However, unlike previous generations of human-
oriented computing, e.g., personal computers and smartphones,
wearable glasses will also have access to egocentric signals
such as eye gaze, head pose, and hand positions, which, along
with other personal signals, form the user’s personal context
over time. This personal context [2] contains significantly
richer, more structured, and longer-term personal information
which, when combined with generative AI, will unlock more
powerful personalized contextual AI applications.
In recent years, generative AI has made significant strides
in creating sophisticated models capable of understanding and
generating human-like text and images. However, these AI
systems are trained on vast datasets that do not include the
nuanced personal data that an egocentric wearable device can
capture. As a result, this restricts their ability to provide
personalized and context-aware assistance. This gap can be
bridged by integrating personal context and egocentric signals,
i.e., how the user experiences the world, into AI systems.
For instance, AI could observe personal food intake or daily
exercise activity, and propose restaurants or routines to meet
personal dietary or health goals. This personally tailored AI
service to augment and complement human capabilities is what
we refer to as contextual AI.
Wearable, always-on devices, such as smart glasses [1],
[9], are uniquely positioned to fill this gap by continuously
capturing egocentric signals like eye gaze, head pose, and
hand positions, along with other personal context signals. The
egocentric signals required to construct personal context are
computed by a set of egocentric primitives (Where am I? What
do I see?, etc.). Each primitive has an implementation that
takes information sensed from the world around us, computes
over it, and generates an egocentric signal. For instance, an
egocentric primitive implementation such as hand tracking
uses outward-facing cameras to sense visual data, compute
over it, and generate hand position signals.
From these egocentric signals and workloads, we can derive
the device architecture requirements for always-on personal
contextual AI. The required input sensing modalities for each
primitive define the type, number, and placement of sensors
for a wearable device (e.g., RGB camera, inertial measurement
units, etc.). The specific algorithm implementations define the
architectural resources required to support each workload, as
well as the design optimization trade-offs between compute,
memory, communication, and oth