On Extending Semantic Abstraction for Efficient Search of Hidden Objects

February 23, 2026

Reading time: 6 minute

...

📝 Original Info

Title: On Extending Semantic Abstraction for Efficient Search of Hidden Objects
ArXiv ID: 2512.22220
Date: 2025-12-22
Authors: Tasha Pais, Nikhilesh Belulkar

📝 Abstract

Semantic Abstraction's key observation is that 2D VLMs' relevancy activations roughly correspond to their confidence of whether and where an object is in the scene. Thus, relevancy maps are treated as "abstract object" representations. We use this framework for learning 3D localization and completion for the exclusive domain of hidden objects, defined as objects that cannot be directly identified by a VLM because they are at least partially occluded. This process of localizing hidden objects is a form of unstructured search that can be performed more efficiently using historical data of where an object is frequently placed. Our model can accurately identify the complete 3D location of a hidden object on the first try significantly faster than a naive random search. These extensions to semantic abstraction hope to provide household robots with the skills necessary to save time and effort when looking for lost objects.

💡 Deep Analysis

Deep Dive into On Extending Semantic Abstraction for Efficient Search of Hidden Objects.

Semantic Abstraction’s key observation is that 2D VLMs’ relevancy activations roughly correspond to their confidence of whether and where an object is in the scene. Thus, relevancy maps are treated as “abstract object” representations. We use this framework for learning 3D localization and completion for the exclusive domain of hidden objects, defined as objects that cannot be directly identified by a VLM because they are at least partially occluded. This process of localizing hidden objects is a form of unstructured search that can be performed more efficiently using historical data of where an object is frequently placed. Our model can accurately identify the complete 3D location of a hidden object on the first try significantly faster than a naive random search. These extensions to semantic abstraction hope to provide household robots with the skills necessary to save time and effort when looking for lost objects.

📄 Full Content

On Extending Semantic Abstraction for Efficient Search of Hidden Objects Nikhilesh Belulkar nb2953@columbia.edu Department of Computer Science Columbia University New York, NY 10025 Tasha Pais tdp2129@columbia.edu Department of Computer Science Columbia University New York, NY 10025 Abstract Semantic Abstraction’s key observation is that 2D VLMs’ relevancy activations roughly correspond to their confidence of whether and where an object is in the scene. Thus, relevancy maps are treated as "abstract object" representations. We use this framework for learning 3D localization and completion for the exclusive domain of hidden objects, defined as objects that cannot be directly identified by a VLM because they are at least partially occluded. This process of localizing hidden objects is a form of unstructured search that can be performed more efficiently using historical data of where an object is frequently placed. Our model can accurately identify the complete 3D location of a hidden object on the first try significantly faster than a naive random search. These extensions to semantic abstraction hope to provide household robots with the skills necessary to save time and effort when looking for lost objects. Additional Resources: Code 1 Problem Definition In the world of personal assistant robots, retrieval of specific objects given a time constraint is an important task. For instance, a user could command a robot to "find my keys" while getting ready to leave the house. Since the robot has sufficiently explored the environment from past actions and has a quantitative understanding of where an individual’s unique preference for an object’s placement is, the robot’s memory is much better suited to quickly search and retrieve an object while a user continues a different task. This can be a form of human-robot collaboration when the task is clear and often repeated. A point cloud is the most complete representation of an object in 3D space. When aggregating views of a partially hidden object from multiple camera angles, it can be difficult to localize the object. Given that an object may be hidden in one view and visible in another, we concatenate an object’s point cloud from multiple views. 2 Method Our method uses semantic abstraction as the primary system over which we develop our algorithm to optimize search. The key steps of our procedure are summarized below: 1. Collect multiple egocentric views of the environment through interaction, ex: opening cabinets and drawers. I Can’t Believe It’s Not Better Workshop III at NeurIPS 2022. arXiv:2512.22220v1 [cs.CV] 22 Dec 2025 2. From these views, query semantic abstraction with objects and their labels to generate point clouds for individual objects, ex: "apple in fridge", "tomato in fridge", "fork in drawer". 3. Aggregate the location data from several views and use the Expectation-Maximization algorithm to generate a Gaussian Mixture Model (GMM) about the location of a particular object. 4. Once this GMM has been developed, use it to optimize search times when queried for a specific object. 2.1 Step One The robot interacts with its environment collecting several different views. Since this is an embodied AI task, we chose RoboTHOR as the simulation environment. This random interaction is currently an explicitly predefined procedure in our code. In the future, we intend to use an LLM like ChatGPT to translate natural language queries into actions and provide an understanding of where certain objects are typically located to begin search. Figure 1: The camera view most often used to query Semantic Abstraction 2.2 Step Two Given an image, we utilized much of the infrastructure provided by semantic abstraction to generate a point cloud representing the relevancy of different parts of the image to the query provided. A query is made to CLIP. CLIP is a database that contains images and captions. Through contrastive learning CLIP localizes the parts of an image relevant to a caption. It utilizes this understanding to score how relevant each pixel of an image is to a particular query. For instance in the image below a relevancy score is provided to each pixel in the image when queried with "fork in cabinet". Further, we also used the intrinsic and extrinsic camera parameters of each view to map where each pixel is in three-dimensional space. This is best visualized through the point cloud produced below. Figure 2: Query to localize the location of a fork inside a cabinet Once each pixel’s 3D location is mapped to its corresponding CLIP relevancy score, we can create a buffer of point clouds in 3-dimensional space labeled with their object names. 2 2.3 Step Three Once we have the precise location of an object mapped over time, we then use this data to perform the Expectation-Maximization algorithm to generate a Gaussian Mixture Model. The EM randomly initializes the means and covariance matrices of a specific number of Gaussian distributions. The algorithm then computes t

…(Full text truncated)…

📄 Read Full PDF on ArXiv