Video and Language Alignment in 2D Systems for 3D Multi-object Scenes with Multi-Information Derivative-Free Control

Reading time: 44 minute
...

๐Ÿ“ Original Info

  • Title: Video and Language Alignment in 2D Systems for 3D Multi-object Scenes with Multi-Information Derivative-Free Control
  • ArXiv ID: 2512.24826
  • Date: 2025-12-31
  • Authors: Jason Armitage, Rico Sennnrich

๐Ÿ“ Abstract

Cross-modal systems trained on 2D visual inputs are presented with a dimensional shift when processing 3D scenes. An inscene camera bridges the dimensionality gap but requires learning a control module. We introduce a new method that improves multivariate mutual information estimates by regret minimisation with derivative-free optimisation. Our algorithm enables offthe-shelf cross-modal systems trained on 2D visual inputs to adapt online to object occlusions and differentiate features. The pairing of expressive measures and value-based optimisation assists control of an in-scene camera to learn directly from the noisy outputs of vision-language models. The resulting pipeline improves performance in cross-modal tasks on multi-object 3D scenes without resorting to pretraining or finetuning.

๐Ÿ“„ Full Content

Vision-language models (VLMs) trained on 2D visual inputs ingest 3D scenes in the form of sequences of discrete viewpoints. Systems rely on this method to perform tasks where samples at test include an additional dimension Wang et al. (2022); Voigt et al. (2023); Xue et al. (2024); Liu et al. (2024). Consider a visual sequence to be a set of views rendered from a camera moving over a 3D scene. In this case, the problem reduces to predicting in order the location of viewpoints over ๐‘ฅ-, ๐‘ฆ-, and ๐‘ง-axes. Given a system and a 3D reconstructed scene, we posit that an optimal sequence of 2D viewpoints exists, that is the sequence that -when paired with the respective linguistic representations -is most likely to return an accurate prediction from the VLM.

Why should we be concerned with the precise locations of the viewpoints presented to the VLM? Reasoning over the features of several objects at the same time is highly challenging for these systems Linghu et al. (2024). In the generation of 3D scenes, systems can expect with some confidence that the left door of a car will be the same colour as the right. The front view of a building is in contrast a less reliable indicator of the appearance of its roof or rear door. A high level of confidence in a VLM’s assessments is of utmost importance if these systems are to be of use for experts performing scientific analysis. In planetary science, features such as a specific boulder’s texture form the core evidence when performing geological analysis on visual data from a Mars rover (see Figure 1). Object appearances vary depending on the lighting at the time of day, wind conditions, and the position of the cameras in relation to the objects Bell III et al. (2022); Paar et al. (2023). Rock surface details are highly variable: features of interest clearly visible on one side of a boulder may be covered by detritus on all other sides.

In addition to these variances in 3D scenes, the internal dynamics of VLMs are also unknown. Generalisation is improved for tasks where the test samples are 2D by finetuning. In contrast, availability of cross-modal training data with 3D visual inputs is restricted Poole et al. (2023). This problem is acute in exploratory science where visual evidence is novel and costly to generate. Additional training is compromised in frontier research by limitations on domain-relevant 3D assets Hofmann et al. (2023); Barnes et al. (2018); Yuan et al. (2024).

We propose a straightforward approach to predicting a VLM’s errors and tune an in-scene controller guided by a measure that quantifies both the information presented in the scene and model uncertainty. The success of this approach will rely on the ability of our measure to express the complexity of the scene. A measure of one or two features alone is insufficient -all available variables should be accounted for. The estimated value of this metric should combine the information for all variables in a way that limits ยฉ2025 Copyright held by the authors. This is the authors’ version of the work. It is posted here for your personal use. Not for redistribution.

redundancies -or regret as defined in Krichevsky (1968) -to a minimum. To limit compute and data costs, we target a process that avoids differentiable training associated with auxiliary models or finetuning the VLM.

Information theory provides an efficient and principled framework in which to develop statistical approaches with provable bounds Kullback (1997). We opt for a form of mutual information estimation over multiple variables (MI) as the basis for our metric. Assuming no access to the VLM’s parameters at inference time, we apply a zeroth-order algorithm (ZO) to optimise the expressiveness of our MI measure by reducing the redundancies between variables Chen et al. (2017); Malladi et al. (2023); Wollstadt et al. (2023) of the model’s predictions. Our multi-information metric Studenแปณ (1987) optimised with ZO algorithm (MI-ZO) overcomes the theoretical challenges of estimating MI measures over ๐‘› > 2 Berrett et al. (2019) mixed discrete and continuous Gao et al. (2017) variables using active regret minimisation.

Accurate and fast reasoning on 3D scenes is a key to unlocking the potential of integrating 2D VLMs into processes where automation complements human expertise. An acute challenge is presented by scenes populated with similar objects containing minor differences. In addition to forms of scientific analysis noted above, this challenge is evident in asset creation for 3D virtual environments. Examples include generating a set of buildings to populate a street and 3D representations of components in a particle accelerator Chen et al. (2022); Peixoto et al. (2020); Huang et al. (2022). A deficit of efficient methods to evaluate and perform rapid testing on 3D generations slows development and constrains adoption of systems Hofmann et al. (2023). New methods for assisting experts in planetary science and adjacent domains are also required as the volume of observations and data processing are on the rise National Aeronautics and Space Administration (2024); Cuillandre et al. (2024).

We formulate the problem of the dimensional shift presented by 3D scenes as adaptive control of the in-scene camera to return an optimal sequence of actions. In robotics, efficient methods from control theory are used to optimise the selection of sequences of views (Swain and Stricker, 1993;Tallamraju et al., 2019). Here control of the camera is optimised by measuring the information of the visual scene in relation to the linguistic description. The result is a method for applications with complex 3D scenes that requires no access to the VLM’s parameters or costly backpropagation Malladi et al. (2023). Our research presents four contributions to improve VLMs trained on 2D data in analysis performed on 3D multi-object scenes:

โ€ข A new algorithm denoted as MI-ZO that estimates online multi-information with active regret minimisation on ๐‘› > 2 variables using zeroth-order optimisation. Our method reduces overlaps between a set of continuous and discrete variables representing cross-modal inputs with a novel application of zeroth-order optimisation. A set of multivariate multi-information proposals is provided. Theoretical proofs for our method are added in Appendix G.

โ€ข A novel framework to apply adaptive control with multivariate information theoretic measures to predict the actions of an in-scene camera using noisy VLM outputs. Our controller is tailored to low-data settings with compute-efficient polynomial regression, least-squares approximation, and an interaction matrix. โ€ข A diagnostic comprising 3D scenes of polygons and descriptions with two levels of complexity termed UC-3DS-MI (Uniform and Complex 3D Scenes for Multi-information) and a new custom measure of variance to demonstrate the relation of online feedback and active regret minimisation in estimating multivariate mutual information. โ€ข Three new cross-modal benchmarks with 3D multiobject scenes (GeoProperties-3DS, FeatureID-3DS, and PartialView-3DS). We introduce sets of 3D scenes with language descriptions to evaluate methods for controlling an in-scene camera to enable a VLM system to analyse object properties, handle feature identification on constrained action counts, and adapt to visual occlusions.

Additional details are available on our project page at https: //mi-zo.github.io/mi-zo/.

We reduce the challenge of promoting accurate and fast reasoning by VLMs on 3D scenes to predicting a sequence of camera actions on the scene that returns a correct response with the least number of views. To predict the optimal sequence of actions, we propose measuring the information capacity of our multimodal inputs.

An optimal method will return accurate assessments of the information in the inputs at low cost. Information theoretic principles are a natural choice for these measurements:

โ€ข Simple measurements on the visual and linguistic elements of the inputs can be obtained at negligible cost and serve as additional entropy sources. โ€ข Entropy is a fundamental logarithmic measure of information and a basis for the statistical theory of estimation Shannon and Weaver (1949); Kullback (1997). In cases with ๐‘›=2 variables, entropy estimators have been proposed that provide theoretical guarantees of convergence at different scales of data -including in problems where sample sizes are small Kraskov et al. (2004). โ€ข The problem is that mutual information provides no theoretical guarantees of optimality when the number of variables or individual measurements exceeds two. Short of perfect knowledge on inputs, a measurement with ๐‘› > 2 variables may be negative when some nonzero quantity of information is redundant Te Sun (1980) invalidating our estimate.

Our next step is to demonstrate what the problem of negative measurement in MI estimation and the resulting regret documented in theoretical work Krichevsky (1968) means in practice. To make this relevant to 3D multi-object scenes, we have designed a novel diagnostic called UC-3DS-MI.

Our diagnostic UC-3DS-MI is a novel set of 3D scenes with varying numbers of objects and a range of features (see Appendix B). The aim is to demonstrate how minimising regret can result in accurate expression of the information in both simple uniform and complex scenes. Complexity is defined in simple terms as the number of colours and sides of the simple polygons in sample scenes. In contrast quantifying complexity in relation to real-world objects is undermined by biases in the training data of a VLM Parashar et al. (2024). The International Space Station is a complex device with a non-uniform geometry, but trivial for large VLMs to handle as it appears frequently in datasets.

We next define the setup used for the UC-3DS-MI diagnostic and in our experiments (see Figure 2). Prediction errors in VLM decisions and complexity estimates are collected for ๐‘› demonstrations ahead of starting the runs. During evaluation (see also Appendix B), the sequence of camera actions in the measurement round contains default rotations. In the correction round, the controller predicts camera actions tuned on feedback from the demonstrations and measurement round. VLM decisions on scene summary questions in this second round are scored and averaged over all samples in the benchmark. In UC-3DS-MI, our analysis compares the impact of MI measurements for samples where the VLM system responses are correct and incorrect. We compute the means of multivariate measurements and correctness labels for each viewpoint to estimate the joint probability densities over both the full set of scenes -and the uniform and complex subsets. Two variants of MI metrics learned by our MI-ZO algorithm are assessed against bivariate and multivariate measures estimated by standard methods:

โ€ข ๐บ๐‘‚-๐ฟ๐ธ ๐ท-๐‘‚ ๐ฟ ar . Regret minimisation over four inputs including global and local visual inputs from the CIELAB colour space. โ€ข ๐บ๐ป-๐ฟ๐ธ ๐ท ar . Regret minimisation over three inputs including a global visual hue variable extracted from the HSV colour model.

Results for our information theoretic measurements are reported by variant and method in Figure 3 (see also Appendix D for additional results). Note in the figure that variants of the two metrics with active regret minimisation ๐‘€ ๐ผ-๐‘Ž๐‘Ÿ separate samples with correct and incorrect labels by learning the relation between information content in a scene and system error. The outcome with camera control is that the most advantageous view of the scene can be selected. Additional results from the diagnostic are provided in Appendix E.

Ahead of detailing our method to obtain expressive MI estimates, we present prior work in control theory for computer vision, measuring mutual information over multiple variables, and zerothorder optimisation.

Our work defines a controller that uses simple estimates of information content in the inputs for cross-modal tasks with 3D data to control an in-scene camera. Efficient control methods for tasks with 3D visual inputs form a canonical topic in robotics Christie et al. (2008);Wiedemann et al. (2023); Gonultas et al. (2023) but have only been explored in a handful of cases for tasks in 3D scenes. System identification -a standard method in control theory -is adapted by Jaques et al. (2021) to calibrate a camera and estimate 3D poses of objects in scenes using videos. The same authors introduced system identification and control into deep learning models to learn physics from images Jaques et al. (2020). System identification is the term used in control theory for the process of learning or improving a model with measurements Ljung and Glad (1994). We also propose a method that fits the broad definition of system identification in using data to learn a system -in this case, to learn the system that leads to a 2D VLM incorrectly interpreting a 3D scene. Our approach differs in using efficient information theoretic measures to identify system errors. The result is that no pretraining or finetuning is used and only a handful of samples are required to run demonstrations.

Multivariate Mutual Information Our proposal for measuring multi-information is grounded in prior work assessing the quantity of information in sets of entropy sources expressed in statistical measurements Watanabe (1960). Seminal theoretical analysis on challenges in estimating mutual information on multiple variables includes Te Sun (1980) and Berrett et al. (2019). Mohammadi et al. (2018) propose a method to eliminate redundant mutual information in sets of variables ahead of learning. Cabeli et al. (2021) improve conditional mutual information estimation for mixed variable types designed for causal models Verny et al. (2017) by switching negative results to null values. Steeg (2017) extended the hierarchical decomposition in the work of Watanabe (1960) and earlier publications Ver Steeg and Galstyan (2016) for unsupervised representation learning in several domains. Our approach to measuring information content aligns closer to multiinformation extending the bivariate case outlined by Studenแปณ (1987) to ๐‘› > 2 variables. Sources in our setup are visual variables extracted from scenes.

Zeroth-order Optimisation Zeroth-order optimisation (ZO) methods minimise an objective function without resort to direct evaluation of the gradient. In this way, ZO dispenses with the computational costs of backpropagating through the graph. Methods that work directly with function values were an early area of investigation in minimising the cost in the form of a function (Conn et al., 2009). Malladi et al. (2023) modified the ZO-SGD process with a memory-efficient algorithm that finetunes LLMs with non-differentiable metrics. Efficient optimisation on function values motivated Hoffman et al. (2022) to use the ZO-AdaMM Chen et al. (2019) algorithm to enhance molecule design. In our use case of combining entropies, ZO fits our requirements from an optimisation method in limiting computational operations and replacing access to model parameters with non-differentiable scores. Our ZO algorithm ensures MI expresses the information in inputs by reducing the redundancies between the variables. To our knowledge, the application of ZO in multi-information estimation is a novel proposal in optimisation.

In this section, we specify a method for estimating multiinformation with active regret minimisation that minimises redundancies in a set of entropy sources with a ZO algorithm (see Algorithm 1). We then outline the design of our controller and illustrate how it uses our MI measures to guide the camera in a 3D scene to optimal views.

Our aim in designing a method to manage regret is to ensure that the addition of variables is always contributive to assessing if the VLM is correct or not. The measure for quantifying these relationships is MI. In MI-ZO, this is implemented by ensuring the addition of information is performed in line with the feedback in the initial demonstrations and after each round of inputs (see Algorithm 1). The redundancies between a set of variables are reduced by forming these into a weighted mixture distribution and the subsequent novel application of a ZO algorithm. Sources We describe the inputs constituting the sources of entropy. A viewpoint render is converted to a colour space and a series of intervals is constructed over ๐‘› axes where ๐‘›=2 for the CIELAB colour space and the single hue axis is selected for the HSV colour model (see our ablation on colour spaces in Subsection 4.3). These intervals are adjacent and enable multivariate estimation with mixed variable types Hall and Morton (1993). A Sobel operator is passed over grayscale object masks segmented from the image to estimate local edge density (๐ฟ๐ธ ๐ท). MI variant ๐บ๐ป-๐ฟ๐ธ ๐ท is the product of global hue (๐บ๐ป) and local edge density histograms. ๐บ๐‘‚-๐ฟ๐ธ ๐ท-๐‘‚ ๐ฟ is composed of global AB axes (๐บ๐‘‚) with an additional series at object level on LAB axes (๐‘‚ ๐ฟ). Scaling factors ฮ› are computed over counts of noun phrases and descriptor terms in the textual input.

Weighted Mixture Distribution Our selection of a mixture distribution for ๐‘€๐‘–๐‘ฅ๐‘ก๐‘ข๐‘Ÿ๐‘’(๐‘‹) avoids integration when combining histograms for sources and measures of ฮ› linguistic inputs. Sources are combined at time ๐‘ก with values for each component weight ๐œƒ ๐‘š๐‘– ๐‘ฅ set by ๐œ‹ summing to 1:

(1)

Optimising Feedback to Minimise Regret Assume the information of a constituent source is composed of units ๐”Ÿ ๐‘ข and ๐”Ÿ ๐‘‘ in the set ๐”… specified by the states ๐‘ˆ ๐‘ and ๐ท๐‘œ๐‘ค๐‘›. Orientation is defined as the contribution to MI where ๐‘ˆ ๐‘ is additive and ๐ท๐‘œ๐‘ค๐‘› is reductive in relation to the product of the MI calculation. A reduction means that MI is failing to express a proportion of the information capacity in the inputs. Consider the similarities and differences in the visual sources in Table 1. Our ZO algorithm is applied to eliminate the redundancies between these sourcesin practical terms, to express only the differences that contribute to a correct assessment. At each step, signed changes to the mixing weights are initialised and only retained in the update if the running mean on MI increases. A regret minimisation process of this form is equivalent to an optimal mixture of unit states from each source. The goal is to identify a separating margin that minimises the loss on the distance of the closest units ๐”Ÿ ๐”ฒ and ๐”Ÿ ๐”ก :

(2) The objective in our gradient-free optimisation step is to reduce the delta on losses for the current set of sources in relation to the mean over all sets accumulated to ๐‘– -1 rounds. A finite difference approximates a set of parameters for each set of components ๐œ™ in ๐‘€๐‘–๐‘ฅ๐‘ก๐‘ข๐‘Ÿ๐‘’(๐‘‹). We minimise the finite difference between the two loss terms as follows

where รฌ ๐‘ฃ ๐‘ก consists of multiple iterations ๐‘Ÿ {รฌ ๐‘ฃ ๐‘ก ,1 , . . . , รฌ ๐‘ฃ ๐‘ก ,๐‘Ÿ } averaged to get the single update รฌ

To demonstrate the ability of our algorithm to minimise regret at a rate that is constant with the quantity of feedback available, we provide scores for MI metrics on the UC-3DS-MI diagnostic in Table 1 in three levels of feedback setting. Note that performance improves in line with the number of variables added and only variants with active regret minimisation benefit from feedback. Theoretical analysis for our method is provided next and detailed in Appendix G.

We begin our theoretical analysis by defining multi-information MI and specifying the condition observed in the numerical analysis where -precluding access to advance information on all outcomes -the respective densities for uniform and complex inputs tend toward equivalent means. If the definition of Shannon entropy is adhered to, there is no guarantee of returning a positive value in all cases Cover and Thomas (2006) (see Lemma 1 and the subsequent proof in Appendix G).

Our aim in the remainder of the analysis is to describe the conditions where a function ๐‘” acts on an estimate of ๐‘€ ๐ผ (๐‘ฅ, ๐‘ฆ) such that the reductive contribution of ๐”Ÿ ๐”ก is bounded as โ‰ค ๐‘” รฌ ๐‘ค, ๐›ผ, ๐›พ , where รฌ ๐‘ค is the weight vector normal to a separating hyperplane, ๐›ผ is the vector representing observed information, and ๐›พ is the margin between additive ๐”Ÿ ๐”ฒ and reductive ๐”Ÿ ๐”ก units. The subsequent analysis proposes that if the hyperplane is selected and updated online using regret minimisation as defined in the previous subsection, then the negative contribution to the mutual information will remain bounded.

We proceed after stating Theorem 1 and the initial definitions to prove the following:

Theorem 2 (Function to maximise the margin). For any ๐‘‹, an optimisation process ๐‘“ (๐‘ฅ) = รฌ ๐‘ค ๐‘‡ ๐‘ฅ + ๐‘ exists to solve the minimax form of deriving the margin using the hyperplane in R ๐ท that maximises the distance between ๐”Ÿ ๐”ฒ and ๐”Ÿ ๐”ก : max

Our proof defines an optimisation function ๐‘“ (๐‘ฅ) = รฌ ๐‘ค ๐‘‡ ๐‘ฅ + ๐‘ that characterises the separating hyperplane in R ๐ท . By converting the primal optimisation problem into a dual form Rockafellar (1974), we obtain an expression for the margin ๐›พ = 2 โˆฅ รฌ ๐‘ค โˆฅ . The analysis is then extended to the online setting where the solution ๐œ” ๐‘ก is a projection to the nearest point in the convex set of variable values ๐ถ, which is updated by information ๐›ผ ๐‘ก and the gradient โˆ‡ of the loss ๐ฟ scaled by step size ๐œ‚ ๐‘ก :

(5)

Regret is dependent on the interaction of the margin and the inner product โŸจ๐›ผ, รฌ ๐‘คโŸฉ. Analysis of this relation confirms that the selected hyperplane determined by our function maximises the separation between unit types and guarantees that the negative impact of ๐”Ÿ ๐”ก is bounded.

Our controller consists of a chain of functions to predict camera actions ๐”ž for ๐‘› > 1 conversation rounds. We combine sample

Feedback setting Full 50% 20% efficient data filters to exploit the information capacity in MI measurements estimated by the MI-ZO algorithm detailed below.

A Central Unit predicts errors and confidence scores on axis-level traces and scores from two Component Models. View-level data updates a low dimensional representation of the 3D space in the form of an interaction matrix updated with a strong product at each step. A full specification of the controller is presented in Appendix F.

We implement a controller framework for empirical testing of multi-information and optimal control methods with open source VLMs. Evaluation on closed systems is precluded to limit the likelihood of data contamination invalidating subsequent benchmarking Xu et al. (2024). Our three benchmarks are designed to identify methods that predict the optimal sequence of viewpoints for appraising 3D scenes. Experiments in both benchmarks present a video-to-text system with a render of a viewpoint of the 3D scene and a description. Input pairs form turns in a conversation where state is accumulated to present the VLM with a complete view of the 3D scene. At the end of the sequence, the VLM assesses the state in relation to the scene and returns a boolean label {๐‘‡๐‘Ÿ๐‘ข๐‘’, ๐น๐‘Ž๐‘™๐‘ ๐‘’}.

An initial measurement round uses a camera initialised at the nearest of four distances to the scene and rotated to cardinal viewpoints by a default sequence of actions. In the correction round, the controller predicts camera actions based on data from the prior round and data for demonstrations on ๐‘› scenes run prior to the start of the evaluation (see Appendix C). Demonstrations provide the controller with feedback in the form of the prediction errors made by the VLM. Ljung (1979), a linear layer optimised with stochastic gradient descent (SGD), and a Radial Basis Function (RBF) network Orr et al. (1996). Camera: A single camera is added to the 3D scene at ๐‘ฅ-, ๐‘ฆ-, -๐‘งcoordinates (0, 0.5, -40) and pointed at the origin in all viewpoints. Initial viewpoint is front and a single rotation about the ๐‘ฅ-axis places the camera equal to ยฑ90 โ€ข to the left or right from its current position. Rotation about the ๐‘ฆ-axis is ยฑ45 โ€ข and constrained to the front or back as starting positions. Zoom operations are rotations on the ๐‘ง-axis performed in increments of ยฑ5 in the range [-10, -25] forward or backward to the origin. Metrics: We use balanced error rate (BER) to assess the performance on multi-object scenes for scientific analysis in our first benchmark Zhao et al. (2020) multiple colour spaces suggest there is a ceiling to gains from adding new sources (see Table 3). Eye-level comparisons indicate that the increase in information extracted for a single sample peaks as the differences between inputs reduces.

Benchmark: In our first evaluation on virtual environments, we introduce the FeatureID-3DS benchmark to test feature identification over multiple objects. A VLM provides a boolean response on world features (eg “ladder”, “doorway”) on presentation of a language description. A scene summary containing the features is presented on the final turn. Applications with low latency between feedback and generation rely on efficient estimation Hofmann et al. (2023). In order to assess control methods on prioritising highinformation viewpoints, camera action count is restricted to 5 in the correction round. Counts of camera actions are the main factor in efficiency as each action entails a round of inputs and responses from the VLM. This is demonstrated by wall-clock time results in Appendix D indicating that time increases at a rate that is linear to the number of actions. Results: Our ๐บ๐‘‚-๐ฟ๐ธ ๐ท-๐‘‚ ๐ฟ ar metric assists the Poly+ZO+MI controller to prioritise viewpoints that

Chat-UniVi-13B Acc@5 ฮ” on R1 Acc@8 ฮ” on R1 Acc@5 ฮ” on R1 Acc@8 ฮ” on R1

Benchmark: Our second assessment on reasoning over virtual environments focuses on offsetting object occlusions. Scenes in the PartialView-3DS benchmark consist of two objects located on opposite sides of a partition to test the adaptation of control methods when there is full or partial occlusion of one of the objects at every viewpoint. A matching description is selected from five descriptions presented on each turn. At the end of the round, the system uses the information collected to select the correct summary of the scene. Camera action count in both rounds is set to 8. Results: Our light and derivative-free Poly+ZO+MI controller using multi-information with active regret minimisation (see variants with ๐‘Ž๐‘Ÿ in Table 5) adapts camera actions to return views that improve VLM performance and outperforms other standard control algorithms, an RBF network, and a linear layer optimised with SGD. Variance over runs for all methods are minor (see additional results from the experiments included in Appendix D).

In this paper, we propose a novel multi-information estimation method and efficient derivative-free control to predict camera actions that provide optimal sequences of views on 3D scenes. This method improves the performance of VLMs trained on 2D visual data when performing cross-modal tasks on multi-object 3D scenes. Numerical and theoretical analysis provides the basis for obtaining the first application of multivariate mutual information estimation to enhance the performance of VLM systems on empirical 3D tasks. As part of this research, we design and implement a framework for evaluating control and information theoretic measures, design a set of scenes to illustrate the impact of minimising regret when selecting inputs for calculating multi-information metrics, and present three novel cross-modal benchmarks on 3D multi-object scenes. (2023) to retrieve optimal viewpoints of 3D objects to improve a 2D CLIP model (see Figure 4). We propose benchmarks that focus on reducing errors when reasoning over visual properties, understanding object features with limited views, and handling view occlusions to improve reasoning by VLMs trained on 2D visual inputs on cross-modal tasks with 3D multi-object scenes. Scenes in all our benchmarks contain multiple objects from a single class.

Our GeoProperties-3DS benchmark consists of sets of scenes that we extract from 3D mesh models of rocks, regolith, and other geological features (see Figure 5) generated from observations performed as part of NASA Mars 2020 missions Bell III et al. (2022). Each scene contains a small set of objects. A total of 6 collections are used in our experiments with each group consisting of 5 individual scenes. We prepare textual descriptions of the objects in scenes describing surface features visible from a subset of viewpoints. The description matches a single member of the 5 scenes in each collection. Our benchmark is designed to assess methods that reduce the likelihood of false positives by 2D VLMs. We extract scenes from 3D models of the Raton target (sol 130), the area around the Rochette rock (sol 196) Scenes in FeatureID-3DS are composed from a set of models drawn from ShapeNetCore topLevelSynsetId 04460130 (cate- 2015) and a floor mesh in RGBA [0.9, 0.9, 0.7, 1.0]. The dataset comprises 60 glb files with language descriptions. Human-generated descriptions of objects are formed into a sentence for each viewpoint with a template. The true description is one of 5 samples presented as a list. To ensure matches from the true sample, matching modifiers in negative samples are replaced. Each text file is composed of descriptions from 6 viewpoints and a single summary string for the 60 scenes in the dataset. A feature is identified in the description for one of the viewpoints where visibility is confirmed.

“There is a gray building and a white and orange building.”

“The gray building has a doorway.” PartialView-3DS scenes are generated from ShapeNetCore topLevelSynsetId [3001627, 4379243] (category: Chair, Table ). Objects are separated by a partition mesh intersecting a floor mesh and limiting visibility to a single item from four of six cardinal viewpoints. A three-stage process to synthesize descriptions for scenes starts with human-generated natural language descriptions of constituent objects from Han et al. (2020). Modifiers are verified by hand and corrected to match the objects. Sentences are generated for viewpoints by adding an indefinite article and period. Summary descriptions of the scene consist of object-level texts conjoined into a single sentence. The PartialView-3DS benchmark consists of a total of 60 glb files and paired description files. Language descriptions refer to the 3D scene from one of 6 viewpoints.

"

Multi-information estimation methods where ๐‘› > 2 variables represent cross-modal inputs are evaluated with our UC-3DS-MI dataset (see Figure 8). Data design is balanced with 24 uniform scenes and an equal number of complex scenes. Each scene contains two polygon objects initialised in one of 6 position groups defined in relation to the centroid of a floor mesh. Abstract polygons are selected to limit the impact of class bias in the training data of VLM systems. Language descriptions and scenes are split between concentration on colour or geometry. Complexity relates to matches or differences between objects on these criteria. VLM systems provide predictions on the correctness of descriptions in relation to 6 viewpoints for each 3D scene.

Our method selects optimal viewpoints in a sequence with a controller guided by information theoretic measures. The basis for our method is the principle that a VLM’s errors in assessments will change given a visual input captured from an optimal viewpoint in relation to the textual input. In selected instances, assessment errors are returned by VLMs on all viewpoints presented to the VLM (see Figure 9). A resulting hard limit exists on improvements in the accuracy of methods that run only in inference. Figure 9: We perform a qualitative assessment and find instances of incorrect predictions returned by the VLMs used in the assessments on optimal viewpoints and simple descriptions. In the samples presented here, the VLM returns a label of False for the view on the left. As a result, a hard limit on improvements from methods that are limited to inference-time changes exists.

In this section, we provide core specifications and computing requirements to perform the experiments in the main paper. A run consists of ๐‘… = 2 rounds for the reported results with a sequence of actions estimated during the correction round following a measurement round and ๐‘› demonstrations. Control operations performed by all evaluated methods are updating demonstration data with system decisions in the measurement round for (๐‘‹, ๐‘Œ ) viewpoints and ๐‘ง-axis levels, updating coefficients measured on the set of decisions with corresponding viewpoint labels, and predicting the set of camera actions for the correction round. Methods receive prediction errors for viewpoints where an error was marked in demonstrations and during the measurement round.

To assess the results in Tables 2 and3 (2024). Video is created with OpenCV 4.9.0.80. Infrastructure A single NVIDIA A100 80GB GPU is used for all runs reported in the experiments section to support VLM systems at inference time. Our controller adds no GPU processing or VRAM memory requirements additional to the hardware allocation for running the VLM. In-scene camera operations are performed with a single NVIDIA GeForce RTX 2080 Super 8GB GPU.

We present fine-grained detail on the efficacy of our multivariate metrics with active regret minimisation to distribute the scores for correct and incorrect responses in separate regions in Figures 10 and11. These are compared to variants with no ๐‘Ž๐‘Ÿ on scenes in our diagnostic defined as uniform and complex.

In Figures 12 and13, an additional analysis of score distributions for the two variants of our metrics with active regret minimisation ๐‘Ž๐‘Ÿ are analysed in relation to the median scores. This assessment is also provided for variants with no ๐‘Ž๐‘Ÿ and univariate metrics.

We provide a visualisation of how univariate measures -computed with the constituent variables in our multivariate metrics -perform as indicators of the complexity of inputs in relation to the decisions outputted by the VLM in Figure 14. The distributions indicate that any variable taken on its own provides scores that conflate samples resulting in a correct or incorrect decision by the system.

Figure 10: The ๐‘€ ๐ผ-๐‘Ž๐‘Ÿ metric groups correct and incorrect scores in distinct regions of the distribution, for both uniform objects and more complex scenes.

Figure 11: Active minimisation of regret distributes correct and incorrect scores in identifiable concentrations for a second multi-information metric.

Figure 12: ๐บ๐‘‚-๐ฟ๐ธ ๐ท-๐‘‚ ๐ฟ scores for correct and incorrect responses around the median. LED score is shown in Figure 13. The median is preferred over the mean for low skew in small-sample tests. Kurtosis ๐œ… indicates moderate to low tail concentration for multi-information with active regret.

Figure 13: GH-LED scores for correct and incorrect responses distributed around the median.

We assess ๐‘€ ๐ผ-๐‘Ž๐‘Ÿ metrics with the two neural methods in our evaluations on the GeoProperties-3DS benchmark (see Table 7). Limited improvements on these methods are a reasonable expectation in the low-data regimes of the tasks that our research focuses on. Neural methods with standard first-order optimisation such as Stochastic Gradient Descent (SGD) are better suited to large-scale data scenarios (Anthony and Bartlett, 2009). In contrast our approach is effective for scenarios suited to online optimisation with limited spare samples for training.

BER@8 โ†“ ฮ” on R1

We report standard deviations ๐œŽ for the two benchmark experiments by system and method in Tables 8 and9

Positioning of a camera in a 3D scene results in minimal computational overhead. The leading factor in assessing efficiencies of

Chat-UniVi-13B

Figure 16: Variance over runs by method for feature identification on a budget of camera actions.

methods is the number of camera actions as each action incurs a constant increment in the overall time required to process inputs (see Figure 17). We observe that computational efficiency of all methods evaluated is a secondary concern. To provide a complete picture, reporting on mean operations per point of accuracy for each control method is presented in Figure 18. Our methods require fewer CPU operations to improve the accuracy of a Video-LLaMA-13B system on FeatureID-3DS benchmark -the main benchmark that we propose to assess time efficiency.

Figure 17: Wall-clock time in relation to the number of actions for a Video-LLaMA-13B system in the FeatureID-3DS benchmark is a linear progression. A single camera action equals one conversation turn with the result that action count is the primary factor in time to process inputs.

Results: Our ๐บ๐‘‚-๐ฟ๐ธ ๐ท-๐‘‚ ๐ฟ ar metric assists the Poly+ZO+MI controller to prioritise viewpoints that reduce VLM errors with fewer actions (see scores for Acc@5 in Table 4). Analysis of viewpoint replacements provided in Appendix D indicates that views displaying features with strong visual prominence are prioritised. Results for the above suggest only minor variations between runs and are presented in Appendix D.

We evaluate our best performing method and system pair for the FeatureID-3DS on input pairs with modified descriptions to assess the influence of phrasing. To conduct the first test, descriptions are replaced by n-grams containing random selections of alphabetic characters. These samples are generated using the method proposed by Chu et al. (2022). Performance collapses with linguistic inputs that are unrelated to scenes. We also perform an assessment with descriptions consisting of an additional sentence. Declines in performance are due to working with VLMs trained on shorter textual inputs (see Table 10).

rnd n-grams 2 sentences Acc@8 ฮ” on best Acc@8 ฮ” on best Ours +GO-LED-OL_ar 15.2 -29.3 35.3 -9.2 Ours + GH-LED_ar 15.3 -38.0 43.9 -9.4

Table 10: Performance on FeatureID-3DS for the best variant (ie Video-LLaMA-13B with our controller and ๐‘€ ๐ผ ar metrics) using ngrams with alphabetic characters selected at random (rnd n-grams) and longer descriptions of 2 sentences. The delta is the drop in performance compared with the default in the paper of 1-sentence descriptions.

An analysis of the ability for VLMs to discriminate 3D scenes based on the object counts in view is performed with scenes from our diagnostic dataset and additional samples. We split UC-3DS-MI into a subset with two objects and scenes with a single object. A third subset is generated with three objects to complete the data required to run the test. Weak performances for all methods on scenes with three objects are due to the inability for the open source VLMs to reason over scenes with a high number of objects.

Sensitivity in relation to combined visual and linguistic outputs is dependent on the combination of mixture components in computing MI. The advantages of negating overlaps between variables are displayed in the distances between scores for ๐‘€ ๐ผ-๐‘Ž๐‘Ÿ. This relation between minimising regret and sensitivity in MI on limited samples motivates a diagnostic on statistical complexity in model optimisation. We study changes in the parameters of a logistic regression model estimated in a framework for Gibbs sampling. Our metric is the difference between maximum and minimum posterior concentrations that starts with samples from the conditional distribution ๐‘(๐‘‹, ๐‘ฆ) = ๐‘(๐‘Ÿ ๐‘‹ | ๐‘ฆ) โ€ข ๐‘(๐‘ฆ):

where

Model selection is determined by the low model bias and theoretical guarantees provided by logistic regression when the target for real-based measures is {0, 1} (Efron, 1975). Our diagnostic examines variance in the posterior ๐›ฝ of the model while a logistic regression is applied to cumulative counts of paired MI estimates in increments of 6 samples. PC dispersion is a point difference over the inverse of the standard deviation ๐œŽ for ๐›ฝ

where ๐‘(๐›ฝ|๐‘‹, ๐‘ฆ) โˆ ๐‘(๐‘ฆ|๐‘‹, ๐›ฝ) ๐‘(๐›ฝ). Our end measure quantifies absolute dispersion of variance in the model as the number of input samples increases. Posterior concentration is proportional to model stability supplied by the metric over the range of sample sizes from the increment when at least one of each of {0, 1} labels is recorded. A low value on the change in posterior concentration for ๐‘€ ๐ผ-๐‘Ž๐‘Ÿ methods in Table 12 indicates stable updates in the beta parameter of the model over a run. Sensitivity for ๐‘€ ๐ผ-๐‘Ž๐‘Ÿ variants to information (see Table 1 in the main paper) supports application in scenarios with limited demonstrations and online feedback.

Plots to illustrate the differences in posterior concentration for variants of our metrics are provided in Figure 19. Posterior concentration is proportional to model stability supplied by the metric over the range of sample sizes from the increment when at least one each of {0, 1} labels is recorded. A low value on the change in posterior concentration for ๐‘€ ๐ผ-๐‘Ž๐‘Ÿ methods indicates stable updates in the beta parameter of the model over a run.

An additional diagnostic quantifies the stability of a model fitted using MI variants with and without active regret minimisation (see Figure 20). Given the low availability of data for 3D scenes paired with language descriptions, a desirable property is to enable fitting a model with a minimum number of samples from dataset D that predict the boolean target of system responses. We design the analysis in the form of direct comparison to demonstrate the contribution of MI methods in training a model to identify viewpoints commensurate with the quantity of information content presented by a scene.

We provide the full specification for our controller to predict camera actions. Assuming a prior process where a VLM returns predictions on a sample scene ๐œ and language description ๐‘™ from dataset D, a set of functions models these predictions and MI-ZO to output a sequence of camera actions. The actions position the camera to return a set of viewpoints defined by (๐‘‹, ๐‘Œ )and ๐‘ง-axes. Functions are described by subprocess and in a figure with detailed specifications on individual operations at each stage.

Interval Estimation To enable working directly with continuous MI scores, a function converts values to intervals ๐‘– with selection of interval size based on entropy maximisation

on proposals for cut points generated with Halton sequences Halton (2005). We prefer this estimation method to a random number generator to reduce variation. Variance is further reduced by computing a mean over cut point proposals returning maximum entropy. The result is an automatic process that requires no set interval widths in experiments.

Component Models Two component models ๐ถ ๐‘€ filter data in the responses with the proxy labels ลถ generated during the first step in Algorithm 2. During the correction round, proxy labels replace the boolean values on the match of a viewpoint with its description that are provided as actual labels during the measurement round. To limit processing time, modeling is performed in both models with iterative least squares.

Component Model 1 increments the probability of prediction error P [ ๐‘ฅโ‰ ๐‘ฅ โ€ฒ ] by the system for a viewpoint ๐‘‰ ๐‘ where an error was marked in the prior round. Coefficients are measured for the set of decisions ๐ท๐‘’๐‘ with corresponding viewpoint labels and the demonstration data are updated. Test error rates and score-based measures on scene attributes are processed by Component Model 2 that ranks ๐‘ง-axis levels ๐ท๐‘–๐‘š ๐‘ง for each viewpoint. Traces of the covariance matrices ๐‘ก๐‘Ÿ in each component model are retained as indicators of model fit lim ๐‘›โ†’โˆž .

Central Unit Outputs are passed to the Central Unit of the controller. Acceptance of VLM feedback and decisions on the ๐‘ง-axis level by viewpoint are modeled using the covariance traces normalised and converted to an inverse factor. View-level data are passed to update element values over a low dimensional representation of the scene in the form of an interaction matrix.

Interaction Matrix An interaction matrix A is a graph ๐‘‹๐‘Œ ๐‘ and a superset over the scene composed of an element drawn from set ๐‘‹๐‘Œ (element ๐‘ฅ๐‘ฆ) with set ๐‘ (element ๐‘ง). Elements in each case are unique instances by position. Set ๐‘‹๐‘Œ is a cyclic graph consisting of adjacent nodes with a direct edge to any element in the fully connected graph of set ๐‘. Pendant nodes in the factor graph ๐‘‹๐‘Œ define the adjacency matrix of graph ๐‘‹๐‘Œ ๐‘. A graph product will result in the targeted structure for the graph and a strong product โŠ  provides a specific advantage in preserving the connectivity of the factors in the edges of any vertex set.

We present details of the theoretical analysis and proofs for the MI-ZO algorithm. The section begins by proving nonpositivity when estimating multi-information, presents the reformulation of the problem in our approach using a duality, moves the analysis into the online setting, and concludes with a proof to Theorem 1, which states that a function exists that places an upper bound on elements in component variables that reduce the expressivity of multivariate combinations:

Theorem 1 (Function with upper bound on nonpositive contribution). There exists a function that combines a set of ๐‘› > 2 entropies ๐ป (๐ท๐‘–๐‘š ๐‘› ) with an upper bound on the nonpositive contribution of reductive units to an output estimate MI that is constant with a bound on the inner product of the vector on total units and the vector of observed information.

In this section, we define the conditions and detail a proof for when multi-information is nonpositive. We default to multi-information as a term for multiple mutual information terms Studenแปณ (1987). We use a semicolon to indicate a function that is not symmetric. In all instances, joint entropy (๐‘ฅ, ๐‘ฆ) < โˆž.

We begin with the interpretation of mutual information as differences in Shannon entropies McAllester and Stratos (2020):

Definition 1. Mutual information for a variable ๐‘ฅ, M(๐‘ฅ : ๐‘ฅ) is equal to the entropy of ๐‘ฅ (๐ป (๐‘ฅ)). For discrete random variables (๐‘ฅ, ๐‘ฆ), mutual information M(๐‘ฅ, ๐‘ฆ) is the sum of entropy ๐‘ฅ and entropy ๐‘ฆ minus the joint entropy of (๐‘ฅ, ๐‘ฆ):

Definition 2. For a pair of variables in the set ๐‘‹, the mutual information of any member ๐ป (๐‘ฅ ๐‘– ) is in the distribution over ๐‘ฅ ๐‘– in relation to ๐‘ฆ. To limit redundancy between members of ๐‘‹, multi-information MI between multiple input variables and the target ๐‘ฆ is a combination of single and joint entropies Te Sun (1980):

Definition 3. Additional joint entropies over all pairs of variables are included for ๐‘‹ with more than 2 members. For any ๐‘›, multi-information follows the chain rule and is summarised as

Lemma 1 (Nonpositivity). Let each member of ๐‘‹ be a set ๐ด ๐‘› , then the intersection of ๐ด is a positive or negative value when members of ๐‘‹ are greater than 1.

Proof of Lemma 1. First consider that when ๐‘‹ contains exactly two elements, then the intersection by the inclusion-exclusion principle for ๐‘›=3 is:

Then if the condition is met:

the case in Equation 13extends by the submodularity of entropy Krause and Guestrin (2005) to:

In this section, we state assumptions and introduce terms for separable regions in a space over the variables with finite dimensions. Our problem of identifying information by unit type when ๐‘› > 2 is converted to a duality. Definitions and a lower bound are provided for a function that separates constituent units of inputs.

Assumption 1. For the set of elements in ๐‘‹ in a multivariate function that results in a single output, note that for ๐ป (๐‘‹ 1 ), ๐ป (๐‘‹ 2 ), . . . ๐ป (๐‘‹ ๐‘› ), each input is conditionally independent from the other inputs given certain subsets of the variables Te Sun (1980). We assume that each member consists of multiple units ๐”… โІ R ๐ท .

Definition 4. The state of any unit ๐”Ÿ โˆˆ ๐”… is additive or reductive in relation to the product of the MI calculation. To specify these states, we use the term ๐”ฒ to denote ๐‘ˆ ๐‘ and ๐”ก to denote ๐ท๐‘œ๐‘ค๐‘›.

Definition 5. A real vector space with all ๐”… contains affine subspaces with dimension ๐‘› -1 in R ๐‘› creating regions defined by the inequalities

where รฌ ๐‘ค is a weight parameter of the hyperplane in โˆ€2 +

and ๐‘‘ is the scaling of the vector รฌ ๐‘ฃ that is orthogonal to the hyperplane applied as

Remark 1. Note that ๐›พ can also be expressed in terms of the weight vector รฌ ๐‘ค, which is normal to the hyperplane:

where รฌ ๐‘ค ๐‘‡ ๐‘ฅ + ๐‘ = 0 defines the separating hyperplane, and รฌ ๐‘ค ๐‘‡ is the transpose of รฌ ๐‘ค.

Theorem 2 (Function to maximise the margin). For any ๐‘‹, an optimisation process ๐‘“ (๐‘ฅ) = รฌ ๐‘ค ๐‘‡ ๐‘ฅ + ๐‘ exists to solve the minimax form of deriving the margin using the hyperplane in R ๐ท that maximises the distance between ๐”Ÿ ๐”ฒ and ๐”Ÿ ๐”ก :

Proof of Theorem 2. We will convert the primal problem into a dual and derive a lower bound on the latter following classic Lagrangian duality Rockafellar (1974). First if we apply a multiplier ๐œ† for the constraint || รฌ ๐‘ค|| 2 2 โ‰ค 1, a sum of Lagrange multiplications for each group ๐‘€ ๐‘–=1 ๐‘ข ๐‘– ๐‘ฆ ๐‘– and ๐‘€ ๐‘–=1 ๐‘ฃ ๐‘– ๐‘ฆ ๐‘– can be derived as follows:

where รฌ ๐‘ค and ๐‘๐‘–๐‘Ž๐‘  are the weight and bias parameters of the hyperplane, ๐‘€ is the number of elements in the group, ๐‘ข ๐‘– is the multiplier for the constraints รฌ ๐‘ค ๐‘‡ ๐‘ฅ ๐‘– -๐‘๐‘–๐‘Ž๐‘  โ‰ฅ ๐‘ก, and ๐‘ฃ ๐‘– is the multiplier for the constraints รฌ ๐‘ค ๐‘‡ ๐‘ฅ ๐‘– -๐‘๐‘–๐‘Ž๐‘  โ‰ฅ ๐‘ก.

We have formulated the primal problem with weak duality and can now progress to the second stage of deriving from this new form a lower bound by finding the supremum of the factor sup ๐œ†โˆˆR ๐‘š ๐‘ž(๐œ†) maximize sup

noting that ๐‘„ is the matrix of the quadratic form of the problem and ๐‘ž( ฮป) = -1 โ€ข ฮป = ๐‘› โ€ข eigmin(๐‘„) is the lower bound derived from ๐‘„.

We start by defining a formulation of the separating function proposed in the last section in an online learning setting in the form ๐‘‚๐‘ข๐‘ก ๐‘ก = ๐‘‚๐‘ข๐‘ก ๐‘ก -1 -๐œ‚ ๐‘ก ๐ฟ(๐‘‚๐‘ข๐‘ก ๐‘ก -1 ) where some outcome ๐‘‚๐‘ข๐‘ก is iterated using step size ๐œ‚. We specify optimisation on real values and prove an algorithm with active regret minimisation on information.

Definition 8. We underline the difference of full and partial information settings characterised by a specified number ๐œ of rounds where the function has access to observed information denoted as ๐›ผ:

Definition 9. A computational solution ๐œ” is updated by an online process where a point is mapped to the next nearest point in the convex set ๐ถ of potential variable values , where ๐œ‹ * is the optimal policy in the set ฮ . Then for every ๐‘ก to ๐œ,

Remark 2. We have the equivalence between regret and the distance between ๐”Ÿ ๐”ฒ and ๐”Ÿ ๐”ก such that an indicator function I (29)

Proof of Lemma 2. The proof of Lemma 2 builds on the definition of redundancy in universal compression and a subsequent proof given in (Cover and Thomas, 2006, Ch.13). We present a worked example of the above with a set of inputs ๐‘‹ = ๐‘ฅ 1 , ๐‘ฅ 2 , . . . , ๐‘ฅ ๐‘› following an unknown distribution ๐‘ ๐œƒ . For every ๐‘ in ๐‘ƒ, there is an associated prior distribution โ„Ž โˆˆ {โ„Ž 1 , โ„Ž 2 , . . . , โ„Ž ๐‘š } over parameters ฮ˜. The definition for the ๐‘…๐‘’๐‘‘๐‘ข๐‘›๐‘‘๐‘Ž๐‘›๐‘๐‘ฆ( ๐‘ ๐œƒ , ๐‘ž) of a code is estimated using the Kullback-Leibler divergence

We assume that ๐‘‹ = {1, 2, 3} and ๐œƒ = {0, 1} with a probability associated when observing ๐‘‹ = 2 in both distributions ๐‘ 1 and ๐‘ 2 denoted as โ…. If ๐‘ 1 = (1 -โ…, โ…, 0), then the KL divergence between ๐‘ž and ๐‘ 1 is

and for ๐‘ 2 where ๐‘ 2 = (0, โ…, 1 -โ…):

The optimal ๐‘ž for the problem is the candidate with minimum KL for both ๐‘ 1 and ๐‘ 2 is

We now turn to the information capacity of a channel ๐ถ๐‘Ž ๐‘. In the minimax case, we have

Given the inferred distribution

the M for ๐œƒ = 1 and ๐œƒ = 2 is

and

Then finding the optimal ๐‘ž * (๐œƒ) yields the maximum ๐ถ๐‘Ž ๐‘. Since โ„Ž(๐œƒ) = {0.5, 0.5} maximises the mutual information, the optimal distribution also is โ„Ž * (๐œƒ) = {0.5, 0.5}:

Remark 3. We follow Krichevsky (1968) in drawing an equivalence between redundancy and regret in the context of information theory. In the online setting, the dual form in Equation 28 is extended to express the relation between minimising regret and increasing the separation between ๐”Ÿ ๐”ฒ and ๐”Ÿ ๐”ก in the presence of observed information ๐›ผ.

Then in the setting where ๐›ผ is not present, the information capacity of the channel is dependent on the function ๐‘“ ๐‘š๐‘œ๐‘› over the processes of minimising min ๐‘…๐‘˜ (๐œ”) and maximising distance max ๐‘‘ being monotonic:

Reordering the above in terms of the solution in Equation 26, ๐‘“ ๐‘š๐‘œ๐‘› holds iff a policy ๐œ‹ * that penalises the squared deviation such that the approximation of รฌ ๐‘ค + ๐‘๐‘–๐‘Ž๐‘  after the period ๐œ is within ฮ” of the estimated ๐œ‡ in relation to the empirical mean ๐œ‡ ๐‘’๐‘š ๐‘ when ๐›ผ is present:

active regret (๐‘Ž๐‘Ÿ) measure ๐บ๐‘‚-๐ฟ๐ธ ๐ท-๐‘‚ ๐ฟ ar is assessed against metrics with less inputs and a variant calculated with no ๐‘Ž๐‘Ÿ. Scores for feedback setting demonstrate the impact when metrics with ๐‘Ž๐‘Ÿ are provided continuous feedback on ๐‘ฆ correctness labels. Decomposition over variables indicates more inputs contribute to the sensitivity of multivariate metrics with ๐‘Ž๐‘Ÿ and demonstrate the benefits of minimising regret in securing additive effects from new sources when ๐‘› > 2. Feedback on correctness provides no guaranteed improvements in univariate or multivariate measures without an active process for minimising regret.

.Test performance on our two evaluations on virtual environments is measured with mean accuracy ๐ด๐‘๐‘ on VLM decisions ๐ท๐‘’๐‘ in the scene summary question ๐‘†๐‘„ and calculated over all scenes:Acc SQ = 100 ร— ๐‘ โˆ‘๏ธ

.Test performance on our two evaluations on virtual environments is measured with mean accuracy ๐ด๐‘๐‘ on VLM decisions ๐ท๐‘’๐‘ in the scene summary question ๐‘†๐‘„ and calculated over all scenes:

.

for our benchmark on reasoning over visual properties to perform scientific analysis. The new GeoProperties-3DS benchmark compares control methods to variants of our ๐บ๐‘‚-๐ฟ๐ธ ๐ท-๐‘‚ ๐ฟ and ๐บ๐ป-๐ฟ๐ธ ๐ท metrics in reducing errors when reasoning over the properties of objects in 3D scenes. Versions incorporating active regret minimisation ๐‘€ ๐ผ-๐‘Ž๐‘Ÿ close to double the performance of a VLM with no control. Performance is measured using balanced error rate (BER).

Cross-modal benchmarks for VLMs where visual inputs are 3D inputs include problems with individual objectsXue et al. (2024) and aligning inputs to improve 3D shape understandingLiu et al. (2024). Search algorithms are proposed byVoigt et al.

Cross-modal benchmarks for VLMs where visual inputs are 3D inputs include problems with individual objectsXue et al. (2024) and aligning inputs to improve 3D shape understandingLiu et al. (2024). Search algorithms are proposed by

Cross-modal benchmarks for VLMs where visual inputs are 3D inputs include problems with individual objectsXue et al. (2024) and aligning inputs to improve 3D shape understandingLiu et al. (2024)

Cross-modal benchmarks for VLMs where visual inputs are 3D inputs include problems with individual objectsXue et al. (2024) and aligning inputs to improve 3D shape understanding

Cross-modal benchmarks for VLMs where visual inputs are 3D inputs include problems with individual objectsXue et al. (2024)

Cross-modal benchmarks for VLMs where visual inputs are 3D inputs include problems with individual objects

  1. A square two-level coffee table with open areas in all four corners of first level, brown in color. 3. A square shaped broad yellow table with a glass top. 4. A table with brown color polish with long length seating. 5. A four-legged snooker table of green color at top and dark brown at bottom and legs." “1. A square shaped broad yellow table with a glass top. 2. A four-legged snooker table of green color at top and dark brown at bottom and legs. 3. A table with brown color polish with long length seating. 4. A square two-level coffee table with open areas in all four corners of first level, brown in color. 5. A four-legged round table made up of steel and wood from below and around transparent glass on it.”

๐‘–=1๐‘“ (๐‘ฅ ๐‘– , รฌ ๐‘ค, ๐‘๐‘–๐‘Ž๐‘ , ๐‘ฆ ๐‘– ) โ‰ค ๐‘…๐‘˜ (๐œ”).

๐‘–=1

โˆ€๐‘ก , ๐›ผ ๐‘ก =1 1 ๐‘€ ๐‘€ โˆ‘๏ธ ๐‘–=1 ๐‘“ (๐‘ฅ ๐‘– , รฌ ๐‘ค, ๐‘๐‘–๐‘Ž๐‘ , ๐‘ฆ ๐‘– ) , min ๐‘…๐‘˜ (๐œ”), max ๐‘‘ ๐‘ˆ ๐‘,๐ท๐‘œ๐‘ค๐‘› โ‡โ‡’ min

๐Ÿ“ธ Image Gallery

EffEd_scn_022_overhead_front_plus_left_lg.png PartialViewEf_scn_013_back_plus_overhead_back_lg.png UC_scns_img.png avg_ops_per_point_barh.png ci_slta_st11_cha-uni_fi.png ci_slta_st11_vid-lla_fi.png ci_slta_st12_vid-cha-uni_oo.png gh_score_all_bar.png gh_score_all_hist.png ghled_ad_col_chart.png ghled_score_adj_all_bar.png ghled_score_adj_all_hist.png ghled_score_adj_gla_line_all.png ghled_score_adj_hist_complex.png ghled_score_adj_hist_uniform.png gl_score_all_bar.png gl_score_all_hist.png gla_go_score_all_hist.png gla_go_score_bar_all.png gla_go_score_complex_hist.png gla_go_score_uniform_hist.png gla_score_all_hist.png gla_score_bar_all.png gla_score_hist_complex.png gla_score_hist_uniform.png go_led_cv_score_adj_all_complex.png go_led_cv_score_adj_all_hist.png go_led_cv_score_adj_hist_uniform.png goled_ad_col_chart.png goled_gla_go_line_all.png led_score_all_bar.png led_score_all_hist.png ol_score_all_bar.png ol_score_all_hist.png qual_1.png scn_022_zm_4_front_led_hist.png scn_022_zm_4_front_ol_repr.png scn_022_zm_4_front_score_global_lab.png wc_time_vs_k_line.png wr_global_lab.png wr_jzazbz_comp.png wr_jzazbz_hist.png wr_led_hist.png wr_oklab_comp.png wr_oklab_hist.png wr_ol_repr.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut