Discovering and Learning Probabilistic Models of Black-Box AI Capabilities

Reading time: 5 minute
...

📝 Abstract

Black-box AI (BBAI) systems such as foundational models are increasingly being used for sequential decision making. To ensure that such systems are safe to operate and deploy, it is imperative to develop efficient methods that can provide a sound and interpretable representation of the BBAI’s capabilities. This paper shows that PDDL-style representations can be used to efficiently learn and model an input BBAI’s planning capabilities. It uses the Monte-Carlo tree search paradigm to systematically create test tasks, acquire data, and prune the hypothesis space of possible symbolic models. Learned models describe a BBAI’s capabilities, the conditions under which they can be executed, and the possible outcomes of executing them along with their associated probabilities. Theoretical results show soundness, completeness and convergence of the learned models. Empirical results with multiple BBAI systems illustrate the scope, efficiency, and accuracy of the presented methods.

💡 Analysis

Black-box AI (BBAI) systems such as foundational models are increasingly being used for sequential decision making. To ensure that such systems are safe to operate and deploy, it is imperative to develop efficient methods that can provide a sound and interpretable representation of the BBAI’s capabilities. This paper shows that PDDL-style representations can be used to efficiently learn and model an input BBAI’s planning capabilities. It uses the Monte-Carlo tree search paradigm to systematically create test tasks, acquire data, and prune the hypothesis space of possible symbolic models. Learned models describe a BBAI’s capabilities, the conditions under which they can be executed, and the possible outcomes of executing them along with their associated probabilities. Theoretical results show soundness, completeness and convergence of the learned models. Empirical results with multiple BBAI systems illustrate the scope, efficiency, and accuracy of the presented methods.

📄 Content

Users are increasingly utilizing black-box AI systems (BBAIs) that accept high-level objectives and attempt to achieve them. Such BBAIs range from purely digital agents (e.g., vision/language model based task aids such as LLaVA (Liu et al. 2023)) to vision-language-action (VLA) models that control physical robots (e.g., Ha et al. (2023) and Black et al. (2025)). However, currently it is difficult to predict what objectives such BBAIs can reliably achieve and under which conditions. BBAIs can have surprising limitations and side-effects which make their effective usage all but impossible in risk-sensitive scenarios.

This paper presents a new approach for discovering and modeling the limits and capabilities of BBAIs. Our results show that planning domain definition languages (e.g., probabilistic PDDL) can be used effectively for learning and expressing BBAI capability models, and can be used to provide a layer of reliability over BBAIs. Research on world-model learning (Aineto, Celorrio, and Onaindia (2019); Hafner et al. (2025); Geffner (2018)) addresses the orthogonal problem of learning models for such primitive actions. Models of such actions cannot inform the user about the agent’s capabilities, because they depend on the agent’s planning and reasoning processes.

Indeed, users may wish to understand the agent function (which takes into account the BBAI’s unknown planning and reasoning mechanisms) rather than the primitive actions of the agent. E.g., whether a household robot has the capability to “make coffee” or “clean kitchen,” may require planning and execution of policies over the primitive actions. A model for the clean kitchen capability would provide conditions under which the agent can clean the kitchen and a probability distribution over the possible outcomes of executing that capability in the user’s high-level vocabulary.

A key limitation of prior work in this area (see Sec. 6 for a broader discussion) has been the restricted expressiveness of the models that were learned (e.g., deterministic PDDL models and/or conjunctive preconditions), the simplicity of the agents that were considered (e.g, agents with fixed task-specific policies) and known capability sets. This limits their applicability to realistic settings featuring stochasticity, learning agents with evolving capabilities and non-stationary decision making algorithms.

We assume only the ability to access the environment, to instruct the agent to complete a task, and knowledge of an abstraction function that translates environment states into a relational vocabulary. Since several directions of research address the problem of learning such abstraction functions (e.g., Shah, Nagpal, and Srivastava (2025), Konidaris, Kaelbling, and Lozano-Perez (2018), Ahmetoglu et al. (2022), Peng et al. (2024), and James, Rosman, and Konidaris (2020)), we focus on the problem of learning capability models given an abstraction function.

Intuitively, our capability-assessment algorithm operates as follows. It observes the BBAI’s interaction with the environment to discover capabilities that induce state changes discernible with the abstraction function. Playing the role of an interviewer, it creates and assigns evaluation tasks (queries) for the BBAI. It maintains optimistic and pessimistic models of the discovered capabilities, uses them to create new queries using customized MCTS-based algorithms, and eventually learns true capability models.

Our experiments show that this approach can reveal surprising gaps in the capabilities of BBAIs and enable safe usage as well as better design of BBAI systems. It learns models of BBAI capabilities in a language with conditional probabilistic effects without syntactic restrictions on the conditions, for agents including LLM and VLM-based implementations. To our knowledge this is the first approach for learning user-interpretable capability models for the assessment of such a broad class of BBAIs in stochastic settings.

Our main contributions are as follows:

(1) a capabilitymodeling framework for expressing the capabilities of a BBAI in an environment with a formal definition of the learning objective (Sec. 2.2); (2) the PCML algorithm for discovering BBAI capabilities and modeling them (Sec. 3);

(3) theoretical results on the soundness, completeness and convergence properties of PCML (Sec. 4); (4) empirical evaluation demonstrating the scope and efficacy of PCML across diverse agents and environments (Sec. 5).

We evaluate BBAIs operating in a stochastic, fully observable environment E, that is characterized by a set of environment states X and a set of low-level actions A. We assume access to a simulator S E for E that supports standard functionality: resetting to an initial state, reverting to any previously encountered state x ∈ X , stepping the simulator given an action a ∈ A to obtain the next state and outcome, and querying the set of possible actions. Otherwise, we do not assume any explicit knowle

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut