On the computability of conditional probability

Reading time: 79 minute
...

📝 Original Info

  • Title: On the computability of conditional probability
  • ArXiv ID: 1005.3014
  • Date: 2019-11-19
  • Authors: Researchers mentioned in the ArXiv original paper

📝 Abstract

As inductive inference and machine learning methods in computer science see continued success, researchers are aiming to describe ever more complex probabilistic models and inference algorithms. It is natural to ask whether there is a universal computational procedure for probabilistic inference. We investigate the computability of conditional probability, a fundamental notion in probability theory and a cornerstone of Bayesian statistics. We show that there are computable joint distributions with noncomputable conditional distributions, ruling out the prospect of general inference algorithms, even inefficient ones. Specifically, we construct a pair of computable random variables in the unit interval such that the conditional distribution of the first variable given the second encodes the halting problem. Nevertheless, probabilistic inference is possible in many common modeling settings, and we prove several results giving broadly applicable conditions under which conditional distributions are computable. In particular, conditional distributions become computable when measurements are corrupted by independent computable noise with a sufficiently smooth bounded density.

💡 Deep Analysis

This research explores the key findings and methodology presented in the paper: On the computability of conditional probability.

As inductive inference and machine learning methods in computer science see continued success, researchers are aiming to describe ever more complex probabilistic models and inference algorithms. It is natural to ask whether there is a universal computational procedure for probabilistic inference. We investigate the computability of conditional probability, a fundamental notion in probability theory and a cornerstone of Bayesian statistics. We show that there are computable joint distributions with noncomputable conditional distributions, ruling out the prospect of general inference algorithms, even inefficient ones. Specifically, we construct a pair of computable random variables in the unit interval such that the conditional distribution of the first variable given the second encodes the halting problem. Nevertheless, probabilistic inference is possible in many common modeling settings, and we prove several results giving broadly applicable conditions under which conditional distribut

📄 Full Content

The use of probability to reason about uncertainty has wide-ranging applications in science and engineering, and some of the most important computational problems relate to conditioning, which is used to perform Bayesian inductive reasoning in probabilistic models. As researchers have faced more complex phenomena, their representations have also increased in complexity, which in turn has led to more complicated inference algorithms. It is natural to ask whether there is a universal inference algorithm -in other words, whether it is possible to automate probabilistic reasoning via a general procedure that can compute conditional probabilities for an arbitrary computable joint distribution.

We demonstrate that there are computable joint distributions with noncomputable conditional distributions. As a consequence, no general algorithm for computing conditional probabilities can exist. Of course, the fact that generic algorithms cannot exist for computing conditional probabilities does not rule out the possibility that large classes of distributions may be amenable to automated inference. The challenge for mathematical theory is to explain the widespread success of probabilistic methods and characterize the circumstances when conditioning is possible. In this vein, we describe broadly applicable conditions under which conditional probabilities are computable.

We begin by describing a setting, probabilistic programming, that motivates the search for these results. We proceed to describe the technical frameworks for our results, computable probability theory and the modern formulation of conditional probability. We then highlight related work, and end the introduction with a summary of results of the paper.

Within probabilistic artificial intelligence and machine learning, probabilistic programming provides formal languages and algorithms for describing and computing answers from probabilistic models. Probabilistic programming languages themselves build on modern programming languages and their facilities for recursion, abstraction, modularity, etc., to enable practitioners to define intricate, in some cases infinite-dimensional, models by implementing a generative process that produces an exact sample from the model’s joint distribution. Probabilistic programming languages have been the focus of a long tradition of research within programming languages, model checking, and formal methods. For some of the early approaches within the AI and machine learning community, see, e.g., the languages PHA [Poole 1991], IBAL [Pfeffer 2001], Markov Logic [Richardson and Domingos 2006], λ • [Park et al. 2008], Church [Goodman et al. 2008], HANSEI [Kiselyov and Shan 2009], and Infer.NET [Minka et al. 2010].

In many of these languages, one can easily represent the higher-order stochastic processes (e.g., distributions on data structures, distributions on functions, and distributions on distributions) that are essential building blocks in modern nonparametric Bayesian statistics. In fact, the most expressive such languages are each capable of describing the same robust class as the others -the class of computable distributions, which delineates those from which a probabilistic Turing machine can sample to arbitrary accuracy.

Traditionally, inference algorithms for probabilistic models have been derived and implemented by hand. In contrast, probabilistic programming systems have introduced varying degrees of support for computing conditional distributions. Given the rate of progress toward broadening the scope of these algorithms, one might hope that there would eventually be a generic algorithm supporting the entire class of computable distributions.

Despite recent progress towards a general such algorithm, support for conditioning with respect to continuous random variables has remained incomplete. Our results explain why this is necessarily the case.

In order to study computable probability theory and the computability of conditioning, we work within the framework of Type-2 Theory of Effectivity (TTE) and use appropriate representations for topological and measurable objects such as distributions, random variables, and maps between them. This framework builds upon and contains as a special case ordinary Turing computation on discrete spaces, and gives us a basis for precisely describing the operations that probabilistic programming languages are capable of performing.

In particular, we study the computability of distributions on computable Polish spaces including, e.g., certain spaces of distributions on distributions. In Section 2 we present the necessary definitions and results from computable probability theory.

For an experiment with a discrete set of outcomes, computing conditional probabilities is, in principle, straightforward as it is simply a ratio of probabilities. However, in the case of conditioning on the value of a continuous random variable, this ratio is undefined. Furthermore, in modern Bayesian statistics, and especially the probabilistic programming setting, it is common to place distributions on higher-order objects, and so one is already in a situation where elementary notions of conditional probability are insufficient and more sophisticated measure-theoretic notions are necessary. Kolmogorov [1933] gave an axiomatic characterization of conditional probabilities and an abstract construction of them using Radon-Nikodym derivatives, but this definition and construction do not yield a general recipe for their calculation. There is a further problem: in this setting, conditional probabilities are formalized as measurable functions that are defined only up to measure zero sets. Therefore, without additional assumptions, a conditional probability is not necessarily well-defined for any particular value of the conditioning random variable. This has long been understood as a challenge for statistical applications, in which one wants to evaluate conditional probabilities given particular values for observed random variables. In this paper, we are therefore especially interested in situations where it makes sense to ask for the conditional distribution given a particular point. One of our main results is in the setting where there is a unique continuous conditional distribution. In this case, conditioning yields a canonical answer, which is a natural desideratum for statistical applications.

A large body of work in probability and statistics is concerned with the derivation of conditional probabilities and distributions in special circumstances, each situation often requiring some special insight into the structure of the answer, especially when it was desirable for conditional probabilities and distributions to be defined at points, as in Bayesian statistical applications. This state of affairs motivated work on constructive definitions of conditioning (such as those due to Tjur [1974;1975;1980], Pfanzagl [1979], and Rao [1988;2005]), although this work has not been sensitive to issues of computability.

Under certain conditions, such as when conditional densities exist, conditioning can proceed using the classic Bayes’ rule; however, it may not be possible to compute the density of a computable distribution (if the density exists at all), as we describe in Section 9.2.

We recall the basics of the measure-theoretic approach to conditional probability in Section 3, and in Section 4 we use notions from computable probability theory to consider the sense in which conditioning could be potentially computable.

We now describe several other connections between conditional probability and computation.

1.4.1 Complexity Theory of Finite Discrete Distributions. Conditional probabilities for computable distributions on finite, discrete sets are clearly computable, but may not be efficiently so. In this finite discrete setting, there are already interesting questions of computational complexity, which have been explored by a number of authors through extensions of Levin’s theory of average-case complexity [Levin 1986]. For example, under cryptographic assumptions, it is difficult to sample from the conditional distribution of a uniformly distributed binary string of length n given its image under a one-way function. This can be seen to follow from the work of Ben-David, Chor, Goldreich, and Luby [1992] in their theory of polynomial-time samplable distributions, which has since been extended by Yamakami [1999] and others. Other positive and negative complexity results have been obtained in the particular case of Bayesian networks by Cooper [1990] and Dagum and Luby [1993;1997]. Extending these complexity results to the more general setting considered here could bear on the practice of statistical AI and machine learning. Learners. Osherson, Stob, and Weinstein [1988] study learning theory in the setting of identifiability in the limit (see [Gold 1967] and [Putnam 1965] for more details on this setting) and prove that a certain type of “computable Bayesian” learner fails to identify the index of a (computably enumerable) set that is “computably identifiable” in the limit. More specifically, a “Bayesian learner” is required to return an index for a set with the highest conditional probability given a finite prefix of an infinite sequence of random draws from the unknown set. An analysis by Roy [2011] of their construction reveals that the conditional distribution of the index given the infinite sequence is an everywhere discontinuous function (on every measure one set), hence noncomputable for much the same reason as our elementary construction involving a mixture of measures concentrated on the rationals and on the irrationals (see Section 5). As we argue, in the context of statistical analysis, it is more appropriate to study the conditioning operator when it is restricted to those random variables whose conditional distributions admit versions that are continuous everywhere, or at least on a measure one set.

1.4.3 Induction with respect to Universal Priors. Our work is distinct from the study of conditional distributions with respect to priors that are universal for partial computable functions (as defined using Kolmogorov complexity) by Solomonoff [1964], Zvonkin and Levin [1970], and Hutter [2007]. The computability of conditional distributions also has a rather different character in Takahashi’s work on the algorithmic randomness of points defined using universal Martin-Löf tests [Takahashi 2008]. The objects with respect to which one is conditioning in these settings are typically not computable (e.g., the universal semimeasure is merely lower semicomputable). In the present paper, we are interested in the problem of computing conditional distributions of random variables that are computable, even though the conditional distribution may itself be noncomputable.

1.4.4 Radon-Nikodym Derivatives. In the abstract setting, conditional probabilities are (suitably measurable) Radon-Nikodym derivatives. In work motivated by questions in algorithmic randomness, Hoyrup and Rojas [2011] study notions of computability for absolute continuity and for Radon-Nikodym derivatives as elements in L 1 , i.e., the space of integrable functions. Hoyrup, Rojas, and Weihrauch [2011] then show an equivalence between the problem of computing Radon-Nikodym derivatives as elements in L 1 and computing the characteristic function of computably enumerable sets. The noncomputability of the Radon-Nikodym derivative operator is demonstrated by a pair µ, ν of computable measures whose Radon-Nikodym derivative dµ/dν is not computable as an element in L 1 (ν ). However, the Radon-Nikodym derivatives they study do not correspond to conditional probabilities, and so the computability of the operator restricted to those maps arising in the construction of conditional probabilities is not addressed by this work. The underlying notion of computability is another important difference. An element in L 1 (ν ) is an equivalence class of functions, every pair agreeing on a set of ν -measure one. Thus one cannot, in general, evaluate these derivatives at points in a well-defined manner. Most Bayesian statisticians would be unfamiliar and perhaps unsatisfied with this notion of computability, especially in settings where their statistical models admit continuous versions of Radon-Nikodym derivatives that are unique, and thus well-defined pointwise. Regardless, we will show that even in such settings, computing conditional probabilities is not possible in general, even in the weaker L 1 sense. On the other hand, our positive results do yield computable probabilities/distributions defined pointwise.

Following our presentation of computable probability theory and conditional probability in Sections 2 through 4, we provide our main positive and negative results about the computability of conditional probability, which we now summarize. Recall that measurable functions are often defined only up to a measure-zero set; any two functions that agree almost everywhere are called versions of each other.

In Proposition 5.1, we construct random variables X and C that are computable on a P-measure one set, such that every version of the conditional distribution map P[C|X = • ] (i.e., a probabilitymeasure-valued function f such that f (X) is a regular version of the conditional distribution P[C|X]) is discontinuous everywhere, even when restricted to a P X -measure one subset. (We make these notions precise in Section 4.) The construction makes use of the elementary fact that the indicator function for the rationals in the unit interval (the so-called Dirichlet function) is itself nowhere continuous.

Because every function computable on a domain D is continuous on D, discontinuity is a fundamental barrier to computability, and so this construction rules out the possibility of a completely general algorithm for conditioning. A natural question is whether conditioning is a computable operation when we restrict the operator to random variables for which some version of the conditional distribution is continuous everywhere, or at least on a measure one set.

In fact, even under this restriction, conditioning is not even continuous, let alone computable, as we show in Section 6. We further demonstrate that if some computer program purports to state true facts about the conditional distribution of a computable joint distribution provided as input, then we can uniformly find some other representation of the input distribution such that the program does not output any nontrivial fact about the conditional distribution.

Our central result, Theorem 7.6, provides a pair of random variables that are computable on a measure one set, but such that the conditional distribution of one variable given the other is not computable on any measure one set (though some version is continuous on a measure one set). The construction involves encoding the halting times of all Turing machines into the conditional distribution map while ensuring that the joint distribution remains computable. This result yields another proof of the noncomputability of the conditioning operation restricted to measures having conditional distributions that are continuous on a measure one set.

In Section 8 we extend our central result by constructing a pair of random variables, again computable on a measure one set, whose conditional distribution map is noncomputable but has an everywhere continuous version with infinitely differentiable conditional probability maps. This construction proceeds by smoothing out the distribution constructed in Section 7, but in such a way that one can still compute the halting problem relative to the conditional distribution. This result implies that conditioning is not a computable operation, even when we further restrict to the case where the conditional distribution has an everywhere continuous version.

Despite the noncomputability of conditioning in general, conditional distribution maps are often computable in practice. We provide some explanation of this phenomenon by characterizing several circumstances in which conditioning is a computable operation. Under suitable computability hypotheses, conditioning is computable in the discrete setting (Proposition 9.2) and where there is a conditional density (Corollary 9.6).

We also characterize a situation in which conditioning is possible in the presence of noisy data, capturing many natural models in science and engineering. Let U, V, and E be computable random variables where U and E are real-valued, and suppose that P E is absolutely continuous with a bounded computable density p E and E is independent of U and V. We can think of U + E as the corruption of an idealized measurement U by independent source of additive error E. In Corollary 9.7, we show that the conditional distribution map

Finally, we discuss how symmetry, in the form of exchangeability, can contribute to the computability of conditional distributions.

We now give some background on computable probability theory, which will enable us to formulate our results. The foundations of the theory include notions of computability for probability measures developed by Edalat [1996], Weihrauch [1999], Schröder [2007], and Gács [2005]. Computable probability theory itself builds off notions and results in computable analysis, specifically the Type-2 Theory of Effectivity. For a general introduction to this approach to real computation, see Weihrauch [2000], Braverman [2005] or Braverman and Cook [2006].

We first recall some elementary definitions from computability theory (see, e.g., Rogers [1987, Ch. 5]). A set of natural numbers (potentially in some correspondence with, e.g., rationals, integers, or other finitely describable objects with an implicit enumeration) is computable when there is a computer program that, given k, outputs whether or not k is in the set. A set is computably enumerable (c.e.) when there is a computer program that outputs every element of the set eventually. Note that a set is computable when both it and its complement are c.e. We say that a sequence of sets {B n } is computable uniformly in n when there is a single computer program that, given n and k, outputs whether or not k is in B n . We say that the sequence is c.e. uniformly in n when there is a computer program that, on input n, outputs every element of B n eventually.

We now recall basic notions of computability for real numbers (see, e.g., [Weihrauch 2000, Ch. 4.2] or [Nies 2009, Ch. 1.8]). We say that a real r is a c.e. real (sometimes called a left-c.e. real) when the set of rationals {q ∈ Q : q < r } is c.e. A real r is computable when both it and its negative are c.e. Equivalently, a real is computable when there is a program that approximates it to any given accuracy (e.g., given an integer k as input, the program reports a rational that is within 2 -k of the real). A function f : N → R is lower semicomputable when f (n) is a c.e. real, uniformly in n (i.e., when the collection of rationals less than f (n) is c.e. uniformly in n). Likewise, a function is upper semicomputable when its negative is lower semicomputable. The function f is computable if and only if it is both lower and upper semicomputable.

Recall that a Polish space is a topological space that admits a metric under which it is a complete separable metric space. Computable Polish spaces, as developed in computable analysis [Hemmerling 2002;Weihrauch 1993] and effective domain theory [Blanck 1997;Edalat and Heckmann 1998], provide a convenient framework for formulating results in computable probability theory. For consistency, we largely use definitions from [Hoyrup and Rojas 2009b] and [Galatolo et al. 2010]. Additional details about computable Polish spaces, sometimes called computable metric spaces or effective Polish spaces, can also be found in [Weihrauch 2000, Ch. 8.1], [Gács 2005, §B.3], and[Moschovakis 2009, Ch. 3I].

Definition 2.1 (Computable Polish space [Galatolo et al. 2010, Def. 2.3.1]). A computable Polish space is a triple (S, δ, D) for which δ is a metric on the set S satisfying (1) (S, δ ) is a complete separable metric space;

(2) D = {s i } i ∈N is an enumeration of a dense subset of S, called ideal points; and, (3) the real numbers δ (s i , s j ) are computable, uniformly in i and j.

In particular, note that condition (1) implies that the topological space determined by the metric space (S, δ ) is a Polish space. Let B(s i , q j ) denote the ball of radius q j centered at s i . We call the elements of the set B S := {B(s i , q j ) : s i ∈ D and q j ∈ Q s.t. q j > 0}

(1)

the ideal balls of S, and fix the canonical enumeration of them induced by that of D and Q.

Let B S denote the Borel σ -algebra on a Polish space S, i.e., the σ -algebra generated by the open balls of S. Let M 1 (S) denote the set of Borel probability measures on S.

In this paper we primarily work with computable Polish spaces. As such, unless otherwise noted, the σ -algebras will always be the Borel σ -algebras on such spaces -in particular, making them standard Borel spaces. Measurable functions between Polish spaces will always be measurable with respect to the Borel σ -algebras. We will sometimes refer to measurable subsets of a probability space as events.

Example 2.2. The set {0, 1} is a computable Polish space under the discrete metric, where δ (0, 1) = 1. Cantor space, the set {0, 1} ∞ of infinite binary sequences, is a computable Polish space under its usual metric and the dense set of eventually constant strings (under a standard enumeration of finite strings).

The set R of real numbers is a computable Polish space under the Euclidean metric with the dense set Q of rationals (under its standard enumeration).

Suppose we are given a finite sequence (T 0 , δ 0 , D 0 ), . . . , (T n-1 , δ n-1 , D n-1 ) of computable Polish spaces. Then the product metric space n-1 i=0 T i (with one of any of the equivalent standard product metrics) is a computable Polish space where the ideal points consist of all finite products of ideal points. Furthermore, given a countably infinite such sequence (T 0 , δ 0 , D 0 ), (T 1 , δ 1 , D 1 ), . . . that is uniformly computable and has a fixed bound on the diameter, the product metric spaces consists of the metric space whose underlying space is i ∈N T i and whose metric is given by δ (x, y) = i ∈N 2 -i δ i (x, y); this too can be made into a computable Polish space, by taking the ideal points to be those sequences (x 0 , x 1 , . . .) with each x i ∈ D i such that for all but finitely many terms i, the point x i is the first element in the enumeration of D i . Note that in both the finite and infinite case, all projection maps are computable.

Definition 2.3 (Computable point [Galatolo et al. 2010, Def. 2.3.2]). Let (S, δ, D) be a computable Polish space with D = {s j } j ∈N and x ∈ S. Given a sequence {i k } i ∈N of natural numbers, we say that the sequence

and that the point x is computable.

Remark 2.4. A real α ∈ R is computable (as in Section 2.1) if and only if α is a computable point of R (as a computable Polish space). Although most of the familiar reals are computable, there are only countably many computable reals, and so almost every real is not computable.

The notion of a c.e. open set (or Σ 0 1 class) is fundamental in classical computability theory, and admits a simple definition in an arbitrary computable Polish space.

Definition 2.5 (C.e. open set [Galatolo et al. 2010, Def. 2.3.3]). Let (S, δ, D) be a computable Polish space with the corresponding enumeration {B i } i ∈N of the ideal open balls B S . We say that U ⊆ S is a c.e. open set when there is some c.e. set E ⊆ N such that U = i ∈E B i .

Note that the class of c.e. open sets is closed under computable unions and finite intersections. A computable function can be thought of as a continuous function whose local modulus of continuity is witnessed by a program. It is important to consider the computability of partial functions, because many natural and important random variables are continuous only on a measure one subset of their domain.

Definition 2.6 (Computable partial function [Galatolo et al. 2010, Def. 2.3.6]). Let (S, δ S , D S ) and (T , δ T , D T ) be computable Polish spaces, the latter with the corresponding enumeration

We call such a sequence {U n } n ∈N a witness to the computability of f . Note that the notion of being computable on a set R can be relativized to an oracle A ⊆ N in the obvious way. A function is continuous on R if and only if it is A-computable on R for some oracle A.

Remark 2.7. Let S and T be computable Polish spaces. If f : S → T is computable on some subset R ⊆ S, then for every computable point x ∈ R, the point f (x) is also computable. One can show that f is computable on R when there is an oracle Turing machine that, upon being fed a representations of points x ∈ R on its oracle tape, computes representations of their images f (x) ∈ S. (For more details, see [Hoyrup and Rojas 2009b, Prop. 3.3.2].)

The standard notion of computability of functions between computable Polish spaces is too restrictive in most cases when the inputs to these functions are points in a probability space. For example, the Heaviside function f (x) = 1(x ≥ 0) is not computable on any set containing a neighborhood of 0. However, we can reliably compute the image of a Gaussian random variable under f , because the Gaussian random variable is nonzero with probability one, and f is computable on R \ {0}.

For a measure space (Ω, G , µ), a set E ∈ G is a µ-null set when µ(E) = 0. More generally, for p ∈ [0, ∞], we say that E is a µ-measure p set when µ(E) = p. A predicate P on Ω is said to hold µ-almost everywhere (abbreviated µ-a.e.) if the event E P = {ϖ ∈ Ω : P(ϖ) does not hold) is a µ-null set. When E P is a µ-null set but µ is a probability measure, we will instead say the event P holds µ-almost surely, and we likewise say that an event E ∈ G occurs µ-almost surely (abbreviated µ-a.s.) when µ(E) = 1. In each case, we may drop the prefix µ when it is clear from context (in particular, when it holds of P).

Definition 2.8. Let S and T be Polish spaces and µ a probability measure on S. A measurable function f : S → T is µ-almost continuous when it is continuous on a µ-measure one set. When S and T are computable Polish spaces, the measurable function f is µ-almost computable when it is computable on a µ-measure one set.

(See [Hoyrup and Rojas 2009b] for further development of the theory of almost computable functions.) The following result relates µ-almost continuity to µ-a.e. continuity, i.e., the set of continuity points being a µ-measure one set. The proofs of the following proposition and lemma are due to François Dorais, Gerald Edgar, and Jason Rute [2013]. Proposition 2.9. Let X and Y be Polish spaces, let f : X → Y be a µ-almost continuous function, and let µ be a probability measure on X . Then there is a µ-a.e. continuous д : X → Y that agrees with f µ-a.e.

We will need the following technical lemma. Recall that a G δ set is a countable intersection of open sets.

Lemma 2.10. Let X be a Polish space. If D ⊆ X is a nonempty G δ -set then there is a measurable map h : X → D such that lim x →x 0 h(x) = x 0 for every x 0 ∈ D.

) will be as required. By definition, it is always possible to find a suitable h(x) ∈ D for each x ∈ U 0 \ D. To ensure that h is measurable, fix an enumeration (d i ) i ∈N of a countable dense subset of D and, if x ∈ U 0 \ D, define h(x) to be the first element in this list that matches all the necessary requirements. (We must have h(x) = x for x ∈ D and it does not matter how h(x) is defined when x U 0 so long as the end result is measurable.) □

Proof of Proposition 2.9. By a classical result of Kuratowski [Kechris 1995, I.3.B, Thm. 3.8], we may assume (after first possibly changing its value on a µ-null set) that f is continuous on a µ-measure one G δ set D ⊆ X .

Let h be as in Lemma 2.10. Then д = f • h is a measurable function that agrees with f on D and lim

for all x 0 ∈ D. □ Remark 2.11. Let S and T be computable Polish spaces. A set X ⊆ S is an effective G δ set (or Π 0 2 class) when it is the intersection of a uniformly computable sequence of c.e. open sets. Suppose that f : S → T is computable on R ⊆ S with {U n } n ∈N a witness to the computability of f . One can show that there is an effective G δ set R ′ ⊇ R and a function f ′ : S → T such that f ′ is computable on R ′ , the restriction of f ′ to R and f are equal as functions, and {U n } n ∈N is a witness to the computability of f ′ . Furthermore, a G δ -code for some such R ′ can be computed uniformly from a code for the witness {U n } n ∈N . For details, see [Hoyrup 2008, Thm. 1.6.2.1]; this generalizes a classical result of Kuratowski [Kechris 1995, I.3.B, Thm. 3.8]. In conclusion, one can always assume that the set R is an effective G δ set.

We will introduce a weaker notion of computability for functions in Section 2.6.

Intuitively, a random variable maps an input source of randomness to an output, inducing a distribution on the output space. Here we will use a sequence of independent fair coin flips as our source of randomness. We formalize this via the probability space ({0, 1} ∞ , F , P), where {0, 1} ∞ is the product space of infinite binary sequences, F is its Borel σ -algebra (generated by the set of basic clopen cylinders extending each finite binary sequence), and P is the product measure formed from the uniform distribution on {0, 1}. Throughout the rest of the paper we will take ({0, 1} ∞ , F , P) to be the basic probability space. We will use a SANS SERIF font for random variables.

Definition 2.12 (Random variable and its distribution). Let S be a Polish space. A random variable in S is a measurable function X : {0, 1} ∞ → S. For a measurable subset A ⊆ S, we let {X ∈ A} denote the inverse image X -1 [A] = {ϖ ∈ {0, 1} ∞ : X(ϖ) ∈ A}, and for x ∈ S we similarly define the event {X = x }. We will write P X for the distribution of X, which is the measure on S defined by

If S is a computable Polish space then we say a random variable X in S is a P-almost computable random variable it is P-almost computable as a measurable function. Intuitively, X is a P-almost computable random variable when there is a program that, given access to an oracle bit tape ϖ ∈ {0, 1} ∞ , outputs a representation of the point X(ϖ) (i.e., enumerates a sequence {x i } in D where δ (x i , X(ϖ)) < 2 -i for all i), for all but a P-measure zero subset of bit tapes ϖ ∈ {0, 1} ∞ .

Even though the source of randomness is a sequence of discrete bits, there are P-almost computable random variables with continuous distributions, such as a uniform random variable (gotten by subdividing the unit interval according to the random bit tape) or an i.i.d. sequence of uniformly distributed random variables (by splitting up the given element of {0, 1} ∞ into countably many disjoint subsequences and dovetailing the constructions). (For explicit constructions, see, e.g., [Freer and Roy 2010, Ex. 3, 4].)

It is crucial that we consider random variables that are merely computable on a P-measure one subset of {0, 1} ∞ . To see why, consider the following example, which was communicated to us by Martín Escardó. For a real α ∈ [0, 1], we say that a binary random variable X : {0, 1} ∞ → {0, 1} is a Bernoulli(α) random variable when P X {1} = α. There is a Bernoulli( 12 ) random variable that is computable on all of {0, 1} ∞ , given by the program that simply outputs the first bit of the input sequence. Likewise, when α is dyadic (i.e., a rational whose denominator is a power of 2), there is a Bernoulli(α) random variable that is computable on all of {0, 1} ∞ . However, this is not possible for any other choices of α (e.g., 1

3 ). Lemma 2.13. Let α ∈ [0, 1] be a nondyadic real. Every Bernoulli(α) random variable X : {0, 1} ∞ → {0, 1} is discontinuous, hence not computable on all of {0, 1} ∞ .

Proof. Assume X is continuous. Let Z 0 := X -1 (0) and Z 1 := X -1 (1). Then {0, 1} ∞ = Z 0 ∪ Z 1 , and so both are closed (as well as open). The compactness of {0, 1} ∞ implies that these closed subspaces are also compact, and so Z 0 and Z 1 can each be written as the finite disjoint union of clopen basis elements. But each of these elements has dyadic measure, hence their sum cannot be either α or 1α, contradicting the fact that P(Z 1 ) = 1 -P(Z 0 ) = α. □

On the other hand, for an arbitrary computable α ∈ [0, 1], consider the random variable X α given by X α (x) = 1 if ∞ i=0 x i 2 -i-1 < α and 0 otherwise. This construction, due to Mann [1973], is a Bernoulli(α) random variable and is computable on every point of {0, 1} ∞ other than a binary expansion of α. Not only are these random variables P-almost computable, but they can be shown to be optimal in their use of input bits, via the classic analysis of rational-weight coins by Knuth and Yao [1976]. Hence it is natural to focus our attention on random variables that are merely P-almost computable.

The setting of P-almost computable random variables is a natural one for probability theory, and the standard operations on random variables preserve P-almost computability, including, e.g., addition and multiplication of P-almost computable real random variables, composition with P-almost computable measurable functions, and cartesian products.

We now introduce the class of computable probability measures on computable Polish spaces.

Let (S, δ S , D S ) be a computable Polish space, and recall that B S denotes its Borel sets and M 1 (S) its Borel probability measures. Consider the subset D P,S ⊆ M 1 (S) comprised of those probability measures that are concentrated on a finite subset of D S and where the measure of each atom is rational, i.e., ν ∈ D P,S if and only if ν = q 1 δ t 1 + • • • + q k δ t k for some rationals q i ≥ 0 such that q 1 + • • • + q k = 1 and some points t i ∈ D S , where for t ∈ S the {0, 1}-valued Dirac measure δ t satisfies δ t (A) = 1 if and only if t ∈ A for all measurable sets A. It is a standard fact (see, e.g., Gács [2005, §B.6.2]) that D P is dense in the Prokhorov metric δ P given by

where

is the open ball of radius ε about p. Moreover, (M 1 (S), δ P , D P,S ) is a computable Polish space. (See [Hoyrup and Rojas 2009b, Prop. 4.1.1].) We say that µ ∈ M 1 (S) is a computable probability measure when µ is a computable point in M 1 (S) as a computable Polish space. Note when the space S is clear from context we will refer to D P,S simply as D P .

One can define computability on the space of probability measures in other natural ways. Early work by Weihrauch [1999] and Müller [1999] formalized the computability of probability measure in terms of the lower semicomputability of the measure as a function on the set of open sets and in terms of the computability of the measure as a linear operator acting on bounded continuous functions; these notions are equivalent. (See Schröder [2007] for a more general setting.) These notions of computability also agree with the notion of computability defined here in terms of the Prokhorov metric.

Proposition 2.14 ( [Hoyrup and Rojas 2009b, Thm. 4.2.1]). Let S be a computable Polish space. A probability measure µ ∈ M 1 (S) is computable if and only if the measure µ(A) of a c.e. open set A ⊆ S is a c.e. real, uniformly in A. □

Note that the measure P on {0, 1} ∞ is a computable probability measure. We can also characterize the class of computable probability measures in terms of the uniform computability of the integrals of bounded continuous functions: Proposition 2.15 ( [Hoyrup and Rojas 2009b, Cor. 4.3.1]). Let S be a computable Polish space, let µ be a probability measure on S, and let F be the set of computable functions from S to R + . Then µ is computable if and only if ∫ f dµ is a c.e. real, uniformly in f ∈ F . □ Corollary 2.16. Let S be a computable Polish space, let µ be a probability measure on S, and let F be the set of computable functions from

Proof. First observe that both f and 1f are non-negative functions. Therefore, by Proposition 2.15, the reals ∫ f dµ and ∫ (1f ) dµ are both c.e., and hence the real ∫ f dµ is computable. □

Having explained the computability of probability measures in terms of integration, we now relate it to the computability of random variables defined on computable Polish spaces.

Definition 2.17 (Computable probability space [Galatolo et al. 2010, Def. 2.4.1]). A computable probability space is a pair (S, µ) where S is a computable Polish space and µ is a computable probability measure on S.

The distribution of a P-almost computable random variable in a computable Polish space is computable.

Proposition 2.18 ( [Galatolo et al. 2010, Prop. 2.4.2]). Let X be a P-almost computable random variable in a computable Polish space S. Then its distribution is a computable point in the computable Polish space M 1 (S). □

On the other hand, given a computable measure, there is a P-almost computable random variable with that distribution. Proposition 2.19 ( [Hoyrup and Rojas 2009b, Thm. 5.1.1]). Let µ be a computable probability measure on a computable Polish space S. Then there is a P-almost computable random variable in S whose distribution is µ. □

In summary, the computable probability measures on a computable Polish space are precisely the distributions of P-almost computable random variables in that space. For this result in a more general setting, see [Schröder 2007, Prop. 4.3].

Further, if µ is a computable probability measure and and f is computable on a µ-measure one set, then the pushforward µ • f -1 is a computable distribution. This fact, along with Proposition 2.19, shows that we have lost no generality in taking ({0, 1} ∞ , F , P) to be our basic probability space.

All of the standard distributions (e.g., normal, uniform, geometric, exponential) found in probability textbooks, and then all the transformations of these distributions by P-almost computable functions, are easily shown to be computable distributions.

Another important class of functions on a probability space is the class of L 1 -computable functions. For more details, including some of the history of L 1 -computability, see Hoyrup and Rojas [2009a, §3.1] and Miyabe [2013].

Definition 2.20 (The metric space of L 1 (µ) functions [Hoyrup and Rojas 2009a, §3.1]). Let µ be a probability measure on a Polish space S, and let F be the set of µ-integrable functions from S to R.

This metric space is called the space of L 1 (µ) functions on S, and we will often speak interchangeably of a µ-integrable function S → R and its equivalence class.

We will make use of the following set of L 1 functions.

Definition 2.21 (Ideal points for L 1 [Gács 2005, §2]). Let (S, δ, D) be a computable Polish space. Define E to be the smallest set of functions containing the constant function 1 and the functions {д u,r,1/n : u ∈ S, r ∈ Q, n ≥ 1}, where

and closed under max, min, and rational linear combinations.

Such functions can be thought of as continuous analogues of step functions having a finite number of steps, each step of which corresponds to a basic open ball with rational radius and ideal center.

Lemma 2.22 ( [Hoyrup and Rojas 2009a, Prop. 3]). Let µ be a computable probability measure on a computable Polish space (S, δ, D). The set E is dense in the L 1 (µ) functions on S, and the distances between points in E are computable under the standard enumeration, making this space into a computable Polish space. □

We say that an L 1 (µ) function on a computable Polish space S is L 1 (µ)-computable when it is a computable point in the L 1 (µ) functions on S.

Lemma 2.23 (Hoyrup and Rojas [2009a, Thm. 4 Claim 2 and Thm. 5 Claim 2]). Let (S, µ) be a computable probability space and let T be a computable Polish space. A function f : S → T is L 1 (µ)computable if and only if ∫ f dµ is a computable real and for each r ∈ N, the function f is computable on some set of P X -measure at least 1 -2 -r , uniformly in r . □

In particular, note that every integrable µ-almost computable function is L 1 (µ)-computable.

We obtain the following immediate corollary of Lemma 2.23 using the fact that if a function is µ-almost computable with a computable µ-integral, then we can uniformly find a collection of ideal points that converge to it in L 1 (µ).

Corollary 2.24. Let (S, µ) be a computable probability space and let T be a computable Polish space. Let f 0 , f 1 , . . . : S → T be a sequence of uniformly µ-almost computable functions taking values in a computable Polish space T that converge effectively in L 1 (µ) to a function f ∈ L 1 (µ). Then f is L 1 (µ)-computable. □

Let (S, µ) be a computable probability space. We know that the µ-measure of a c.e. open set A ⊆ S is a c.e. real. In general, the measure of a c.e. open set is not a computable real. On the other hand, if A is a decidable subset (i.e., S \ A is c.e. open) then µ(S \ A) a c.e. real, and therefore, by the identity µ(A) + µ(S \ A) = 1, we have that µ(A) is a computable real. In connected spaces, the only decidable subsets are the empty set and the whole space. However, there exists a useful surrogate when dealing with measure spaces.

Definition 2.25 (Almost decidable set [Galatolo et al. 2010, Def. 3.1.3]). Let (S, µ) be a computable probability space. A measurable subset A ⊆ S is said to be µ-almost decidable when there are two c.e. open sets U and V such that U ⊆ A and V ⊆ S \ A and µ(U ) + µ(V ) = 1. In this case we say that (U , V ) witnesses the µ-almost decidability of A.

The following lemma is immediate.

Lemma 2.26 ( [Galatolo et al. 2010, Prop. 3.1.1]). Let (S, µ) be a computable probability space, and let A be µ-almost decidable. Then µ(A) is a computable real. □

While we may not be able to compute the probability measure of ideal balls, we can compute a new basis of ideal balls for which we can. (See also Bosserhoff [2008, Lem. 2.15].) Lemma 2.27 ( [Galatolo et al. 2010, Thm. 3.1.2]). Let (S, µ) be a computable probability space, and let D S be the ideal points of S with standard enumeration {d i } i ∈N . There is a computable sequence {r j } j ∈N of reals, dense in the positive reals, such that the balls {B(d i , r j )} i, j ∈N form a basis of µ-almost decidable sets, which we call a µ-almost decidable basis.

We now show that every c.e. open set of a computable probability space (S, µ) is the union of a computable sequence of µ-almost decidable subsets.

Lemma 2.28 (Almost decidable subsets). Let (S, µ) be a computable probability space with ideal points {d i } i ∈N , and let {r j } j ∈N be a computable sequence of reals such that {B(d i , r j )} i, j ∈N is a µ-almost decidable basis. Let V be a c.e. open set. Then, uniformly in {r j } j ∈N and V , we can compute a sequence of µ-almost decidable sets

Proof. Let {B k } k ∈N be a standard enumeration of the ideal balls of S where B k = B(d m k , q l k ), and let E ⊆ N be a c.e. set such that V = k ∈E B k . Consider the c.e. set

Because {d i } i ∈N is dense in S and {r j } j ∈N is dense in the positive reals we have for each k ∈ N that

In particular this implies that the set

Using the notion of an almost decidable set, we have the following characterization of computable measures.

Corollary 2.29. Let (S, µ) be a computable probability space with ideal points {d i } i ∈N , and let {r j } j ∈N be a computable sequence of reals such that {B(d i , r j )} i, j ∈N is a µ-almost decidable basis. Let ν ∈ M 1 (S) be a probability measure on S that is absolutely continuous with respect to µ. Then ν is computable uniformly in the sequence {ν (B(d i , r j ))} i, j ∈N .

Proof. Let V be a c.e. open set of S. By Proposition 2.14, it suffices to show that ν (V ) is a c.e. real, uniformly in V . By Lemma 2.28, we can compute a nested sequence {V k } k ∈N of µ-almost decidable sets whose union is V . By the absolute continuity of ν with respect to µ, these sets are also ν -almost decidable. Because V is open, ν (V ) = sup k ∈N ν (V k ), which is the supremum of a sequence of reals that is computable uniformly in the sequence {ν (B(d i , r j ))} i, j ∈N . □

We close with the following extension of Corollary 2.16.

Proposition 2.30. Let S, T be computable Polish spaces, µ a probability measure on T , B a µ-almost decidable subset of R, and f : dt) is a computable function, uniformly in f and B. Proof. This follows immediately from Propositions 3.2.3 and 4.3.1 of [Hoyrup and Rojas 2009b]. □

Let µ be a probability measure on a measurable space of outcomes S, and let A, B ⊆ S be events. Informally, given that event A has occurred, the probability that event B also occurs, written µ(B|A), must satisfy µ(A) µ(B|A) = µ(A ∩ B). Clearly µ(B|A) is uniquely defined if and only if µ(A) > 0, which leads to the following definition.

Definition 3.1 (Conditioning on positive-measure events). Suppose that µ(A) > 0. Then the conditional probability of B given A, written µ(B|A), is defined by

It is straightforward to check that, for any fixed event A ⊆ S with µ(A) > 0, the set function µ( • |A) is a probability measure.

We will often be interested in the case where B and A are events of the form {Y ∈ D} and {X ∈ C}. In this case, we define the abbreviation

Again, this is well-defined when P{X ∈ C} > 0. When P{X = x } > 0, we may simply write

This elementary notion of conditioning is undefined when the conditioning event has zero measure, such as when a continuous random variable takes a particular value. In the modern formulation of conditional probability due to Kolmogorov [1933], one defines conditioning with respect to (the σ -algebra generated by) a random variable rather than an individual event. In theory, this yields a consistent solution to the problem of conditioning on the value of general (and in particular, continuous) random variables, although we will see that other issues arise. (See Kallenberg [2002, Ch. 6] for a rigorous treatment.)

In order to bridge the divide between the elementary notion of conditioning on events and the abstract approach of conditioning on random variables, consider the case of conditioning on a random variable X taking values in a countable discrete set S and satisfying P{X = x } > 0 for all x ∈ S. Let {Y ∈ B} be an event. Then the conditional probability that Y ∈ B given X, written P[Y ∈ B|X], is the random variable satisfying P

for all measurable subsets A ⊆ S. For sets A of the form {x }, for x ∈ S, we have

, and so (9) yields a more abstract characterization of elementary conditional probability for positive-measure events. The general case is captured by the same defining property. Let X be a random variable in a measurable space S. Then the conditional probability that Y ∈ B given X, written P[Y ∈ B|X], is defined to be a random variable in [0, 1] of the form f B (X) where again f B : S → [0, 1] is such that (9) holds for all measurable subsets A ⊆ S. In many situations, such a function f B is itself the object of interest and so we will let P[Y ∈ B|X = • ] denote an arbitrary such function. We may then re-express its defining property in the following more intuitive form:

for all measurable subsets A ⊆ S.

The existence of the conditional probability P[Y ∈ B|X], or equivalently, the existence of P[Y ∈ B|X = • ], follows from the Radon-Nikodym theorem. Recall that a measure µ on a measurable space S is absolutely continuous with respect to another measure ν on the same space, written µ ≪ ν, if ν (A) = 0 implies µ(A) = 0 for all measurable sets A ⊆ S.

Theorem 3.2 (Radon-Nikodym). Let S be a measurable space and let µ and ν be σ -finite measures on S such that µ ≪ ν . Then there exists a nonnegative measurable function

for all measurable subsets A ⊆ S. □

We call any function dµ dν satisfying Equation ( 11) for all measurable subsets A ⊆ S a Radon-Nikodym derivative (of µ with respect to ν ).

Note that if д is also a Radon-Nikodym derivative of µ with respect to ν , then д = dµ dν outside a ν-null set, and so Radon-Nikodym derivatives are unique up to a null set. (Functions that agree a.e. are called versions.) We may safely refer to the Radon-Nikodym derivative when we want to ignore such differences, but in some cases these differences are important.

It is straightforward to verify that the function

both considered as measures on S, and so a function f B satisfying (9) for all measurable subsets A ⊆ S always exists, but it is only defined up to a null set. This is inconsequential when the conditional probability P[Y ∈ B|X] is the object of interest. In applications, especially statistical ones, however, the function P[Y ∈ B|X = • ] mapping values in S to probabilities is the object of interest, and, moreover, one typically wants to evaluate this function at particular observed values x ∈ S. Because P[Y ∈ B|X = • ] is merely determined up to a P X -null set, interpreting its values at individual points is problematic.

As mentioned in the introduction, the fact that general conditional probabilities are not uniquely defined at points is the subject of a large literature. However, in some circumstances, two versions of P[Y ∈ B|X = • ] must agree at individual points. In particular, if two versions are continuous at a point in the support of the distribution P X , then they agree on the value at that point. In order to state this claim formally, we first recall the definition of the support of a distribution: Definition 3.3. Let µ be a measure on a topological space S with open sets S. Then the support of µ, written supp(µ), is defined to be the set of points x ∈ S such that all open neighborhoods of x have positive measure, i.e., supp(µ

Note that the support of µ can equivalently be defined as the smallest closed set of µ-measure one. We now state our claim formally: Lemma 3.4. Let S be a Polish space. Suppose

a point of continuity of f 1 and f 2 , and x ∈ supp(P X ), then f 1 (x) = f 2 (x). In particular, if f 1 and f 2 are continuous on a P X -measure one set D ⊆ S, then they agree everywhere in D ∩ supp(P X ). □

The proof is immediate from the following elementary result. Lemma 3.5. Let f 1 , f 2 : S → T be two measurable functions between Polish spaces S and T , and suppose that f 1 = f 2 almost everywhere with respect to some measure µ on S. Let D ⊆ S be a set of µ-measure one. If x ∈ S is a point of continuity of f 1 and f 2 on D, and x ∈ supp(µ), then f 1 (x) = f 2 (x). In particular, if f 1 and f 2 are continuous on D, then they agree everywhere in D ∩ supp(µ).

Proof. Let δ T be any metric under which T is complete. Define the measurable function д : S → R by

We know that д = 0 µ-a.e., and also that д is continuous at x on D, because f 1 and f 2 are continuous at x on D and δ T is continuous (on all of T ). Assume, for the purpose of contradiction, that д(x) = ε > 0.

By the continuity of д on D, there is an open neighborhood

The observation that continuity gives a unique answer to conditioning on zero-measure events of the form {X = x } is an old one, going back to at least Tjur [1974].

For a pair of random variables X and Y taking values in a pair of measurable space S and T , respectively, it is natural to consider not just individual conditional probabilities P[Y ∈ B|X], for measurable subsets B ⊆ T , but the entire conditional distribution P[Y|X] := P[Y ∈ • |X]. Unfortunately, the fact that Radon-Nikodym derivatives are only defined up to a null set can cause problems. In particular, while it is the case that

for every countable measurable partition B 0 , B 1 , . . . of a measurable set B ⊆ T , the random set function given by B → P[Y ∈ B|X] need not be a measure in general because the exceptional null set may depend on the sequence. However, when T is Polish, we can construct versions of the conditional probabilities that combine to produce a measure. In order to make this definition precise, we recall the notion of a probability kernel.

Definition 3.6 (Probability kernel). Let S and T be Polish spaces. A function κ : S × B T → [0, 1] is called a probability kernel (from S to T ) when (1) for every s ∈ S, the function κ(s, • ) is a probability measure on T ; and

(2) for every B ∈ B T , the function κ( • , B) is measurable.

For every κ : S × B T → [0, 1], let κ be the map s → κ(s, • ). It can be shown that κ is a probability kernel from S to T if and only if κ is a (Borel) measurable function from S to M 1 (T ) [Kallenberg 2002, Lem. 1.40], where we adopt the weak topology on M 1 (T ), which is Polish because T is.

We say that a conditional distribution P[Y|X] has a regular version when, for some probability kernel κ from S to T ,

for every measurable subset B ⊆ T . In this case, we would say that κ(X) is a regular version of the conditional distribution.

Proposition 3.7 (Regular versions [Kallenberg 2002, Lem. 6.3]). Let X and Y be random variables in a Polish space S and a measurable space T , respectively. Then there is a regular version of the conditional distribution P[Y|X], which is, moreover, determined by the joint distribution of X and Y.

As with the derivatives underlying conditional probabilities, κ is only defined up to a P X -null set. When such a kernel κ exists, i.e., when there is a regular version of the conditional distribution P[Y|X], we define P[Y|X = • ] to be equal to some arbitrary version of κ.

Despite the fact that the kernels underlying regular versions of conditional distributions are defined only up to sets of measure zero, it follows immediately from Lemma 3.5 that when S and T are Polish, any two versions of P[Y|X = • ] that are continuous on some subset of the support of P X must agree on that subset. More carefully, let κ1 (X) and κ2 (X) be regular versions of the conditional distribution P[Y|X]. If x ∈ S is a point of continuity of κ1 and κ2 , and x ∈ supp(P X ), then κ1 (x) = κ2 (x). In particular, if both maps are continuous on a set D ⊆ S, then they agree everywhere in D ∩ supp(P X ).

When conditioning on a random variable whose distribution concentrates on a countable set, it is well known that a regular version of the conditional distribution can be built by elementary conditioning with respect to single events. This includes the special case of conditioning on discrete random variables, i.e., those concentrating on a countable discrete subspace. Lemma 3.8. Let X and Y be random variables in Polish spaces S and T , respectively. Suppose the distribution of X concentrates on a countable set R ⊆ S, i.e., P X (R) = 1 and x ∈ R implies P X {x } > 0. Let ν be an arbitrary probability measure on T . Define the function κ :

for all x ∈ R and κ(x) = ν for x R. Then κ is a probability kernel and κ(X) is a regular version of the conditional distribution P[Y|X].

Proof. The function κ is well-defined because P{X = x } > 0 for all x ∈ R. It follows that κ(x) is a probability measure for every x. Because R is countable, κ is also measurable and so κ is a probability kernel from S to T . Note that P{X ∈ R} = 1 and so, for all measurable sets A ⊆ S and B ⊆ T , we have

That is, κ(X, B) is the conditional probability of the event {Y ∈ B} given X, and so κ(X) is a regular version of the conditional distribution P[Y|X]. □

Beyond the setting of conditioning on discrete random variables, explicit formulas for conditional distributions are also available when Bayes’ rule applies. We begin by introducing the notion of a dominated kernel. (The usual terms, such as dominated families or models, refers to measurable families of probability measures, i.e., probability kernels.) Definition 3.9 (dominated kernel). A probability kernel κ from T to S is dominated when there is a σ -finite measure ν on S such that κ(t) ≪ ν for every t ∈ T .

Let X and Y be random variables in Polish spaces S and T , respectively, and let κX|Y (X) be a regular version of P[X|Y] such that κ X|Y is dominated. Then there exists a (product) measurable function

for every measurable set A ⊆ S and every y ∈ T .

Definition 3.10 (conditional density). We call any such function p X |Y a conditional density of X given Y (with respect to ν ).

Common finite-dimensional, parametric families of distributions (e.g., exponential families like Gaussian, gamma, etc.) are dominated, and so, in probabilistic models composed from these families, conditional densities exist and Bayes’ rule gives a formula for expressing the conditional distribution. We give a proof of this classic result for completeness. Lemma 3.11 (Bayes’ rule [Schervish 1995, Thm. 1.13]). Let X and Y be random variables as in Proposition 3.7, and assume that there exists a conditional density p X |Y of X given Y with respect to a σ -finite measure ν . Let µ be an arbitrary distribution on T and define κ :

for those points x ∈ S where the denominator is positive and finite, and by κ(x) = µ otherwise. Then κ is a probability kernel and κ(X) is a regular version of the conditional distribution P[Y|X].

Proof. Let κX|Y (Y) be a regular version of the conditional distribution P[X|Y]. By hypothesis, κ X |Y is dominated by ν and p X |Y is a conditional density with respect to ν. By Proposition 3.7 and Fubini’s theorem, for measurable sets A ⊆ S and B ⊆ T , we have that

Taking B = T , we have

Because P X (S) = 1, this implies that the set of points x for which the denominator of the right-hand side of ( 21) is infinite has ν -measure zero, and thus P X -measure zero. Taking A to be the set of points x for which the denominator is zero, we see that P X (A) = 0. It follows that ( 21) characterizes κ up to a P X -null set.

By ( 25), we see that the denominator is a density of P X with respect to ν , and so we have

for all measurable sets A ⊆ S and B ⊆ T . Finally, by the definition of κ, Equation ( 24), and the fact that the denominator is positive and finite for P X -almost every x, we see that κ(X) is a regular version of the conditional distribution P [Y|X]. □

Comparing Bayes’ rule ( 21) to the definition of conditional density (20), we see that any conditional density of Y given X (with respect to P Y ) satisfies

for P (X,Y) -almost every (x, y).

The following result suggests why the mere a.e. definedness of conditional distributions can be ignored by those working entirely within the framework of dominated families. Proposition 3.12. Let X and Y be random variables on Polish spaces S and T , respectively, let κ(X) be a regular version of the conditional distribution P[Y|X], and let R ⊆ S. If a conditional density p X |Y (x |y) of X given Y is continuous on R × T , positive, and bounded, then κ as defined in ( 21) is a version of κ that is continuous on R. In particular, if R is a P X -measure one subset, then κ is a P X -almost continuous version.

We defer the proof to Section 9.2. We will use this result in the proof of Lemma 7.3, towards our central result.

Before we lay the foundations for the remainder of the paper and define notions of computability for conditional probability and conditional distributions in the abstract setting, we address the computability of distributions conditioned on positive-measure sets. In order for the distributions obtained from positive measure sets to be computable, we will need the conditioning events to be almost decidable sets. Lemma 4.1 ([Galatolo et al. 2010, Prop. 3.1.2]). Let (S, µ) be a computable probability space and let A be a µ-almost decidable subset of S satisfying µ(A) > 0. Then µ( • |A) is a computable probability measure, uniformly in a witness to the µ-almost decidability of A.

Proof. By Lemma 2.27 there is a µ-almost decidable basis for S. Note that µ( • |A) is absolutely continuous with respect to µ. Hence by Corollary 2.29, it suffices to show that µ(B∩A) µ(A) is computable for a µ-almost decidable set B, uniformly in witnesses to the µ-almost decidability of A and B. All subsequent statements in this proof are uniform in both. Now, B ∩ A is µ-almost decidable with computable witness, and so its measure, the numerator, is a computable real. The denominator is likewise the measure of a set that is almost decidable with computable witness, hence is a computable real. Finally, the ratio of two computable reals is itself computable. □

In the abstract setting, conditional probabilities are random variables. In many applications of probability, including statistics, the conditional probability map, or some version of it, is the actual object of interest, and so the computability of this map is our focus.

Let B ⊆ T be a measurable set. Viewing P[Y ∈ B|X = • ] as a function from S to [0, 1], recall that we can speak formally as to whether this function is everywhere computable, P X -almost computable, and/or L 1 -computable. Recall also that the function P[Y ∈ B|X = • ] may have many versions that agree only up to a null set. Despite this, their almost computability does not differ (up to a change in domain by a null set). Lemma 4.2. Let f be a measurable function from a computable probability space (S, µ) to a computable Polish space T . If any version of f is computable on a µ-measure p set, then every version of f is computable on a µ-measure p set. In particular, if one version is µ-almost computable, then all version are.

Proof. Let f be computable on a µ-measure p set D, and let д be a version of f , i.e., Z := {s ∈ S :

then it is computable on a µ-measure one set, and so д is as well.

We can develop notions of computability for conditional distributions in a similar way. We begin by characterizing the computability of probability kernels. Definition 4.3 (Computable probability kernel). Let S and T be computable Polish spaces and let κ : S × B T → [0, 1] be a probability kernel from S to T . Then we say that κ is a computable probability kernel when κ : S → M 1 (T ) given by κ(s) := κ(s, • ) is a computable function in the ordinary sense between S and the computable Polish space M 1 (T ) induced by T . Similarly, we say that κ is computable on a subset D ⊆ S when κ is computable on D.

As we will see, this notion of computability corresponds with a more direct notion of computability for κ, which we now develop. We begin by noting that the collection of sets of the form

for A open and q rational, form a subbasis for the weak topology on M 1 (T ) (which is the topology induced by the Prokhorov metric). Indeed, it suffices for A to range over finite unions of some countable basis of T . We will also omit mention of T when the ambient space is clear from context. The next result relates balls in the Prokhorov metric to the subbasis elements above. Recall that δ p denotes the Prokhorov metric and that the collection D P of measures with finitely many point masses on elements D T , each assigned rational mass, form a dense set. Proposition 4.4 ( [Gács 2005, Prop. B.17]). Let ν, µ ∈ M 1 (T ), and assume that ν is supported on a finite set S. Then the condition δ p (ν, µ) < ϵ is equivalent to the finite set of conditions

for all A ⊆ S. □

The next corollary states that we can compute a representation for a Prokhorov ball in terms of the subbasis elements. The sets are easily defined from those in Proposition 4.4. Corollary 4.5. Uniformly in ν ∈ D P and ϵ ∈ Q, we can compute a finite collection of pairs (A i , q i ) i ≤n , each A i a finite union of open balls of radius ϵ around elements of D T and each q i a rational, such that

Finally, as a direct consequence of [Hoyrup and Rojas 2009b, Prop. 4.2.1], these subbasis elements are c.e. open. Proposition 4.6. Let A be a c.e. open subset of T and q be a rational. Then the set P(A, q) is c.e. open in the Prokhorov metric, uniformly in A and q. □ Recall that a lower semicomputable function from a computable Polish space to [0, 1] is one for which the preimage of (q, 1] is c.e. open, uniformly in rationals q. Furthermore, we say that a function f from a computable Polish space S to [0, 1] is lower semicomputable on D ⊆ S when there is a uniformly computable sequence {U q } q ∈Q of c.e. open sets such that

We can also interpret a computable probability kernel κ as a computable map sending each c.e. open set A ⊆ T to a lower semicomputable function κ( • , A). Proof. Let q ∈ (0, 1) be rational, let A ⊆ T be c.e. open, and define I := (q, 1]. Then

where P(A, q) is as in (28). By Proposition 4.6, P(A, q) is even c.e. open. Suppose κ is computable on D. Then there is a c.e. open set V A,q , uniformly computable in q and A, such that

and so κ( • , A) is lower semicomputable on D, uniformly in A.

Conversely, suppose κ( • , A) is lower semicomputable on D, uniformly in A. Then by (32), uniformly in A and q, we can find a c.e. open V A,q such that ( 33

Hence κ is computable on D. □ Let X and Y be random variables in computable Polish spaces S and T , respectively, and let κ(X) be a regular version of the conditional distribution P[Y|X]. The above notions of computability are suitable for talking about the computability of κ or any other version of it, and are appropriate notions of computability for statistical applications.

It is also straightforward to verify that C and X are conditionally independent, given an indicator for the event {X rational}. Therefore, all versions f of P[C = 1|X = • ] satisfy, for P X -almost every x,

The right hand side, considered as a function of x, is called the Dirichlet function, and is nowhere continuous.

Suppose some version of f were continuous at a point y on a P X -measure one set R. Then there would exist an open interval I containing y such that the image of I ∩ R contains 0 or 1, but not both. However, R must contain all rationals in I and Lebesgue-almost every irrational in I . Furthermore, the image of every rational in I ∩ R is 1, and the image of Lebesgue-almost every irrational in I ∩ R is 0, a contradiction. □

Although we cannot hope to compute P[C = 1|X = • ] on a P X -measure one set, we can compute it in a weaker sense. Proposition 5.2.

Proof. By Corollary 2.24, it suffices to construct a sequence of uniformly P X -almost computable functions that converge effectively in L 1 (P X ) to

2 min m <n ≤k |r mr n | be half the minimum distance between any pair among r 0 , . . . , r k , and define, for every k ∈ N,

Note that the set on which f k takes the value 1 is uniformly P X -almost decidable in part because its boundary points are irrationals, a null set. It is then clear that the functions f k , for k ∈ N, are uniformly P X -almost computable. For every k ∈ N, we have that P{X = r n for some n > k} = 2 -k -2 . Therefore,

completing the proof. □

Conditioning in general can produce discontinuous conditional distributions, which is an obstruction to a conditioning operator being computable. But even if we restrict our attention to distributions that admit conditional distributions that are continuous on their support, the operation of conditioning cannot be computable because, as we will show, it is discontinuous. Indeed, conditioning is discontinuous in a rather strong way. We use the recursion theorem to explain the computational consequences. Namely, for any potential program analysis that aims to perform conditioning on an arbitrary given distribution, there is a representation of that distribution such that the program analysis cannot identify a single nontrivial fact about its conditional distribution.

To begin, we formalize the notion of a conditioning operator.

Definition 6.1. Let F ⊆ M 1 ([0, 1] 2 ) be a set of probability measures. A map Φ :

) is a conditioning operator (for F ) if, for all distributions µ ∈ F and random variables X and Y with joint distribution µ, we have Φ(µ, x) = P[Y|X = x] for P X -almost all x.

Observe, by Proposition 3.7, that there is a conditioning operator for all

) are taken to be the canonical computable Polish spaces.

The previous section motivates restricting one’s attention to conditioning operators for the set F 0 ⊆ M 1 ([0, 1] 2 ) of probability distributions on pairs (X, Y) of random variables in [0, 1] such that there exists a P X -almost continuous version of the conditional distribution map P[Y|X = • ]. We will show that conditioning operators for F 0 are not computable, simply on grounds of continuity.

Recall that the name of a probability measure µ ∈ M 1 (T ) on a computable Polish space T is given by a Cauchy sequence in the dense elements D P,T of the associated Prokhorov metric. Note that F 0 contains D P,[0,1] 2 . Further recall that P T (A, q) is defined to be the set {η ∈ M 1 (T ) : η(A) > q}, for any open set A ⊆ T and rational q ∈ Q. Let A(T ) := {(A, q) : A is a finite union of open balls in T , and q ∈ Q}.

Given a computable Cauchy sequence in the Prokhorov metric that converges to a measure µ ∈ M 1 (T ), by Corollary 4.5 we can compute a sequence ⟨A i , q i ⟩ i ∈N in A(T ) such that i ∈N P T (A i , q i ) = {µ}. Further by Proposition 4.6, given a finite sequence ⟨A i , q i ⟩ i ≤n in A(T ) we can compute, uniformly in ⟨A i , q i ⟩ i ≤n , a ball in the Prokhorov metric contained in i ≤n P T (A i , q i ). Therefore, uniformly in a probability measure µ and a collection ⟨A i , q i ⟩ i ∈N with i ∈N P T (A i , q i ) = {µ}, we can computably recover a name for µ. Conversely, from a name for µ, we can uniformly compute such a collection. Lemma 6.3.

and rational ϵ ∈ (0, 1), we can uniformly find a measure µ ∈ D P,[0,1] 2 such that δ P (µ, ν ) < ϵ and Φ(µ, x) = α for every conditioning operator Φ for F . Proof. Relative to {ν i } i ∈N and {x i } i ∈N , we can computably find an element p * ∈ D P,[0,1] 2 such that p * ({x } × [0, 1]) = 0 and δ P (p * , ν ) < ϵ 2 . Let δ x denote the Dirac measure on [0, 1] concentrating on x and let τ ⊗ τ ′ denote the product measure on [0, 1] 2 with respective marginal distributions τ , τ ′ ∈ M 1 ([0, 1]). Defining p := ϵ 2 (δ x ⊗ α) + (1 -ϵ 2 )p * , it is easy to check that δ P (p, p * ) ≤ ϵ 2 , hence δ P (ν, p) < ϵ. Because p * ({x } ×[0, 1]) = 0 and p * is a finite mixture of point masses, every conditioning operator Φ for F must satisfy Φ(µ, x) = α. □ Proposition 6.4. Let F ⊆ M 1 ([0, 1] 2 ) contain D P,[0,1] 2 . Every conditioning operator on F is discontinuous everywhere, hence noncomputable.

Proof. On M 1 ([0, 1]), adopt the weak topology (induced by the standard topology on [0, 1]). On M 1 ([0, 1] 2 ) × [0, 1], adopt the product topology induced by the weak and standard topologies, respectively. Then Lemma 6.3 implies that every conditioning operator Φ for F is discontinuous everywhere. Hence every conditioning operator is noncomputable. □

The above definitions and proposition capture the essential difficulty of conditioning: a finite approximation to the joint distribution determines nothing about the result of conditioning on a particular point.

We now establish a stronger notion of noncomputability, namely that it is not even possible to always produce some nontrivial fact about a conditional distribution. For e ∈ N, let φ e denote the partial computable function defined by code e. The recursion theorem, due to Kleene [1938], states that when F is a total computable function, there is some integer i for which the partial computable functions φ i and φ F (i) are equal partial functions (i.e., they are defined on the same inputs, and are equal where they are defined). For more details, see [Rogers 1987, Ch. 11]. Definition 6.5. A program φ e : N → N represents a distribution µ e on a computable Polish space T if it is total and on input k, the output value φ e (k) is a code for a pair (A k , q k ) ∈ A(T ) such that {µ} = k ∈N P T (A k , q k ). Definition 6.6. A program φ a : N 3 → N is a conditioning program for F ⊆ M 1 ([0, 1] 2 ) if it is total and whenever e represents a computable distribution µ e ∈ F , there exists a conditioning operator Φ for F such that, for every code j ∈ N for a computable real x ∈ [0, 1], and for every k ∈ N, the return value φ a (e, j, k) is a code for either the empty string or an element (A, q) ∈ A([0, 1]) such that Φ(µ e , x) ∈ P [0,1] (A, q) and P [0,1] (A, q) M 1 ([0, 1]). Theorem 6.7 (Nonapproximable conditional distributions). Suppose that φ a is a conditioning program for a set F ⊆ M 1 ([0, 1] 2 ) containing D P,[0,1] 2 . Let e ∈ N be a code for a computable distribution µ e on [0, 1] 2 , and let j ∈ N be a code for a computable real x ∈ [0, 1]. Then uniformly in a, e, and j, we can compute an i ∈ N such that µ e = µ i and φ a (i, j, k) is a code for the empty string for every k ∈ N.

Proof. Uniformly in a, we can compute some b ∈ N such that for all n, m, r ∈ N: if the value φ a (n, m, r ′ ) is a code for the empty string for all r ′ ≤ r , then φ b (n, m, r ) is also a code for the empty string; and otherwise φ b (n, m, r ) = φ a (n, m, r ′ ), where r ′ is the least index such that φ a (n, m, r ′ ) is not a code for the empty string. Note that for each n, m ∈ N, the value φ b (n, m, k) is a code for the empty string for all k ∈ N if and only if φ a (n, m, k) is a code for the empty string for all k. Let η n, j denote the least index k ∈ N such that φ b (n, j, k) is not a code for the empty string, if such k exists, and ∞ otherwise. Note that for each k ∈ N, we can compute (uniformly in n, e, a, and j) whether or not k < η n, j (even though the finiteness of η n, j may not be computable). For k ∈ N, let (A k , q k ) be the pair coded by φ e (k).

Define the total computable function F : N → N such that for n, k ∈ N,

where e ′ ∈ N is defined as follows. Let (A ′ , q ′ ) be the pair coded by φ b (n, j, η n, j ). First, compute a Prokhorov ball B ⊆ M 1 ([0, 1] 2 ) contained within ℓ ≤η n, j P [0,1] 2 (A ℓ , q ℓ ). Next, compute some α ∈ M 1 ([0, 1]) such that α(A ′ ) = 0, and hence α P [0,1] (A ′ , q ′ ). Then, by Lemma 6.3, compute a code e ′ for a distribution ν ∈ B such that Φ(ν, x) = α for every conditioning operator Φ for F . By the recursion theorem, we can compute an index i, uniformly in a, e and j, such that φ F (i) = φ i . We now argue that η i, j = ∞, which implies that φ i = φ e by (45).

Suppose, for a contradiction, that η i, j ∈ N. Then φ b (i, j, η i, j ) = (A ′ , q ′ ) for some (A ′ , q ′ ) ∈ A([0, 1]). Hence, as φ a is a conditioning program, there is some conditioning operator Φ for F , such that Φ(µ i , x) ∈ P [0,1] (A ′ , q ′ ), where µ i is the measure represented by φ i . By construction, for every conditioning operator Φ for F , we have Φ(µ i , x) P [0,1] (A ′ , q ′ ), a contradiction. □

These results rely on the density of the finitely supported discrete probability distributions D P,[0,1] 2 . However, analogous results can be established if we restrict ourselves to absolutely continuous distributions admitting continuous joint density functions. In this case, the role of the finitely supported continuous distributions would be played by absolutely continuous distributions with sharp but continuous bump functions concentrating on small sets. The fundamental obstruction is the same: partial information in the weak topology does not suffice to condition continuously.

In this section, we construct a pair of P-almost computable random variables X in [0, 1] and N in N such that the conditional probability map P[N = k |X = • ] is not even L 1 (P X )-computable, despite the existence of an P X -almost continuous version. Our construction in this section can be thought of as providing a single witness to the noncomputability of the conditioning operator.

Let M n denote the nth Turing machine, under a standard enumeration, and let h : N → N ∪ {∞} be the map given by h(n) := ∞ if M n does not halt (on input 0) and h(n) := k if M n halts (on input 0) at the kth step. We may then take ∅ ′ : N → {0, 1} to denote the halting set

which is computably enumerable but not computable. The set ∅ ′ and the function h are computable from each other because

We now use h to define a pair of P-almost computable random variables (N, X) such that ∅ ′ is computable from P

Let N, C, U, and V be independent P-almost computable random variables such that N is a geometric( 1 5 ) random variable, C is a Bernoulli( 1 3 ) random variable, and U and V are uniformly distributed random variables in [0, 1].

Let ⌊x⌋ denote the greatest integer y ≤ x, and note that ⌊2 k V⌋ is uniformly distributed in {0, 1, 2, . . . , 2 k -1} and is P-almost computable. For each k ∈ N, consider the derived random variable

Note that lim k →∞ X k almost surely exists. Define X ∞ := lim k →∞ X k , and observe that X ∞ = V a.s. Finally, define X := X h(N) .

Proposition 7.1. The random variable X is P-almost computable.

Proof. Because U and V are computable on a P-measure one set and a.s. nondyadic, their binary expansions {U n : n ∈ N} and {V n : n ∈ N} (which are uniquely determined with probability 1) are themselves P-almost computable random variables in {0, 1}, uniformly in n. Fig. 1. A visualization of the distribution of (W, Y), as defined in Theorem 7.6. In this plot, the darkness of each pixel is proportional to the mass contained in that region. Note that this is not a plot of the (noncomputable) density, but rather of a discretization to a given pixel size. Regions that appear (at low resolution) to be uniform can suddenly be revealed (at higher resolutions) to be patterned. Deciding whether the pattern is in fact uniform (as opposed to nonuniform but at a finer granularity than the resolution of this printer/display) is tantamount to solving the halting problem, but it is possible to sample from this distribution nonetheless.

for every ℓ ∈ {0, 1, . . . , 2 k +1 -1}. It follows immediately that a density p of 2 k +1 X k with respect to Lebesgue measure on [0, 2 k +1 ] is given by p

As X k admits a density with respect to Lebesgue measure on [0, 1] for all k ∈ N ∪ {∞}, it follows that the conditional distribution of X given N admits a conditional density p X |N (with respect to Lebesgue measure on [0, 1]) given by p

We may summarize our central result as follows.

Theorem 7.6. There are P-almost computable random variables W and Y on [0, 1] such that the conditional distribution map P[Y|W = • ] is P W -almost continuous but not P W -almost computable.

Proof. Let X and N be as above, let Y be uniformly distributed on [0, 1], and let M be the geometric( 12 ) random variable given by M := ⌊log 2 Y⌋. Finally, let W be such that P

The result then follows from Lemma 7.3 and Proposition 7.5. □

For a visualization of the distribution of (W, Y), see Figure 1. The proof of Proposition 7.5 shows that not only is the conditional distribution map P[N|X = • ] not P X -almost computable, but that, in fact, it computes the halting set ∅ ′ ; to make this precise, we would define the notion of an oracle that encodes P[N|X = • ] using, e.g., infinite strings as in the Type-2 Theory of Effectivity. Despite not having a definition of computability from the conditional distribution map, we can easily relativize the notion of computability for the conditional distribution map, to obtain the following.

Corollary 7.7. If P[N|X = • ] is A-computable on a set of P X -measure greater than 5 6 for an oracle A ⊆ N, then A computes the halting set, i.e., A ≥ T ∅ ′ . In particular, P[N|X = • ] is not computable on any set of P X -measure greater than 5 6 , and hence no version of P[N|X = • ] is P X -almost computable. Proof. Suppose A is such that P[N|X = • ] is A-computable on a set of P X -measure greater than 5/6. Then P[N = 1|X = • ] is A-computable on the same set. But then, by the argument in the proof of Proposition 7.5, the function h, and hence the halting set ∅ ′ , is computable from A. □

On the other hand, by Lemma 7.4, the conditional distribution map P[N|X = • ] is in fact ∅ ′computable on a P X -measure one set, and so the bound in Corollary 7.7 is the best possible.

Computable operations map computable points to computable points, and so Corollary 7.7 provides another context in which conditioning operators are noncomputable (cf. Proposition 6.4).

The next result shows that Proposition 7.5 also rules out the computability of conditional probability maps in the weaker sense of L 1 (P X )-computability. This extends [Hoyrup and Rojas 2011, Prop. 3], which states that there is a pair of computable measures µ ≪ ν on a computable Polish space such that the Radon-Nikodym derivative dµ/dν is not L 1 -computable. Namely, we show that in the case of measures on [0, 1], the measures can be taken to be of the form µ = P{X ∈ • , N = k} and ν = P{X ∈ • } = P X . Proposition 7.8. For every k ∈ N, the map P

Proof. Let k ∈ N. By Proposition 7.5, the conditional probability map P[N = k |X = • ] is not computable on any measurable set R of P X -measure greater than 5/6. On the other hand, by Lemma 2.23, the conditional probability map is L 1 (P X )-computable only if, for each r ∈ N, the map is computable on some set of P X -measure at least 1-2 -r , uniformly in r . This does not hold, and so every conditional probability map P

It is natural to ask whether this construction can be modified to produce a pair of P-almost computable random variables, like N and X, such that the corresponding conditional distribution Next we define, for every k ∈ N, the random variables Z k mimicking the construction of X k . Specifically, for k ∈ N, define

and let Z ∞ := lim k→∞ Z k = V a.s. Then the nth bit of Z k is

where S ℓ denotes the ℓth bit of S. It is straightforward to show from (i) and (ii) above that P Z k admits an infinitely differentiable density p Z k with respect to Lebesgue measure on [0, 1].

To complete the construction, we define Z := Z h(N) . The following results are analogous to those in the almost continuous construction. Proof. By construction, the conditional density of Z given N is everywhere continuous, bounded, and positive. The result follows from Proposition 3.12 for R = [0, 1]. □

We next show that, for each k ∈ N, the conditional probability map P[N = k |Z = • ] is not computable. Our proof relies on the fact that, for each k ∈ N, there is a large set of points x ∈ [0, 1] for which the density p Z |N (x |k) := p Z h(k ) (x) agrees with p X |N (x |k). This will then allow us to use techniques similar to those of Proposition 7.5.

We say a real x ∈ [0, 1] is valid for P S if x ∈ ( 1 8 , 1 2 ) ∪ ( 5 8 , 1). In particular, when D {0, 4}, then S is valid for P S . The following are then consequences of the construction of S and the definition of valid points:

We begin with the problem of conditioning on a random variable that takes values in a discrete set. Given an appropriate notion of computability for discrete sets, conditioning is always possible in this setting, as it reduces to the elementary notion of conditional probability with respect to single events. Definition 9.1 (Computably discrete set). Let S be a computable Polish space. A subset D ⊆ S is computably discrete when there exists a function f : S → N that is computable and injective on D. In particular, such a D is countable. We call f the witness to the discreteness of D. The proof above relies on the ability to compute probabilities for continuity sets. In practice, computing probabilities can be inefficient. We give an alternative proof of Proposition 9.2 via a rejection-sampling argument, which yields an algorithm that can be much more efficient. We begin with a version of a well-known result. Lemma 9.3 (rejection sampling). Let X and Y be random variables in computable Polish spaces S and T , respectively, let B ⊆ S be a measurable set with positive P X -measure, let (X 0 , Y 0 ), (X 1 , Y 1 ), . . . be a sequence of i.i.d. copies of (X, Y), and define ν to be the distribution of (X 0 , X 1 , . . . ). Further define д : S N → N to be the map Remark 9.5. One might likewise obtain an algorithm for conditioning in the context of Proposition 9.4 using the proof of Proposition 9.2 via computable rejection sampling. In particular, the following classical argument may be carried out computably. For x ∈ R, let E x be a P X -almost continuous random variable in {0, 1} such that P[E x = 1|Y = y] = 1 M p X|Y (x |y), where M satisfies p X|Y (x |y) < M for all x ∈ R and y ∈ T . (Such an M exists because of the boundedness condition.) Then, for every Borel B ⊆ T and every x ∈ R, we have

providing the reduction to (classical) rejection sampling.

Because the computability result Proposition 9.4 was established in a way that obviously relativizes to any oracle, we now obtain a proof of its analogue for a continuous conditional density, as promised in Section 3.2.

Proof of Proposition 3.12. Follows from the proof of Proposition 9.4 by relativizing with respect to an arbitrary oracle. □ Corollary 9.6 (Density and independence). Let U, V, and X be P-almost computable random variables in computable Polish spaces, where X is independent of V given U. Assume that there exists a bounded and computable conditional density p X|U (x |u) of X given U. Then the kernel P[(U, V)|X = • ] is computable.

Proof. Let Y = (U, V). Then p X|Y (x |(u, v)) = p X |U (x |u) is the conditional density of X given Y (with respect to ν ). The result then follows immediately from Proposition 9.4. □

As an immediate consequence of Corollary 9.6, we obtain the computability of the following common situation in probabilistic modeling, where the observed random variable has been corrupted by independent absolutely continuous noise.

Corollary 9.7 (Independent noise). Let U and E be P-almost computable random variables in R, and let V be a P-almost computable random variable in a computable Polish space. Define X = U + E. If P E is absolutely continuous (with respect to Lebesgue measure) with a bounded computable density p E , and E is independent of U and V, then the conditional distribution map P[(U, V)|X = • ] is computable.

Proof. We have that

is the conditional density of X given U (with respect to Lebesgue measure). The result then follows from Corollary 9.6. □ -El and Richards [1989, Ch. 1, Thm. 2] show that a twice continuously differentiable computable function has a computable derivative (despite the fact that Myhill [1971] exhibits a computable function from [0, 1] to R whose derivative is continuous, but not computable). Therefore, noise with a sufficiently smooth computable distribution has a computable density, and by Corollary 9.7, if such noise has a bounded density, then an almost computable random variable corrupted by such noise still admits a computable conditional distribution map.

Furthermore, Corollary 9.7 implies that the conditional distribution of a random variable given a continuous observation cannot always be uniformly approximated using noise in the sense that one cannot computably tell how little noise must be present to obtain a given accuracy. For example, consider our main construction (Theorem 7.6) corrupted with noise E = σ Z, where Z is a standard Gaussian noise and σ > 0 determines the standard deviation. Even though, as σ → 0, the conditional distribution given the corrupted observation converges weakly to the uncorrupted conditional distribution with σ = 0, the noncomputability of the uncorrupted conditional distribution implies that one cannot uniformly compute a value of σ from a desired bound on the error introduced to the conditional distribution corrupted by the noise σ Z.

Freer and Roy [2010] show how to compute conditional distributions in the setting of exchangeable sequences, i.e., sequences of random variables whose joint distribution is invariant to permutations of the indices. A classic result by de Finetti shows that exchangeable sequences of random variables are in fact conditionally i.i.d. sequences, conditioned on a random measure, often called the directing random measure. Freer and Roy describe how to transform an algorithm for sampling an exchangeable sequence into a rule for computing the posterior distribution of the directing random measure given observations. The result is a corollary of a computable version of de Finetti’s theorem [Freer andRoy 2009, 2012], and covers a wide range of common scenarios in nonparametric Bayesian statistics (often where no conditional density exists).

West Tower, 661 University Ave., Suite 710, Toronto, ON M5G 1M1, droy@utstat.toronto.edu arXiv:1005.3014v4 [math.LO] 16 Nov 2019 1 INTRODUCTION

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut