On the computability of conditional probability

As inductive inference and machine learning methods in computer science see continued success, researchers are aiming to describe ever more complex probabilistic models and inference algorithms. It is natural to ask whether there is a universal compu…

Authors: ** 논문에 명시된 저자 정보는 제공되지 않았습니다. (원문에 저자 명단이 포함되지 않음) **

On the Computability of Conditional Probability NA THANAEL L. A CKERMAN, Harvard University CAMERON E. FREER, Massachusetts Institute of T echnology D ANIEL M. RO Y, University of T oronto As inductive inference and machine learning methods in computer science see continued success, resear chers are aiming to describe ever more complex probabilistic models and inference algorithms. It is natural to ask whether there is a univ ersal computational procedure for probabilistic infer ence. W e investigate the computability of conditional probability , a fundamental notion in probability theor y and a cornerstone of Bayesian statistics. W e show that ther e are computable joint distributions with noncomputable conditional distributions, ruling out the prospect of general inference algorithms, even inecient ones. Specically , we construct a pair of computable random variables in the unit interval such that the conditional distribution of the rst variable given the second encodes the halting problem. Nev ertheless, probabilistic inference is possible in many common modeling settings, and we prove sev eral results giving broadly applicable conditions under which conditional distributions are computable. In particular , conditional distributions be come computable when measurements are corrupted by independent computable noise with a suciently smooth bounde d density . Contents 1 Introduction 2 1.1 Probabilistic Programming 2 1.2 Computable Probability Theory 3 1.3 Conditional Probability 3 1.4 Other Related W ork 4 1.4.1 Complexity Theory of Finite Discrete Distributions 4 1.4.2 Computable Bayesian Learners 4 1.4.3 Induction with respect to Universal Priors 5 1.4.4 Radon–Nikodym Derivatives 5 1.5 Summary of Results 5 2 Computable Probability Theory 6 2.1 Computable and Computable Enumerable Reals 7 2.2 Computable Polish Spaces 7 2.3 Notions of Computability for Functions on Probability Spaces 9 2.4 Computable and Almost Computable Random V ariables 10 2.5 Computable Probability Measures 12 2.6 W eaker Notions of Computability for Functions on Probability Spaces 13 2.7 Almost Decidable Sets and Bases 14 3 Conditional Probabilities and Distributions 15 3.1 Conditional Distributions 18 A uthors’ addresses: Nathanael L. Ackerman, Harvard University, Department of Mathematics, One Oxford St., Cambridge, MA, 02138–2901, nate@math.harvard.edu; Cameron E. Freer, Massachusetts Institute of T echnology, Department of Brain and Cognitive Sciences, 77 Massachusetts A ve., Cambridge, MA, 02139–4301, freer@mit.edu; Daniel M. Roy, University of T oronto, Department of Statistical Sciences, 100 St. George St., T oronto, ON, M5S 3G3 and V e ctor Institute, MaRS Centre, W est T ower , 661 University A ve., Suite 710, T oronto, ON M5G 1M1, droy@utstat.toronto.edu On the Computability of Conditional Probability 2 3.2 Dominated Families 19 4 Computable Conditional Probabilities and Distributions 21 5 Discontinuous Conditional Distributions 24 6 Conditioning is Discontinuous 25 7 Noncomputable Almost-Continuous Conditional Distributions 28 8 Noncomputable Everywhere Continuous Conditional Distributions 34 9 Positive Results 37 9.1 Discrete Random V ariables 38 9.2 Continuous and Dominated Setting 39 9.3 Conditioning on Noisy Observations 40 9.4 Exchangeable Setting 41 Acknowledgments 41 References 42 1 INTRODUCTION The use of probability to reason about uncertainty has wide-ranging applications in science and engineering, and some of the most important computational problems relate to conditioning , which is used to perform Bayesian inductive reasoning in probabilistic models. A s researchers have faced more complex phenomena, their r epresentations have also increased in comple xity , which in turn has led to more complicated inference algorithms. It is natural to ask whether there is a universal inference algorithm — in other words, whether it is possible to automate pr obabilistic reasoning via a general procedure that can compute conditional probabilities for an arbitrary computable joint distribution. W e demonstrate that there are computable joint distributions with noncomputable conditional distributions. As a consequence, no general algorithm for computing conditional pr obabilities can exist. Of course, the fact that generic algorithms cannot exist for computing conditional probabilities does not rule out the possibility that large classes of distributions may be amenable to automated inference. The challenge for mathematical theory is to explain the widespread success of probabilistic methods and characterize the circumstances when conditioning is p ossible. In this vein, we describe broadly applicable conditions under which conditional probabilities are computable . W e begin by describing a setting, probabilistic programming , that motivates the sear ch for these results. W e procee d to describ e the technical framew orks for our results, computable probability theory and the modern formulation of conditional probability . W e then highlight related work, and end the introduction with a summary of results of the paper . 1.1 Probabilistic Programming Within probabilistic articial intelligence and machine learning, probabilistic programming provides formal languages and algorithms for describing and computing answers from pr obabilistic models. Probabilistic pr ogramming languages themselv es build on modern programming languages and their facilities for recursion, abstraction, mo dularity , etc., to enable practitioners to dene intricate, in some cases innite-dimensional, models by implementing a generative process that produces an On the Computability of Conditional Probability 3 exact sample from the model’s joint distribution. Pr obabilistic programming languages have been the focus of a long tradition of research within pr ogramming languages, model checking, and formal methods. For some of the early approaches within the AI and machine learning community , see, e.g., the languages PHA [Poole 1991], IBAL [Pfeer 2001], Markov Logic [Richardson and Domingos 2006], λ ◦ [Park et al . 2008], Church [Goodman et al . 2008], HANSEI [Kiselyov and Shan 2009], and Infer .NET [Minka et al. 2010]. In many of these languages, one can easily r epresent the higher-order stochastic processes ( e.g., distributions on data structures, distributions on functions, and distributions on distributions) that are essential building blocks in modern nonparametric Bay esian statistics. In fact, the most expressiv e such languages are each capable of describing the same robust class as the others — the class of computable distributions , which delineates those from which a probabilistic T uring machine can sample to arbitrary accuracy . Traditionally , inference algorithms for pr obabilistic models have been derived and implemented by hand. In contrast, probabilistic pr ogramming systems have introduced varying degrees of support for computing conditional distributions. Given the rate of progress towar d broadening the scope of these algorithms, one might hop e that there would eventually be a generic algorithm supporting the entire class of computable distributions. Despite recent progress towar ds a general such algorithm, support for conditioning with respe ct to continuous random variables has r emained incomplete. Our r esults explain why this is necessarily the case. 1.2 Computable Probability Theory In order to study computable probability theor y and the computability of conditioning, we work within the framework of T yp e-2 Theor y of Eectivity (T TE) and use appropriate representations for topological and measurable obje cts such as distributions, random variables, and maps between them. This framework builds upon and contains as a spe cial case ordinary T uring computation on discrete spaces, and gives us a basis for precisely describing the operations that pr obabilistic programming languages are capable of performing. In particular , we study the computability of distributions on computable Polish spaces including, e.g., certain spaces of distributions on distributions. In Section 2 we pr esent the necessary denitions and results from computable probability theory . 1.3 Conditional Probability For an experiment with a discr ete set of outcomes, computing conditional pr obabilities is, in principle, straightforward as it is simply a ratio of probabilities. How ever , in the case of conditioning on the value of a continuous random variable, this ratio is undened. Furthermore, in modern Bayesian statistics, and especially the probabilistic programming setting, it is common to place distributions on higher-order objects, and so one is alr eady in a situation where elementary notions of conditional probability are insucient and more sophisticated measure-theor etic notions are necessar y . Kolmogor ov [1933] gave an axiomatic characterization of conditional pr obabilities and an abstract construction of them using Radon–Nikodym derivatives, but this denition and construction do not yield a general recipe for their calculation. There is a further problem: in this setting, conditional probabilities are formalized as measurable functions that ar e dened only up to measure zero sets. Therefore , without additional assumptions, a conditional probability is not ne cessarily well-dened On the Computability of Conditional Probability 4 for any particular value of the conditioning random variable. This has long been understood as a challenge for statistical applications, in which one wants to evaluate conditional probabilities given particular values for obser ved random variables. In this pap er , we are therefore especially interested in situations where it makes sense to ask for the conditional distribution given a particular point. One of our main results is in the setting where there is a unique continuous conditional distribution. In this case, conditioning yields a canonical answ er , which is a natural desideratum for statistical applications. A large body of work in probability and statistics is concerned with the derivation of conditional probabilities and distributions in special circumstances, each situation often requiring some sp ecial insight into the structur e of the answer , especially when it was desirable for conditional probabilities and distributions to be dened at points, as in Bayesian statistical applications. This state of aairs motivated work on constructive denitions of conditioning (such as those due to Tjur [1974; 1975; 1980], Pfanzagl [1979], and Rao [1988; 2005]), although this work has not been sensitive to issues of computability . Under certain conditions, such as when conditional densities exist, conditioning can proceed using the classic Bayes’ rule; how ever , it may not be possible to compute the density of a computable distribution (if the density exists at all), as we describe in Section 9.2. W e recall the basics of the measure-theoretic approach to conditional probability in Section 3, and in Section 4 we use notions from computable probability theor y to consider the sense in which conditioning could be potentially computable. 1.4 Other Related W ork W e now describe several other connections between conditional probability and computation. 1.4.1 Complexity Theory of Finite Discrete Distributions. Conditional probabilities for computable distributions on nite, discrete sets are clearly computable, but may not be eciently so. In this nite discrete setting, ther e are already inter esting questions of computational complexity , which have been explored by a number of authors thr ough extensions of Levin’s theory of average-case complexity [Levin 1986]. For example, under cryptographic assumptions, it is dicult to sample from the conditional distribution of a uniformly distributed binary string of length n given its image under a one-way function. This can b e seen to follow from the work of Ben-David, Chor , Goldreich, and Luby [1992] in their theory of polynomial-time samplable distributions, which has since b een extended by Y amakami [1999] and others. Other positive and negative complexity r esults have be en obtained in the particular case of Bayesian networks by Coop er [1990] and Dagum and Luby [1993; 1997]. Extending these complexity results to the more general setting considered here could bear on the practice of statistical AI and machine learning. 1.4.2 Computable Bayesian Learners. Osherson, Stob , and W einstein [1988] study learning theory in the setting of identiability in the limit (see [Gold 1967] and [Putnam 1965] for mor e details on this setting) and pro ve that a certain type of “ computable Bayesian” learner fails to identify the inde x of a ( computably enumerable) set that is “ computably identiable ” in the limit. Mor e specically , a “Bayesian learner” is required to return an index for a set with the highest conditional probability given a nite prex of an innite sequence of random draws from the unknown set. An analysis by Roy [2011] of their construction reveals that the conditional distribution of the index given the innite sequence is an everywhere discontinuous function (on every measure one set), On the Computability of Conditional Probability 5 hence noncomputable for much the same reason as our elementary construction involving a mixture of measures concentrated on the rationals and on the irrationals (see Section 5). As we argue, in the context of statistical analysis, it is more appropriate to study the conditioning operator when it is restricted to those random variables whose conditional distributions admit versions that are continuous everywhere, or at least on a measure one set. 1.4.3 Induction with respect to Universal Priors. Our work is distinct fr om the study of conditional distributions with respect to priors that ar e universal for partial computable functions (as dened using Kolmogor ov complexity) by Solomono [1964], Zv onkin and Levin [1970], and Hutter [2007]. The computability of conditional distributions also has a rather dier ent character in T akahashi’s work on the algorithmic randomness of points dened using universal Martin-Löf tests [T akahashi 2008]. The objects with respect to which one is conditioning in these settings are typically not computable (e.g., the univ ersal semimeasure is merely lower semicomputable). In the present paper , we are interested in the problem of computing conditional distributions of random variables that are computable , even though the conditional distribution may itself be noncomputable. 1.4.4 Radon–Nikodym Derivatives. In the abstract setting, conditional probabilities are (suitably measurable) Radon–Nikodym derivatives. In work motivate d by questions in algorithmic randomness, Hoyrup and Rojas [2011] study notions of computability for absolute continuity and for Radon– Nikodym derivatives as elements in L 1 , i.e., the space of integrable functions. Hoyrup, Rojas, and W eihrauch [2011] then sho w an equivalence between the problem of computing Radon–Nikodym derivatives as elements in L 1 and computing the characteristic function of computably enumerable sets. The noncomputability of the Radon–Nikodym derivative operator is demonstrated by a pair µ , ν of computable measures whose Radon–Nikodym derivativ e d µ / d ν is not computable as an element in L 1 ( ν ) . However , the Radon–Nikodym derivatives they study do not correspond to conditional proba- bilities, and so the computability of the op erator restricted to those maps arising in the construction of conditional probabilities is not addr essed by this work. The underlying notion of computability is another important dierence. An element in L 1 ( ν ) is an equivalence class of functions, every pair agreeing on a set of ν -measure one. Thus one cannot, in general, evaluate these derivatives at points in a well-dened manner . Most Bayesian statisticians would be unfamiliar and perhaps unsatised with this notion of computability , especially in settings where their statistical mo dels admit continuous versions of Radon–Nikodym derivativ es that are unique, and thus well-dened pointwise. Regardless, we will show that ev en in such settings, computing conditional probabilities is not possible in general, ev en in the weaker L 1 sense. On the other hand, our positive results do yield computable probabilities/distributions dened pointwise. 1.5 Summary of Results Following our pr esentation of computable probability theor y and conditional probability in Sections 2 through 4, we provide our main positive and negative results about the computability of conditional probability , which we now summarize. Recall that measurable functions are often dened only up to a measure-zero set; any two functions that agree almost ev er ywhere are called versions of each other . In Proposition 5.1, we construct random variables X and C that are computable on a P -measure one set, such that e very version of the conditional distribution map P [ C | X = · ] (i.e., a pr obability- measure-valued function f such that f ( X ) is a regular version of the conditional distribution P [ C | X ] ) is discontinuous every where, even when restricted to a P X -measure one subset. (W e make these On the Computability of Conditional Probability 6 notions precise in Section 4.) The construction makes use of the elementary fact that the indicator function for the rationals in the unit inter val (the so-called Dirichlet function) is itself no where continuous. Because every function computable on a domain D is continuous on D , discontinuity is a funda- mental barrier to computability , and so this construction rules out the p ossibility of a completely general algorithm for conditioning. A natural question is whether conditioning is a computable op- eration when we restrict the operator to random variables for which some version of the conditional distribution is continuous everywhere, or at least on a measure one set. In fact, even under this restriction, conditioning is not e ven continuous, let alone computable, as we show in Section 6. W e further demonstrate that if some computer program purports to state true facts about the conditional distribution of a computable joint distribution provided as input, then we can uniformly nd some other representation of the input distribution such that the program does not output any nontrivial fact about the conditional distribution. Our central result, Theorem 7.6, provides a pair of random variables that are computable on a measure one set, but such that the conditional distribution of one variable given the other is not computable on any measure one set (though some version is continuous on a measure one set). The construction involves encoding the halting times of all T uring machines into the conditional distribution map while ensuring that the joint distribution r emains computable. This result yields another proof of the noncomputability of the conditioning operation restricted to measures having conditional distributions that are continuous on a measure one set. In Section 8 we extend our central result by constructing a pair of random variables, again computable on a measure one set, whose conditional distribution map is noncomputable but has an everywhere continuous version with innitely dierentiable conditional pr obability maps. This construction proceeds by smo othing out the distribution constructed in Se ction 7, but in such a way that one can still compute the halting problem r elative to the conditional distribution. This result implies that conditioning is not a computable operation, even when w e further restrict to the case where the conditional distribution has an everywhere continuous version. Despite the noncomputability of conditioning in general, conditional distribution maps are often computable in practice. W e provide some explanation of this phenomenon by characterizing several circumstances in which conditioning is a computable operation. Under suitable computability hy- potheses, conditioning is computable in the discrete setting (Pr oposition 9.2) and where there is a conditional density (Corollary 9.6). W e also characterize a situation in which conditioning is possible in the presence of noisy data, capturing many natural models in science and engineering. Let U , V , and E b e computable random variables where U and E are real-valued, and suppose that P E is absolutely continuous with a bounded computable density p E and E is independent of U and V . W e can think of U + E as the corruption of an idealized measurement U by independent source of additive error E . In Cor ollary 9.7, we show that the conditional distribution map P [( U , V ) | U + E = · ] is computable (e ven if P [( U , V ) | U = · ] is not). Finally , w e discuss ho w symmetry , in the form of exchangeability , can contribute to the computability of conditional distributions. 2 COMP U T ABLE PROBABILIT Y THEORY W e now give some background on computable probability theory , which will enable us to formulate our results. The foundations of the theor y include notions of computability for probability measures On the Computability of Conditional Probability 7 developed by Edalat [1996], W eihrauch [1999], Schröder [2007], and Gács [2005]. Computable probability theory itself builds o notions and results in computable analysis, sp ecically the T ype-2 Theory of Eectivity . For a general introduction to this approach to r eal computation, see W eihrauch [2000], Braverman [2005] or Braverman and Cook [2006]. 2.1 Computable and Computable Enumerable Reals W e rst recall some elementary denitions fr om computability theory (see, e.g., Rogers [1987, Ch. 5]). A set of natural numb ers (potentially in some correspondence with, e .g., rationals, integers, or other nitely describable objects with an implicit enumeration) is computable when there is a computer program that, given k , outputs whether or not k is in the set. A set is computably enumerable (c.e.) when there is a computer program that outputs every element of the set eventually . Note that a set is computable when both it and its complement ar e c.e. W e say that a sequence of sets { B n } is computable uniformly in n when there is a single computer program that, given n and k , outputs whether or not k is in B n . W e say that the sequence is c.e. uniformly in n when there is a computer program that, on input n , outputs every element of B n eventually . W e now recall basic notions of computability for real numbers ( see, e.g., [W eihrauch 2000, Ch. 4.2] or [Nies 2009, Ch. 1.8]). W e say that a real r is a c.e. real (sometimes called a left-c.e. real ) when the set of rationals { q ∈ Q : q < r } is c.e. A r eal r is computable when both it and its negative are c.e. Equivalently , a real is computable when there is a program that approximates it to any given accuracy (e .g., given an integer k as input, the program r eports a rational that is within 2 − k of the real). A function f : N → R is lower semicomputable when f ( n ) is a c.e. real, uniformly in n (i.e., when the collection of rationals less than f ( n ) is c.e. uniformly in n ). Likewise, a function is upper semicomputable when its negative is lower semicomputable. The function f is computable if and only if it is both lower and upper semicomputable. 2.2 Computable Polish Spaces Recall that a Polish space is a topological space that admits a metric under which it is a complete separable metric space. Computable Polish spaces, as dev eloped in computable analysis [Hemmerling 2002; W eihrauch 1993] and eective domain theory [Blanck 1997; Edalat and Heckmann 1998], provide a convenient framework for formulating results in computable probability theor y . For consistency , we largely use denitions from [Hoyrup and Rojas 2009b] and [Galatolo et al . 2010]. Additional details ab out computable Polish spaces, sometimes called computable metric spaces or ee ctive Polish spaces, can also be found in [W eihrauch 2000, Ch. 8.1], [Gács 2005, §B.3], and [Moschovakis 2009, Ch. 3I]. Denition 2.1 (Computable Polish space [Galatolo et al . 2010, Def. 2.3.1]). A computable Polish space is a triple ( S , δ , D ) for which δ is a metric on the set S satisfying (1) ( S , δ ) is a complete separable metric space; (2) D = { s i } i ∈ N is an enumeration of a dense subset of S , called ideal points ; and, (3) the real numbers δ ( s i , s j ) are computable, uniformly in i and j . In particular , note that condition (1) implies that the topological space determined by the metric space ( S , δ ) is a Polish space. On the Computability of Conditional Probability 8 Let B ( s i , q j ) denote the ball of radius q j centered at s i . W e call the elements of the set B S : = { B ( s i , q j ) : s i ∈ D and q j ∈ Q s.t. q j > 0 } (1) the ideal balls of S , and x the canonical enumeration of them induce d by that of D and Q . Let B S denote the Borel σ -algebra on a Polish space S , i.e., the σ -algebra generated by the open balls of S . Let M 1 ( S ) denote the set of Borel probability measures on S . In this paper we primarily work with computable Polish spaces. As such, unless otherwise note d, the σ -algebras will always be the Borel σ -algebras on such spaces — in particular , making them standard Borel spaces. Measurable functions between Polish spaces will always b e measurable with respect to the Borel σ -algebras. W e will sometimes refer to measurable subsets of a probability space as events . Example 2.2. The set { 0 , 1 } is a computable Polish space under the discrete metric, where δ ( 0 , 1 ) = 1 . Cantor space, the set { 0 , 1 } ∞ of innite binary sequences, is a computable Polish space under its usual metric and the dense set of eventually constant strings (under a standard enumeration of nite strings). The set R of real numbers is a computable Polish space under the Euclidean metric with the dense set Q of rationals (under its standard enumeration). Suppose we are giv en a nite sequence ( T 0 , δ 0 , D 0 ) , . . . , ( T n − 1 , δ n − 1 , D n − 1 ) of computable Polish spaces. Then the product metric space  n − 1 i = 0 T i (with one of any of the equivalent standard product metrics) is a computable Polish space wher e the ideal p oints consist of all nite products of ideal points. Furthermore, given a countably innite such sequence ( T 0 , δ 0 , D 0 ) , ( T 1 , δ 1 , D 1 ) , . . . that is uniformly computable and has a xed b ound on the diameter , the product metric spaces consists of the metric space whose underlying space is  i ∈ N T i and whose metric is given by δ ( x , y ) =  i ∈ N 2 − i δ i ( x , y ) ; this too can be made into a computable Polish space , by taking the ideal points to be those sequences ( x 0 , x 1 , . . . ) with each x i ∈ D i such that for all but nitely many terms i , the p oint x i is the rst element in the enumeration of D i . Note that in both the nite and innite case, all projection maps are computable. Denition 2.3 (Computable p oint [Galatolo et al . 2010, Def. 2.3.2]). Let ( S , δ , D ) be a computable Polish space with D = { s j } j ∈ N and x ∈ S . Given a sequence { i k } i ∈ N of natural numbers, we say that the sequence { s i k } k ∈ N of elements of D is a representation of the point x if δ ( s i k , x ) < 2 − k for all k . When { i k } i ∈ N is a computable sequence such that { s i k } k ∈ N is a representation of x , we say that { s i k } k ∈ N is a computable representation , and that the point x is computable . Remark 2.4. A real α ∈ R is computable (as in Section 2.1) if and only if α is a computable point of R (as a computable Polish space). Although most of the familiar reals are computable , there are only countably many computable reals, and so almost every real is not computable . The notion of a c.e. open set (or Σ 0 1 class) is fundamental in classical computability theor y , and admits a simple denition in an arbitrary computable Polish space. Denition 2.5 (C.e. op en set [Galatolo et al . 2010, Def. 2.3.3]). Let ( S , δ , D ) be a computable Polish space with the corresponding enumeration { B i } i ∈ N of the ideal open balls B S . W e say that U ⊆ S is a c.e. open set when there is some c.e. set E ⊆ N such that U =  i ∈ E B i . On the Computability of Conditional Probability 9 Note that the class of c.e. open sets is closed under computable unions and nite intersections. A computable function can be thought of as a continuous function whose local mo dulus of continuity is witnesse d by a pr ogram. It is important to consider the computability of partial functions, because many natural and imp ortant random variables are continuous only on a measure one subset of their domain. Denition 2.6 (Computable partial function [Galatolo et al . 2010, Def. 2.3.6]). Let ( S , δ S , D S ) and ( T , δ T , D T ) be computable Polish spaces, the latter with the corresponding enumeration { B n } n ∈ N of the ideal open balls B T . A function f : S → T is said to be continuous on R ⊆ S when f restricted to R is continuous as a function from R , under the subspace topology to T . A function f : S → T is said to be computable on R ⊆ S when there is a computable se quence { U n } n ∈ N of c.e. op en sets U n ⊆ S such that f − 1 [ B n ] ∩ R = U n ∩ R for all n ∈ N . W e call such a sequence { U n } n ∈ N a witness to the computability of f . Note that the notion of being computable on a set R can be relativized to an oracle A ⊆ N in the obvious way . A function is continuous on R if and only if it is A -computable on R for some oracle A . Remark 2.7. Let S and T be computable Polish spaces. If f : S → T is computable on some subset R ⊆ S , then for every computable point x ∈ R , the point f ( x ) is also computable. One can show that f is computable on R when there is an oracle Turing machine that, upon being fe d a representations of p oints x ∈ R on its oracle tape, computes representations of their images f ( x ) ∈ S . (For more details, see [Hoyrup and Rojas 2009b, Prop. 3.3.2].) 2.3 Notions of Computability for Functions on Probability Spaces The standard notion of computability of functions between computable Polish spaces is too r estrictive in most cases when the inputs to these functions are points in a probability space. For example, the Heaviside function f ( x ) = 1 ( x ≥ 0 ) is not computable on any set containing a neighborhood of 0. Howev er , we can reliably compute the image of a Gaussian random variable under f , because the Gaussian random variable is nonzero with probability one , and f is computable on R \ { 0 } . For a measure space ( Ω , G , µ ) , a set E ∈ G is a µ -null set when µ ( E ) = 0 . More generally , for p ∈ [ 0 , ∞] , we say that E is a µ -measure p set when µ ( E ) = p . A predicate P on Ω is said to hold µ -almost everywhere (abbre viated µ -a.e. ) if the event E P = { ϖ ∈ Ω : P ( ϖ ) does not hold ) is a µ -null set. When E P is a µ -null set but µ is a probability measure, we will instead say the event P holds µ -almost surely , and we likewise say that an event E ∈ G occurs µ -almost surely (abbre viated µ -a.s. ) when µ ( E ) = 1 . In each case , we may drop the pr ex µ when it is clear from context (in particular , when it holds of P ). Denition 2.8. Let S and T be Polish spaces and µ a pr obability measure on S . A measurable function f : S → T is µ -almost continuous when it is continuous on a µ -measure one set. When S and T are computable Polish spaces, the measurable function f is µ -almost computable when it is computable on a µ -measure one set. (See [Hoyrup and Rojas 2009b] for further development of the theor y of almost computable functions.) The following result relates µ -almost continuity to µ -a.e. continuity , i.e., the set of continuity points b eing a µ -measure one set. The proofs of the following proposition and lemma are due to François Dorais, Gerald Edgar , and Jason Rute [2013]. On the Computability of Conditional Probability 10 Proposition 2.9. Let X and Y be Polish spaces, let f : X → Y be a µ -almost continuous function, and let µ be a probability measure on X . Then there is a µ -a.e. continuous д : X → Y that agrees with f µ -a.e. W e will need the following technical lemma. Re call that a G δ set is a countable intersection of open sets. Lemma 2.10. Let X be a Polish space. If D ⊆ X is a nonempty G δ -set then there is a measurable map h : X → D such that lim x → x 0 h ( x ) = x 0 for every x 0 ∈ D . Proof. Suppose D =  n ∈ N U n , where ( U n ) n ∈ N is a descending sequence of open sets such that U n ⊆  x 0 ∈ D B ( x 0 , 1 /( n + 1 )) . Any measurable retraction h : X → D with the property that if x ∈ U n \ U n + 1 then d ( h ( x ) , x ) < 1 /( n + 1 ) will be as required. By denition, it is always possible to nd a suitable h ( x ) ∈ D for each x ∈ U 0 \ D . T o ensure that h is measurable, x an enumeration ( d i ) i ∈ N of a countable dense subset of D and, if x ∈ U 0 \ D , dene h ( x ) to be the rst element in this list that matches all the necessary requirements. (W e must have h ( x ) = x for x ∈ D and it does not matter how h ( x ) is dened when x < U 0 so long as the end result is measurable.) □ Proof of Proposition 2.9. By a classical result of Kuratowski [Kechris 1995, I.3.B, Thm. 3.8], we may assume (after rst possibly changing its value on a µ -null set) that f is continuous on a µ -measure one G δ set D ⊆ X . Let h be as in Lemma 2.10. Then д = f ◦ h is a measurable function that agrees with f on D and lim x → x 0 д ( x ) = f  lim x → x 0 h ( x )  = f ( x 0 ) = д ( x 0 ) for all x 0 ∈ D . □ Remark 2.11. Let S and T be computable Polish spaces. A set X ⊆ S is an eective G δ set (or Π 0 2 class) when it is the interse ction of a uniformly computable sequence of c.e. open sets. Suppose that f : S → T is computable on R ⊆ S with { U n } n ∈ N a witness to the computability of f . One can show that there is an eective G δ set R ′ ⊇ R and a function f ′ : S → T such that f ′ is computable on R ′ , the restriction of f ′ to R and f are equal as functions, and { U n } n ∈ N is a witness to the computability of f ′ . Furthermore , a G δ -code for some such R ′ can be computed uniformly from a code for the witness { U n } n ∈ N . For details, see [Hoyrup 2008, Thm. 1.6.2.1]; this generalizes a classical result of Kuratowski [Kechris 1995, I.3.B, Thm. 3.8]. In conclusion, one can always assume that the set R is an eective G δ set. W e will introduce a weaker notion of computability for functions in Section 2.6. 2.4 Computable and Almost Computable Random V ariables Intuitively , a random variable maps an input source of randomness to an output, inducing a distribu- tion on the output space. Here we will use a sequence of indep endent fair coin ips as our source of randomness. W e formalize this via the probability space ({ 0 , 1 } ∞ , F , P ) , wher e { 0 , 1 } ∞ is the product space of innite binar y se quences, F is its Borel σ -algebra (generated by the set of basic clopen cylinders extending each nite binar y sequence), and P is the product measure formed from the uniform distribution on { 0 , 1 } . Throughout the r est of the paper we will take ({ 0 , 1 } ∞ , F , P ) to be the basic probability space. W e will use a SANS SERIF font for random variables. On the Computability of Conditional Probability 11 Denition 2.12 (Random variable and its distribution). Let S be a Polish space. A random variable in S is a measurable function X : { 0 , 1 } ∞ → S . For a measurable subset A ⊆ S , we let { X ∈ A } denote the inverse image X − 1 [ A ] = { ϖ ∈ { 0 , 1 } ∞ : X ( ϖ ) ∈ A } , and for x ∈ S we similarly dene the event { X = x } . W e will write P X for the distribution of X , which is the measure on S dened by P X ( · ) : = P { X ∈ · } . If S is a computable Polish space then we say a random variable X in S is a P -almost computable random variable it is P -almost computable as a measurable function. Intuitively , X is a P -almost computable random variable when there is a program that, given access to an oracle bit tape ϖ ∈ { 0 , 1 } ∞ , outputs a representation of the point X ( ϖ ) (i.e., enumerates a sequence { x i } in D where δ ( x i , X ( ϖ )) < 2 − i for all i ), for all but a P -measure zero subset of bit tapes ϖ ∈ { 0 , 1 } ∞ . Even though the sour ce of randomness is a sequence of discrete bits, there are P -almost computable random variables with continuous distributions, such as a uniform random variable (gotten by subdividing the unit interval according to the random bit tape) or an i.i.d. sequence of uniformly distributed random variables (by splitting up the given element of { 0 , 1 } ∞ into countably many disjoint subsequences and dovetailing the constructions). (For explicit constructions, se e, e.g., [Freer and Roy 2010, Ex. 3, 4].) It is crucial that w e consider random variables that are merely computable on a P -measure one subset of { 0 , 1 } ∞ . T o see why , consider the following example, which was communicated to us by Martín Escardó. For a real α ∈ [ 0 , 1 ] , we say that a binar y random variable X : { 0 , 1 } ∞ → { 0 , 1 } is a Bernoulli ( α ) random variable when P X { 1 } = α . There is a Bernoulli ( 1 2 ) random variable that is computable on all of { 0 , 1 } ∞ , given by the program that simply outputs the rst bit of the input sequence. Likewise, when α is dyadic (i.e., a rational whose denominator is a p ower of 2), there is a Bernoulli ( α ) random variable that is computable on all of { 0 , 1 } ∞ . Howev er , this is not possible for any other choices of α ( e.g., 1 3 ). Lemma 2.13. Let α ∈ [ 0 , 1 ] be a nondyadic real. Every Bernoulli ( α ) random variable X : { 0 , 1 } ∞ → { 0 , 1 } is discontinuous, hence not computable on all of { 0 , 1 } ∞ . Proof. Assume X is continuous. Let Z 0 : = X − 1 ( 0 ) and Z 1 : = X − 1 ( 1 ) . Then { 0 , 1 } ∞ = Z 0 ∪ Z 1 , and so both are closed (as well as open). The compactness of { 0 , 1 } ∞ implies that these closed subspaces are also compact, and so Z 0 and Z 1 can each be written as the nite disjoint union of clopen basis elements. But each of these elements has dyadic measure , hence their sum cannot b e either α or 1 − α , contradicting the fact that P ( Z 1 ) = 1 − P ( Z 0 ) = α . □ On the other hand, for an arbitrar y computable α ∈ [ 0 , 1 ] , consider the random variable X α given by X α ( x ) = 1 if  ∞ i = 0 x i 2 − i − 1 < α and 0 otherwise. This construction, due to Mann [1973], is a Bernoulli ( α ) random variable and is computable on e very point of { 0 , 1 } ∞ other than a binary expansion of α . Not only are these random variables P -almost computable, but they can be shown to be optimal in their use of input bits, via the classic analysis of rational-weight coins by Knuth and Y ao [1976]. Hence it is natural to focus our attention on random variables that are merely P -almost computable. The setting of P -almost computable random variables is a natural one for probability the ory , and the standard operations on random variables pr eserve P -almost computability , including, e.g., addition and multiplication of P -almost computable real random variables, composition with P -almost computable measurable functions, and cartesian products. On the Computability of Conditional Probability 12 2.5 Computable Probability Measures W e now introduce the class of computable probability measures on computable Polish spaces. Let ( S , δ S , D S ) be a computable Polish space, and recall that B S denotes its Borel sets and M 1 ( S ) its Borel probability measures. Consider the subset D P , S ⊆ M 1 ( S ) comprised of those probability measures that are concentrated on a nite subset of D S and where the measure of each atom is rational, i.e., ν ∈ D P , S if and only if ν = q 1 δ t 1 + · · · + q k δ t k for some rationals q i ≥ 0 such that q 1 + · · · + q k = 1 and some points t i ∈ D S , where for t ∈ S the { 0 , 1 } -valued Dirac measure δ t satises δ t ( A ) = 1 if and only if t ∈ A for all measurable sets A . It is a standard fact (see, e.g., Gács [2005, §B.6.2]) that D P is dense in the Prokhorov metric δ P given by δ P ( µ , ν ) : = inf { ε > 0 : ∀ A ∈ B S , µ ( A ) ≤ ν ( A ε ) + ε } , (2) where A ε : = { p ∈ S : ∃ q ∈ A , δ S ( p , q ) < ε } =  p ∈ A B ε ( p ) (3) is the ε -neighborhood of A and B ε ( p ) is the open ball of radius ε about p . Moreover , (M 1 ( S ) , δ P , D P , S ) is a computable Polish space. (See [Hoyrup and Rojas 2009b, Prop. 4.1.1].) W e say that µ ∈ M 1 ( S ) is a computable probability measur e when µ is a computable point in M 1 ( S ) as a computable Polish space. Note when the space S is clear from context we will refer to D P , S simply as D P . One can dene computability on the space of pr obability measures in other natural ways. Early work by W eihrauch [1999] and Müller [1999] formalized the computability of pr obability measure in terms of the low er semicomputability of the measure as a function on the set of open sets and in terms of the computability of the measure as a linear operator acting on bounde d continuous functions; these notions are equivalent. (See Schrö der [2007] for a more general setting.) These notions of computability also agree with the notion of computability dened here in terms of the Prokhorov metric. Proposition 2.14 ([Hoyrup and Rojas 2009b, Thm. 4.2.1]). Let S be a computable Polish space. A probability measure µ ∈ M 1 ( S ) is computable if and only if the measure µ ( A ) of a c.e. op en set A ⊆ S is a c.e. real, uniformly in A . □ Note that the measure P on { 0 , 1 } ∞ is a computable probability measure . W e can also characterize the class of computable probability measures in terms of the uniform computability of the integrals of bounded continuous functions: Proposition 2.15 ([Hoyrup and Rojas 2009b, Cor. 4.3.1]). Let S be a computable Polish space, let µ be a probability measure on S , and let F be the set of computable functions from S to R + . Then µ is computable if and only if  f d µ is a c.e. real, uniformly in f ∈ F . □ Corollary 2.16. Let S be a computable Polish space, let µ be a probability measure on S , and let F be the set of computable functions from S to [ 0 , 1 ] . Then µ is computable if and only if  f d µ is computable, uniformly in f ∈ F . Proof. First obser ve that both f and 1 − f are non-negative functions. Therefore, by Proposi- tion 2.15, the reals  f d µ and  ( 1 − f ) d µ are both c.e., and hence the real  f d µ is computable. □ Having explained the computability of probability measur es in terms of integration, w e now relate it to the computability of random variables dened on computable Polish spaces. On the Computability of Conditional Probability 13 Denition 2.17 (Computable probability space [Galatolo et al . 2010, Def. 2.4.1]). A computable probability space is a pair ( S , µ ) where S is a computable Polish space and µ is a computable probability measure on S . The distribution of a P -almost computable random variable in a computable Polish space is computable. Proposition 2.18 ([Gala tolo et al . 2010, Prop. 2.4.2]). Let X b e a P -almost computable random variable in a computable Polish space S . Then its distribution is a computable point in the computable Polish space M 1 ( S ) . □ On the other hand, given a computable measure, there is a P -almost computable random variable with that distribution. Proposition 2.19 ([Hoyrup and Rojas 2009b, Thm. 5.1.1]). Let µ be a computable probability measure on a computable Polish space S . Then there is a P -almost computable random variable in S whose distribution is µ . □ In summary , the computable probability measures on a computable Polish space are precisely the distributions of P -almost computable random variables in that space. For this result in a mor e general setting, see [Schröder 2007, Prop. 4.3]. Further , if µ is a computable probability measure and and f is computable on a µ -measure one set, then the pushforward µ ◦ f − 1 is a computable distribution. This fact, along with Proposition 2.19, shows that we have lost no generality in taking ({ 0 , 1 } ∞ , F , P ) to b e our basic probability space. All of the standard distributions (e .g., normal, uniform, ge ometric, exponential) found in probability textbooks, and then all the transformations of these distributions by P -almost computable functions, are easily shown to be computable distributions. 2.6 W eaker Notions of Computability for Functions on Probability Spaces Another important class of functions on a probability space is the class of L 1 -computable functions. For more details, including some of the history of L 1 -computability , see Hoyrup and Rojas [2009a, §3.1] and Miyabe [2013]. Denition 2.20 (The metric space of L 1 ( µ ) functions [Hoyrup and Rojas 2009a, §3.1]). Let µ be a probability measure on a Polish space S , and let F be the set of µ -integrable functions from S to R . Then δ ( f , д ) : =  | f − д | d µ is a metric on the quotient space of F dened by the equivalence relation f ∼ д i  | f − д | d µ = 0 . This metric space is called the space of L 1 ( µ ) functions on S , and we will often speak interchangeably of a µ -integrable function S → R and its equivalence class. W e will make use of the following set of L 1 functions. Denition 2.21 (Ideal points for L 1 [Gács 2005, §2]). Let ( S , δ , D ) be a computable Polish space. Dene E to be the smallest set of functions containing the constant function 1 and the functions { д u , r , 1 / n : u ∈ S , r ∈ Q , n ≥ 1 } , where д u , r , ϵ ( x ) : = max  0 , 1 − max ( 0 , δ ( x , u ) − r )/ ϵ  , (4) and closed under max , min , and rational linear combinations. On the Computability of Conditional Probability 14 Such functions can be thought of as continuous analogues of step functions having a nite number of steps, each step of which corresponds to a basic open ball with rational radius and ideal center . Lemma 2.22 ([Hoyrup and Rojas 2009a, Prop. 3]). Let µ be a computable probability measure on a computable Polish space ( S , δ , D ) . The set E is dense in the L 1 ( µ ) functions on S , and the distances between points in E are computable under the standard enumeration, making this space into a computable Polish space. □ W e say that an L 1 ( µ ) function on a computable Polish space S is L 1 ( µ ) -computable when it is a computable point in the L 1 ( µ ) functions on S . Lemma 2.23 (Hoyrup and Rojas [2009a, Thm. 4 Claim 2 and Thm. 5 Claim 2]). Let ( S , µ ) be a computable probability space and let T be a computable Polish space. A function f : S → T is L 1 ( µ ) - computable if and only if  f d µ is a computable real and for each r ∈ N , the function f is computable on some set of P X -measure at least 1 − 2 − r , uniformly in r . □ In particular , note that every integrable µ -almost computable function is L 1 ( µ ) -computable. W e obtain the following immediate corollary of Lemma 2.23 using the fact that if a function is µ -almost computable with a computable µ -integral, then we can uniformly nd a collection of ideal points that converge to it in L 1 ( µ ) . Corollary 2.24. Let ( S , µ ) be a computable probability space and let T be a computable Polish space. Let f 0 , f 1 , . . . : S → T be a sequence of uniformly µ -almost computable functions taking values in a computable Polish space T that converge eectively in L 1 ( µ ) to a function f ∈ L 1 ( µ ) . Then f is L 1 ( µ ) -computable. □ 2.7 Almost Decidable Sets and Bases Let ( S , µ ) be a computable probability space. W e know that the µ -measure of a c.e. open set A ⊆ S is a c.e. real. In general, the measure of a c.e. open set is not a computable real. On the other hand, if A is a decidable subset (i.e., S \ A is c.e. open) then µ ( S \ A ) a c.e. real, and therefor e, by the identity µ ( A ) + µ ( S \ A ) = 1 , we have that µ ( A ) is a computable real. In connected spaces, the only de cidable subsets ar e the empty set and the whole space. How ever , ther e exists a useful surrogate when dealing with measure spaces. Denition 2.25 (Almost decidable set [Galatolo et al . 2010, Def. 3.1.3]). Let ( S , µ ) be a computable probability space. A measurable subset A ⊆ S is said to be µ -almost decidable when there are two c.e. open sets U and V such that U ⊆ A and V ⊆ S \ A and µ ( U ) + µ ( V ) = 1 . In this case we say that ( U , V ) witnesses the µ -almost decidability of A . The following lemma is immediate. Lemma 2.26 ([Gala tolo et al . 2010, Prop. 3.1.1]). Let ( S , µ ) be a computable probability space, and let A be µ -almost decidable. Then µ ( A ) is a computable real. □ While we may not be able to compute the probability measur e of ideal balls, we can compute a new basis of ideal balls for which we can. (See also Bosserho [2008, Lem. 2.15].) Lemma 2.27 ([Gala tolo et al . 2010, Thm. 3.1.2]). Let ( S , µ ) be a computable probability space, and let D S be the ideal points of S with standard enumeration { d i } i ∈ N . There is a computable sequence On the Computability of Conditional Probability 15 { r j } j ∈ N of reals, dense in the positive reals, such that the balls { B ( d i , r j )} i , j ∈ N form a basis of µ -almost decidable sets, which we call a µ -almost decidable basis . □ W e now show that ev er y c.e. open set of a computable probability space ( S , µ ) is the union of a computable sequence of µ -almost decidable subsets. Lemma 2.28 (Almost decid able subsets). Let ( S , µ ) be a computable probability space with ideal points { d i } i ∈ N , and let { r j } j ∈ N be a computable sequence of reals such that { B ( d i , r j )} i , j ∈ N is a µ -almost decidable basis. Let V be a c.e. op en set. Then, uniformly in { r j } j ∈ N and V , we can compute a sequence of µ -almost decidable sets { V k } k ∈ N such that V k ⊆ V k + 1 for each k , and  k ∈ N V k = V . Proof. Let { B k } k ∈ N be a standard enumeration of the ideal balls of S where B k = B ( d m k , q l k ) , and let E ⊆ N b e a c.e. set such that V =  k ∈ E B k . Consider the c.e. set F k : = { ( i , j ) : δ S ( d i , d m k ) + r j < q l k } . (5) Because { d i } i ∈ N is dense in S and { r j } j ∈ N is dense in the positive reals we have for each k ∈ N that B k =  ( i , j ) ∈ F k B ( d i , r j ) . In particular this implies that the set F : =  k ∈ E F k is a c.e. set with V =  ( i , j ) ∈ F B ( d i , r j ) . Let {( i n , j n )} n ∈ N be a computable enumeration of F and let V k : =  n ≤ k B ( d i n , r j n ) , which is µ -almost decidable. By construction, V k ⊆ V k + 1 for each k , and  k ∈ N V k = V . □ Using the notion of an almost decidable set, we hav e the following characterization of computable measures. Corollary 2.29. Let ( S , µ ) be a computable probability space with ideal p oints { d i } i ∈ N , and let { r j } j ∈ N be a computable sequence of reals such that { B ( d i , r j )} i , j ∈ N is a µ -almost de cidable basis. Let ν ∈ M 1 ( S ) be a probability measure on S that is absolutely continuous with respect to µ . Then ν is computable uniformly in the sequence { ν ( B ( d i , r j ))} i , j ∈ N . Proof. Let V be a c.e. open set of S . By Proposition 2.14, it suces to show that ν ( V ) is a c.e. real, uniformly in V . By Lemma 2.28, we can compute a nested sequence { V k } k ∈ N of µ -almost decidable sets whose union is V . By the absolute continuity of ν with respect to µ , these sets are also ν -almost decidable. Because V is open, ν ( V ) = sup k ∈ N ν ( V k ) , which is the supremum of a se quence of reals that is computable uniformly in the sequence { ν ( B ( d i , r j ))} i , j ∈ N . □ W e close with the following extension of Corollary 2.16. Proposition 2.30. Let S , T be computable Polish spaces, µ a probability measure on T , B a µ -almost decidable subset of R , and f : S × T → R a b ounded function, computable on R × T with R ⊆ S . Then the map s 7→  B f ( s , t ) µ ( d t ) is a computable function, uniformly in f and B . Proof. This follows immediately from Propositions 3.2.3 and 4.3.1 of [Hoyrup and Rojas 2009b]. □ 3 CONDITIONAL PROBABILITIES AND DISTRIBU TIONS Let µ be a probability measure on a measurable space of outcomes S , and let A , B ⊆ S be events. Informally , given that event A has occurred, the probability that event B also occurs, written µ ( B | A ) , must satisfy µ ( A ) µ ( B | A ) = µ ( A ∩ B ) . Clearly µ ( B | A ) is uniquely dened if and only if µ ( A ) > 0 , which leads to the following denition. On the Computability of Conditional Probability 16 Denition 3.1 (Conditioning on p ositive-measure events). Suppose that µ ( A ) > 0 . Then the condi- tional probability of B given A , written µ ( B | A ) , is dened by µ ( B | A ) = µ ( B ∩ A ) µ ( A ) . (6) It is straightforward to check that, for any xed e vent A ⊆ S with µ ( A ) > 0 , the set function µ ( · | A ) is a probability measure . W e will often be interested in the case where B and A are ev ents of the form { Y ∈ D } and { X ∈ C } . In this case, we dene the abbr eviation P { Y ∈ D | X ∈ C } : = P  { Y ∈ D } | { X ∈ C }  . (7) Again, this is well-dened when P { X ∈ C } > 0 . When P { X = x } > 0 , we may simply write P { Y ∈ D | X = x } (8) for P { Y ∈ D | X ∈ { x } } . This elementar y notion of conditioning is undened when the conditioning e vent has zero measure, such as when a continuous random variable takes a particular value. In the modern formulation of conditional probability due to Kolmogor ov [1933], one denes conditioning with respect to (the σ -algebra generate d by) a random variable rather than an individual event. In theory , this yields a consistent solution to the pr oblem of conditioning on the value of general (and in particular , continuous) random variables, although w e will see that other issues arise. (See K allenberg [2002, Ch. 6] for a rigorous treatment.) In order to bridge the divide between the elementary notion of conditioning on events and the abstract approach of conditioning on random variables, consider the case of conditioning on a random variable X taking values in a countable discr ete set S and satisfying P { X = x } > 0 for all x ∈ S . Let { Y ∈ B } be an event. Then the conditional probability that Y ∈ B given X , written P [ Y ∈ B | X ] , is the random variable satisfying P [ Y ∈ B | X ] = P { Y ∈ B | X = x } when X = x . Note that there is a measurable function f B : S → [ 0 , 1 ] satisfying P { Y ∈ B , X ∈ A } =  A f B ( x ) P X ( d x ) (9) for all measurable subsets A ⊆ S . For sets A of the form { x } , for x ∈ S , we have P { Y ∈ B , X = x } = f B ( x ) P { X = x } , hence f B ( x ) = P { Y ∈ B | X = x } . In summary , P [ Y ∈ B | X ] = f B ( X ) , and so (9) yields a more abstract characterization of elementary conditional probability for positive-measure events. The general case is captured by the same dening property . Let X be a random variable in a measurable space S . Then the conditional probability that Y ∈ B given X , written P [ Y ∈ B | X ] , is dened to be a random variable in [ 0 , 1 ] of the form f B ( X ) where again f B : S → [ 0 , 1 ] is such that (9) holds for all measurable subsets A ⊆ S . In many situations, such a function f B is itself the object of interest and so we will let P [ Y ∈ B | X = · ] denote an arbitrary such function. W e may then re-expr ess its dening property in the following more intuitive form: P { Y ∈ B , X ∈ A } =  A P [ Y ∈ B | X = x ] P X ( d x ) (10) for all measurable subsets A ⊆ S . On the Computability of Conditional Probability 17 The existence of the conditional probability P [ Y ∈ B | X ] , or equivalently , the existence of P [ Y ∈ B | X = · ] , follows from the Radon–Nikodym theorem. Recall that a measure µ on a mea- surable space S is absolutely continuous with respect to another measure ν on the same space, written µ ≪ ν , if ν ( A ) = 0 implies µ ( A ) = 0 for all measurable sets A ⊆ S . Theorem 3.2 (Radon–Nik odym). Let S be a measurable space and let µ and ν be σ -nite measures on S such that µ ≪ ν . Then there exists a nonnegative measurable function d µ d ν : S → R + such that µ ( A ) =  A d µ d ν d ν (11) for all measurable subsets A ⊆ S . □ W e call any function d µ d ν satisfying Equation (11) for all measurable subsets A ⊆ S a Radon– Nikodym derivative (of µ with respect to ν ). Note that if д is also a Radon–Nikodym derivative of µ with respect to ν , then д = d µ d ν outside a ν -null set, and so Radon–Nikodym derivatives are unique up to a null set . (Functions that agr ee a.e. ar e called versions .) W e may safely refer to the Radon–Nikodym derivative when we want to ignor e such dierences, but in some cases these dierences are important. It is straightforward to verify that the function P [ Y ∈ B | X = · ] is a Radon–Nikodym derivative of P { Y ∈ B , X ∈ · } with respe ct to P { X ∈ · } = P X , b oth considered as measures on S , and so a function f B satisfying (9) for all measurable subsets A ⊆ S always exists, but it is only dened up to a null set. This is inconsequential when the conditional probability P [ Y ∈ B | X ] is the object of interest. In applications, especially statistical ones, how ever , the function P [ Y ∈ B | X = · ] mapping values in S to probabilities is the object of interest, and, moreover , one typically wants to evaluate this function at particular observed values x ∈ S . Because P [ Y ∈ B | X = · ] is merely determined up to a P X -null set, interpreting its values at individual points is problematic. As mentioned in the introduction, the fact that general conditional probabilities are not uniquely dened at points is the subject of a large literature. Ho wever , in some circumstances, tw o versions of P [ Y ∈ B | X = · ] must agree at individual points. In particular , if two versions are continuous at a point in the supp ort of the distribution P X , then they agree on the value at that p oint. In order to state this claim formally , we rst recall the denition of the support of a distribution: Denition 3.3. Let µ be a measure on a topological space S with open sets S . Then the support of µ , written supp ( µ ) , is dened to be the set of points x ∈ S such that all open neighborhoo ds of x have positive measure , i.e., supp ( µ ) : = { x ∈ S : ∀ B ∈ S ( x ∈ B = ⇒ µ ( B ) > 0 )} . (12) Note that the support of µ can equivalently b e dened as the smallest close d set of µ -measure one. W e now state our claim formally: Lemma 3.4. Let S be a Polish space. Suppose f 1 , f 2 : S → [ 0 , 1 ] satisfy P [ Y ∈ B | X ] = f 1 ( X ) = f 2 ( X ) a.s. If x ∈ S is a point of continuity of f 1 and f 2 , and x ∈ supp ( P X ) , then f 1 ( x ) = f 2 ( x ) . In particular , if f 1 and f 2 are continuous on a P X -measure one set D ⊆ S , then they agree everywhere in D ∩ supp ( P X ) . □ The proof is immediate from the following elementary result. On the Computability of Conditional Probability 18 Lemma 3.5. Let f 1 , f 2 : S → T be two measurable functions between Polish spaces S and T , and suppose that f 1 = f 2 almost every where with respect to some measure µ on S . Let D ⊆ S be a set of µ -measure one. If x ∈ S is a p oint of continuity of f 1 and f 2 on D , and x ∈ supp ( µ ) , then f 1 ( x ) = f 2 ( x ) . In particular , if f 1 and f 2 are continuous on D , then they agree everywhere in D ∩ supp ( µ ) . Proof. Let δ T be any metric under which T is complete. Dene the measurable function д : S → R by д ( x ) = δ T  f 1 ( x ) , f 2 ( x )  . (13) W e know that д = 0 µ -a.e., and also that д is continuous at x on D , because f 1 and f 2 are continuous at x on D and δ T is continuous (on all of T ). A ssume, for the purpose of contradiction, that д ( x ) = ε > 0 . By the continuity of д on D , there is an open neighborho od B of x such that д ( B ∩ D ) ⊆ ( ε 2 , 3 ε 2 ) . But x ∈ supp ( µ ) , hence µ ( B ∩ D ) = µ ( B ) > 0 , contradicting д = 0 µ -a.e. □ The observation that continuity gives a unique answer to conditioning on zero-measure events of the form { X = x } is an old one, going back to at least Tjur [1974]. 3.1 Conditional Distributions For a pair of random variables X and Y taking values in a pair of measurable space S and T , respectively , it is natural to consider not just individual conditional probabilities P [ Y ∈ B | X ] , for measurable subsets B ⊆ T , but the entire conditional distribution P [ Y | X ] : = P [ Y ∈ · | X ] . Unfortunately , the fact that Radon–Nikodym derivatives are only dened up to a null set can cause problems. In particular , while it is the case that  j P [ Y ∈ B j | X ] = P [ Y ∈ B | X ] a.s. (14) for every countable measurable partition B 0 , B 1 , . . . of a measurable set B ⊆ T , the random set function given by B 7→ P [ Y ∈ B | X ] need not be a measure in general because the exceptional null set may depend on the sequence. However , when T is Polish, we can construct versions of the conditional probabilities that combine to produce a measure . In order to make this denition precise, w e recall the notion of a probability kernel. Denition 3.6 (Probability kernel). Let S and T be Polish spaces. A function κ : S × B T → [ 0 , 1 ] is called a probability kernel (from S to T ) when (1) for every s ∈ S , the function κ ( s , · ) is a probability measure on T ; and (2) for every B ∈ B T , the function κ ( · , B ) is measurable. For every κ : S × B T → [ 0 , 1 ] , let ¯ κ be the map s 7→ κ ( s , · ) . It can be shown that κ is a probability kernel from S to T if and only if ¯ κ is a (Borel) measurable function from S to M 1 ( T ) [Kallenberg 2002, Lem. 1.40], where we adopt the weak topology on M 1 ( T ) , which is Polish be cause T is. W e say that a conditional distribution P [ Y | X ] has a regular version when, for some probability kernel κ from S to T , P [ Y ∈ B | X ] = κ ( X , B ) a.s. (15) for every measurable subset B ⊆ T . In this case, w e would say that ¯ κ ( X ) is a regular version of the conditional distribution. On the Computability of Conditional Probability 19 Proposition 3.7 (Regular versions [Kallenberg 2002, Lem. 6.3]). Let X and Y be random variables in a Polish space S and a measurable space T , respe ctively . Then there is a regular version of the conditional distribution P [ Y | X ] , which is, moreover , determined by the joint distribution of X and Y . □ As with the derivatives underlying conditional probabilities, ¯ κ is only dened up to a P X -null set. When such a kernel κ exists, i.e., when there is a regular v ersion of the conditional distribution P [ Y | X ] , we dene P [ Y | X = · ] to be equal to some arbitrar y version of ¯ κ . Despite the fact that the kernels underlying regular versions of conditional distributions are dened only up to sets of measure zero , it follo ws immediately fr om Lemma 3.5 that when S and T are Polish, any two versions of P [ Y | X = · ] that are continuous on some subset of the support of P X must agree on that subset. More carefully , let ¯ κ 1 ( X ) and ¯ κ 2 ( X ) be regular versions of the conditional distribution P [ Y | X ] . If x ∈ S is a point of continuity of ¯ κ 1 and ¯ κ 2 , and x ∈ supp ( P X ) , then ¯ κ 1 ( x ) = ¯ κ 2 ( x ) . In particular , if b oth maps are continuous on a set D ⊆ S , then they agree everywhere in D ∩ supp ( P X ) . When conditioning on a random variable whose distribution concentrates on a countable set, it is well known that a regular version of the conditional distribution can be built by elementary conditioning with r espect to single ev ents. This includes the special case of conditioning on discrete random variables , i.e., those concentrating on a countable discrete subspace. Lemma 3.8. Let X and Y be random variables in Polish spaces S and T , respectively . Suppose the distribution of X concentrates on a countable set R ⊆ S , i.e., P X ( R ) = 1 and x ∈ R implies P X { x } > 0 . Let ν be an arbitrary probability measure on T . Dene the function κ : S × B T → [ 0 , 1 ] by κ ( x , B ) : = P { Y ∈ B | X = x } (16) for all x ∈ R and ¯ κ ( x ) = ν for x < R . Then κ is a probability kernel and ¯ κ ( X ) is a regular version of the conditional distribution P [ Y | X ] . Proof. The function κ is well-dened be cause P { X = x } > 0 for all x ∈ R . It follows that ¯ κ ( x ) is a probability measur e for e very x . Because R is countable, ¯ κ is also measurable and so κ is a probability kernel from S to T . Note that P { X ∈ R } = 1 and so, for all measurable sets A ⊆ S and B ⊆ T , we have  A κ ( x , B ) P X ( d x ) =  x ∈ R ∩ A P { Y ∈ B | X = x } P { X = x } (17) =  x ∈ R ∩ A P { Y ∈ B , X = x } (18) = P { Y ∈ B , X ∈ A } . (19) That is, κ ( X , B ) is the conditional probability of the e vent { Y ∈ B } given X , and so ¯ κ ( X ) is a regular version of the conditional distribution P [ Y | X ] . □ 3.2 Dominated Families Beyond the setting of conditioning on discrete random variables, explicit formulas for conditional distributions are also available when Bayes’ rule applies. W e begin by introducing the notion of a dominated kernel. (The usual terms, such as dominate d families or models, refers to measurable families of probability measures, i.e ., probability kernels.) On the Computability of Conditional Probability 20 Denition 3.9 (dominated kernel). A probability kernel κ from T to S is dominated when there is a σ -nite measure ν on S such that ¯ κ ( t ) ≪ ν for every t ∈ T . Let X and Y be random variables in Polish spaces S and T , respectively , and let ¯ κ X | Y ( X ) be a regular version of P [ X | Y ] such that κ X | Y is dominate d. Then there exists a (product) measurable function p X | Y ( · | · ) : S × T → R + such that p X | Y ( · | y ) is a Radon–Nikodym derivative of ¯ κ X | Y ( y ) with respect to ν for ev ery y ∈ T , i.e., κ X | Y ( y , A ) =  A p X | Y ( x | y ) ν ( d x ) (20) for every measurable set A ⊆ S and every y ∈ T . Denition 3.10 (conditional density). W e call any such function p X | Y a conditional density of X given Y (with respe ct to ν ) . Common nite-dimensional, parametric families of distributions (e .g., exponential families like Gaussian, gamma, etc.) are dominated, and so, in probabilistic models comp osed from these families, conditional densities e xist and Bayes’ rule giv es a formula for expressing the conditional distribution. W e give a proof of this classic result for completeness. Lemma 3.11 (Ba yes’ rule [Schervish 1995, Thm. 1.13]). Let X and Y be random variables as in Proposition 3.7, and assume that there exists a conditional density p X | Y of X given Y with respe ct to a σ -nite measure ν . Let µ be an arbitrary distribution on T and dene κ : S × B T → [ 0 , 1 ] by κ ( x , B ) =  B p X | Y ( x | y ) P Y ( d y )  p X | Y ( x | y ) P Y ( d y ) , B ∈ B T , (21) for those points x ∈ S where the denominator is positive and nite, and by ¯ κ ( x ) = µ otherwise. Then κ is a probability kernel and ¯ κ ( X ) is a regular version of the conditional distribution P [ Y | X ] . Proof. Let ¯ κ X | Y ( Y ) be a regular version of the conditional distribution P [ X | Y ] . By hypothesis, κ X | Y is dominated by ν and p X | Y is a conditional density with respe ct to ν . By Proposition 3.7 and Fubini’s theorem, for measurable sets A ⊆ S and B ⊆ T , we have that P { X ∈ A , Y ∈ B } =  B κ X | Y ( y , A ) P Y ( d y ) (22) =  B   A p X | Y ( x | y ) ν ( d x )  P Y ( d y ) (23) =  A   B p X | Y ( x | y ) P Y ( d y )  ν ( d x ) . (24) T aking B = T , we have P X ( A ) =  A   p X | Y ( x | y ) P Y ( d y )  ν ( d x ) . (25) Because P X ( S ) = 1 , this implies that the set of points x for which the denominator of the right-hand side of (21) is innite has ν -measure zero , and thus P X -measure zero . T aking A to be the set of points x for which the denominator is zero, we see that P X ( A ) = 0 . It follows that (21) characterizes κ up to a P X -null set. On the Computability of Conditional Probability 21 By (25), we see that the denominator is a density of P X with respect to ν , and so we have  A κ ( x , B ) P X ( d x ) =  A κ ( x , B )   p X | Y ( x | y ) P Y ( d y )  ν ( d x ) , (26) for all measurable sets A ⊆ S and B ⊆ T . Finally , by the denition of κ , Equation (24) , and the fact that the denominator is positive and nite for P X -almost e very x , w e see that ¯ κ ( X ) is a regular version of the conditional distribution P [ Y | X ] . □ Comparing Bay es’ rule (21) to the denition of conditional density (20) , we see that any conditional density of Y given X (with r espect to P Y ) satises p Y | X ( y | x ) = p X | Y ( x | y )  p X | Y ( x | y ) P Y ( d y ) , (27) for P ( X , Y ) -almost every ( x , y ) . The following result suggests why the mere a.e. denedness of conditional distributions can be ignored by those working entirely within the framew ork of dominated families. Proposition 3.12. Let X and Y be random variables on Polish spaces S and T , respectively , let ¯ κ ( X ) be a regular version of the conditional distribution P [ Y | X ] , and let R ⊆ S . If a conditional density p X | Y ( x | y ) of X given Y is continuous on R × T , positive, and bounde d, then κ as dened in (21) is a version of ¯ κ that is continuous on R . In particular , if R is a P X -measure one subset, then κ is a P X -almost continuous version. W e defer the proof to Section 9.2. W e will use this result in the proof of Lemma 7.3, towar ds our central result. 4 COMP U T ABLE CONDITIONAL PROBABILITIES AND DISTRIBU TIONS Before we lay the foundations for the r emainder of the paper and dene notions of computability for conditional pr obability and conditional distributions in the abstract setting, we address the computability of distributions conditioned on positive-measure sets. In or der for the distributions obtained from positive measure sets to be computable, we will need the conditioning ev ents to be almost decidable sets. Lemma 4.1 ([Gala tolo et al . 2010, Prop. 3.1.2]). Let ( S , µ ) be a computable probability space and let A be a µ -almost de cidable subset of S satisfying µ ( A ) > 0 . Then µ ( · | A ) is a computable probability measure, uniformly in a witness to the µ -almost decidability of A . Proof. By Lemma 2.27 there is a µ -almost decidable basis for S . Note that µ ( · | A ) is absolutely continuous with respect to µ . Hence by Corollary 2.29, it suces to show that µ ( B ∩ A ) µ ( A ) is computable for a µ -almost decidable set B , uniformly in witnesses to the µ -almost decidability of A and B . All subsequent statements in this proof are uniform in b oth. Now , B ∩ A is µ -almost decidable with computable witness, and so its measur e, the numerator , is a computable real. The denominator is likewise the measur e of a set that is almost decidable with computable witness, hence is a computable real. Finally , the ratio of two computable reals is itself computable. □ On the Computability of Conditional Probability 22 In the abstract setting, conditional probabilities are random variables. In many applications of probability , including statistics, the conditional probability map, or some v ersion of it, is the actual object of interest, and so the computability of this map is our focus. Let B ⊆ T be a measurable set. Viewing P [ Y ∈ B | X = · ] as a function from S to [ 0 , 1 ] , recall that we can speak formally as to whether this function is e very where computable, P X -almost computable, and/or L 1 -computable. Recall also that the function P [ Y ∈ B | X = · ] may have many versions that agree only up to a null set. Despite this, their almost computability do es not dier (up to a change in domain by a null set). Lemma 4.2. Let f be a measurable function from a computable probability space ( S , µ ) to a computable Polish space T . If any version of f is computable on a µ -measure p set, then every version of f is computable on a µ -measure p set. In particular , if one version is µ -almost computable, then all version are. Proof. Let f be computable on a µ -measure p set D , and let д be a version of f , i.e., Z : = { s ∈ S : f ( s ) , д ( s )} is a µ -null set. Therefore, f = д on D \ Z . Hence д is computable on the µ -measure p set D \ Z . If f is µ -almost computable, then it is computable on a µ -measure one set, and so д is as well. □ W e can develop notions of computability for conditional distributions in a similar way . W e b egin by characterizing the computability of probability kernels. Denition 4.3 (Computable probability kernel). Let S and T be computable Polish spaces and let κ : S × B T → [ 0 , 1 ] be a probability kernel from S to T . Then we say that κ is a computable probability kernel when ¯ κ : S → M 1 ( T ) given by ¯ κ ( s ) : = κ ( s , · ) is a computable function in the ordinary sense between S and the computable Polish space M 1 ( T ) induced by T . Similarly , we say that κ is computable on a subset D ⊆ S when ¯ κ is computable on D . As we will see, this notion of computability corr esponds with a more direct notion of computability for κ , which we now develop . W e begin by noting that the collection of sets of the form P T ( A , q ) : = { µ ∈ M 1 ( T ) : µ ( A ) > q } (28) for A open and q rational, form a subbasis for the weak topology on M 1 ( T ) (which is the topology induced by the Prokhorov metric). Indeed, it suces for A to range over nite unions of some countable basis of T . W e will also omit mention of T when the ambient space is clear from context. The next result r elates balls in the Prokhorov metric to the subbasis elements above. Recall that δ p denotes the Prokhoro v metric and that the collection D P of measures with nitely many point masses on elements D T , each assigned rational mass, form a dense set. Proposition 4.4 ([Gács 2005, Prop. B.17]). Let ν , µ ∈ M 1 ( T ) , and assume that ν is supported on a nite set S . Then the condition δ p ( ν , µ ) < ϵ is equivalent to the nite set of conditions µ ( A ϵ ) > ν ( A ) − ϵ (29) for all A ⊆ S . □ The next corollary states that we can compute a representation for a Prokhorov ball in terms of the subbasis elements. The sets are easily dened from those in Proposition 4.4. On the Computability of Conditional Probability 23 Corollary 4.5. Uniformly in ν ∈ D P and ϵ ∈ Q , we can compute a nite collection of pairs ( A i , q i ) i ≤ n , each A i a nite union of open balls of radius ϵ around elements of D T and each q i a rational, such that { µ ∈ M 1 ( T ) : δ p ( µ , ν ) < ϵ } =  i ≤ n P ( A i , q i ) . □ (30) Finally , as a direct conse quence of [Hoyrup and Rojas 2009b, Prop. 4.2.1], these subbasis elements are c.e. open. Proposition 4.6. Let A be a c.e. open subset of T and q be a rational. Then the set P ( A , q ) is c.e. open in the Prokhorov metric, uniformly in A and q . □ Recall that a lower semicomputable function from a computable Polish space to [ 0 , 1 ] is one for which the preimage of ( q , 1 ] is c.e. open, uniformly in rationals q . Furthermore, we say that a function f from a computable Polish space S to [ 0 , 1 ] is lower semicomputable on D ⊆ S when there is a uniformly computable sequence { U q } q ∈ Q of c.e. open sets such that f − 1  ( q , 1 ]  ∩ D = U q ∩ D . (31) W e can also interpret a computable probability kernel κ as a computable map sending each c.e . open set A ⊆ T to a low er semicomputable function κ ( · , A ) . Lemma 4.7. Let S and T be computable Polish spaces, let κ be a probability kernel from S to T , and let D ⊆ S . If ¯ κ is computable on D then κ ( · , A ) is lower semicomputable on D uniformly in the c.e. open set A , and conversely . Proof. Let q ∈ ( 0 , 1 ) be rational, let A ⊆ T be c.e. open, and dene I : = ( q , 1 ] . Then κ − 1  · , A  [ I ] = { x : ¯ κ ( x )( A ) ∈ I } = ¯ κ − 1 [ P ( A , q )] , (32) where P ( A , q ) is as in (28). By Proposition 4.6, P ( A , q ) is even c.e. open. Suppose ¯ κ is computable on D . Then there is a c.e. open set V A , q , uniformly computable in q and A , such that V A , q ∩ D = ¯ κ − 1 [ P ( A , q )] ∩ D = κ ( · , A ) − 1 [ I ] ∩ D , (33) and so κ ( · , A ) is lower semicomputable on D , uniformly in A . Conversely , suppose κ ( · , A ) is lower semicomputable on D , uniformly in A . Then by (32) , uniformly in A and q , we can nd a c.e. open V A , q such that (33) holds. By Corollary 4.5, every basic op en ball in the Prokhorov metric is the nite intersection of sets of the form P ( A , q ) , which are c.e. open themselves because a nite intersection of c.e. open sets is c.e. open. Therefore , uniformly in a c.e . open set U in the Prokhor ov metric, we can nd a c.e. open set V in S such that V ∩ D = ¯ κ − 1 [ U ] ∩ D . (34) Hence ¯ κ is computable on D . □ Let X and Y be random variables in computable Polish spaces S and T , respectively , and let ¯ κ ( X ) be a regular version of the conditional distribution P [ Y | X ] . The above notions of computability are suitable for talking about the computability of κ or any other v ersion of it, and are appropriate notions of computability for statistical applications. On the Computability of Conditional Probability 24 Intuitively , a probability kernel κ is computable when, for some (and hence for any ) version of κ , there is a program that, given as input a representation of a point s ∈ S , outputs a representation of the measure ¯ κ ( s ) for P X -almost every input s . 5 DISCONTINUOUS CONDITIONAL DISTRIBU TIONS Our study of the computability of conditional distributions begins at the following roadblock: a con- ditional distribution need not have any version that is continuous or even almost continuous (in the sense described in Se ction 4). This will rule out almost computability (though not L 1 -computability). W e will w ork with the standard ee ctive pr esentations of the spaces R , N , { 0 , 1 } , as well as product spaces thereof, as computable Polish spaces. For example, we will use R under the Euclidean metric, along with the ideal points Q under their standard enumeration. Recall that a random variable C is a Bernoulli ( p ) random variable when P { C = 1 } = 1 − P { C = 0 } = p . A random variable N is a geometric ( p ) random variable when it takes val- ues in N = { 0 , 1 , 2 , . . . } and satises P { N = n } = p n ( 1 − p ) (35) for all n ∈ N . A random variable that takes values in a nite set is uniformly distribute d when it assigns equal probability to each element. A continuous random variable U on the unit interval is uniformly distribute d when the probability that it falls in the subinterval [ ℓ, r ] is r − ℓ . It is easy to show that the distributions of these random variables are computable, provided p , ℓ , and r are computable reals and p ∈ [ 0 , 1 ] . Let N , C , and U be independent P -almost computable random variables such that N is a geometric( 1 2 ) random variable, C is a Bernoulli( 1 2 ) random variable, and U is a uniformly distribute d random variable in [ 0 , 1 ] . Fix a computable enumeration { r i } i ∈ N of the rational numbers (without repetition) in ( 0 , 1 ) , and consider the random variable X : =  r N , if C = 1; U , otherwise , (36) which is also P -almost computable because it is a computable function of C , U , and N . Proposition 5.1. Ev ery version of P [ C = 1 | X = · ] is discontinuous every where on every P X -measure one set. In particular , no version is P X -almost computable. Proof. Note that P { X rational } = 1 2 and, furthermore, P { X = r k } = 1 2 k + 2 > 0 . Therefore , any two versions of P [ C = 1 | X = · ] must agree on all rationals in [ 0 , 1 ] . In addition, b ecause P U ≪ P X , i.e., P { U ∈ A } > 0 = ⇒ P { X ∈ A } > 0 (37) for all measurable sets A ⊆ [ 0 , 1 ] , any two versions must agr ee on a Lebesgue-measure one set of the irrationals in [ 0 , 1 ] . An elementar y calculation shows that P { C = 1 | X rational } = 1 , (38) while P { C = 1 | X irrational } = 0 . (39) On the Computability of Conditional Probability 25 It is also straightfor ward to verify that C and X are conditionally independent, given an indicator for the event { X rational } . Therefore, all versions f of P [ C = 1 | X = · ] satisfy , for P X -almost every x , f ( x ) =  1 , x rational ; 0 , x irrational . (40) The right hand side, consider ed as a function of x , is called the Dirichlet function, and is nowhere continuous . Suppose some version of f were continuous at a point y on a P X -measure one set R . Then there would exist an open interval I containing y such that the image of I ∩ R contains 0 or 1, but not both. Howev er , R must contain all rationals in I and Lebesgue-almost every irrational in I . Furthermore, the image of every rational in I ∩ R is 1, and the image of Lebesgue-almost every irrational in I ∩ R is 0 , a contradiction. □ Although we cannot hope to compute P [ C = 1 | X = · ] on a P X -measure one set, we can compute it in a weaker sense. Proposition 5.2. P [ C = 1 | X = · ] is L 1 ( P X ) -computable. Proof. By Corollary 2.24, it suces to construct a sequence of uniformly P X -almost computable functions that converge eectively in L 1 ( P X ) to P [ C = 1 | X = · ] . Let ϕ k = 1 2 min m < n ≤ k | r m − r n | be half the minimum distance between any pair among r 0 , . . . , r k , and dene, for every k ∈ N , f k ( x ) : =  1 , if | x − r n | < 1 ( k + 1 ) √ 2 min ( 2 − k − 2 , ϕ k ) holds for some n ≤ k ; 0 , otherwise. (41) Note that the set on which f k takes the value 1 is uniformly P X -almost decidable in part b ecause its boundar y p oints are irrationals, a null set. It is then clear that the functions f k , for k ∈ N , are uniformly P X -almost computable. For every k ∈ N , we have that P { X = r n for some n > k } = 2 − k − 2 . Therefore ,     P [ C = 1 | X = x ] − f k ( x )    P X ( d x ) (42) ≤ P { X = r n for some n > k } +  [ 0 , 1 ]\ Q    P [ C = 1 | X = x ] − f k ( x )    P X ( d x ) (43) ≤ 2 − k − 2 + 1 ( k + 1 ) √ 2 k  n = 0 2 − k − 2 < 2 − k − 1 , (44) completing the proof. □ 6 CONDITIONING IS DISCONTINUOUS Conditioning in general can pr oduce discontinuous conditional distributions, which is an obstruction to a conditioning operator being computable. But ev en if we restrict our attention to distributions that admit conditional distributions that are continuous on their support, the op eration of condi- tioning cannot be computable be cause, as we will show , it is discontinuous. Inde ed, conditioning is discontinuous in a rather strong way . W e use the recursion theorem to explain the computational consequences. Namely , for any potential program analysis that aims to perform conditioning on On the Computability of Conditional Probability 26 an arbitrary given distribution, there is a repr esentation of that distribution such that the program analysis cannot identify a single nontrivial fact about its conditional distribution. T o begin, we formalize the notion of a conditioning operator . Denition 6.1. Let F ⊆ M 1 ([ 0 , 1 ] 2 ) be a set of probability measures. A map Φ : M 1 ([ 0 , 1 ] 2 ) × [ 0 , 1 ] → M 1 ([ 0 , 1 ]) is a conditioning operator (for F ) if, for all distributions µ ∈ F and random variables X and Y with joint distribution µ , we have Φ ( µ , x ) = P [ Y | X = x ] for P X -almost all x . Observe, by Proposition 3.7, that there is a conditioning operator for all F ⊆ M 1 ([ 0 , 1 ] 2 ) . Denition 6.2. Let F ⊆ M 1 ([ 0 , 1 ] 2 ) . A conditioning operator for F is computable if it is com- putable on F × [ 0 , 1 ] , considered as a function M 1 ([ 0 , 1 ] 2 ) × [ 0 , 1 ] → M 1 ([ 0 , 1 ]) , where both M 1 ([ 0 , 1 ] 2 ) × [ 0 , 1 ] and M 1 ([ 0 , 1 ]) are taken to be the canonical computable Polish spaces. The previous section motivates restricting one ’s attention to conditioning operators for the set F 0 ⊆ M 1 ([ 0 , 1 ] 2 ) of probability distributions on pairs ( X , Y ) of random variables in [ 0 , 1 ] such that there exists a P X -almost continuous version of the conditional distribution map P [ Y | X = · ] . W e will show that conditioning operators for F 0 are not computable, simply on gr ounds of continuity . Recall that the name of a probability measure µ ∈ M 1 ( T ) on a computable Polish space T is given by a Cauchy sequence in the dense elements D P , T of the associated Prokhorov metric. Note that F 0 contains D P , [ 0 , 1 ] 2 . Further recall that P T ( A , q ) is dened to be the set { η ∈ M 1 ( T ) : η ( A ) > q } , for any open set A ⊆ T and rational q ∈ Q . Let A ( T ) : = { ( A , q ) : A is a nite union of open balls in T , and q ∈ Q } . Given a computable Cauchy sequence in the Pr okhorov metric that converges to a measure µ ∈ M 1 ( T ) , by Corollary 4.5 w e can compute a sequence ⟨ A i , q i ⟩ i ∈ N in A ( T ) such that  i ∈ N P T ( A i , q i ) = { µ } . Further by Proposition 4.6, given a nite se quence ⟨ A i , q i ⟩ i ≤ n in A ( T ) we can compute, uniformly in ⟨ A i , q i ⟩ i ≤ n , a ball in the Prokhorov metric contained in  i ≤ n P T ( A i , q i ) . Therefore, uniformly in a probability measure µ and a collection ⟨ A i , q i ⟩ i ∈ N with  i ∈ N P T ( A i , q i ) = { µ } , we can computably recover a name for µ . Conversely , from a name for µ , we can uniformly compute such a collection. Lemma 6.3. Let F ⊆ M 1 ([ 0 , 1 ] 2 ) contain D P , [ 0 , 1 ] 2 . For every α ∈ D P , [ 0 , 1 ] , computable representation { ν i } i ∈ N of ν ∈ M 1 ([ 0 , 1 ] 2 ]) , computable representation { x i } i ∈ N of x ∈ [ 0 , 1 ] , and rational ϵ ∈ ( 0 , 1 ) , we can uniformly nd a measure µ ∈ D P , [ 0 , 1 ] 2 such that δ P ( µ , ν ) < ϵ and Φ ( µ , x ) = α for every conditioning operator Φ for F . Proof. Relative to { ν i } i ∈ N and { x i } i ∈ N , we can computably nd an element p ∗ ∈ D P , [ 0 , 1 ] 2 such that p ∗ ({ x } × [ 0 , 1 ]) = 0 and δ P ( p ∗ , ν ) < ϵ 2 . Let δ x denote the Dirac measure on [ 0 , 1 ] concentrating on x and let τ ⊗ τ ′ denote the product measure on [ 0 , 1 ] 2 with respective marginal distributions τ , τ ′ ∈ M 1 ([ 0 , 1 ]) . Dening p : = ϵ 2 ( δ x ⊗ α ) + ( 1 − ϵ 2 ) p ∗ , it is easy to check that δ P ( p , p ∗ ) ≤ ϵ 2 , hence δ P ( ν , p ) < ϵ . Because p ∗ ({ x } × [ 0 , 1 ]) = 0 and p ∗ is a nite mixture of point masses, every conditioning operator Φ for F must satisfy Φ ( µ , x ) = α . □ Proposition 6.4. Let F ⊆ M 1 ([ 0 , 1 ] 2 ) contain D P , [ 0 , 1 ] 2 . Every conditioning operator on F is discontinuous every where, hence noncomputable. Proof. On M 1 ([ 0 , 1 ]) , adopt the weak topology (induced by the standard topology on [ 0 , 1 ] ). On M 1 ([ 0 , 1 ] 2 ) × [ 0 , 1 ] , adopt the product topology induced by the weak and standard topologies, respectively . Then Lemma 6.3 implies that every conditioning operator Φ for F is discontinuous everywhere. Hence every conditioning operator is noncomputable. □ On the Computability of Conditional Probability 27 The above denitions and proposition capture the essential diculty of conditioning: a nite approximation to the joint distribution determines nothing about the result of conditioning on a particular point. W e now establish a stronger notion of noncomputability , namely that it is not e ven possible to always produce some nontrivial fact about a conditional distribution. For e ∈ N , let φ e denote the partial computable function dened by code e . The recursion theorem, due to Kleene [1938], states that when F is a total computable function, there is some integer i for which the partial computable functions φ i and φ F ( i ) are equal partial functions (i.e., they ar e dened on the same inputs, and are equal where they are dened). For more details, see [Rogers 1987, Ch. 11]. Denition 6.5. A program φ e : N → N represents a distribution µ e on a computable Polish space T if it is total and on input k , the output value φ e ( k ) is a code for a pair ( A k , q k ) ∈ A ( T ) such that { µ } =  k ∈ N P T ( A k , q k ) . Denition 6.6. A program φ a : N 3 → N is a conditioning program for F ⊆ M 1 ([ 0 , 1 ] 2 ) if it is total and whenever e represents a computable distribution µ e ∈ F , there exists a conditioning operator Φ for F such that, for every code j ∈ N for a computable real x ∈ [ 0 , 1 ] , and for every k ∈ N , the return value φ a ( e , j , k ) is a code for either the empty string or an element ( A , q ) ∈ A ([ 0 , 1 ]) such that Φ ( µ e , x ) ∈ P [ 0 , 1 ] ( A , q ) and P [ 0 , 1 ] ( A , q ) , M 1 ([ 0 , 1 ]) . Theorem 6.7 (Nonappro ximable conditional distributions). Suppose that φ a is a conditioning program for a set F ⊆ M 1 ([ 0 , 1 ] 2 ) containing D P , [ 0 , 1 ] 2 . Let e ∈ N be a code for a computable distribution µ e on [ 0 , 1 ] 2 , and let j ∈ N be a code for a computable real x ∈ [ 0 , 1 ] . Then uniformly in a , e , and j , we can compute an i ∈ N such that µ e = µ i and φ a ( i , j , k ) is a co de for the empty string for every k ∈ N . Proof. Uniformly in a , we can compute some b ∈ N such that for all n , m , r ∈ N : if the value φ a ( n , m , r ′ ) is a code for the empty string for all r ′ ≤ r , then φ b ( n , m , r ) is also a code for the empty string; and other wise φ b ( n , m , r ) = φ a ( n , m , r ′ ) , where r ′ is the least index such that φ a ( n , m , r ′ ) is not a code for the empty string. Note that for each n , m ∈ N , the value φ b ( n , m , k ) is a code for the empty string for all k ∈ N if and only if φ a ( n , m , k ) is a code for the empty string for all k . Let η n , j denote the least index k ∈ N such that φ b ( n , j , k ) is not a code for the empty string, if such k exists, and ∞ otherwise. Note that for each k ∈ N , we can compute (uniformly in n , e , a , and j ) whether or not k < η n , j (ev en though the niteness of η n , j may not be computable). For k ∈ N , let ( A k , q k ) be the pair coded by φ e ( k ) . Dene the total computable function F : N → N such that for n , k ∈ N , φ F ( n ) ( k ) =  φ e ( k ) if k < η n , j ; and φ e ′ ( k − η n , j ) if k ≥ η n , j , (45) where e ′ ∈ N is dened as follows. Let ( A ′ , q ′ ) be the pair coded by φ b ( n , j , η n , j ) . First, compute a Prokhorov ball B ⊆ M 1 ([ 0 , 1 ] 2 ) contained within  ℓ ≤ η n , j P [ 0 , 1 ] 2 ( A ℓ , q ℓ ) . Next, compute some α ∈ M 1 ([ 0 , 1 ]) such that α ( A ′ ) = 0 , and hence α < P [ 0 , 1 ] ( A ′ , q ′ ) . Then, by Lemma 6.3, compute a code e ′ for a distribution ν ∈ B such that Φ ( ν , x ) = α for every conditioning operator Φ for F . By the recursion theorem, we can compute an inde x i , uniformly in a , e and j , such that φ F ( i ) = φ i . W e now argue that η i , j = ∞ , which implies that φ i = φ e by (45). Suppose, for a contradiction, that η i , j ∈ N . Then φ b ( i , j , η i , j ) = ( A ′ , q ′ ) for some ( A ′ , q ′ ) ∈ A ([ 0 , 1 ]) . Hence, as φ a is a conditioning program, there is some conditioning operator Φ for F , such that On the Computability of Conditional Probability 28 Φ ( µ i , x ) ∈ P [ 0 , 1 ] ( A ′ , q ′ ) , where µ i is the measure represented by φ i . By construction, for every conditioning operator Φ for F , we have Φ ( µ i , x ) < P [ 0 , 1 ] ( A ′ , q ′ ) , a contradiction. □ These results r ely on the density of the nitely supported discrete probability distributions D P , [ 0 , 1 ] 2 . Howev er , analogous results can be established if we restrict ourselves to absolutely continuous distributions admitting continuous joint density functions. In this case, the role of the nitely supported continuous distributions would be played by absolutely continuous distributions with sharp but continuous bump functions concentrating on small sets. The fundamental obstruction is the same: partial information in the weak topology does not suce to condition continuously . 7 NONCOMP U T ABLE ALMOST -CONTIN UOUS CONDITIONAL DISTRIBU TIONS In this section, we construct a pair of P -almost computable random variables X in [ 0 , 1 ] and N in N such that the conditional pr obability map P [ N = k | X = · ] is not ev en L 1 ( P X ) -computable, despite the existence of an P X -almost continuous version. Our construction in this section can be thought of as providing a single witness to the noncomputability of the conditioning operator . Let M n denote the n th T uring machine, under a standard enumeration, and let h : N → N ∪ { ∞ } be the map given by h ( n ) : = ∞ if M n does not halt (on input 0) and h ( n ) : = k if M n halts (on input 0) at the k th step. W e may then take ∅ ′ : N → { 0 , 1 } to denote the halting set { ℓ : M ℓ halts on input 0 } , (46) which is computably enumerable but not computable. The set ∅ ′ and the function h are computable from each other because ∅ ′ = { n ∈ N : h ( n ) < ∞ } . (47) W e now use h to dene a pair of P -almost computable random variables ( N , X ) such that ∅ ′ is computable from P [ N | X = · ] . Let N , C , U , and V be independent P -almost computable random variables such that N is a geometric( 1 5 ) random variable, C is a Bernoulli( 1 3 ) random variable, and U and V are uniformly distributed random variables in [ 0 , 1 ] . Let ⌊ x ⌋ denote the greatest integer y ≤ x , and note that ⌊ 2 k V ⌋ is uniformly distributed in { 0 , 1 , 2 , . . . , 2 k − 1 } and is P -almost computable. For each k ∈ N , consider the derived random variable X k : = 2 ⌊ 2 k V ⌋ + C + U 2 k + 1 . (48) Note that lim k →∞ X k almost surely exists. Dene X ∞ : = lim k →∞ X k , and observe that X ∞ = V a.s. Finally , dene X : = X h ( N ) . Proposition 7.1. The random variable X is P -almost computable. Proof. Because U and V are computable on a P -measure one set and a.s. nondyadic, their binary expansions { U n : n ∈ N } and { V n : n ∈ N } (which are uniquely determined with probability 1) are themselves P -almost computable random variables in { 0 , 1 } , uniformly in n . On the Computability of Conditional Probability 29 For each k ≥ 0 , dene the random variable D k =          V k , h ( N ) > k ; C , h ( N ) = k ; U k − h ( N )− 1 , h ( N ) < k . (49) By simulating M N for k steps, we can decide whether h ( N ) is less than, equal to, or gr eater than k . Therefore the random variables { D k } k ≥ 0 are P -almost computable, uniformly in k . W e now show that, with probability one , { D k } k ≥ 0 is the binary expansion of X , thus demonstrating that X is itself a P -almost computable random variable. Let D denote the P -almost computable random real whose binary expansion is { D k } k ≥ 0 . There are two cases to consider . First, conditione d on the event { h ( N ) = ∞ } , we have that D k = V k for all k ≥ 0 , and so D = V = X ∞ = X almost surely . In the second case, let m ∈ N , and condition on the event { h ( N ) = m } . W e must then show that D = X m a.s. Note that ⌊ 2 m X m ⌋ = ⌊ 2 m V ⌋ = m − 1  k = 0 2 m − 1 − k V k = ⌊ 2 m D ⌋ , (50) and thus the binary expansions agree for the rst m digits. Finally , notice that 2 m + 1 X m − 2 ⌊ 2 m X m ⌋ = C + U , and so the next binar y digit of X m is C , followed by the binar y expansion of U , thus agreeing with D for all k ≥ 0 . □ W e now show that P [ N | X = · ] is P X -almost continuous. W e begin by characterizing the conditional density of X given N . Lemma 7.2. For each k ∈ N ∪ {∞ } , the distribution of X k admits a density p X k with respe ct to Lebesgue measure on [ 0 , 1 ] given by p X ∞ ( x ) = 1 and p X k ( x ) =  4 3 , ⌊ 2 k + 1 x ⌋ ev en; 2 3 , ⌊ 2 k + 1 x ⌋ odd , (51) for k < ∞ . Proof. W e have X ∞ = V a.s. and so the constant function taking the value 1 is a density of X ∞ with respect to Lebesgue measure on [ 0 , 1 ] . Let k ∈ N . With probability one, the integer part of 2 k + 1 X k is 2 ⌊ 2 k V ⌋ + C while the fractional part is U . Therefore, the distribution of 2 k + 1 X k (and hence X k ) admits a piecewise constant density with respect to Lebesgue measure. In particular , ⌊ 2 k + 1 X k ⌋ ≡ C ( mod 2 ) almost surely and 2 ⌊ 2 k V ⌋ is independent of C and uniformly distributed on { 0 , 2 , . . . , 2 k + 1 − 2 } . Therefore, P { ⌊ 2 k + 1 X k ⌋ = ℓ } = 2 − k ·  2 3 , ℓ ev en; 1 3 , ℓ odd, (52) On the Computability of Conditional Probability 30 n = 0 n = 1 n = 2 n = 3 n = 4 n = 5 0 1 1  2 1  4 1  8 Fig. 1. A visualization of the distribution of ( W , Y ) , as defined in Theorem 7.6. In this plot, the darkness of each pixel is proportional to the mass contained in that region. Note that this is not a plot of the (noncomputable) density , but rather of a discretization to a given pixel size. Regions that app ear (at low resolution) to be uniform can suddenly be r evealed (at higher resolutions) to be paerned. Deciding whether the paern is in fact uniform (as opposed to nonuniform but at a finer granularity than the resolution of this printer/display) is tantamount to solving the halting problem, but it is possible to sample from this distribution nonetheless. for every ℓ ∈ { 0 , 1 , . . . , 2 k + 1 − 1 } . It follows immediately that a density p of 2 k + 1 X k with respect to Lebesgue measure on [ 0 , 2 k + 1 ] is given by p ( x ) = 2 − k ·  2 3 , ⌊ x ⌋ ev en; 1 3 , ⌊ x ⌋ odd . (53) A density of X k is then obtained by the rescaling p X k ( x ) = 2 k + 1 · p ( 2 k + 1 x ) . □ As X k admits a density with respect to Lebesgue measure on [ 0 , 1 ] for all k ∈ N ∪ { ∞ } , it follows that the conditional distribution of X given N admits a conditional density p X | N (with respe ct to Lebesgue measure on [ 0 , 1 ] ) given by p X | N ( x | n ) = p X h ( n ) ( x ) (54) On the Computability of Conditional Probability 31 for x ∈ [ 0 , 1 ] and n ∈ N . By Bayes’ rule (Lemma 3.11), for every B ⊆ N and P X -almost e very x ∈ [ 0 , 1 ] , P [ N ∈ B | X = x ] = ϕ ( x , B ) ϕ ( x , N ) , (55) where ϕ ( x , B ) : =  n ∈ B p X | N ( x | n ) · P { N = n } . (56) Lemma 7.3. P [ N | X = · ] is P X -almost continuous. Proof. For every k ∈ N ∪ { ∞ } , it holds that p X k is bounded by 4 / 3 , positive, and continuous on the P X -measure one set R of nondyadic reals in the unit interval. Therefore , p X | N is positive, bounded and continuous on R × N , and so, by Proposition 3.12, P [ N | X = · ] is P X -almost continuous. □ Lemma 7.4. P [ N | X = · ] is ∅ ′ -computable on a P X -measure one set. Proof. Let B ⊆ N be a computable set. By Lemma 4.7, Equation (55) , and the computability of division, it suces to show that ϕ B : = ϕ ( · , B ) is P X -almost computable from ∅ ′ . Note that P X is absolutely continuous with respect to Leb esgue measure (on [ 0 , 1 ] ) and vice versa, and so P X -almost and Lebesgue-almost computability coincide. W e will therefore drop the measure when referring to almost computability for the remainder of the proof. Dene, for each n ∈ N , the function a n on [ 0 , 1 ] given by a n ( x ) : =          3 , h ( n ) = ∞ ; 2 , h ( n ) < ∞ and ⌊ 2 h ( n ) x ⌋ ev en; and 4 , h ( n ) < ∞ and ⌊ 2 h ( n ) x ⌋ odd . (57) Clearly the functions { a n ( x ) } n ∈ N are almost computable from ∅ ′ , uniformly in n . Observe that, for all n ∈ N and x ∈ [ 0 , 1 ] , a n ( x ) = 3 p X | N ( x | n ) (58) by Lemma 7.2. Also recall that P { N = n } = 4 5 · 5 − n for all n ∈ N . Hence, for any nite set F ⊆ N , the function ϕ F ( x ) : = ϕ ( x , F ) =  n ∈ F p X | N ( x | n ) · P { N = n } = 4 15  n ∈ F a n ( x ) · 5 − n (59) is almost computable from ∅ ′ , uniformly in F . Howev er , for every k ∈ N and x ∈ [ 0 , 1 ] , | ϕ B ( x ) − ϕ F k ( x ) | ≤  n > k p X | N ( x | n ) · P { N = n } ≤ 4 15 · 5 − k , (60) where F k = B ∩ { 1 , . . . , k } . It follows that ϕ B is almost computable from ∅ ′ . □ Proposition 7.5. Let R ⊆ [ 0 , 1 ] be a measurable subset of P X -measure greater than 5 6 . For each k ∈ N , the conditional probability map P [ N = k | X = · ] is neither lower nor upper semicomputable on R . On the Computability of Conditional Probability 32 Proof. First note that if R had Lebesgue measure no greater than 3 4 , then, for k ∈ N , P X k ( R ) ≤ 1 2 · 4 3 + 1 4 · 2 3 = 5 6 , (61) and so P X ( R ) ≤ 5 6 , a contradiction. Hence R has Leb esgue measure greater that 3 4 . Fix k ∈ N . By (55) and the denition of ϕ , P [ N = k | X = x ] = p X | N ( x | k ) · P { N = k } ϕ ( x , N ) . (62) for a.e. x . The density p X | N (· | k ) = p X h ( k ) , given by Lemma 7.2, is a piecewise constant function and computable on a measure one set. Furthermore, P { N = k } = 4 · 5 − k − 1 is a computable real. Hence it remains to show that, as a function of x , the denominator ϕ ( x , N ) : =  n ∈ N p X | N ( x | n ) · P { N = n } of the right-hand-side is neither lower nor upper semicomputable on R . W e will show the former; the latter follows in a similar fashion. Recall the sequence of functions { a n } n ∈ N on [ 0 , 1 ] from (57) , and dene the function τ : [ 0 , 1 ] → R by τ ( x ) : =  n ∈ N a n ( x ) · 5 − n , (63) which furthermore satises τ ( x ) =  n ∈ N 3 p X | N ( x | n ) · 5 4 P { N = n } (64) for all x ∈ [ 0 , 1 ] . Observe that for every x ∈ [ 0 , 1 ] , the real τ ( x ) has a unique base-5 expansion with all the digits contained in { 2 , 3 , 4 } , which must necessarily be given by the sequence a 0 ( x ) , a 1 ( x ) , . . . . For each ℓ ∈ N , dene τ ℓ ( x ) : = 5 ℓ  τ ( x ) −  n < ℓ a n ( x ) · 5 − n  =  n ∈ N a n + ℓ ( x ) · 5 − n . (65) Note that τ 0 ( = τ ) is lower semicomputable on R precisely when ϕ ( x , N ) is lower semicomputable on R . Further note that if a ℓ ( x ) = 2 , then τ ℓ ( x ) < 3 (wher e the inequality is strict because for every n , there exists an m > n such that h ( m ) = ∞ ). On the other hand, if a ℓ ( x ) ≥ 3 , then τ ℓ ( x ) > 3 . W e now prove by induction, uniformly in ℓ , that, if τ ℓ is lower semicomputable on R , then τ ℓ + 1 is lower semicomputable on R , and we can compute whether or not h ( ℓ ) is nite. This then implies that if τ were lower semicomputable on R , then ∅ ′ would be computable, a contradiction. Suppose τ ℓ is lower semicomputable on R . Then we can compute (as a c.e. open subset of [ 0 , 1 ] ) a set S such that S ∩ R = τ − 1 ℓ [( 3 , ∞)] ∩ R . Simultaneously run M ℓ (on input 0 ). If h ( ℓ ) < ∞ , then we will eventually notice this fact by observing that M ℓ halts. It is also the case that S ∩ R ⊆ { x ∈ [ 0 , 1 ] : a ℓ ( x ) = 4 } , and so, by the denition of the functions { a n } , the set S ∩ R has Lebesgue measure at most 1 2 . Therefore S has measure at most 1 2 plus the measure of [ 0 , 1 ] − R , i.e., S has measure less than 1 2 + ( 1 − 3 4 ) = 3 4 . On the other hand, if h ( ℓ ) = ∞ , then S ⊇ R , and hence S has Lebesgue measure greater than 3 4 , which we will e ventually notice (and which rules out the rst case). Hence we can compute whether or not h ( ℓ ) is nite, which in turn implies that a ℓ is computable on a measure one set. Therefore , as τ ℓ + 1 ( x ) = 5 · ( τ ℓ ( x ) − a ℓ ( x ) ) , the function τ ℓ + 1 is lower semicomputable on R . □ On the Computability of Conditional Probability 33 W e may summarize our central result as follows. Theorem 7.6. There are P -almost computable random variables W and Y on [ 0 , 1 ] such that the conditional distribution map P [ Y | W = · ] is P W -almost continuous but not P W -almost computable. Proof. Let X and N b e as above, let Y be uniformly distributed on [ 0 , 1 ] , and let M b e the geometric( 1 2 ) random variable given by M : = ⌊ − log 2 Y ⌋ . Finally , let W be such that P [ W | Y ]( ϖ ) = P [ X | N = M ( ϖ )] . Note that we have P [ Y ∈ ( 2 − n − 1 , 2 − n ) | W = x ] = P [ Y ∈ [ 2 − n − 1 , 2 − n ] | W = x ] = P [ N = n | X = x ] for n ∈ N and x ∈ [ 0 , 1 ] . The result then follows from Lemma 7.3 and Proposi- tion 7.5. □ For a visualization of the distribution of ( W , Y ) , see Figure 1. The proof of Proposition 7.5 shows that not only is the conditional distribution map P [ N | X = · ] not P X -almost computable, but that, in fact, it computes the halting set ∅ ′ ; to make this precise, w e would dene the notion of an oracle that encodes P [ N | X = · ] using, e.g., innite strings as in the T yp e-2 Theory of Eectivity . Despite not having a denition of computability from the conditional distribution map, we can easily relativize the notion of computability for the conditional distribution map, to obtain the following. Corollary 7.7. If P [ N | X = · ] is A -computable on a set of P X -measure greater than 5 6 for an oracle A ⊆ N , then A computes the halting set, i.e., A ≥ T ∅ ′ . In particular , P [ N | X = · ] is not computable on any set of P X -measure greater than 5 6 , and hence no version of P [ N | X = · ] is P X -almost computable. Proof. Suppose A is such that P [ N | X = · ] is A -computable on a set of P X -measure greater than 5 / 6 . Then P [ N = 1 | X = · ] is A -computable on the same set. But then, by the argument in the proof of Proposition 7.5, the function h , and hence the halting set ∅ ′ , is computable from A . □ On the other hand, by Lemma 7.4, the conditional distribution map P [ N | X = · ] is in fact ∅ ′ - computable on a P X -measure one set, and so the bound in Corollary 7.7 is the best p ossible. Computable op erations map computable p oints to computable points, and so Corollary 7.7 provides another context in which conditioning operators are noncomputable (cf. Pr oposition 6.4). The next result shows that Proposition 7.5 also rules out the computability of conditional probability maps in the weaker sense of L 1 ( P X ) -computability . This extends [Hoyrup and Rojas 2011, Prop. 3], which states that there is a pair of computable measur es µ ≪ ν on a computable Polish space such that the Radon–Nikodym derivative d µ / d ν is not L 1 -computable . Namely , we show that in the case of measur es on [ 0 , 1 ] , the measures can be taken to be of the form µ = P { X ∈ · , N = k } and ν = P { X ∈ · } = P X . Proposition 7.8. For every k ∈ N , the map P [ N = k | X = · ] is not L 1 ( P X ) -computable. Proof. Let k ∈ N . By Proposition 7.5, the conditional probability map P [ N = k | X = · ] is not com- putable on any measurable set R of P X -measure greater than 5 / 6 . On the other hand, by Lemma 2.23, the conditional probability map is L 1 ( P X ) -computable only if, for each r ∈ N , the map is computable on some set of P X -measure at least 1 − 2 − r , uniformly in r . This does not hold, and so every conditional probability map P [ N = k | X = · ] is not L 1 ( P X ) -computable. □ It is natural to ask whether this construction can be modied to produce a pair of P -almost computable random variables, like N and X , such that the corresponding conditional distribution On the Computability of Conditional Probability 34 map is not P X -almost computable even though it has an ev ery where continuous version. W e provide such a strengthening in the next section. 8 NONCOMP U T ABLE EVERY WHERE CONTINUOUS CONDITIONAL DISTRIBU TIONS In Section 7, we demonstrated a pair of computable random variables that admit a conditional distribution that is continuous on a measure one set but still noncomputable on every measure one set. It is therefore natural to ask whether we can construct a pair of random variables ( Z , N ) that is computable and admits an everywhere continuous version of the conditional distribution map P [ N | Z = · ] , which is itself nonetheless not computable. In fact, we do so now , using a construction similar to that of ( X , N ) in Section 7. If we think of the construction of the k th bit of X as an iterative process, we se e that there are two distinct stages. During the rst stage, which o ccurs so long as k < h ( N ) , the bits of X simply mimic those of the uniform random variable V . Then during the second stage, once k ≥ h ( N ) , the bits mimic that of 1 2 ( C + U ) . Our construction of Z will dier in the second stage , wher e the bits of Z will instead mimic those of a random variable S specially designed to smooth out the rough edges caused by the biased Bernoulli random variable C , while still allo wing us to encode the halting set. In particular , S will be absolutely continuous and will have an innitely dierentiable density . W e now b egin the construction. Let N , U , V , and C be as in the rst construction. W e next dene several random variables from which w e will construct S , and then Z . Lemma 8.1. There is a random variable F in [ 0 , 1 ] with the following properties: (1) F is P -almost computable. (2) P F admits a computable density p F with respect to Lebesgue measure (on [ 0 , 1 ]) that is innitely dierentiable every where. (3) p F ( 0 ) = 2 3 and p F ( 1 ) = 4 3 . (4) d n + d x n p F ( 0 ) = d n − d x n p F ( 1 ) = 0 , for all n ≥ 1 (where d n − d x n and d n + d x n are the left and right derivatives respectively). □ (See Figure 2 for one such random variable.) Let F be as in Lemma 8.1, and independent of all earlier random variables mentioned. Note that F is almost surely nondyadic and so the r -th bit F r of F is a P -almost computable random variable, uniformly in r . Let D b e a P -almost computable random variable, independent of all earlier random variables mentioned, and uniformly distributed on { 0 , 1 , . . . , 7 } . Consider S = 1 8 ×          F , if D = 0; 4 + ( 1 − F ) , if D = 4; 4 C + ( D mo d 4 ) + U , otherwise. (66) It is clear that S is also P -almost computable, and straightforward to show that (i) P S admits an innitely dierentiable and computable density p S with respect to Lebesgue measure on [ 0 , 1 ] ; and (ii) For all n ≥ 0 , we have d n + d x n p S ( 0 ) = d n − d x n p S ( 1 ) . (For a visualization of the density p S see Figure 3.) On the Computability of Conditional Probability 35 - 1 1 e - 1 1 2 3 4 3 Fig. 2. (le) The graph of the function defined by f ( x ) = exp { −( 1 − x 2 ) − 1 } , for x ∈ (− 1 , 1 ) , and 0 otherwise, a C ∞ bump function whose derivatives at ± 1 are all 0. (right) A density p ( y ) = 2 / 3 ( 1 + Φ ( 2 y − 1 )/ Φ ( 1 )) , for y ∈ ( 0 , 1 ) , of a random variable satisfying Lemma 8.1, where Φ ( y ) =  y − 1 f ( x ) d x is the integral of the bump function. Next we dene, for e very k ∈ N , the random variables Z k mimicking the construction of X k . Specically , for k ∈ N , dene Z k : = ⌊ 2 k V ⌋ + S 2 k , (67) and let Z ∞ : = lim k →∞ Z k = V a.s. Then the n th bit of Z k is ( Z k ) n =  V n , n < k ; S n − k , n ≥ k a.s. , (68) where S ℓ denotes the ℓ th bit of S . It is straightforward to sho w from (i) and (ii) above that P Z k admits an innitely dierentiable density p Z k with respect to Lebesgue measure on [ 0 , 1 ] . T o complete the construction, we dene Z : = Z h ( N ) . The following results are analogous to those in the almost continuous construction. Lemma 8.2. The random variable Z is P -almost computable. □ Lemma 8.3. There is an everywhere continuous version of P [ N | Z = · ] . Proof. By construction, the conditional density of Z given N is everywhere continuous, bounde d, and positive. The result follo ws from Proposition 3.12 for R = [ 0 , 1 ] . □ W e next show that, for each k ∈ N , the conditional probability map P [ N = k | Z = · ] is not com- putable. Our proof relies on the fact that, for each k ∈ N , there is a large set of points x ∈ [ 0 , 1 ] for which the density p Z | N ( x | k ) : = p Z h ( k ) ( x ) agrees with p X | N ( x | k ) . This will then allow us to use techniques similar to those of Proposition 7.5. W e say a real x ∈ [ 0 , 1 ] is valid for P S if x ∈ ( 1 8 , 1 2 ) ∪ ( 5 8 , 1 ) . In particular , when D < { 0 , 4 } , then S is valid for P S . The following ar e then consequences of the construction of S and the denition of valid points: On the Computability of Conditional Probability 36 1  8 1  2 5  8 1 2 3 4 3 Fig. 3. Graph of p S , the density of S, when S is constructed from F as given in Figure 2. (iii) If x is valid for P S then p S ( x ) ∈ { 2 3 , 4 3 } . In particular , p S ( x ) = 4 3 for x ∈ ( 1 8 , 1 2 ) , and p S ( x ) = 2 3 for x ∈ ( 5 8 , 1 ) . (iv) The Leb esgue measure of points valid for P S is ( 1 2 − 1 8 ) + ( 1 − 5 8 ) = 3 4 . For k < ∞ , we say that x ∈ [ 0 , 1 ] is valid for P Z k if the fractional part of 2 k x is valid for P S , and we say that x is valid for P Z ∞ , for all x . Let A k be the collection of x valid for P Z k , and let A ∞ = [ 0 , 1 ] . Note that, at all points x that are valid for P Z k , the density p Z k ( x ) agrees with p X k ( x ) , and, at all points x that are valid for P Z ∞ , the density p Z ∞ ( x ) agrees with p X ∞ ( x ) . For each k < ∞ , the set A k consists of the set of points that are valid for P S rst dilated by a factor of 2 − k and then tiled 2 k -many times, and so A k has Leb esgue measure 3 4 ; the set A ∞ = [ 0 , 1 ] has Lebesgue measure one. Proposition 8.4. Let R ⊆ [ 0 , 1 ] be a measurable subset of P Z -measure greater than 11 / 12 . Then for each k ∈ N , the conditional probability map P [ N = k | Z = · ] is neither lower semicomputable nor upper semicomputable on R . Proof. W e proce ed analogously to the proof of Proposition 7.5. First note that if R had Lebesgue measure no greater than 7 8 , then, for k ∈ N , P Z k ( R ) ≤ 1 2 · 4 3 + 3 8 · 2 3 = 11 12 (69) and so P Z ( R ) ≤ 11 12 , a contradiction. Hence R has Leb esgue measure greater that 7 8 . Fix k ∈ N . By Bayes’ rule (Lemma 3.11), P [ N = k | Z = x ] = p Z | N ( x | k ) · P { N = k }  n ∈ N p Z | N ( x | n ) · P { N = n } . (70) for P Z -a.e. x , and hence for Lebesgue-a.e. x . Now , p Z | N (· | k ) is piecewise (on a nite number of pieces with rational endp oints) the computable dilation and translation of the computable function p S . By (ii) above, the endpoints of the pieces line up, and so p Z | N (· | k ) is computable. On the Computability of Conditional Probability 37 Because P { N = k } = 4 5 · 5 − k is a computable real, it remains to show that, as a function of x , the denominator  n ∈ N p Z | N ( x | n ) · P { N = n } of the right-hand side of (70) is neither lower nor upper semicomputable on R . W e will show the former; the latter follows in a similar fashion. Dene , for each ℓ ∈ N the function ψ ℓ : [ 0 , 1 ] → R by ψ ℓ ( x ) = 5 ℓ  n ≥ ℓ 3 p Z | N ( x | n ) · 5 4 P { N = n } . (71) Note that ψ 0 is equal to 15 4 times the denominator . Hence it suces to show that ψ 0 is not lower semicomputable on R . W e now prove by induction, uniformly in ℓ , that, if ψ ℓ is lower semicomputable on R , then ψ ℓ + 1 is lower semicomputable on R , and we can compute whether or not h ( ℓ ) is nite. This then implies that if ψ 0 were low er semicomputable on R , then ∅ ′ would be computable, a contradiction. Suppose that ψ ℓ is lower semicomputable on R . Then we can compute (as a c.e. open subset of [ 0 , 1 ] ) a set S such that S ∩ R = ψ − 1 ℓ [( 3 , ∞)] ∩ R . If ℓ is such that h ( ℓ ) = ∞ , then for x ∈ [ 0 , 1 ] we have p Z | N ( x | ℓ ) = 1 , and so ψ ℓ ( x ) = 3 + 1 5 ψ ℓ + 1 ( x ) > 3 . (72) In particular , S will contain R , and hence have Lebesgue measure greater than 7 8 . On the other hand, if ℓ is instead such that h ( ℓ ) < ∞ , then S ∩ R ∩ A h ( ℓ ) has Lebesgue measure at most 1 2 . Hence S ⊆ ( S ∩ R ∩ A h ( ℓ ) ) ∪ ([ 0 , 1 ] \ R ) ∪ ([ 0 , 1 ] \ A h ( ℓ ) ) , (73) and so the Lebesgue measure of S is less than 1 2 + ( 1 − 7 8 ) + ( 1 − 3 4 ) = 7 8 . Simultaneously (1) enumerate a c.e . sequence of basic open sets whose union is S , hence computing arbitrarily good lo wer bounds on its measure, and (2) run M ℓ (on input 0 ). Either S will have measure greater than 7 8 or M ℓ will halt, but not both. Hence we will eventually learn whether or not h ( ℓ ) is nite, which in turn shows that p Z | N (· | ℓ ) is computable on [ 0 , 1 ] , and so ψ ℓ + 1 is also lo wer semicomputable on R . □ As before, it follows immediately that P [ N | Z = · ] is not computable, ev en on a P Z -measure one set. It is possible to carr y on the same development, showing that the conditional probability map is not computable as an element in L 1 ( P Z ) . W e state only the following strengthening of Corollary 7.7. Corollary 8.5. Let Φ be a conditioning operator for the set of probability distributions on pairs ( X , Y ) of random variables in [ 0 , 1 ] such that there exists an everywhere continuous version of the conditional distribution map P [ Y | X = · ] . Then Φ is noncomputable. □ 9 POSITIVE RESULTS Despite the fact that conditioning is not computable in general, many practitioners vie w conditioning as a routine operation involving the application of Bayes’ rule . Indeed, b elow we sho w that condi- tioning is computable under mild assumptions when the obser vation is discrete, or when Bayes’ rule is applicable, or when the observation is corrupted by indep endent “smooth” computable noise. W e conclude by examining the setting where a notion of symmetry known as exchangeability holds. On the Computability of Conditional Probability 38 9.1 Discrete Random V ariables W e b egin with the problem of conditioning on a random variable that takes values in a discrete set. Given an appropriate notion of computability for discr ete sets, conditioning is always possible in this setting, as it reduces to the elementary notion of conditional probability with respect to single events. Denition 9.1 (Computably discrete set). Let S be a computable Polish space. A subset D ⊆ S is computably discrete when there exists a function f : S → N that is computable and injective on D . In particular , such a D is countable. W e call f the witness to the discreteness of D . Proposition 9.2. Let X and Y be P -almost computable random variables in computable Polish spaces S and T , r espectively , let D ⊆ S be a computably discrete subset with witness f , and assume that P { X = d } > 0 for all d ∈ D . Then the conditional distribution map P [ Y | X = · ] is computable on D , uniformly in X , Y , and f . Proof. Let A ⊆ T be a P Y -almost decidable set. By Lemma 4.7, it suces to show that P [ Y ∈ A | X = · ] is computable on D , uniformly in µ , f , and (the witness for the P Y -almost decidability of ) A . Let x ∈ D . Uniformly in f and x , we can nd a witness to a P X -almost decidable set B x ⊆ S such that x ∈ B x and no other element of D is in B x . It then follows that P [ Y ∈ A | X = x ] = P [ Y ∈ A | X ∈ B x ] a.s. (74) By Lemma 4.1, P [ Y ∈ A | X ∈ B x ] is a computable real uniformly in µ and (the witness for the P X -almost decidability of ) B x , hence uniformly in µ , f , and x . □ The proof above relies on the ability to compute probabilities for continuity sets. In practice, computing probabilities can be inecient. W e give an alternativ e proof of Proposition 9.2 via a rejection-sampling argument, which yields an algorithm that can be much more ecient. W e begin with a version of a well-known r esult. Lemma 9.3 (rejection sampling). Let X and Y be random variables in computable Polish spaces S and T , respectively , let B ⊆ S be a measurable set with p ositive P X -measure, let ( X 0 , Y 0 ) , ( X 1 , Y 1 ) , . . . be a sequence of i.i.d. copies of ( X , Y ) , and dene ν to b e the distribution of ( X 0 , X 1 , . . . ) . Further dene д : S N → N to b e the map ( x 0 , x 1 , . . . ) 7→ inf { n ∈ N : x n ∈ B } (75) and set ζ : = д ( X 0 , X 1 , . . . ) . (76) Then Y ζ has distribution P { Y ∈ · | X ∈ B } and, if B is a P X -almost decidable set, then д is ν -almost computable, uniformly in a witness for B . Proof. For all n ∈ N , we have P { ζ = n } = P { X n ∈ B }  n − 1 i = 0 P { X j < B } = P { X ∈ B } ( 1 − P { X ∈ B }) n . Summing this expression ov er all n ∈ N , w e obtain P { ζ < ∞ } = 1 . (77) On the Computability of Conditional Probability 39 For P ζ -almost all n ∈ N , P { Y ζ ∈ · | ζ = n } = P { Y n ∈ · | X n ∈ B and X j < B for all j < n } (78) = P { Y n ∈ · | X n ∈ B } = P { Y ∈ · | X ∈ B } , (79) where the rst equality follows from the denition of ζ , the second from the i.i.d. property , and the last by denition. As the right hand side has no dependence on n , it follows from the chain rule of conditional expectation and (77) that P { Y ζ ∈ · } = P { Y ∈ · | X ∈ B } . W e now establish the ν -almost computability of д . It follows from P { ζ < ∞ } = 1 that д is N -valued on a ν -measure one set. By denition, the characteristic function for B , written χ B : S → { 0 , 1 } , is P X -almost computable uniformly in a witness for the almost decidability of B . Hence д ( x 0 , x 1 , . . . ) = min { n ∈ N : χ B ( x n ) = 1 } on a ν -measure one set A ⊆ S N of input sequences. But then A ∩ д − 1 ( n ) = A ∩  ( χ − 1 B ( 0 )) n − 1 × χ − 1 B ( 1 ) × S N  , and so д is ν -almost computable. □ W e may now give an alternative proof of Proposition 9.2. Proof of Proposition 9.2 (via rejection sampling). Let µ be the distribution of ( X , Y ) , which is computable. Uniformly in µ , we may compute a sequence ( X 0 , Y 0 ) , ( X 1 , Y 1 ) , . . . of independent P -almost computable random variables with distribution µ . Let f be a witness to the computable discreteness of D . Then ( f ( X 0 ) , Y 0 ) , ( f ( X 1 ) , Y 1 ) , . . . is also a computable sequence of independent P -almost computable random variables. Uniformly in f , we can compute a sequence of disjoint P X - continuity sets ⟨ B k ⟩ k ∈ N such that x ∈ B f ( x ) for P X -almost all x . For k ∈ N , let д k be the function fr om Lemma 9.3 with respect to the set B k . Uniformly in k , the random variable ζ k = д k ( f ( X 0 ) , f ( X 1 ) , . . . ) is P -almost computable, and then so is the random variable Y ζ k . For k ∈ N , let γ k be the distribution of Y ζ k . By Lemma 9.3, γ k = P { Y ∈ · | f ( X ) = k } and γ k is computable, uniformly in k , f , and µ . It follows that the map γ : N → M 1 ( T ) given by k 7→ γ k is computable , uniformly in f and µ . Note, howe ver , that for P X -almost all x , we have P [ Y | X = x ] = P { Y ∈ · | f ( X ) = f ( x ) } . Therefore P [ Y | X = · ] = γ ◦ f is P X -almost computable, uniformly in f and µ . □ Rejection sampling has been used to give informal semantics of conditioning operators for discrete random variables in probabilistic programming languages such as λ ◦ [Park et al. 2008] and Church [Goodman et al. 2008]. 9.2 Continuous and Dominated Seing The most common way to to p erform conditioning given a continuous random variable is via Bayes’ rule, which requires the existence of a conditional density . Using the characterization of computable measures in terms of the computability of integration, we can characterize when Bay es’ rule is computable. Proposition 9.4. Let X and Y be P -almost computable random variables on computable Polish spaces S and T , and let R ⊆ S . If p X | Y ( x | y ) is a conditional density of X given Y that is p ositive, bounded, and computable on R × T , then κ as dene d in (21) is computable on R . In particular , if R is a P X -measure one subset, then κ is a P X -almost computable version. On the Computability of Conditional Probability 40 Proof. By Bayes’ rule (Lemma 3.11), a version of P [ Y | X = · ] is given by the ratio in Equation (21) : κ ( x , B ) =  B p X | Y ( x | y ) P Y ( d y )  p X | Y ( x | y ) P Y ( d y ) , B ∈ B T . (80) By Proposition 2.30, when B is an P Y -almost decidable set, x 7→  B p X | Y ( x | y ) P Y ( d y ) is a computable function of x on R , uniformly in a witness for the P Y -almost decidability of B . Further we know that  p X | Y ( x | y ) P Y ( d y ) is non-zero on a P X -measure one set, and so the ratio κ (· , B ) is an P X -almost computable function, uniformly in a witness for the P Y -almost decidability of B . Hence for c.e. open sets E , the function κ (· , E ) is lower semicomputable uniformly in a r epresentation for E . Therefore , by Lemma 4.7, the function ¯ κ is computable, as desired. □ Remark 9.5. One might likewise obtain an algorithm for conditioning in the context of Propo- sition 9.4 using the proof of Proposition 9.2 via computable rejection sampling. In particular , the following classical argument may be carried out computably . For x ∈ R , let E x be a P X -almost continuous random variable in { 0 , 1 } such that P [ E x = 1 | Y = y ] = 1 M p X | Y ( x | y ) , where M satises p X | Y ( x | y ) < M for all x ∈ R and y ∈ T . (Such an M exists be cause of the b oundedness condition.) Then, for every Borel B ⊆ T and every x ∈ R , we have κ ( x , B ) = P { Y ∈ B , E x = 1 } P { E x = 1 } = P { Y ∈ B | E x = 1 } , (81) providing the reduction to ( classical) rejection sampling. Because the computability r esult Proposition 9.4 was established in a way that obviously relativizes to any oracle, we no w obtain a proof of its analogue for a continuous conditional density , as promised in Section 3.2. Proof of Proposition 3.12. Follows from the proof of Proposition 9.4 by relativizing with respect to an arbitrary oracle. □ Corollary 9.6 (Density and independence). Let U , V , and X be P -almost computable random variables in computable Polish spaces, where X is independent of V given U . Assume that there exists a bounded and computable conditional density p X | U ( x | u ) of X given U . Then the kernel P [( U , V ) | X = · ] is computable. Proof. Let Y = ( U , V ) . Then p X | Y ( x | ( u , v )) = p X | U ( x | u ) is the conditional density of X given Y (with respect to ν ). The result then follows immediately from Proposition 9.4. □ 9.3 Conditioning on Noisy Observations As an immediate consequence of Cor ollar y 9.6, we obtain the computability of the follo wing common situation in probabilistic modeling, where the observed random variable has b een corrupted by independent absolutely continuous noise. Corollary 9.7 (Independent noise). Let U and E be P -almost computable random variables in R , and let V be a P -almost computable random variable in a computable Polish space. Dene X = U + E . If P E is absolutely continuous (with respect to Lebesgue measure) with a b ounded computable density p E , and E is indep endent of U and V , then the conditional distribution map P [( U , V ) | X = · ] is computable. On the Computability of Conditional Probability 41 Proof. W e have that p X | U ( x | u ) = p E ( x − u ) (82) is the conditional density of X given U ( with respect to Lebesgue measure). The result then follows from Corollary 9.6. □ Pour-El and Richards [1989, Ch. 1, Thm. 2] show that a twice continuously dierentiable computable function has a computable derivative (despite the fact that Myhill [1971] exhibits a computable function from [ 0 , 1 ] to R whose derivative is continuous, but not computable). Therefore, noise with a suciently smooth computable distribution has a computable density , and by Corollary 9.7, if such noise has a bounded density , then an almost computable random variable corrupted by such noise still admits a computable conditional distribution map. Furthermore, Cor ollar y 9.7 implies that the conditional distribution of a random variable given a continuous observation cannot always b e uniformly approximated using noise in the sense that one cannot computably tell how little noise must be present to obtain a given accuracy . For example , consider our main construction (Theorem 7.6) corrupted with noise E = σ Z , where Z is a standard Gaussian noise and σ > 0 determines the standard deviation. Even though, as σ → 0 , the conditional distribution given the corrupted observation converges weakly to the uncorrupted conditional distribution with σ = 0 , the noncomputability of the uncorrupted conditional distribution implies that one cannot uniformly compute a value of σ from a desired bound on the error introduced to the conditional distribution corrupted by the noise σ Z . 9.4 Exchangeable Seing Freer and Roy [2010] show how to compute conditional distributions in the setting of exchangeable sequences , i.e., sequences of random variables whose joint distribution is invariant to permutations of the indices. A classic result by de Finetti shows that exchangeable sequences of random variables are in fact conditionally i.i.d. se quences, conditioned on a random measure, often called the directing random measure . Freer and Roy describe how to transform an algorithm for sampling an exchangeable sequence into a rule for computing the posterior distribution of the directing random measure given observations. The result is a corollar y of a computable version of de Finetti’s theorem [Freer and Roy 2009, 2012], and covers a wide range of common scenarios in nonparametric Bayesian statistics (often where no conditional density e xists). A CKNO WLEDGMENTS A preliminary version of this article appears as “Noncomputable conditional distributions” in Pr o- ceedings of the 26th A nnual IEEE Symp osium on Logic in Computer Science , 107–116 [Ackerman et al . 2011], which was based upon results rst appearing in an arXiv preprint [Ackerman et al. 2010]. C. E. Freer was partially supported by NSF grants DMS-0901020 and DMS-0800198, DARP A Contract A ward No. F A8750-14-C-0001, and a grant from the John T empleton Foundation. D . M. Roy was partially supporte d by graduate fellowships from the National Science Foundation and MI T Lincoln Laboratory , a Newton International Fellowship , a Resear ch Fello wship at Emmanuel College, and an Ontario Early Researcher A ward. W ork on this publication was also made possible thr ough the support of ARO grant W911NF-13-1- 0212, ONR grant N00014-13-1-0333, D ARP A Contract A ward No. F A8750-14-2-0004, and Google’s On the Computability of Conditional Probability 42 “Rethinking AI” project. The opinions expressed in this publication are those of the authors and do not necessarily reect the views of the John T empleton Foundation or the U.S. Go vernment. The authors would like to thank Jeremy A vigad, Henry Cohn, Leslie K aelbling, Vikash Mansinghka, Arno Pauly , Hartley Rogers, Michael Sipser , and Joshua T enenbaum for helpful discussions, Bill Thurston for a useful comment regarding Lemma 8.1, Martín Escardó for a helpful discussion regarding computable random variables, and Quinn Culver , Noah Goodman, Jonathan Huggins, Leslie Kaelbling, Bjørn Kjos-Hanssen, Vikash Mansinghka, Timothy O’Donnell, Geo Patterson, and the anonymous referees for comments on earlier drafts. REFERENCES Nathanael L. Ackerman, Cameron E. Freer , and Daniel M. Roy . 2010. On the computability of conditional probability . (2010). Preprint, http://arxiv .org/abs/1005.3014v1 . Nathanael L. Ackerman, Cameron E. Freer , and Daniel M. Roy . 2011. Noncomputable conditional distributions. In Proc. of the 26th Ann. IEEE Symp . on Logic in Comput. Sci. (LICS 2011) . IEEE Computer Society , 107–116. Shai Ben-David, Benny Chor , Oded Goldreich, and Michael Luby . 1992. On the theor y of average case complexity . J. Comput. System Sci. 44, 2 (1992), 193–219. Jens Blanck. 1997. Domain representability of metric spaces. A nnals of Pure and A pplied Logic 83, 3 (1997), 225–247. V olker Bosserho. 2008. Notions of probabilistic computability on represented spaces. J.UCS 14, 6 (2008), 956–995. Mark Braverman. 2005. On the complexity of real functions. In FOCS’05—Proc. of the 46th Ann. IEEE Symp. on Foundations of Comput. Sci. IEEE Computer Society , 155–164. Mark Braverman and Stephen Cook. 2006. Computing o ver the reals: foundations for scientic computing. Notices A mer . Math. So c. 53, 3 (2006), 318–329. Gregory F. Cooper . 1990. The computational complexity of probabilistic inference using Bayesian belief networks. A rticial Intelligence 42, 2-3 (1990), 393–405. Paul Dagum and Michael Luby . 1993. Approximating probabilistic inference in Bayesian belief networks is NP-hard. Articial Intelligence 60, 1 (1993), 141–153. Paul Dagum and Michael Luby . 1997. An optimal approximation algorithm for Bayesian inference. Articial Intelligence 93, 1-2 (1997), 1–27. François G. Dorais, Gerald Edgar , and Jason Rute. 2013. Continuity on a measure one set versus measure one set of points of continuity . MathOverow . http://mathoverow .net/q/146063 (version: 2013-10-27). Abbas Edalat. 1996. The Scott topology induces the weak topology . In Proc. of the 11th Ann. IEEE Symp. on Logic in Comput. Sci. (New Brunswick, NJ, 1996) . IEEE Comput. Soc. Press, Los Alamitos, CA, 372–381. Abbas Edalat and Reinhold Heckmann. 1998. A computational model for metric spaces. Theoret. Comput. Sci. 193, 1-2 (1998), 53–73. Cameron E. Freer and Daniel M. Ro y . 2009. Computable exchangeable sequences have computable de Finetti measures. In Mathematical The ory and Computational Practice (CiE 2009), Proc. of the 5th Conf. on Computability in Eur ope (Lecture Notes in Comput. Sci.) , Klaus Ambos-Spies, Benedikt Löwe, and W olfgang Merkle (Eds.), V ol. 5635. Springer , 218–231. Cameron E. Freer and Daniel M. Roy . 2010. Posterior distributions ar e computable from predictive distributions. In Proc. of the 13th Int. Conf. on A rticial Intelligence and Statistics (AIST A TS 2010) (Y . W . T eh and M. Titterington, eds.), JMLR: W&CP 9 . 233–240. Cameron E. Freer and Daniel M. Roy . 2012. Computable de Finetti measures. A nn. Pure A ppl. Logic 163, 5 (2012), 530–546. Peter Gács. 2005. Uniform test of algorithmic randomness over a general space. Theoret. Comput. Sci. 341, 1-3 (2005), 91–137. Stefano Galatolo, Mathieu Hoyrup, and Cristóbal Rojas. 2010. Eective symbolic dynamics, random points, statistical behavior , complexity and entropy . Inform. and Comput. 208, 1 (2010), 23–41. E. Mark Gold. 1967. Language identication in the limit. Inform. and Control 10, 5 (1967), 447–474. Noah D. Goodman, Vikash K. Mansinghka, Daniel M. Roy , Keith Bonawitz, and Joshua B. T enenbaum. 2008. Church: a language for generative models. In Proc. of the 24th Conf. on Uncertainty in A rticial Intelligence . Armin Hemmerling. 2002. Eective metric spaces and representations of the r eals. The oret. Comput. Sci. 284, 2 (2002), 347–372. Mathieu Hoyrup. 2008. Computability, Randomness and Ergodic Theory on Metric Spaces . Ph.D. Dissertation. Université Paris Diderot (Paris VII). On the Computability of Conditional Probability 43 Mathieu Hoyrup and Cristóbal Rojas. 2009a. An application of Martin-Löf randomness to eective probability theor y . In Mathematical The ory and Computational Practice , Klaus Ambos-Spies, Benedikt Löwe, and W olfgang Merkle (Eds.). Lecture Notes in Comput. Sci., V ol. 5635. Springer , 260–269. Mathieu Hoyrup and Cristóbal Rojas. 2009b. Computability of probability measures and Martin-Löf randomness over metric spaces. Inform. and Comput. 207, 7 (2009), 830–847. Mathieu Hoyrup and Cristobal Rojas. 2011. Absolute continuity of measur es and pr eservation of randomness. (2011). Preprint, https://members.loria.fr/MHoyrup/abscont.pdf . Mathieu Hoyrup, Cristobal Rojas, and Klaus W eihrauch. 2011. Computability of the Radon-Nikodym derivative. In Models of Computation in Context, Proc. of the 7th Conf. on Computability in Europe (CiE 2011) (Lecture Notes in Comput. Sci.) , Benedikt Löwe, Dag Normann, Ivan N. Soskov , and Alexandra A. Soskova (Eds.), V ol. 6735. Springer , 132–141. Marcus Hutter . 2007. On universal prediction and Bayesian conrmation. Theoret. Comput. Sci. 384, 1 (2007), 33–48. Olav Kallenberg. 2002. Foundations of modern probability (2nd e d.). Springer , New Y ork. x x+638 pages. Alexander S. K echris. 1995. Classical descriptive set theory . Graduate T exts in Mathematics, V ol. 156. Springer- V erlag, New Y ork. xviii+402 pages. Oleg Kiselyov and Chung-chieh Shan. 2009. Emb edded probabilistic programming. In Domain-Specic Languages (Lecture Notes in Computer Science) , W alid Mohamed Taha (Ed.), V ol. 5658. Springer , 360–384. Stephen Cole Kleene. 1938. On Notation for Ordinal Numbers. J. Symbolic Logic 3, 4 (1938), 150–155. Donald E. Knuth and Andrew C. Y ao. 1976. The complexity of nonuniform random number generation. In Algorithms and complexity (Proc. Sympos., Carnegie-Mellon Univ ., Pittsburgh, Pa., 1976) . Academic Press, Ne w Y ork, 357–428. A. N. Kolmogoro v . 1933. Grundbegrie der Wahrscheinlichkeitsrechnung . Springer . vii+62 pages. Leonid A. Levin. 1986. A verage case complete problems. SIAM J. Comput. 15, 1 (1986), 285–286. Irwin Mann. 1973. Probabilistic recursive functions. T rans. A mer . Math. Soc. 177 (1973), 447–467. T . Minka, J.M. Winn, J.P. Guiver , and D.A. Knowles. 2010. Infer.NET 2.4. Microsoft Research Cambridge. http://research.microsoft.com/infernet. Kenshi Miyabe. 2013. L 1 -computability , layer wise computability and Solovay reducibility . Computability 2, 1 (2013), 15–29. Yiannis N. Moschovakis. 2009. Descriptive set theory (2nd ed.). Mathematical Surveys and Monographs, V ol. 155. American Mathematical Society , Providence, RI. xiv+502 pages. Norbert Th. Müller . 1999. Computability on random variables. The oret. Comput. Sci. 219, 1-2 (1999), 287–299. J. Myhill. 1971. A recursive function, dened on a compact interval and having a continuous derivative that is not recursive. Michigan Math. J. 18 (1971), 97–98. André Nies. 2009. Computability and randomness . Oxford Logic Guides, V ol. 51. Oxford University Press, Oxford. xvi+433 pages. Daniel N. Osherson, Michael Stob, and Scott W einstein. 1988. Mechanical learners pay a price for Bayesianism. J. Symbolic Logic 53, 4 (1988), 1245–1251. Sungwoo Park, Frank Pfenning, and Sebastian Thrun. 2008. A probabilistic language based on sampling functions. A CM Trans. Program. Lang. Syst. 31, 1 (2008), 1–46. J. Pfanzagl. 1979. Conditional distributions as derivatives. Ann. Probab. 7, 6 (1979), 1046–1050. A vi Pfeer . 2001. IBAL: A probabilistic rational programming language. In Proc. of the 17th Int. Joint Conf. on A rticial Intelligence . Morgan Kaufmann Publ., 733–740. David Poole. 1991. Representing Bayesian networks within probabilistic Horn ab duction. In Pr oc. of the 7th Conf. on Uncertainty in Articial Intelligence . 271–278. Marian B. Pour-El and J. Ian Richards. 1989. Computability in analysis and physics . Springer-V erlag, Berlin. xii+206 pages. Hilary Putnam. 1965. Trial and error predicates and the solution to a problem of Mostowski. J. Symbolic Logic 30 (1965), 49–57. M. M. Rao. 1988. Paradoxes in conditional pr obability . J. Multivariate A nal. 27, 2 (1988), 434–446. M. M. Rao. 2005. Conditional measures and applications (2nd ed.). Pure and Applied Mathematics, V ol. 271. Chapman & Hall/CRC, Boca Raton, FL. xxiv+483 pages. Matthew Richardson and Pedro Domingos. 2006. Markov logic networks. Machine Learning 62, 1-2 (2006), 107–136. Hartley Rogers, Jr . 1987. Theory of recursive functions and eective computability (2nd ed.). MIT Press, Cambridge, MA. xxii+482 pages. Daniel M. Roy . 2011. Computability , inference and modeling in probabilistic pr ogramming . Ph.D. Dissertation. Massachusetts Institute of T echnology. On the Computability of Conditional Probability 44 Mark J. Schervish. 1995. Theory of statistics . Springer- V erlag, New Y ork. xvi+702 pages. Matthias Schröder . 2007. Admissible representations for probability measures. Math. Log. Q. 53, 4-5 (2007), 431–445. R. J. Solomono. 1964. A formal theory of inductive inference II. Inform. and Control 7 (1964), 224–254. Hayato T akahashi. 2008. On a denition of random sequences with respect to conditional probability . Inform. and Comput. 206, 12 (2008), 1375–1382. T ue Tjur . 1974. Conditional probability distributions . Institute of Mathematical Statistics, University of Copenhagen, Copen- hagen. iv+370 pages. T ue Tjur . 1975. A Constructive denition of conditional distributions . Institute of Mathematical Statistics, University of Copenhagen, Copenhagen. 20 pages. T ue Tjur . 1980. Probability based on Radon measures . John Wiley & Sons Ltd., Chichester . xi+232 pages. Klaus W eihrauch. 1993. Computability on computable metric spaces. Theoret. Comput. Sci. 113, 2 (1993), 191–210. Klaus W eihrauch. 1999. Computability on the probability measures on the Borel sets of the unit interval. Theoret. Comput. Sci. 219, 1-2 (1999), 421–437. Klaus W eihrauch. 2000. Computable analysis: an introduction . Springer- V erlag, Berlin. x+285 pages. T omoyuki Y amakami. 1999. Polynomial time samplable distributions. J. Complexity 15, 4 (1999), 557–574. A. K. Zvonkin and L. A. Levin. 1970. The complexity of nite objects and the development of the concepts of information and randomness by means of the theory of algorithms. Uspekhi Mat. Nauk 25, 6 (156) (1970), 85–127.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment