Geometry and Expressive Power of Conditional Restricted Boltzmann Machines

Geometry and Expr essiv e P ower of Conditional Restricted Boltzmann Machines Guido Mont ´ ufar 1 , Nihat A y 1,2,3 , and K eyan Ghazi-Zahedi 1 1 Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, 04103 Leipzig, Germany 2 Department of Mathematics and Computer Science, Leipzig Univ ersity , PF 10 09 20, 04009 Leipzig, Germany 3 Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA Abstract Conditional restricted Boltzmann machines are undirected stochastic neural networks with a layer of input and output units connected bipartitely to a layer of hidden units. These networks deﬁne models of conditional probability distributions on the states of the output units gi v en the states of the input units, parametrized by interaction weights and biases. W e address the representational po wer of these models, proving results their ability to represent conditional Marko v random ﬁelds and conditional distributions with restricted supports, the minimal size of uni v ersal approximators, the maximal model approximation errors, and on the dimension of the set of representable conditional distributions. W e contribute new tools for in v estigating conditional probability models, which allow us to improve the results that can be deri ved from e xisting work on restricted Boltzmann machine probability models. Keyw ords: conditional restricted Boltzmann machine, uni versal approximation, Kullback-Leibler approximation error , e xpected dimension 1 Intr oduction Restricted Boltzmann Machines (RBMs) ( Smolensk y 1986 ; Freund and Haussler 1994 ) are gener- ati ve probability models deﬁned by undirected stochastic networks with bipartite interactions be- tween visible and hidden units. These models are well-kno wn in machine learning applications, where the y are used to infer distrib uted representations of data and to train the layers of deep neural networks ( Hinton et al. 2006 ; Bengio 2009 ). The restricted connecti vity of these networks allo ws to train them efﬁciently on the basis of cheap inference and ﬁnite Gibbs sampling ( Hinton 2002 ; 2012 ), e ven when they are deﬁned with man y units and parameters. An RBM deﬁnes Gibbs-Boltzmann probability distributions over the observ able states of the network, depending on the interaction weights and biases. An introduction is of fered by Fischer and Igel ( 2012 ). The e xpressi ve power of these probability models has attracted much attention and has been studied in numerous papers, treating, in particular , their univ ersal approximation properties ( Y ounes 1996 ; Le Roux and Bengio 2008 ; Mont ´ ufar and A y 2011 ), approximation errors ( Mont ´ ufar et al. 2011 ), efﬁciency of represen- tation ( Martens et al. 2013 ; Mont ´ ufar and Morton 2015 ), and dimension ( Cueto et al. 2010 ). In certain applications, it is preferred to w ork with conditional probability distrib utions, instead of joint probability distributions. For example, in a classiﬁcation task, the conditional distribution may be used to indicate a belief about the class of an input, without modeling the probability of M O N T ´ U FA R , A Y , A N D G H A Z I - Z A H E D I observing that input; in sensorimotor control, it can describe a stochastic polic y for choosing ac- tions based on w orld observations; and in the context of information communication, to describe a channel. RBMs naturally deﬁne models of conditional probability distributions, called conditional restricted Boltzmann machines (CRBMs). These models inherit many of the nice properties of RBM probability models, such as the cheap inference and efﬁcient training. Speciﬁcally , a CRBM is de- ﬁned by clamping the states of an input subset of the visible units of an RBM. For each input state one obtains a conditioned distribution ov er the states of the output visible units. See Figure 1 for an illustration of this architecture. This kind of conditional models and slight variants thereof hav e seen success in many applications; for example, in classiﬁcation ( Larochelle and Bengio 2008 ), collaborati ve ﬁltering ( Salakhutdinov et al. 2007 ), motion modeling ( T aylor et al. 2007 ; Zeiler et al. 2009 ; Mnih et al. 2012 ; Sutskev er and Hinton 2007 ), and reinforcement learning ( Sallans and Hin- ton 2004 ). So far , ho we ver , there is not much theoretical work addressing the e xpressi v e power of CRBMs. W e note that it is relati v ely straightforward to obtain some results on the expressi ve power of CRBMs from the existing theoretical work on RBM probability models. Nev ertheless, an accurate analysis requires to take into account the speciﬁcities of the conditional case. Formally , a CRBM is a collection of RBMs, with one RBM for each possible input v alue. These RBMs dif fer in the biases of the hidden units, as these are inﬂuenced by the input values. Howe v er , these hidden biases are not independent for all different inputs, and, moreover , the same interaction weights and biases of the visible units are shared for all dif ferent inputs. This sharing of parameters draws a substantial distinction of CRBM models from independent tuples of RBM models. In this paper we address the representational po wer of CRBMs, contributing theoretical insights to the optimal number of hidden units. Our focus lies on the classes of conditional distributions that can possibly be represented by a CRBM with a ﬁxed number of inputs and outputs, depending on the number of hidden units. Having said this, we do not discuss the problem of ﬁnding the optimal parameters that gi ve rise to a desired conditional distribution (although our deri v ations include an algorithm that does this), nor problems related to incomplete kno wledge of the target conditional distrib utions and generalization errors. A number of training methods for CRBMs have been discussed in the references listed above, depending on the concrete applications. The problems that we deal with here are the following: 1) are distinct parameters of the model mapped to distinct conditional distributions; what is the smallest number of hidden units that sufﬁces for obtaining a model that can 2) approximate any tar get conditional distribution arbitrarily well (a universal approximator); 3) approximate any tar get conditional distribution without e xceeding a given error tolerance; 4) approximate selected classes of conditional distributions arbitrarily well? W e provide non-tri vial solutions to all of these problems. W e focus on the case of binary units, but the main ideas extend to the case of discrete non-binary units. This paper is organized as follows. Section 2 contains formal deﬁnitions and elementary prop- erties of CRBMs. Section 3 in vestigates the geometry of CRBM models in three subsections. In Section 3.1 we study the dimension of the sets of conditional distributions represented by CRBMs and show that in most cases this is the dimension expected from counting parameters (Theorem 4 ). In Section 3.2 we address the univ ersal approximation problem, deriving upper and lower bounds on the minimal number of hidden units that sufﬁces for this purpose (Theorem 7 ). In Section 3.3 we analyze the maximal approximation errors of CRBMs (assuming optimal parameters) and deri v e an upper -bound for the minimal number of hidden units that sufﬁces to approximate every condi- tional distribution within a gi v en error tolerance (Theorem 11 ). Section 4 in vestigates the expressi ve 2 G E O M E T RY A N D E X P R E S S I V E P O W E R O F C R B M S z 1 z 2 z 3 z 4 · · · z m x 1 x 2 · · · x k y 1 y 2 · · · y n V W c b Input layer Hidden layer Output layer Figure 1: Architecture of a CRBM. An RBM is the special case with k = 0 . po wer of CRBMs in tw o subsections. In Section 4.1 we describe ho w CRBMs can represent natural families of conditional distributions that arise in Markov random ﬁelds. In Section 4.2 we study the ability of CRBMs to approximate conditional distributions with restricted supports. This section addresses, especially , the approximation of deterministic conditional distributions (Theorem 21 ). In Section 5 we of fer a discussion and an outlook. In order to present the main results in a concise w ay , we hav e deferred all proofs to the appendices. Nonetheless, we think that the proofs are interesting in their o wn right, and we hav e prepared them with a fair amount of detail. 2 Deﬁnitions W e will denote the set of probability distrib utions on { 0 , 1 } n by ∆ n . A probability distribution p ∈ ∆ n is a vector of 2 n non-negati v e entries p ( y ) , y ∈ { 0 , 1 } n , adding to one, P y ∈{ 0 , 1 } n p ( y ) = 1 . The set ∆ n is a (2 n − 1) -dimensional simplex in R 2 n . W e will denote the set of conditional distributions of a v ariable y ∈ { 0 , 1 } n , giv en another v ariable x ∈ { 0 , 1 } k , by ∆ k,n . A conditional distribution p ( ·|· ) ∈ ∆ k,n is a 2 k × 2 n ro w-stochastic matrix with rows p ( ·| x ) ∈ ∆ n , x ∈ { 0 , 1 } k . The set ∆ k,n is a 2 k (2 n − 1) -dimensional polytope in R 2 k × 2 n . It can be re garded as the 2 k -fold Cartesian product ∆ k,n = ∆ n × · · · × ∆ n , where there is one probability simplex ∆ n for each possible input state x ∈ { 0 , 1 } k . W e will use the abbreviation [ N ] := { 1 , . . . , N } , where N is a natural number . Deﬁnition 1. The conditional restricted Boltzmann machine (CRBM) with k input units, n output units, and m hidden units, denoted RBM k n,m , is the set of all conditional distributions in ∆ k,n that can be written as p ( y | x ) = 1 Z ( W, b, V x + c ) X z ∈{ 0 , 1 } m exp( z > V x + z > W y + b > y + c > z ) , ∀ y ∈ { 0 , 1 } n , x ∈ { 0 , 1 } k , with normalization function Z ( W, b, V x + c ) = X y ∈{ 0 , 1 } n X z ∈{ 0 , 1 } m exp( z > V x + z > W y + b > y + c > z ) , ∀ x ∈ { 0 , 1 } k . Here, x , y , and z are column state vectors of the k input units, n output units, and m hidden units, respectiv ely , and > denotes transposition. The parameters of this model are the matrices of interaction weights V ∈ R m × k , W ∈ R m × n and the vectors of biases b ∈ R n , c ∈ R m . 3 M O N T ´ U FA R , A Y , A N D G H A Z I - Z A H E D I When there are no input units ( k = 0 ), the model RBM k n,m reduces to the restricted Boltzmann machine probability model with n visible units and m hidden units, denoted RBM n,m . W e can vie w RBM k n,m as a collection of 2 k restricted Boltzmann machine probability models with shared parameters. For each input x ∈ { 0 , 1 } k , the output distribution p ( ·| x ) is the probability distribution represented by RBM n,m for the parameters W , b, ( V x + c ) . All p ( ·| x ) hav e the same interaction weights W , the same biases b for the visible units, and dif fer only in the biases ( V x + c ) for the hidden units. The joint behavior of these distrib utions with shared parameters is not tri vial. The model RBM k n,m can also be re garded as representing block-wise normalized v ersions of the joint probability distrib utions represented by RBM n + k,m . Namely , a joint distribution p ∈ RBM n + k,m ⊆ ∆ k + n is an array with entries p ( x, y ) , x ∈ { 0 , 1 } k , y ∈ { 0 , 1 } n . Conditioning p on x is equi v alent to considering the normalized x -th ro w p ( y | x ) = p ( x, y ) / P y 0 p ( x, y 0 ) , y ∈ { 0 , 1 } n . 3 Geometry of Conditional Restricted Boltzmann Machines In this section we in vestigate three basic questions about the geometry of CRBM models. First, what is the dimension of a CRBM model? Second, how many hidden units does a CRBM need in order to be able to approximate ev ery conditional distribution arbitrarily well? Third, how accurate are the approximations of a CRBM, depending on the number of hidden units? 3.1 Dimension The model RBM k n,m is deﬁned by mar ginalizing out the hidden units of a graphical model. This implies that sev eral choices of parameters may represent the same conditional distributions. In turn, the dimension of the set of representable conditional distributions may be smaller than the number of model parameters, in principle. When the dimension of RBM k n,m is equal to the number of parameters, dim(RBM k n,m ) = ( k + n ) m + n + m , or , otherwise, equal to the dimension of the ambient polytope of conditional distributions, dim(RBM k n,m ) = 2 k (2 n − 1) , then the model is said to have the expected dimension . In this section we show that RBM k n,m has the e xpected dimension for most triplets ( k, n, m ) . In particular , we sho w that this holds in all practical cases, where the number of hidden units m is smaller than exponential with respect to the number of visible units k + n . The dimension of a parametric model is gi v en by the maximum of the rank of the Jacobian of its parametrization (assuming mild dif ferentiability conditions). Computing the rank of the Jacobian is not easy in general. A resort is to compute the rank only in the limit of large parameters, which corresponds to considering a piece-wise linearized version of the original model, called the tr opical model . Cueto et al. ( 2010 ) used this approach to study the dimension of RBM probability models. Here we apply their ideas in order to study the dimension of CRBM conditional models. The follo wing functions from coding theory will be useful for phrasing the results: Deﬁnition 2. Let A ( n, d ) denote the cardinality of the largest subset of { 0 , 1 } n whose elements are at least Hamming distance d apart. Let K ( n, d ) denote the smallest cardinality of a set such that e very element of { 0 , 1 } n is at most Hamming distance d apart from that set. Cueto et al. ( 2010 ) showed that dim(RBM n,m ) = nm + n + m for m + 1 ≤ A ( n, 3) , and dim(RBM n,m ) = 2 n − 1 for m ≥ K ( n, 1) . It is known that A ( n, 3) ≥ 2 n −d log 2 ( n +1) e and 4 G E O M E T RY A N D E X P R E S S I V E P O W E R O F C R B M S K ( n, 1) ≤ 2 n −b log 2 ( n +1) c . In turn, the probability model RBM n,m has the expected dimension for most pairs ( n, m ) . Noting that dim(RBM k n,m ) ≥ dim(RBM k + n,m ) − (2 k − 1) , we directly infer the follo wing bounds for the dimension of conditional models: Proposition 3. • dim(RBM k n,m ) ≥ ( n + k ) m + n + m + k − (2 k − 1) for m + 1 ≤ A ( k + n, 3) . • dim(RBM k n,m ) = 2 k (2 n − 1) for m ≥ K ( k + n, 1) . These bounds are too loose and do not allow us to attest whether the conditional model has the expected dimension, unless m ≥ K ( k + n, 1) . Hence we need to study the conditional model in more detail. W e obtain the follo wing result: Theorem 4. The conditional model RBM k n,m has the expected dimension in the following cases: • dim(RBM k n,m ) = ( k + n + 1) m + n for m + 1 ≤ A ( k + n, 4) . • dim(RBM k n,m ) = 2 k (2 n − 1) for m ≥ K ( k + n, 1) . W e note the follo wing practical v ersion of the theorem, which results from inserting appropriate bounds on the functions A and K : Corollary 5. The conditional model RBM k n,m has the expected dimension in the following cases: • dim(RBM k n,m ) = ( n + k + 1) m + n for m ≤ 2 ( k + n ) −b log 2 (( k + n ) 2 − ( k + n )+2) c . • dim(RBM k n,m ) = 2 k (2 n − 1) for m ≥ 2 ( k + n ) −b log 2 ( k + n +1) c . These results sho w that, in all cases of practical interest, where m is less than e xponential in k + n , the dimension of the CRBM model is indeed equal to the number of model parameters. In all these cases, almost ev ery conditional distribution that can be represented by the model is represented by at most ﬁnitely many dif ferent choices of parameters. On the other hand, the dimension alone is not very informativ e about the ability of a model to approximate target distrib utions. In particular , it may be that a high dimensional model covers only a tiny fraction of the set of all conditional distributions, or also that a lo w dimensional model can approximate any target conditional relati vely well. W e address the minimal dimension and number of parameters of a uni versal approximator in the ne xt section. In the subsequent section we address the approximation errors depending on the number of parameters. 3.2 Universal A pproximation In this section we ask for the smallest number of hidden units m for which the model RBM k n,m can approximate e very conditional distrib ution from ∆ k,n arbitrarily well. Note that each conditional distrib ution p ( y | x ) can be identiﬁed with the set of joint distributions of the form r ( x, y ) = q ( x ) p ( y | x ) , with strictly positive marginals q ( x ) . In particular , by ﬁxing a marginal distribution, we obtain an identiﬁcation of ∆ k,n and a subset of ∆ k + n . Figure 2 illustrates this identiﬁcation in the case n = k = 1 and q ≡ 1 2 . This implies that universal approximators of joint probability distributions deﬁne univ ersal ap- proximators of conditional distributions. W e kno w that RBM n + k,m is a universal approximator whene ver m ≥ 1 2 2 k + n − 1 (see Mont ´ ufar and A y 2011 ), and therefore: 5 M O N T ´ U FA R , A Y , A N D G H A Z I - Z A H E D I δ 00 δ 10 δ 01 δ 11 1 2 ∆ 1 , 1 ∆ 2 Figure 2: The polytope of conditional distrib utions ∆ 1 , 1 embedded in the probability simplex ∆ 2 . Proposition 6. The model RBM k n,m can appr oximate every conditional distribution fr om ∆ k,n ar- bitrarily well whene ver m ≥ 1 2 2 k + n − 1 . This improves pre vious results by Y ounes ( 1996 ) and v an der Maaten ( 2011 ). On the other hand, since conditional models do not need to model the input-state distribution, in principle it is possible that RBM k n,m is a uni versal approximator ev en if RBM n + k,m is not a uni v ersal approxi- mator . In fact, we obtain the follo wing improv ement of Proposition 6 , which does not follow from corresponding results for RBM probability models: Theorem 7. The model RBM k n,m can appr oximate every conditional distrib ution fr om ∆ k,n arbi- trarily well whene ver m ≥      1 2 2 k (2 n − 1) , if k ≥ 1 3 8 2 k (2 n − 1) + 1 , if k ≥ 3 1 4 2 k (2 n − 1 + 1 / 30) , if k ≥ 21 . In fact, the model RBM k n,m can appr oximate every conditional distribution fr om ∆ k,n arbitrarily well whenever m ≥ 2 k K ( r )(2 n − 1) + 2 S ( r ) P ( r ) , wher e r is any natural number satisfying k ≥ 1 + · · · + r =: S ( r ) , and K and P ar e functions (deﬁned in Lemma 30 and Pr oposition 32 ) which tend to appr oximately 0 . 2263 and 0 . 0269 , r espectively , as r tends to inﬁnity . W e note the following weaker b ut practical version of Theorem 7 : Corollary 8. Let k ≥ 1 . The model RBM k n,m can appr oximate every conditional distribution fr om ∆ k,n arbitrarily well whene ver m ≥ 1 2 2 k (2 n − 1) = 1 2 2 k + n − 1 2 2 k . These results are signiﬁcant, because they reduce the bounds follo wing from univ ersal approxi- mation results for probability models by an additiv e term of order 2 k , which corresponds precisely to the order of parameters needed in order to model the input-state distributions. As expected, the asymptotic behavior of the theorem’ s bound is exponential in the number of input and output units. This lies in the nature of the univ ersal approximation property . A crude lower bound on the number of hidden units that sufﬁces for uni versal approximation can be obtained by comparing the number of parameters of the model and the dimension of the conditional polytope: 6 G E O M E T RY A N D E X P R E S S I V E P O W E R O F C R B M S b b b b b b b b b b b b ∆ k ,n M k ,n D M k,n Figure 3: Schematic illustration of the maximal approximation error of a model M k,n ⊆ ∆ k,n . Proposition 9. If the model RBM k n,m can appr oximate every conditional distribution fr om ∆ k,n arbitrarily well, then necessarily m ≥ 1 ( n + k +1) (2 k (2 n − 1) − n ) . The results presented above highlight the fact that CRBM universal approximation may be pos- sible with a drastically smaller number of hidden units than RBM univ ersal approximation, for the same number of visible units. Howe ver , e ven with these reductions the uni versal approximation property requires an enormous number of hidden units. In order to pro vide a more informative de- scription of the approximation capabilities of CRBMs, in the ne xt section we in vestigate ho w the maximal approximation error decreases as hidden units are added to the model. 3.3 Maximal A pproximation Errors From a practical perspective it is not necessary to approximate conditional distributions arbitrarily well, but fair approximations suf ﬁce. This can be especially important if the number of required hidden units grows disproportionately with the quality of the approximation. In this section we in vestigate the maximal approximation errors of CRBMs depending on the number of hidden units. Figure 3 gi ves a schematic illustration of the maximal approximation error of a conditional model. The Kullback-Leibler di ver gence of two probability distrib utions p and q in ∆ k + n is gi ven by D ( p k q ) := X x X y p ( x ) p ( y | x ) log p ( x ) p ( y | x ) q ( x ) q ( y | x ) = D ( p X k q X ) + X x p ( x ) D ( p ( ·| x ) k q ( ·k x )) , where p X = P y ∈{ 0 , 1 } n p ( x, y ) denotes the mar ginal distribution ov er x ∈ { 0 , 1 } k . The di ver gence of two conditional distributions p ( ·|· ) and q ( ·|· ) in ∆ k,n is gi ven by D ( p ( ·|· ) k q ( ·|· )) := X x u X ( x ) D ( p ( ·| x ) k q ( ·| x )) , where u X denotes the uniform distribution over x . Even if the diver gence between two joint distri- butions does not v anish, the div ergence between their conditional distrib utions may vanish. 7 M O N T ´ U FA R , A Y , A N D G H A Z I - Z A H E D I The div ergence from a conditional distrib ution p ( ·|· ) to the set M k,n of conditional distrib utions deﬁned by a model of joint probability distributions M k + n is gi ven by D ( p ( ·|· ) kM k,n ) := inf q ∈M k,n D ( p ( ·|· ) k q ( ·|· )) = inf q ∈M k + n D ( u X p ( ·|· ) k q ) − D ( u X k q X ) . The maximum of the di ver gence from a conditional distribution to M k,n satisﬁes D M k,n := max p ( ·|· ) ∈ ∆ k,n D ( p ( ·|· ) kM k,n ) ≤ max p ∈ ∆ k + n D ( p kM k + n ) =: D M k + n . Hence we can bound the maximal diver gence of a CRBM by the maximal di ver gence of an RBM (studied in Mont ´ ufar et al. 2011 ) and obtain the follo wing: Proposition 10. If m ≤ 2 ( n + k ) − 1 − 1 , then the diverg ence from any conditional distribution p ( ·|· ) ∈ ∆ k,n to the model RBM k n,m is bounded by D RBM k n,m ≤ D RBM k + n,m ≤ ( n + k ) − b log 2 ( m + 1) c − m + 1 2 b log 2 ( m +1) c . This proposition implies the univ ersal approximation result from Proposition 6 as the special case with vanishing approximation error , but it does not imply Theorem 7 in the same way . T aking more speciﬁc properties of the conditional model into account, we can improv e the proposition and obtain the follo wing: Theorem 11. Let l ∈ [ n ] . The diver gence fr om any conditional distribution in ∆ k,n to the model RBM k n,m is bounded fr om above by D RBM k n,m ≤ n − l, whenever m ≥      1 2 2 k (2 l − 1) , if k ≥ 1 3 8 2 k (2 l − 1) + 1 , if k ≥ 3 1 4 2 k (2 l − 1 + 1 / 30) , if k ≥ 21 . In fact, the diver gence fr om any conditional distribution in ∆ k,n to RBM k n,m is bounded fr om abo ve by D RBM k n,m ≤ n − l , where l is the lar gest inte ger with m ≥ 2 k − S ( r ) F ( r )(2 l − 1) + R ( r ) . This theorem implies the univ ersal approximation result from Theorem 7 as the special case with v anishing approximation error . W e note the follo wing weaker but practical v ersion of Theorem 11 (analogue to Corollary 8 ): Corollary 12. Let k ≥ 1 and l ∈ [ n ] . The diverg ence fr om any conditional distribution in ∆ k,n to the model RBM k n,m is bounded fr om above by D RBM k n,m ≤ n − l , whenever m ≥ 1 2 2 k (2 l − 1) . Gi ven an error tolerance, we can use these bounds to ﬁnd a suf ﬁcient number of hidden units that guarantees approximations within this error tolerance. In plain terms, the results presented above sho w that the worst case approximation errors of CRBMs decrease at least with the logarithm of the number of hidden units. On the other hand, in practice one is not interested in approximating all possible conditional distrib utions, but only special classes. One can expect that CRBMs can approximate certain classes of conditional distrib utions better than others. This is the subject of the next section. 8 G E O M E T RY A N D E X P R E S S I V E P O W E R O F C R B M S b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b Figure 4: Example of a Marko v random ﬁeld and a corresponding RBM architecture that can rep- resent it. 4 Repr esentation of Special Classes of Conditional Models In this section we ask about the classes of conditional distributions that can be compactly repre- sented by CRBMs and whether CRBMs can approximate interesting conditional distributions using only a moderate number of hidden units. The ﬁrst part of the question is about f amiliar classes of conditional distrib utions that can be expressed in terms of CRBMs, which in turn would allo w us to compare CRBMs with other models and to de velop a more intuiti ve picture of Deﬁnition 1 . The second part of the question clearly depends on the speciﬁc problem at hand. Nonetheless, some classes of conditional distributions may be considered generally interesting, as they contain solutions to all instances of certain classes of problems. An example is the class of deterministic conditional distributions, which suf ﬁces to solve an y Markov decision problem in an optimal way . 4.1 Representation of Conditional Mark ov Random Fields In this section we discuss the ability of CRBMs to represent conditional Marko v random ﬁelds, depending on the number of hidden units that they hav e. The main idea is that each hidden unit of an RBM can be used to model the pure interaction of a group of visible units. This idea appeared in pre vious work by Y ounes ( 1996 ), in the conte xt of univ ersal approximation. Deﬁnition 13. Consider a simplicial comple x I on [ N ] ; that is, a collection of subsets of [ N ] = { 1 , . . . , N } such that A ∈ I implies B ∈ I for all B ⊆ A , and ∅ ∈ I . The random ﬁeld E I ⊆ ∆ N with interactions I is the set of probability distributions of the form p ( x ) = 1 Z exp  X A ∈ I θ A Y i ∈ A x i  , for all x = ( x 1 , . . . , x N ) ∈ { 0 , 1 } N , with normalization Z = P x 0 ∈{ 0 , 1 } N exp( P A ∈ I θ A Q i ∈ A x 0 i ) and parameters θ A ∈ R , A ∈ I . W e obtain the following result: Theorem 14. Let I be a simplicial complex on [ k + n ] . If m ≥ |{ A ∈ I : A 6⊆ [ k ] , A 6 = { k + 1 } , . . . , A 6 = { k + n } , A 6 = ∅}| , then the model RBM k n,m can r epresent every conditional distribution of ( x k +1 , . . . , x k + n ) , given ( x 1 , . . . , x k ) , that can be r epr esented by E I ⊆ ∆ k + n . An interesting special case is when each output distrib ution can be chosen arbitrarily from a gi ven Mark ov random ﬁeld: 9 M O N T ´ U FA R , A Y , A N D G H A Z I - Z A H E D I Corollary 15. Let I be a simplicial comple x on [ n ] and for eac h x ∈ { 0 , 1 } n let p x be some pr obability distribution fr om E I ⊆ ∆ n . If m ≥ 2 k ( | I | − 1) − |{ A ∈ I : | A | = 1 }| , then the model RBM k n,m can r epr esent the conditional distribution deﬁned by q ( y | x ) = p x ( y ) , for all y ∈ { 0 , 1 } n , for all x ∈ { 0 , 1 } k . W e note the following direct implication for RBM probability models: Corollary 16. Let I be a simplicial complex on [ n ] . If m ≥ |{ A ∈ I : | A | > 1 }| , then RBM n,m can r epr esent any pr obability distribution p fr om E I . Figure 4 illustrates a Marko v random ﬁeld and an RBM architecture that can represent it. 4.2 A pproximation of Conditional Distributions with Restricted Supports In this section we continue the discussion about the classes of conditional distributions that can be represented by CRBMs, depending on the number of hidden units. Here we focus on a hierarchy of conditional distrib utions deﬁned by the total number of input-output pairs with positiv e probability . Deﬁnition 17. For any k , n , and 0 ≤ d ≤ 2 k (2 n − 1) , let C k,n ( d ) ⊆ ∆ k,n denote the union of all d -dimensional faces of ∆ k,n ; that is, the set of conditional distributions that have a total of 2 k + d or fe wer non-zero entries, C k,n ( d ) := { p ( ·|· ) ∈ ∆ k,n : |{ ( x, y ) : p ( y | x ) > 0 }| ≤ 2 k + d } . Note that C k,n (2 k (2 n − 1)) = ∆ k,n . The vertices (zero-dimensional faces) of ∆ k,n are the conditional distrib utions which assign positive probability to only one output, given each input, and are called deterministic . By Carath ´ eodory’ s theorem, e very element of C k,n ( d ) is a con ve x combination of ( d + 1) or fe wer deterministic conditional distributions. The sets C k,n ( d ) arise naturally in the context of reinforcement learning and partially observable Marko v decision processes (POMDPs). Namely , e very ﬁnite POMDP has an associated effecti ve dimension d , which is the dimension of the set of all state processes that can be generated by station- ary stochastic policies. Mont ´ ufar et al. ( 2014 ) showed that the policies represented by conditional distributions from the set C k,n ( d ) are suf ﬁcient to generate all the processes that can be generated by ∆ k,n . In general, the ef fecti ve dimension d is relati ve small, such that C k,n ( d ) is a much smaller policy search space than ∆ k,n . W e have the follo wing result: Proposition 18. If m ≥ 2 k + d − 1 , then the model RBM k n,m can appr oximate every element fr om C k,n ( d ) arbitr arily well. This result sho ws the intuiti ve f act that each hidden unit of can be used to model the probability of an input-output pair . Since each conditional distribution has 2 k input-output probabilities that are completely determined by the other probabilities (due to normalization), it is interesting to ask whether the amount of hidden units indicated in the proposition is strictly necessary . Further belo w , Theorem 21 will sho w that, indeed, hidden units are required for modeling the positions of the positi ve probability input-output pairs, e ven if their speciﬁc v alues do not need to be modeled. W e note that certain structures of positi ve probability input-output pairs can be modeled with fe wer hidden units than stated in Proposition 18 . An simple example is the following direct gener- alization of Corollary 8 : 10 G E O M E T RY A N D E X P R E S S I V E P O W E R O F C R B M S Proposition 19. If d is divisible by 2 k and m ≥ d/ 2 , then the model RBM k n,m can appr oximate every element fr om C k,n ( d ) arbitrarily well, when the set of positive-pr obability outputs is the same for all inputs. In the follo wing we will focus on deterministic conditional distrib utions. This is a particularly interesting and simple class of conditional distributions with restricted supports. It is well known that any ﬁnite Marko v decision processes (MDPs) has an optimal policy deﬁned by a stationary deterministic conditional distribution (see Bellman 1957 ; Ross 1983 ). Furthermore, A y et al. ( 2013 ) sho wed that it is alw ays possible to deﬁne simple two-dimensional manifolds that approximate all deterministic conditional distributions arbitrarily well. Certain classes of conditional distrib utions (in particular deterministic conditionals) coming from feedforward networks can be approximated arbitrarily well by CRBMs: Theorem 20. The model RBM k n,m can appr oximate every conditional distribution arbitr arily well, which can be r epresented by a feedforwar d network with k input units, a hidden layer of m linear thr eshold units, and an output layer of n sigmoid units. In particular , the model RBM k n,m can appr oximate every deterministic conditional distrib ution fr om ∆ k,n arbitrarily well, which can be r epr esented by a feedforwar d linear thr eshold network with k input, m hidden, and n output units. The representational po wer of feedforward linear threshold networks has been studied inten- si vely in the literature. For example, W enzel et al. ( 2000 ) showed that a feedforward linear threshold network with k ≥ 1 input, m hidden, and n = 1 output units, can represent the follo wing: • Any Boolean function f : { 0 , 1 } k → { 0 , 1 } , when m ≥ 3 · 2 k − 1 −b log 2 ( k +1) c ; e.g., when m ≥ 3 k +2 2 k . • The parity function f parity : { 0 , 1 } k → { 0 , 1 } ; x 7→ P i x i mo d 2 , when m ≥ k . • The indicator function of any union of m linearly separable subsets of { 0 , 1 } k . Although CRBMs can approximate this rich class of deterministic conditional distributions ar- bitrarily well, the next result shows that the number of hidden units required for universal approxi- mation of deterministic conditional distributions is rather lar ge: Theorem 21. The model RBM k n,m can appr oximate every deterministic policy fr om ∆ k,n arbitrar - ily well if m ≥ min n 2 k − 1 , 3 n k +2 2 k o and only if m ≥ 2 k/ 2 − ( n + k ) 2 2 n . By this theorem, in order to approximate all deterministic conditional distrib utions arbitrarily well, a CRBM requires exponentially man y hidden units, with respect to the number of input units. 5 Conclusion This paper giv es a theoretical description of the representational capabilities of conditional restricted Boltzmann machines (CRBMs) relating model complexity and model accuracy . CRBMs are based on the well studied restricted Boltzmann machine (RBM) probability models. W e proved an e xten- si ve series of results that generalize recent theoretical w ork on the representational po wer of RBMs in a non-tri vial way . 11 M O N T ´ U FA R , A Y , A N D G H A Z I - Z A H E D I W e studied the problem of parameter identiﬁability . W e sho wed that ev ery CRBM with up to exponentially many hidden units (in the number of input and output units) represent a set of conditional distributions of dimension equal to the number of model parameters. This implies that in all practical cases, CRBMs do not w aste parameters, and, generically , only ﬁnitely many choices of the interaction weights and biases produce the same conditional distribution. W e addressed the classical problems of univ ersal approximation and approximation quality . Our results sho w that a CRBM with m hidden units can approximate e very conditional distribution of n output units, gi ven k input units, without surpassing a Kullback-Leibler approximation error of the form n − log 2 ( m/ 2 k − 1 + 1) (assuming optimal parameters). Thus this model is a uni versal approximator whene ver m ≥ 1 2 2 k (2 n − 1) . In fact we provided tighter bounds depending on k . For instance, if k ≥ 21 , then the univ ersal approximation property is attained whenev er m ≥ 1 4 2 k (2 n − 29 / 30) . Our proof is based on an upper bound for the complexity of an algorithm that packs Boolean cubes with sequences of non-overlapping stars, for which improvements may be possible. It is worth mentioning that the set of conditional distributions for which the approximation error is maximal may be very small. This is a lar gely open and dif ﬁcult problem. W e note that our results can be plugged into certain analytic integrals ( Mont ´ ufar and Rauh 2014 ) to produce upper-bounds for the expectation v alue of the approximation error when approximating conditional distributions drawn from a product Dirichlet density on the polytope of all conditional distrib utions. For future w ork it would be interesting to extend our (optimal-parameter) considerations by an analysis of the CRBM training complexity and the errors resulting from non-optimal parameter choices. W e also studied speciﬁc classes of conditional distributions that can be represented by CRBMs, depending on the number of hidden units. W e showed that CRBMs can represent conditional Marko v random ﬁelds by using each hidden unit to model the interaction of a group of visible v ari- ables. Furthermore, we showed that CRBMs can approximate all binary functions with k input bits and n output bits arbitrarily well if m ≥ 2 k − 1 or m ≥ 3 n k +2 2 k and only if m ≥ 2 k/ 2 − ( n + k ) 2 / 2 n . In particular , this implies that there are exponentially many deterministic conditional distributions which can only be approximated arbitrarily well by a CRBM if the number of hidden units is expo- nential in the number of input units. This aligns with well kno wn e xamples of functions that cannot be compactly represented by shallow feedforward networks, and rev eals some of the intrinsic con- straints of CRBM models that may pre vent them from grossly o ver-ﬁtting. W e think that the de veloped techniques can be used for studying other conditional probability models as well. In particular , for future work it would be interesting to compare the representational po wer of CRBMs and of combinations of CRBMs with feedforward nets (combined models of this kind include CRBMs with retroacti ve connections and recurrent temporal RBMs). Also, it would be interesting to apply our techniques to study stacks of CRBMs and other multilayer conditional models. Finally , although our analysis focuses on the case of binary units, the main ideas can be extended to the case of discrete non-binary units. A Details on the Dimension Proof of Proposition 3 . Each joint distribution of x and y has the form p ( x, y ) = p ( x ) p ( y | x ) and the set ∆ k of all marginals p ( x ) has dimension 2 k − 1 . This sho ws the ﬁrst statement. The items follo w directly from the corresponding statements for the probability model.  12 G E O M E T RY A N D E X P R E S S I V E P O W E R O F C R B M S Proof of Theorem 4 . W e will prove a stronger statement, where the condition on m appearing in the ﬁrst item is relaxed to the following: The set { 0 , 1 } k + n contains m disjoint radius- 1 Hamming balls whose union does not contain any set of the form [ x ] := { ( x, y ) ∈ { 0 , 1 } k + n : y ∈ { 0 , 1 } n } for x ∈ { 0 , 1 } k , and whose complement has full af ﬁne rank as a subset of R k + n . The proof is based on the ideas developed in ( Cueto et al. 2010 ) for studying the RBM proba- bility model. W e consider the Jacobian of RBM k n,m for the parametrization giv en in Deﬁnition 1 . The dimen- sion of RBM k n,m is the maximum rank of the Jacobian over all possible choices of θ = ( W, V , b, c ) ∈ R N , N = n + m + ( n + k ) m . Let h θ ( v ) := argmax z ∈{ 0 , 1 } m p ( z | v ) denote the most likely hidden state of RBM k + n,m gi ven the visible state v = ( x, y ) , depending on the parameter θ . After a fe w direct algebraic manipulations, we ﬁnd that the maximum rank of the Jacobian is bounded from belo w by the maximum ov er θ of the dimension of the column-span of the matrix A θ with ro ws h (1 , x > , y > ) , (1 , x > , y > ) ⊗ h θ ( x, y ) > i , for all ( x, y ) ∈ { 0 , 1 } k + n , (1) modulo vectors whose ( x, y ) -th entries are independent of y gi ven x . Here ⊗ is the Kronecker product, which is deﬁned by ( a ij ) i,j ⊗ ( b kl ) k,l = ( a ij b kl ) ik,j l . The modulo operation has the effect of disregarding the input distrib ution p ( x ) in the joint distrib ution p ( x, y ) = p ( x ) p ( y | x ) represented by the RBM. For example, from the ﬁrst block of A θ we can remov e the columns that correspond to x , without af fecting the mentioned column-span. Summarizing, the maximal column-rank of A θ modulo the vectors whose ( x, y ) -th entries are independent of y giv en x is a lo wer bound for the dimension of RBM k n,m . Note that A θ depends on θ in a discrete way; the parameter space R N is partitioned in ﬁnitely many re gions where A θ is constant. The piece-wise linear map thus emerging, with linear pieces represented by the A θ , is the tropical CRBM morphism, and its image is the tropical CRBM model. Each linear region of the tropical morphism corresponds to an inference function h θ : { 0 , 1 } k + n → { 0 , 1 } m taking visible state vectors to the most likely hidden state v ectors. Geometrically , such an inference function corresponds to m slicings of the ( k + n ) -dimensional unit hypercube. Namely , e very hidden unit di vides the visible space { 0 , 1 } k + n ⊂ R k + n in two halfspaces, according to its preferred state. Each of these m slicings deﬁnes a column block of the matrix A θ . More precisely , A θ = ( A | A C 1 | · · · | A C m ) , where A is the matrix with ro ws (1 , v 1 , . . . , v k + n ) for all v ∈ { 0 , 1 } k + n , and A C is the same matrix, with rows multiplied by the indicator function of the set C of points v classiﬁed as positi ve by a linear classiﬁer (slicing). If we consider only linear classiﬁers that select rows of A corresponding to disjoint Hamming balls of radius one (that is, such that the C i are disjoint radius-one Hamming balls), then the rank of A θ is equal to the number of such classiﬁers times ( n + k + 1) (which is the rank of each block A C i ), plus the rank of A { 0 , 1 } k + n \∪ i ∈ [ m ] C i (which is the remainder rank of the ﬁrst block A ). The column-rank modulo functions of x is equal to the rank minus k + 1 (which is the dimension of the functions of x spanned by columns of A ), minus at most the number of cylinder sets [ x ] = { ( x, y ) : y ∈ { 0 , 1 } n } for some x ∈ { 0 , 1 } k that are contained in ∪ i ∈ [ m ] C i . This completes the proof of the general statement in the ﬁrst item. 13 M O N T ´ U FA R , A Y , A N D G H A Z I - Z A H E D I The example gi ven in the ﬁrst item is a consequence of the following observ ations. Each c ylin- der set [ x ] contains 2 n points. If a giv en c ylinder set [ x ] intersects a radius- 1 Hamming ball B but is not contained in it, then it also intersects the radius- 2 Hamming sphere around B . Choosing the radius- 1 Hamming ball slicings C 1 , . . . , C m to have centers at least Hamming distance 4 apart, we can ensure that their union does not contain any c ylinder set [ x ] . The second item is by the second item of Proposition 3 ; when the probability model RBM n + k,m is full dimensional, then RBM k n,m is full dimensional.  Proof of Cor ollary 5 . For the maximal cardinality of distance- 4 binary codes of length l it is kno wn that A ( l , 4) ≥ 2 r , where r is the largest integer with 2 r < 2 l 1+( l − 1)+( l − 1)( l − 2) / 2 ( Gilbert 1952 ; V arshamov 1957 ), and so A 2 ( l, 4) ≥ 2 l −b log 2 ( l 2 − l +2) c . Furthermore, for the minimal size of radius one co vering codes of length l it is known that K ( l , 1) ≤ 2 l −b log 2 ( l +1) c ( Cueto et al. 2010 ).  B Details on Univ ersal Appr oximation B.1 Sufﬁcient Number of Hidden Units This section contains the proof of Theorem 7 about the minimal size of CRBM univ ersal approxi- mators. The proof is constructi ve; given any tar get conditional distrib ution, it proceeds by adjusting the weights of the hidden units successi vely until obtaining the desired approximation. The idea of the proof is that each hidden unit can be used to model the probability of an output vector , for se veral dif ferent input vectors. The probability of a giv en output vector can be adjusted at will by a single hidden unit, jointly for se veral input vectors, when these input v ectors are in general position. This comes at the cost of generating dependent output probabilities for all other inputs in the same af ﬁne space. The main difﬁculty of the proof lies in the construction of sequences of successiv ely conﬂict-free groups of af ﬁnely independent inputs, and in estimating the shortest possible length of such sequences exhausting all possible inputs. The proof is composed of se veral lemmas and propositions. W e start with a fe w deﬁnitions: Deﬁnition 22. Given two probability distrib utions p and q on a ﬁnite set X , the Hadamar d pr oduct or renormalized entry-wise product p ∗ q is the probability distribution on X deﬁned by ( p ∗ q )( x ) = p ( x ) q ( x ) / P x 0 p ( x 0 ) q ( x 0 ) for all x ∈ X . When building this product, we assume that the supports of p and q are not disjoint, such that the normalization term does not v anish. The probability distrib utions that can be represented by RBMs can be described in terms of Hadamard products. Namely , for every probability distribution p that can be represented by RBM n,m , the model RBM n,m +1 with one additional hidden unit can represent precisely the probability dis- tribution of the form p 0 = p ∗ q , where q = λ 0 r + (1 − λ 0 ) s is a mixture, with λ 0 ∈ [0 , 1] , of two strictly positiv e product distributions r ( x ) = Q n i =1 r i ( x i ) and s ( x ) = Q n i =1 s i ( x i ) . In other words, each additional hidden unit amounts to Hadamard-multiplying the distrib utions representable by an RBM with the distrib utions representable as mixtures of product distributions. The same result is obtained by considering only the Hadamard products with mixtures where r is equal to the uniform distribution. In this case, the distributions p 0 = p ∗ q are of the form p 0 = λp + (1 − λ ) p ∗ s , where s is an y strictly positiv e product distribution and λ = λ 0 λ 0 +2 n (1 − λ 0 ) P x p ( x ) s ( x ) is any weight in [0 , 1] . 14 G E O M E T RY A N D E X P R E S S I V E P O W E R O F C R B M S Deﬁnition 23. A probability sharing step is a transformation taking a probability distribution p to p 0 = λp + (1 − λ ) p ∗ s , for some strictly positi ve product distrib ution s and some λ ∈ [0 , 1] . W e will need two more standard deﬁnitions from coding theory: Deﬁnition 24. A radius- 1 Hamming ball in { 0 , 1 } k is a set B consisting of a length- k binary v ector and all its immediate neighbors; that is, B = { x ∈ { 0 , 1 } k : d H ( x, z ) ≤ 1 } for some z ∈ { 0 , 1 } k , where d H ( x, z ) := |{ i ∈ [ k ] : x i 6 = z i }| denotes the Hamming distance between x and z . Here [ k ] := { 1 , . . . , k } . Deﬁnition 25. An r -dimensional cylinder set in { 0 , 1 } k is a set C of length- k binary vectors with arbitrary v alues in r coordinates and ﬁxed v alues in the other coordinates; that is, C = { x ∈ { 0 , 1 } k : x i = z i for all i ∈ Λ } for some z ∈ { 0 , 1 } k and some Λ ⊆ [ k ] with k − | Λ | = r . The geometric intuition is simple: a cylinder set corresponds to the vertices of a face of a unit cube, and a radius- 1 Hamming ball corresponds to the vertices of a corner of a unit cube. The vectors in a radius- 1 Hamming ball are afﬁnely independent. See Figure 5 A for an illustration. In order to prove Theorem 7 , for each k ∈ N and n ∈ N we want to ﬁnd an m k,n ∈ N such that: for any given strictly positiv e conditional distrib ution q ( ·|· ) , there e xists p ∈ RBM n + k, 0 and m k,n probability sharing steps taking p to a strictly positiv e joint distribution p 0 with p 0 ( ·|· ) = q ( ·|· ) . The idea is that the starting distrib ution is represented by an RBM with no hidden units, and each sharing step is realized by adding a hidden unit to the RBM. In order to obtain these sequences of sharing steps, we will use the follo wing technical lemma: Lemma 26. Let B be a r adius- 1 Hamming ball in { 0 , 1 } k and let C be a cylinder subset of { 0 , 1 } k containing the center of B . Let λ x ∈ (0 , 1) for all x ∈ B ∩ C , let ˜ y ∈ { 0 , 1 } n and let δ ˜ y denote the Dirac delta on { 0 , 1 } n assigning pr obability one to ˜ y . Let p ∈ ∆ k + n be a strictly positive pr obability distribution with conditionals p ( ·| x ) and let p 0 ( ·| x ) := ( λ x p ( ·| x ) + (1 − λ x ) δ ˜ y , for all x ∈ B ∩ C p ( ·| x ) , for all x ∈ { 0 , 1 } k \ C . Then, for any  > 0 , ther e is a probability sharing step taking p to a joint distribution p 00 with conditionals satisfying P y | p 00 ( y | x ) − p 0 ( y | x ) | ≤  for all x ∈ ( B ∩ C ) ∪ ( { 0 , 1 } k \ C ) . Proof. W e deﬁne the sharing step p 0 = λp + (1 − λ ) p ∗ s with a product distribution s sup- ported on { ˜ y } × C ⊆ { 0 , 1 } k + n . Note that giv en any distrib ution q on C and a radius- 1 Ham- ming ball B whose center is contained in C , there is a product distribution s on C such that ( s ( x )) x ∈ C ∩ B ∝ ( q ( x )) x ∈ C ∩ B . In other words, the restriction of a product distribution s to a radius- 1 Hamming ball B can be made proportional to an y non-ne gativ e v ector of length | B | . T o see this, note that a product distrib ution is a v ector with entries s ( x ) = Q i ∈ [ k ] s i ( x i ) for all x = ( x 1 , . . . , x k ) , with factor distrib utions s i . Hence the restriction of s to B is giv en by the vector  Q i s i (0) , s 1 (1) s 1 (0) Q i s i (0) , . . . , s k (1) s k (0) Q i s i (0)  , where, without loss of generality , we chose B centered at (0 , . . . , 0) . No w , by choosing the factor distributions s i appropriately , the vector  s 1 (1) s 1 (0) , . . . , s k (1) s k (0)  can be made arbitrary in R k + .  W e have the follo wing two implications of Lemma 26 : 15 M O N T ´ U FA R , A Y , A N D G H A Z I - Z A H E D I Corollary 27. F or any  > 0 and q ( ·| x ) ∈ ∆ n for all x ∈ B ∩ C , there is an  0 > 0 such that, for any strictly positive joint distribution p ∈ ∆ k + n with conditionals satisfying P y | p ( y | x ) − δ 0 ( y ) | ≤  0 for all x ∈ B ∩ C , there ar e 2 n − 1 sharing steps taking p to a joint distrib ution p 00 with conditionals satisfying P y | p 00 ( y | x ) − p 0 ( y | x ) | ≤  for all x ∈ ( B ∩ C ) ∪ ( { 0 , 1 } k \ C ) , wher e δ 0 is the Dirac delta on { 0 , 1 } n assigning pr obability one to the vector of zer os and p 0 ( ·| x ) := ( q ( ·| x ) , for all x ∈ B ∩ C p ( ·| x ) , for all x ∈ { 0 , 1 } k \ C . Proof. Consider any x ∈ B ∩ C . W e will show that the probability distrib ution q ( ·| x ) ∈ ∆ n can be written as the transformation of a Dirac delta by 2 n − 1 sharing steps. Then the claim follo ws from Lemma 26 . Let σ : { 0 , 1 } n → { 0 , . . . , 2 n − 1 } be an enumeration of { 0 , 1 } n . Let p (0) ( y | x ) = δ σ − 1 (0) ( y ) be the starting distribution (the Dirac delta concentrated at the state ˜ y ∈ { 0 , 1 } n with σ ( ˜ y ) = 0 ) and let the t -th sharing step be deﬁned by p ( t ) ( y ) = λ x σ − 1 ( t ) p ( t − 1) ( y | x ) + (1 − λ x σ − 1 ( t ) ) δ σ − 1 ( t ) ( y ) , for some weight λ x σ − 1 ( t ) ∈ [0 , 1] . After 2 n − 1 sharing steps, we obtain the distribution p (2 n − 1) ( y | x ) = X ˜ y  Y ˜ y 0 : σ ( ˜ y 0 ) >σ ( ˜ y ) λ x ˜ y 0  (1 − λ x ˜ y ) δ ˜ y ( y ) , for all y ∈ { 0 , 1 } n , whereby λ x ˜ y := 0 for σ ( ˜ y ) = 0 . This distribution is equal to q ( ·| x ) for the follo wing choice of weights: λ x ˜ y := 1 − q ( ˜ y | x ) 1 − P ˜ y 0 : σ ( ˜ y 0 ) >σ ( ˜ y ) q ( ˜ y 0 | x ) , for all ˜ y ∈ { 0 , 1 } n . It is easy to verify that these weights satisfy the condition λ x ˜ y ∈ [0 , 1] for all ˜ y ∈ { 0 , 1 } n , and λ x ˜ y = 0 for that ˜ y with σ ( ˜ y ) = 0 , independently of the speciﬁc choice of σ .  Note that this corollary does not make any statement about the ro ws p 00 ( ·| x ) with x ∈ C \ B . When transforming the ( B ∩ C ) -rows of p according to Lemma 26 , the ( C \ B ) -rows get transformed as well, in a non-trivial dependent way . F ortunately , there is a sharing step that allows us to “reset” exactly certain ro ws to a desired point measure, without introducing new non-tri vial dependencies: Corollary 28. F or any  > 0 , any cylinder set C ⊆ { 0 , 1 } k , and any ˜ y ∈ { 0 , 1 } n , any strictly positive joint distribution p can be transformed by a pr obability sharing step to a joint distribution p 00 with conditionals satisfying P y | p 00 ( y | x ) − p 0 ( y | x ) | ≤  for all x ∈ { 0 , 1 } k , wher e p 0 ( ·| x ) := ( δ ˜ y , for all x ∈ C p ( ·| x ) , for all x ∈ { 0 , 1 } k \ C . Proof. The sharing step can be deﬁned as p 00 = λp + (1 − λ ) p ∗ s with s close to the uniform distribution on { ˜ y } × C and λ close to 0 (close enough depending on  ).  W e will refer to a sharing step as described in Corollary 28 as a reset of the C -rows of p . W ith all the observ ations made above, we can construct an algorithm that generates an arbitrarily accurate approximation of any giv en conditional distribution by applying a sequence of sharing 16 G E O M E T RY A N D E X P R E S S I V E P O W E R O F C R B M S Input : Strictly positiv e joint distribution p , tar get conditional distribution q ( ·|· ) , and  > 0 Output : T ransformation p 0 of the input distribution with P y | p 0 ( y | x ) − q ( y | x ) | ≤  for all x Initialize B ← ∅ ; while B 6⊇ { 0 , 1 } k do Choose (disjoint) cylinder sets C 1 , . . . , C K packing { 0 , 1 } k \ B ; If needed, perform at most K sharing steps resetting the C i ro ws of p for all i ∈ [ K ] , taking p ( ·| x ) close to δ 0 for all x ∈ C i for all i ∈ [ K ] and lea ving all other rows close to their current v alues, according to Corollary 28 ; f or each i ∈ [ K ] do Perform at most 2 n − 1 sharing steps taking p ( ·| x ) close to q ( ·| x ) for all x ∈ B i , where B i is some star contained in C i , and leaving the ( { 0 , 1 } k \ C i ) -ro ws close to their current v alues, according to Corollary 27 ; end B ← B ∪ ( ∪ i ∈ [ K ] B i ) ; end Algorithm 1: Algorithmic illustration of the proof of Theorem 7 . The algorithm performs sequen- tial sharing steps on a strictly positiv e joint distribution p ∈ ∆ k + n until the resulting distribution p 0 has a conditional distrib ution p 0 ( ·|· ) satisfying P y | p 0 ( y | x ) − q ( y | x ) | ≤  for all x . Here B ⊆ { 0 , 1 } k denotes the set of inputs x that ha ve been readily processed in the current iteration. steps to any giv en strictly positiv e joint distribution. W e denote by star the intersection of a radius- 1 Hamming ball and a cylinder set containing the center of the ball. See Figure 5 A. The details of the algorithm are gi ven in Algorithm 1 . In order to obtain a bound on the number m of hidden units for which RBM k n,m can approx- imate a gi ven tar get conditional distribution arbitrarily well, we just need to ev aluate the number of sharing steps run by Algorithm 1 . For this purpose, we in vestigate the combinatorics of sharing step sequences and ev aluate their w orst case lengths. W e can choose as starting distrib ution some p ∈ RBM n + k, 0 with conditionals satisfying P y | p ( y | x ) − δ 0 ( y ) | ≤  0 for all x ∈ { 0 , 1 } k , for some  0 > 0 small enough depending on the tar get conditional q ( ·|· ) and the targeted approximation accuracy  . Deﬁnition 29. A sequence of stars B 1 , . . . , B l packing { 0 , 1 } k with the property that the smallest cylinder set containing an y of the stars in the sequence does not intersect any previous star in the sequence is called a star packing sequence for { 0 , 1 } k . The number of sharing steps run by Algorithm 1 is bounded from abov e by (2 n − 1) times the length of a star packing sequence for the set of inputs { 0 , 1 } k . Note that the choices of stars and the lengths of the possible star packing sequences are not unique. Figure 5 B giv es an example sho wing that starting a sequence with lar ge stars is not necessarily the best strate gy to produce a short sequence. The next lemma states that there is a class of star packing sequences of a certain length, depending on the size of the input space. Thereby , this lemma upper-bounds the worst case complexity of Algorithm 1 . Lemma 30. Let r ∈ N , S ( r ) := 1 + 2 + · · · + r , k ≥ S ( r ) , f i ( z ) := 2 S ( i − 1) + (2 i − ( i + 1)) z , and F ( r ) := f r ( f r − 1 ( · · · f 2 ( f 1 ))) . Ther e is a star packing sequence for { 0 , 1 } k of length 2 k − S ( r ) F ( r ) . Furthermor e, for this sequence, Algorithm 1 r equir es at most R ( r ) := Q r i =2 (2 i − ( i + 1)) r esets. 17 M O N T ´ U FA R , A Y , A N D G H A Z I - Z A H E D I A b b b b b b b b h 0 1 0 i h 0 0 0 i h 0 1 1 i h 0 0 1 i h 1 0 0 i h 1 0 1 i h 1 1 0 i h 1 1 1 i b b b b b b b b b b b h 0 1 0 i h 0 0 0 i h 1 0 0 i h 1 1 0 i b b b b b b b b b h 0 0 0 i h 1 0 0 i b b B b b b b 1 b b 2 b b 3 4 b b b 1 2 b b b b b 3 1 2 3 4 b b b b b b b b C b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b B (1) , 1 B (1) , 2 B (1) , 3 B (1) , 4 B (1) , 5 B (1) , 6 B (1) , 7 B (1) , 8 b b B (1 , 1 , 1) , 1 b b b b b b B (1 , 1) , 1 B (1 , 1) , 2 D (1) = { 0 , 1 } 6 Figure 5: A) Examples of radius- 1 Hamming balls in c ylinder sets of dimension 3 , 2 , and 1 . The cylinder sets are shown as bold vertices connected by dashed edges, and the nested Hamming balls (stars) as bold vertices connected by solid edges. B) Three examples of star packing sequences for { 0 , 1 } 3 . C) Illustration of the star packing sequence constructed in Lemma 30 for { 0 , 1 } 6 . Proof. The star packing sequence is constructed by the following procedure. In each step, we deﬁne a set of cylinder sets packing all sites of { 0 , 1 } k that ha ve not been covered by stars so far , and include a sub-star of each of these cylinder sets in the sequence. As an initialization step, we split { 0 , 1 } k into 2 k − S ( r ) S ( r ) -dimensional cylinder sets, denoted D ( j 1 ) , j 1 ∈ { 1 , . . . , 2 k − S ( r ) } . In the ﬁrst step, for each j 1 , the S ( r ) -dimensional c ylinder set D ( j 1 ) is packed by 2 S ( r − 1) r - dimensional cylinder sets C ( j 1 ) ,i , i ∈ { 1 , . . . , 2 S ( r − 1) } . F or each i , we deﬁne the star B ( j 1 ) ,i as the radius- 1 Hamming ball within C ( j 1 ) ,i centered at the smallest element of C ( j 1 ) ,i (with respect to the lexicographic order of { 0 , 1 } k ), and include it in the sequence. At this point, the sites in D ( j 1 ) that hav e not yet been covered by stars is D ( j 1 ) \ ( ∪ i B ( j 1 ) ,i ) . This set is split into 2 r − ( r + 1) S ( r − 1) -dimensional cylinder sets, which we denote by D ( j 1 ,j 2 ) , j 2 ∈ { 1 , . . . , 2 r − ( r + 1) } . Note that ∪ j 1 D ( j 1 ,j 2 ) is a cylinder set, and hence, for each j 2 , the ( ∪ j 1 D ( j 1 ,j 2 ) ) -ro ws of a con- ditional distribution being processed by Algorithm 1 can be jointly reset by one single sharing step to achie ve p 0 ( ·| x ) ≈ δ 0 for all x ∈ ∪ j 1 D ( j 1 ,j 2 ) . In the second step, for each j 2 , the cylinder set D ( j 1 ,j 2 ) is packed by 2 S ( r − 2) ( r − 1) -dimensional cylinder sets C ( j 1 ,j 2 ) ,i , i ∈ { 1 , . . . , 2 S ( r − 2) } , and the corresponding stars are included in the se- quence. The procedure is iterated until the r -th step. In this step, each D ( j 1 ,...,j r ) is a 1 -dimensional cylinder set and is packed by a single 1 -dimensional cylinder set C ( j 1 ,...,j r ) , 1 = B ( j 1 ,...,j r ) , 1 . Hence, at this point, all of { 0 , 1 } k has been exhausted and the procedure terminates. 18 G E O M E T RY A N D E X P R E S S I V E P O W E R O F C R B M S m ( r ) n,k = r 2 k 2 − S ( r ) F ( r ) (2 n − 1) + R ( r ) 1 2 k 2 − 1 1 (2 n − 1) + 0 2 2 k 2 − 3 3 (2 n − 1) + 1 3 2 k 2 − 6 20 (2 n − 1) + 4 4 2 k 2 − 10 284 (2 n − 1) + 44 5 2 k 2 − 15 8408 (2 n − 1) + 1144 . . . . . . . . . . . . . . . . . . > 17 2 k 0 . 2263 (2 n − 1) + 2 S ( r ) 0 . 0269 T able 1: Numerical e v aluation of the bounds from Proposition 31 . Each ro w ev aluates the uni versal approximation bound m ( r ) n,k for a v alue of r . Summarizing, the procedure is initialized by creating the branches D ( j 1 ) , j 1 ∈ [2 k − S ( r ) ] . In the ﬁrst step, each branch D ( j 1 ) produces 2 S ( r − 1) stars and splits into the branches D ( j 1 ,j 2 ) , j 2 ∈ [2 r − ( r + 1)] . More generally , in the i -th step, each branch D ( j 1 ,...,j i ) produces 2 S ( r − i ) stars, and splits into the branches D ( j 1 ,...,j i ,j i +1 ) , j i +1 ∈ [2 r − ( i − 1) − ( r + 1 − ( i − 1))] . The total number of stars D ( j 1 ,...,j r ) is gi ven precisely by 2 k − S ( r ) times the value of the iterative function F ( r ) = f r ( f r − 1 ( · · · f 2 ( f 1 ))) , whereby f 1 = 1 . The total number of resets is giv en by the number of branches created from the ﬁrst step on, which is precisely R ( r ) = Q i ∈ [ r ] (2 i − ( i + 1)) . Figure 5 C offers an illustration of these star packing sequences. The ﬁgure shows the case k = S (3) = 6 . In this case, there is only one initial branch D (1) = { 0 , 1 } 6 . The stars B (1) ,i , i ∈ [2 S (2) ] = [8] are shown in solid blue, B (1 , 1) ,i , i ∈ [2 S (1) ] = [2] in dashed red, and B (1 , 1 , 1) , 1 in dotted green. For clarity , only these stars are highlighted. The stars B (1 ,j 2 ) ,i and B (1 ,j 2 , 1) , 1 resulting from split branches are similar , translated versions of the highlighted ones.  W ith this, we obtain the general bound of the theorem: Proposition 31 (Theorem 7 , general bound) . Let k ≥ S ( r ) . The model RBM k n,m can appr oximate every conditional distrib ution fr om ∆ k,n arbitrarily well whenever m ≥ m ( r ) k,n , where m ( r ) k,n := 2 k − S ( r ) F ( r )(2 n − 1) + R ( r ) . Proof. This is in vie w of the complexity of Algorithm 1 for the sequence gi ven in Lemma 30 .  In order to make the uni versal approximation bound more comprehensible, in T able 1 we e valu- ated the sequence m ( r ) n,k for r = 1 , 2 , 3 . . . and k ≥ S ( r ) . Furthermore, the next proposition giv es an explicit expression for the coefﬁcients 2 − S ( r ) F ( r ) and R ( r ) appearing in the bound. This yields the second part of Theorem 7 . In general, the bound m ( r ) n,k decreases with increasing r , except possibly for a fe w v alues of k when n is small. For a pair ( k , n ) , any m ( r ) n,k with k ≥ S ( r ) is a sufﬁcient number of hidden units for obtaining a uni versal approximator . Proposition 32 (Theorem 7 , explicit bounds) . The function K ( r ) := 2 − S ( r ) F ( r ) is bounded fr om below and above as K (6) Q r i =7  1 − i − 3 2 i  ≤ K ( r ) ≤ K (6) Q r i =7  1 − i − 4 2 i  for all r ≥ 6 . Fur - 19 M O N T ´ U FA R , A Y , A N D G H A Z I - Z A H E D I thermor e, K (6) ≈ 0 . 2442 and K ( ∞ ) ≈ 0 . 2263 . Mor eover , R ( r ) := Q r i =2 (2 i − ( i + 1)) = 2 S ( r ) P ( r ) , wher e P ( r ) := 1 2 Q r i =2 (1 − ( i +1) 2 i ) , and P ( ∞ ) ≈ 0 . 0269 . Proof. From the deﬁnition of S ( r ) and F ( r ) , we obtain that K ( r ) = 2 − r + K ( r − 1)(1 − 2 − r ( r + 1)) . (2) Note that K (1) = 1 2 , and that K ( r ) decreases monotonically . No w , note that if K ( r − 1) ≤ 1 c , then the left hand side of Equation ( 2 ) is bounded from below as K ( r ) ≥ K ( r − 1)(1 − 2 − r ( r + 1 − c )) . For a gi ven c , let r c be the ﬁrst r for which K ( r − 1) ≤ 1 c , assuming that such an r exists. Then K ( r ) ≥ K ( r c − 1) r Y i = r c  1 − i + 1 − c 2 i  , for all r ≥ r c . (3) Similarly , if K ( r ) > 1 d for all r ≥ r b , then K ( r ) ≤ K ( r b − 1) r Y i = r b  1 − i + 1 − b 2 i  , for an y r ≥ r b . Direct computations show that K (6) ≈ 0 . 2445 ≤ 1 4 . On the other hand, using the computational en- gine Wolfram|Alpha(access June 01, 2014) we obtain that Q ∞ i =0  1 − i − 3 2 i  ≈ 7 . 7413 . Plugging both terms into Equation ( 3 ) yields that K ( r ) is alw ays bounded from belo w by 0 . 2259 . Since K ( r ) is nev er smaller than or equal to 1 5 , we obtain that K ( r ) ≤ K ( r 0 − 1) Q r i = r 0  1 − i − 4 2 i  , for any r 0 and r ≥ r 0 . Using r 0 = 7 , the right hand side ev aluates in the limit of large r to approxi- mately 0 . 2293 . Numerical ev aluation of K ( r ) from Equation ( 2 ) for r up to one million (using Matlab R2013b ) indicates that, indeed, K ( r ) tends to approximately 0 . 2263 for large r .  W e close this subsection with the remark that the proof strategy can be used not only to study uni versal approximation, b ut also approximability of selected classes of conditional distributions: Remark 33. If we only want to model a restricted class of conditional distributions, then adapting Algorithm 1 to these restrictions may yield tighter bounds for the number of hidden units that suf ﬁces to represent these restricted conditionals. For e xample: If we only want to model the target conditionals q ( ·| x ) for the inputs x from a subset S ⊆ { 0 , 1 } k and do not care about q ( ·| x ) for x 6∈ S , then in the algorithm we just need to replace { 0 , 1 } k by S . In this case, a cylinder set packing of S \ B is understood as a collection of disjoint cylinder sets C 1 , . . . , C K ⊆ { 0 , 1 } k with ∪ i ∈ [ K ] C i ⊇ S \ B and ( ∪ i ∈ [ K ] C i ) ∩ B = ∅ . Furthermore, if for some cylinder set C i and a corresponding star B i ⊆ C i the conditionals q ( ·| x ) with x ∈ B i hav e a common support set T ⊆ { 0 , 1 } n , then the C i -ro ws of p can be reset to a distribution δ y with y ∈ T , and only | T | − 1 sharing steps are needed to transform p to a distrib ution whose conditionals approximate q ( ·| x ) for all x ∈ B i to an y desired accuracy . In particular , for the class of target conditional distrib utions with supp q ( ·| x ) = T for all x , the term 2 n − 1 in the complexity bound of Algorithm 1 is replaced by | T | − 1 . 20 G E O M E T RY A N D E X P R E S S I V E P O W E R O F C R B M S B.2 Necessary Number of Hidden Units Proposition 9 follo ws from simple parameter counting arguments. In order to make this rigorous, ﬁrst we make the observation that uni versal approximation of (conditional) probability distrib utions by Boltzmann machines or an y other models based on exponential f amilies, with or without hidden v ariables, requires the number of model parameters to be as large as the dimension of the set being approximated. W e denote by ∆ X , Y the set of conditionals with inputs form a ﬁnite set X and outputs from a ﬁnite set Y . Accordingly , we denote by ∆ Y the set of probability distributions on Y . Lemma 34. Let X , Y , and Z be some ﬁnite sets. Let M ⊆ ∆ X , Y be deﬁned as the set of condi- tionals of the marginal M 0 ⊆ ∆ X ×Y of an exponential family E ⊆ ∆ X ×Y ×Z . If M is a universal appr oximator of conditionals fr om ∆ X , Y , then dim( E ) ≥ dim(∆ X , Y ) = |X | ( |Y | − 1) . The intuition of this lemma is that, for models deﬁned by marginals of e xponential families, the set of conditionals that can be approximated arbitrarily well is essentially equal to the set of conditionals that can be represented exactly , implying that there are no lo w-dimensional uni versal approximators of this type. Proof of Lemma 34 . W e consider ﬁrst the case of probability distributions; that is, the case with |X | = 1 and X × Y ∼ = Y . Let M be the image of the exponential family E by a dif ferentiable map f (for example, the mar ginal map). The closure E , which consists of all distrib utions that can be approximated arbitrarily well by E , is a compact set. Since f is continuous, the image of E is also compact, and M = f ( E ) = f ( E ) . The model M is a universal approximator if and only if M = ∆ Y . The set E is a ﬁnite union of exponential families; one e xponential family E F for each possible support set F of distributions from E . When dim( E ) < dim(∆ Y ) , each point of each E F is a critical point of f (the Jacobian is not surjecti ve at that point). By Sard’ s theorem, each E F is mapped by f to a set of measure zero in ∆ Y . Hence the ﬁnite union ∪ F f ( E F ) = f ( ∪ F E F ) = f ( E ) = M has measure zero in ∆ Y . For the general case, with |X | ≥ 1 , note that M ⊆ ∆ X , Y is a universal approximator if f the joint model ∆ X M = { p ( x ) q ( y | x ) : p ∈ ∆ X , q ∈ M} ⊆ ∆ X ×Y is a univ ersal approximator . The latter is the marginal of the e xponential family ∆ X ∗ E = { p ∗ q : p ∈ ∆ X , q ∈ E } ⊆ ∆ X ×Y ×Z . Hence the claim follo ws from the ﬁrst part.  Proof of Pr oposition 9 . If RBM k n,m is a uni versal approximator of conditionals from ∆ k,n , then the model consisting of all probability distributions of the form p ( x, y ) = 1 Z P z exp( z > W y + z > V x + b > y + c > z + f ( x )) is a uni versal approximator of probability distributions from ∆ k + n . The latter is the marginal of an exponential family of dimension mn + mk + n + m + 2 k − 1 . Thus, by Lemma 34 , m ≥ 2 k + n − 2 k − n ( n + k +1) .  C Details on the Maximal A pproximation Err ors Proof of Proposition 10 . W e have that D RBM k n,m ≤ max p ∈ ∆ k + n : p X = u X D ( p k RBM n + k,m ) . The right hand side is bounded by n , since the RBM model contains the uniform distribution. It is also bounded by the maximal diver gence D RBM n + k,m ≤ ( n + k ) − b log 2 ( m + 1) c − m +1 2 b log 2 ( m +1) c ( Mont ´ ufar et al. 2013 ).  21 M O N T ´ U FA R , A Y , A N D G H A Z I - Z A H E D I In order to prov e Theorem 11 , we will upper bound the approximation errors of CRBMs by the approximation errors of submodels of CRBMs. First, we note the following: Lemma 35. The maximal diver gence of a conditional model that is a Cartesian pr oduct of a pr obability model is bounded fr om above by the maximal diver gence of that pr obability model: if M = × x ∈{ 0 , 1 } k N ⊆ ∆ k,n for some N ⊆ ∆ n , then D M ≤ D N . Proof. F or any p ∈ ∆ k,n , we hav e D ( p kM ) = inf q ∈M 1 2 k X x D ( p ( ·| x ) k q ( ·| x )) = 1 2 k X x inf q ( ·| x ) ∈N D ( p ( ·| x ) k q ( ·| x )) ≤ 1 2 k X x D N = D N .  Deﬁnition 36. Gi ven a partition Z = {Y 1 , . . . , Y L } of { 0 , 1 } n , the partition model P Z ⊆ ∆ n is the set of all probability distributions on { 0 , 1 } n with constant v alue on each partition block. The set { 0 , 1 } l , l ≤ n naturally deﬁnes a partition of { 0 , 1 } n into cylinder sets { y ∈ { 0 , 1 } n : y [ l ] = z } for all z ∈ { 0 , 1 } l . The div ergence from P Z is bounded from abov e by D P Z ≤ l − n . No w , the model RBM k n,m can approximate certain products of partition models arbitrarily well: Proposition 37. Let Z = { 0 , 1 } l with l ≤ n . Let r be any inte ger with k ≥ S ( r ) . The model RBM k n,m can appr oximate any conditional distribution fr om the pr oduct of partition models P k Z := P Z × · · · × P Z arbitrarily well whene ver m ≥ 2 k − S ( r ) F ( r )( |Z | − 1) + R ( r ) . Proof. This is analogous to the proof of Proposition 19 , with a fe w dif ferences. Each element z of Z corresponds to a c ylinder set { y ∈ { 0 , 1 } n : y [ l ] = z } and the collection of cylinder sets for all z ∈ Z is a partition of { 0 , 1 } n . Now we can run Algorithm 1 in a slightly different way , with sharing steps deﬁned by p 0 = λp + (1 − λ ) u z , where u z is the uniform distribution on the cylinder set corresponding to z .  Proof of Theor em 11 . This follo ws directly from Lemma 35 and Proposition 37 .  D Details on the Repr esentation of Conditional Distributions from Marko v Random Fields The proof of Theorem 14 is based on ideas from Y ounes ( 1996 ), who discussed the uni versal ap- proximation property of Boltzmann machines. W e will use the following ( Y ounes 1996 ; Lemma 1): 22 G E O M E T RY A N D E X P R E S S I V E P O W E R O F C R B M S Lemma 38. Let % be a r eal number . Consider a ﬁxed inte ger N and binary variables x 1 , . . . , x N . Ther e ar e r eal numbers w and b suc h that: • If % ≥ 0 , log (1 + exp( w ( x 1 + · · · + x N ) + b )) = % Q i x i + Q ( x 1 , . . . , x N ) . • If % ≤ 0 , log (1 + exp( w ( x 1 + · · · + x N − 1 − x N ) + b )) = % Q i x i + Q ( x 1 , . . . , x N ) . Wher e Q is in each case a polynomial of de gr ee less than N − 1 in x 1 , . . . , x N . The follo wing is a generalization of ( Y ounes 1996 ; Lemma 2): Lemma 39. Let I and J be two simplicial comple xes on [ n ] with J ⊆ I . If p is any distribution fr om E I and m ≥ |{ A ∈ I \ J : | A | > 1 }| , then ther e is a distribution p 0 ∈ E J , such that p ∗ p 0 is contained in RBM n,m . Proof. The proof follo ws closely the arguments presented in ( Y ounes 1996 ; Lemma 2). Let K = { A ∈ I \ J : | A | > 1 } . Consider an RBM with n visible units and m = | K | hidden units. Consider a joint distrib ution q ( x, u ) = 1 Z exp( H ( x, u )) of the fully observ able RBM, deﬁned as follows. W e label the hidden units by subsets A ∈ K . F or each A ∈ K , let s ( A ) denote the largest element of A , and let H ( x, u ) = X A ∈ K u A  w A S  A A ( x A ) + b A  + X s ∈ [ n ] b s x s , where S  A A ( x A ) =  X s ∈ A,s (hs( V x + c )) + b ) , for all x ∈ { 0 , 1 } k , for some generic c hoice of W , V , b, c . Proof. Consider the conditional distrib ution p ( ·| x ) . This is the visible marginal of p ( y , z | x ) = 1 Z exp(( V x + c ) > z + b > y + z > W y ) . Consider weights α and β , with α large enough, such that argmax z ( αV x + αc ) > z = argmax z ( αV x + αc ) > z + ( β W > z + β b ) > y for all y ∈ { 0 , 1 } n . Note that for generic choices of V and c , the set argmax z ( αV + αc ) > z consists of a single point z ∗ = hs( V x + c ) . W e ha ve argmax ( y ,z ) ( αV x + αc ) > z + ( β W > z + β b ) > y = ( z ∗ , argmax y ( β W > z ∗ + β b ) > y ) . Here, again, for generic choices of V and b , the set argmax y ( β W > z ∗ + β b ) > y con- sists of a single point y ∗ = hs( W > z ∗ + b ) . The joint distribution p ( y , z | x ) with parameters tβ W , tαV , tβ b, tαc tends to the point measure δ ( y ∗ ,z ∗ ) ( y , z ) as t → ∞ . In this case p ( y | x ) tends to δ y ∗ ( y ) as t → ∞ , where y ∗ = hs( W > z ∗ + b ) = hs( W > hs( V x + c ) + b ) , for all x ∈ { 0 , 1 } k .  Proof of Theorem 20 . The second statement is precisely Lemma 40 . For the more general statement the arguments are as follows. Note that the conditional distrib ution p ( y | z ) of the output units, gi ven the hidden units, is the same for a CRBM and for its feedforward network version. Furthermore, for each input x , the CRBM output distribution is p ( y | x ) = P z ( q ( z | x ) ∗ p ( z )) p ( y | z ) , where q ( z | x ) = exp( z > V x + c > z ) P z 0 exp( z 0> V x + c > z 0 ) is the conditional distribution represented by the ﬁrst layer , p ( y , z ) = exp( z > W y + b > y ) P y 0 ,z 0 exp( z 0> W y 0 + b > y 0 ) is the distribution represented by the RBM with parameters W , b, 0 , and q ( z | x ) ∗ p ( z ) = q ( z | x ) p ( z ) P z 0 q ( z 0 | x ) p ( z 0 ) , for all z is the renormalized entry-wise product of the conditioned distribution q ( ·| x ) and the RBM hidden marginal distrib ution p ( z ) = X y p ( y , z ) . No w , if q is deterministic, then q ( z | x ) ∗ p ( z ) is the same as q ( z | x ) , regardless of p ( z ) (strictly posi- ti ve).  25 M O N T ´ U FA R , A Y , A N D G H A Z I - Z A H E D I The proof of Theorem 21 b uilds on the following lemma, which describes a combinatorial property of the deterministic policies that can be approximated arbitrarily well by CRBMs. Recall that the Heaviside step function hs maps a real number a to 0 if a < 0 , to 1 / 2 if a = 0 , and to 1 if a > 0 . Lemma 41. Consider a function f : { 0 , 1 } k → { 0 , 1 } n . The model RBM k n,m can appr oximate the deterministic policy p ( y | x ) = δ f ( x ) ( y ) arbitrarily well only if there is a c hoice of the model parameter s W , V , b, c for which f ( x ) = hs( W > hs([ W , V ]  f ( x ) x  + c ) + b ) , for all x ∈ { 0 , 1 } k , wher e the Heaviside function hs is applied entry-wise to its ar gument. Proof. Consider a choice of W , V , b, c . For each input state x , the conditional represented by RBM k n,m is equal to the mixture distribution p ( y | x ) = P z p ( z | x ) p ( y | x, z ) , with mixture compo- nents p ( y | x, z ) = p ( y | z ) ∝ exp(( z > W + b > ) y ) and mixture weights p ( z | x ) ∝ P y 0 exp(( z > W + b > ) y 0 + z > ( V x + c )) for all z ∈ { 0 , 1 } m . The support of a mixture distribution is equal to the union of the supports of the mixture components with non-zero mixture weights. In the present case, if P y | p ( y | x ) − δ f ( x ) ( y ) | ≤ α , then P y | p ( y | x, z ) − δ f ( x ) ( y ) | ≤ α/ for all z with p ( z | x ) >  , for any  > 0 . Choosing α small enough, α/ can be made arbitrarily small for any ﬁx ed  > 0 . In this case, for e very z with p ( z | x ) >  , necessarily ( z > W + b > ) f ( x )  ( z > W + b > ) y , for all y 6 = f ( x ) , (5) and hence sgn( z > W + b > ) = sgn( f ( x ) − 1 2 ) . Furthermore, the probability assigned by p ( z | x ) to all z that do not satisfy Equation ( 5 ) has to be v ery close to zero (upper bounded by a function that decreases with α ). The probability of z gi ven x is giv en by p ( z | x ) = 1 Z z | x exp( z > ( V x + c )) X y 0 exp(( z > W + b > ) y 0 ) . In vie w of Equation ( 5 ), for all z with p ( z | x ) >  , if α is small enough, p ( z | x ) is arbitrarily close to 1 Z z | x exp( z > ( V x + c )) exp(( z > W + b > ) f ( x )) . This holds, in particular , for ev ery z that maximizes p ( z | x ) . Therefore, argmax z p ( z | x ) = argmax z z > ( W f ( x ) + V x + c ) . Each of these z must satisfy Equation ( 5 ). This completes the proof.  Proof of Theorem 21 . Sufﬁcient condition: The bound 2 k − 1 follows directly from Proposition 18 . For the second bound, note that any function f : { 0 , 1 } k → { 0 , 1 } n ; x 7→ y can be computed by 26 G E O M E T RY A N D E X P R E S S I V E P O W E R O F C R B M S a parallel composition of the functions f i : x 7→ y i , for all i ∈ [ n ] . Hence the bound follows from Lemma 40 and the f act that a feedforward linear threshold network with 3 k +2 2 k hidden units can compute any Boolean function. Necessary condition: Recall that a linear threshold function with N input bits and M output bits is a function of the form { 0 , 1 } N → { 0 , 1 } M ; y 7→ hs( W y + b ) with W ∈ R M × N and b ∈ R M . Lemma 41 sho ws that each deterministic policy that can be approximated by RBM k n,m arbitrarily well corresponds to the y -coordinate ﬁxed points of a map deﬁned as the composition of two linear threshold functions { 0 , 1 } k + n → { 0 , 1 } m ; ( x, y ) 7→ hs([ W , V ] [ y x ] + c ) and { 0 , 1 } m → { 0 , 1 } n ; z 7→ hs( W > z + b ) . In particular , we can upper bound the number of deterministic policies that can be approximated arbitrarily well by RBM k n,m , by the total number of compositions of tw o linear threshold functions; one with n + k inputs and m outputs and the other with m inputs and n outputs. Let L TF( N , M ) be the number of linear threshold functions with N inputs and M outputs. It is kno wn that ( Ojha 2000 ; W enzel et al. 2000 ) L TF( N , M ) ≤ 2 N 2 M . The number of deterministic policies that can be approximated arbitrarily well by RBM k n,m is thus bounded abov e by L TF( n + k , m ) · L TF( m, n ) ≤ 2 m ( n + k ) 2 + nm 2 . The actual number may be much smaller , in vie w of the ﬁxed-point and shared parameter constraints. On the other hand, the number of deterministic policies in ∆ k,n is as large as (2 n ) 2 k = 2 n 2 k . The claim follows from comparing these two numbers.  Acknowledgment W e acknowledge support from the DFG Priority Program Autonomous Learning (DFG-SPP 1527). G. M. and K. G.-Z. w ould like to thank the Santa Fe Institute for hosting them during the initial work on this article. References N. A y , G. Mont ´ ufar , and J. Rauh. Selection criteria for neuromanifolds of stochastic dynamics. In Y . Y amaguchi, editor , Advances in Cognitive Neur odynamics (III) , pages 147–154. Springer , 2013. URL http://dx . doi . org/10 . 1007/978- 94- 007- 4792- 0 20 . R. E. Bellman. Dynamic pr ogramming . Princeton Uni versity Press, Princeton, NY , 1957. Y . Bengio. Learning deep architectures for AI. F ound. T r ends Mach. Learn. , 2(1):1–127, Jan. 2009. URL http://dx . doi . org/10 . 1561/2200000006 . M. A. Cueto, J. Morton, and B. Sturmfels. Geometry of the restricted Boltzmann machine. In M. V iana and H. W ynn, editors, Algebraic methods in statistics and pr obability II, AMS Special Session , volume 2. AMS, 2010. A. Fischer and C. Igel. An introduction to restricted Boltzmann machines. In L. Alv arez, M. Mejail, L. Gomez, and J. Jacobo, editors, Pr ogr ess in P attern Recognition, Image Anal- ysis, Computer V ision, and Applications , volume 7441 of Lectur e Notes in Computer Sci- 27 M O N T ´ U FA R , A Y , A N D G H A Z I - Z A H E D I ence , pages 14–36. Springer Berlin Heidelberg, 2012. URL http://dx . doi . org/10 . 1007/ 978- 3- 642- 33275- 3 2 . Y . Freund and D. Haussler . Unsupervised Learning of Distrib utions of Binary V ectors Using T wo Layer Networks . T echnical report. Computer Research Laboratory , University of California, Santa Cruz, 1994. E. N. Gilbert. A comparison of signalling alphabets. Bell System T echnical Journal , 31:504–522, 1952. G. E. Hinton. T raining products of experts by minimizing contrastiv e div ergence. Neural Computation , 14(8):1771–1800, 2002. URL http://dx . doi . org/10 . 1162/ 089976602760128018 . G. E. Hinton. A practical guide to training restricted boltzmann machines. In G. Montav on, G. B. Orr , and K.-R. M ¨ uller , editors, Neur al Networks: T ricks of the T rade , v olume 7700 of Lectur e Notes in Computer Science , pages 599–619. Springer Berlin Heidelberg, 2012. URL http: //dx . doi . org/10 . 1007/978- 3- 642- 35289- 8 32 . G. E. Hinton, S. Osindero, and Y .-W . T eh. A fast learning algorithm for deep belief nets. Neural Computation , 18(7):1527–1554, 2006. H. Larochelle and Y . Bengio. Classiﬁcation using discriminativ e restricted Boltzmann machines. In W . W . Cohen, A. McCallum, and S. T . Roweis, editors, Pr oceedings of the 25th International Confer ence on Machine Learning (ICML 2008) , pages 536–543. A CM, 2008. N. Le Roux and Y . Bengio. Representational power of restricted Boltzmann machines and deep belief networks. Neural Computation , 20(6):1631–1649, 2008. J. Martens, A. Chattopadhya, T . Pitassi, and R. Zemel. On the e xpressive power of restricted Boltz- mann machines. In C. Burges, L. Bottou, M. W elling, Z. Ghahramani, and K. W einberger , ed- itors, Advances in Neur al Information Pr ocessing Systems 26 , pages 2877–2885. Curran Asso- ciates, Inc., 2013. URL http://papers . nips . cc/paper/5020- on- the- expressiv e- power- of- restricted- boltzmann- machines . pdf . V . Mnih, H. Larochelle, and G. E. Hinton. Conditional restricted Boltzmann machines for structured output prediction. CoRR , abs/1202.3748, 2012. G. Mont ´ ufar and N. A y . Reﬁnements of univ ersal approximation results for deep belief networks and restricted Boltzmann machines. Neural Computation , 23(5):1306–1319, 2011. G. Mont ´ ufar and J. Morton. When does a mixture of products contain a product of mixtures? SIAM J ournal on Discr ete Mathematics , 29:321–347, 2015. URL http://dx . doi . org/10 . 1137/ 140957081 . G. Mont ´ ufar and J. Rauh. Scaling of model approximation errors and expected entropy distances. K ybernetika , 50(2):234–245, 2014. 28 G E O M E T RY A N D E X P R E S S I V E P O W E R O F C R B M S G. Mont ´ ufar , J. Rauh, and N. A y . Expressive power and approximation errors of restricted Boltz- mann machines. In J. Sha we-T aylor , R. Zemel, P . Bartlett, F . Pereira, and K. W einberger , editors, Advances in Neural Information Pr ocessing Systems 24 , pages 415–423. Curran Associates, Inc., 2011. URL http://papers . nips . cc/paper/4380- expressive- power- and- app roximation- errors- of- restricted- boltzmann- machines . pdf . G. Mont ´ ufar , J. Rauh, and N. A y . Maximal information diver gence from statistical models deﬁned by neural networks. In F . Nielsen and F . Barbaresco, editors, Geometric Science of Informa- tion , LNCS 8085, pages 759–766. Springer , 2013. URL http://dx . doi . org/10 . 1007/ 978- 3- 642- 40020- 9 85 . G. Mont ´ ufar , K. Ghazi-Zahedi, and N. A y . A theory of cheap control in embodied systems. arXiv pr eprint arXiv:1407.6836 , 2014. P . C. Ojha. Enumeration of linear threshold functions from the lattice of hyperplane intersec- tions. Neural Networks, IEEE T ransactions on , 11(4):839–850, Jul 2000. ISSN 1045-9227. doi: 10 . 1109/72 . 857765. S. M. Ross. Intr oduction to Stochastic Dynamic Pr ogramming: Pr obability and Mathematical . Academic Press, Inc., Orlando, FL, USA, 1983. R. Salakhutdinov , A. Mnih, and G. E. Hinton. Restricted Boltzmann machines for collaborati ve ﬁl- tering. In Pr oceedings of the 24th International Confer ence on Machine Learning (ICML 2007) , pages 791–798, Ne w Y ork, NY , USA, 2007. A CM. B. Sallans and G. E. Hinton. Reinforcement learning with factored states and actions. Journal of Machine Learning Resear ch , 5:1063–1088, 2004. P . Smolensky . Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1. In D. E. Rumelhart, J. L. McClelland, and C. PDP Research Group, editors, P arallel Distributed Pr ocessing: V olume 1: F oundations , chapter Information Processing in Dynamical Systems: Foundations of Harmony Theory , pages 194–281. MIT Press, Cambridge, MA, USA, 1986. URL http://dl . acm . org/citation . cfm?id=104279 . 104290 . I. Sutske ver and G. E. Hinton. Learning multilev el distributed representations for high-dimensional sequences. In M. Meila and X. Shen, editors, AIST A TS , volume 2 of JMLR Pr oceedings , pages 548–555. JMLR.org, 2007. G. W . T aylor , G. E. Hinton, and S. T . Ro weis. Modeling human motion using binary latent v ariables. In B. Sch ¨ olkopf, J. Platt, and T . Hoffman, editors, Advances in Neur al Information Pr ocessing Systems 19 , pages 1345–1352. MIT Press, 2007. URL http://papers . nips . cc/paper/ 3078- modeling- human- motion- using- binary- latent- variables . pdf . L. van der Maaten. Discriminativ e restricted Boltzmann machines are universal approximators for discrete data. T echnical Report EWI-PRB TR 2011001, Delft Univ ersity of T echnology , 2011. R. R. V arshamo v . Estimate of the number of signals in error correcting codes. Doklady Akad. Nauk SSSR , 117:739–741, 1957. 29 M O N T ´ U FA R , A Y , A N D G H A Z I - Z A H E D I W . W enzel, N. A y , and F . P asemann. Hyperplane arrangements separating arbitrary v ertex classes in n-cubes. Adv . Appl. Math. , 25(3):284–306, 2000. URL http://dx . doi . org/10 . 1006/ aama . 2000 . 0701 . L. Y ounes. Synchronous Boltzmann machines can be uni versal approximators. Applied Mathemat- ics Letters , 9(3):109 – 113, 1996. URL http://www . sciencedirect . com/science/a rticle/pii/0893965996000419 . M. Zeiler, G. T aylor , N. T roje, and G. E. Hinton. Modeling pigeon behaviour using a condi- tional restricted Boltzmann machine. In 17th Eur opean Symposium on Artiﬁcial Neural Networks (ESANN) , 2009. 30

Geometry and Expressive Power of Conditional Restricted Boltzmann Machines

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment