Building fast Bayesian computing machines out of intentionally stochastic, digital parts

Building fast Ba yesian computing mac hines out of inten- tionall y stochastic, digital par ts. V ikash Mansinghka 1 , 2 , 3 and Eric Jonas 1 , 3 1 The authors contributed equally to this w ork. 2 Computer Science & Artiﬁcial Intelligence Laboratory , MIT 3 Department of Brain & Cogniti ve Sciences, MIT The brain interpr ets ambiguous sensory information faster and more r eliably than modern computers, using neurons that ar e slower and less reliable than logic gates. But Bayesian in- ference, which underpins many computational models of per ception and cognition, appears computationally challenging even given moder n transistor speeds and energy budgets. The computational principles and structures needed to narro w this gap are unknown. Here we show ho w to b uild fast Bayesian computing machines using intentionally stochastic, digi- tal parts, narr owing this efﬁciency gap by multiple orders of magnitude. W e ﬁnd that by connecting stochastic digital components according to simple mathematical rules, one can build massively parallel, low precision circuits that solve Bayesian inference problems and are compatible with the Poisson ﬁring statistics of cortical neurons. W e evaluate circuits f or depth and motion perception, perceptual learning and causal reasoning, each performing inference over 10,000+ latent variables in real time — a 1,000x speed advantage over com- modity micropr ocessors. These results suggest a new r ole f or randomness in the engineering and re verse-engineering of intelligent computation. 1 Our ability to see, think and act all depend on our mind’ s ability to process uncertain infor - mation and identify probable explanations for inherently ambiguous data. Man y computational models of the perception of motion 1 , motor learning 2 , higher -le vel cognition 3, 4 and cognitiv e de velopment 5 are based on Bayesian inference in rich, ﬂe xible probabilistic models of the world. Machine intelligence systems, including W atson 6 , autonomous vehicles 7 and other robots 8 and the Kinect 9 system for gestural control of video games, also all depend on probabilistic inference to resolve ambiguities in their sensory input. But brains solve these problems with greater speed than modern computers, using information processing units that are orders of magnitude slo wer and less reliable than the switching elements in the earliest electronic computers. The original UNIV A C I ran at 2.25 MHz 10 , and RAM from twenty years ago had one bit error per 256 MB per month 11 . In contrast, the fastest neurons in human brains operate at less than 1 kHz, and synaptic transmission can completely fail up to 50% of the time 12 . This efﬁcienc y gap presents a fundamental challenge for computer science. Ho w is it possi- ble to solv e problems of probabilistic inference with an efﬁcienc y that be gins to approach that of the brain? Here we introduce intentionally stochastic b ut still digital circuit elements, along with composition laws and design rules, that together narrow the ef ﬁciency gap by multiple orders of magnitude. Our approach both b uilds on and departs from the principles behind digital logic. Like tradi- tional digital gates, stochastic digital gates consume and produce discrete symbols, which can be represented via binary numbers. Also like digital logic gates, our circuit elements can be composed 2 and abstracted via simple mathematical rules, yielding larger computational units that whose be- havior can be analyzed in terms of their constituents. W e describe primiti ves and design rules for both stateless and synchronously clocked circuits. But unlike digital gates and circuits, our gates and circuits are intentionally stochastic: each output is a sample from a probability distrib ution conditioned on the inputs, and (except in degenerate cases) simulating a circuit twice will pro- duce dif ferent results. The numerical probability distrib utions themselv es are implicit, though they can be estimated via the circuits’ long-run time-av eraged behavior . And also unlike digital gates and circuits, Bayesian reasoning arises naturally via the dynamics of our synchronously clocked circuits, simply by ﬁxing the v alues of the circuit elements representing the data. W e hav e built prototype circuits that solve problems of depth and motion perception and perceptual learning, plus a compiler that can automatically generate circuits for solving causal reasoning problems gi ven a description of the underlying causal model. Each of these systems il- lustrates the use of stochastic digital circuits to accelerate Bayesian inference an important class of probabilistic models, including Mark ov Random Fields, nonparametric Bayesian mixture models, and Bayesian netw orks. Our prototypes show that this combination of simple choices at the hard- ware le vel — a discrete, digital representation for information, coupled with intentionally stochas- tic rather than ideally deterministic elements — has f ar reaching architectural consequences. For example, software implementations of approximate Bayesian reasoning typically rely on high- precision arithmetic and serial computation. W e sho w that our synchronous stochastic circuits can be implemented at very lo w bit precision, incurring only a negligible decrease in accuracy . This lo w precision enables us to make fast, small, power -efﬁcient circuits at the core of our designs. 3 W e also sho w that these reductions in computing unit size are suf ﬁcient to let us exploit the mas- si ve parallelism that has always been inherent in complex probabilistic models at a granularity that has been previously impossible to e xploit. The resulting high computation density dri v es the performance gains we see from stochastic digital circuits, narro wing the ef ﬁciency gap with neural computation by multiple orders of magnitude. Our approach is fundamentally dif ferent from existing approaches for reliable computation with unreliable components 13–15 , which vie w randomness as either a source of error whose impact needs to be mitigated or as a mechanism for approximating arithmetic calculations. Our combi- national circuits are intentionally stochastic, and we depend on them to produce exact samples from the probability distributions they represent. Our approach is also different from and com- plementary to classic analog 16 and modern mixed-signal 17 neuromorphic computing approaches: stochastic digital primitiv es and architectures could potentially be implemented using neuromor- phic techniques, providing a means of applying these designs to problems of Bayesian inference. In theory , stochastic digital circuits could be used to solve any computable Bayesian in- ference problem with a computable likelihood 18 by implementing a Marko v chain for inference in a T uring-complete probabilistic programming language 19, 20 . Stochastic ciruits can thus imple- ment inference and learning techniques for divers e intelligent computing architectures, including both probabilistic models deﬁned o ver structured, symbolic representations 5 as well as sparse, dis- tributed, connectionist representations 21 . In contrast, hardware accelerators for belief propagation algorithms 22–24 can only answer queries about marginal probabilities or most probable conﬁgura- 4 tions, only apply to ﬁnite graphical models with discrete or binary nodes, and cannot be used to learn model parameters from data. For e xample, the formulation of perceptual learning we present here is based on inference in a nonparametric Bayesian model to which belief propagation does not apply . Additionally , because stochastic digital circuits produce samples rather than probabili- ties, their results capture the comple x dependencies between v ariables in multi-modal probability distributions, and can also be used to solve otherwise intractable problems in decision theory by estimating expected utilities. Stochastic Digital Gates and Stateless Stochastic Circuits (Figure 1 about her e) Digital logic circuits are based on a gate abstraction deﬁned by Boolean functions: deter- ministic mappings from input bit values to output bit values 25 . For elementary gates, such as the AND gate, these are giv en by truth tables; see Figure 1A. Their power and ﬂexibility comes in part from the composition laws that the y support, shown in Figure 1B. The output from one gate can be connected to the input of another , yielding a circuit that samples from the composition of the Boolean functions represented by each gate. The compound circuit can also be treated as a ne w primitiv e, abstracting away its internal structure. These simple laws hav e proved surprisingly po werful: they enable comple x circuits to be built up out of reusable pieces. Stochastic digital gates (see Figure 1C) are similar to Boolean gates, but consume a source of random bits to generate samples from conditional probability distrib utions. Stochastic gates are 5 speciﬁed by conditional probability tables; these gi v e the probability that a gi ven output will result from a gi ven input. Digital logic corresponds to the de generate case where all the probabilities are 0 or 1; see Figure 1D for the conditional probability table for an AND gate. Many stochastic gates with m input bits and n output bits are possible. Figure 1E sho ws one central example, the THET A gate, which generates dra ws from a biased coin whose bias is speciﬁed on the input. Supplementary material outlining serial and parallel implementations is av ailable at 26 . Crucially , stochastic gates support generalizations of the composition laws from digital logic, sho wn in Figure 1F . The output of one stochastic gate can be fed as the input to another , yielding samples from the joint probability distribution over the random v ariables simulated by each gate. The compound circuit can also be treated as a ne w primiti ve that generates samples from the mar ginal distrib ution of the ﬁnal output gi ven the ﬁrst input. As with digital gates, an enormous variety of circuits can be constructed using these simple rules. F ast Bayesian Infer ence via Massively P arallel Stochastic T ransition Cir cuits Most digital systems are based on deterministic ﬁnite state machines; the template for these ma- chines is shown in Figure 2A. A stateless digital circuit encodes the transition function that cal- culates the ne xt state from the pre vious state, and the clocking machinery (not sho wn) iterates the transition function repeatedly . This abstraction has prov ed enormously fruitful; the ﬁrst micro- processors had roughly 2 20 distinct states. In Figure 2B, we sho w the stochastic analogue of this synchronous state machine: a stochastic transition cir cuit . 6 Instead of the combinational logic circuit implementing a deterministic transition function, it contains a combinational stochastic circuit implementing a stochastic transition operator that samples the next state from a probability distribution that depends on the current state. It thus corresponds to a Markov chain in hardware. T o be a v alid transition circuit, this transition operator must ha ve a unique stationary distribution P ( S | X ) to which it ergodically con verges. A num- ber of recipes for suitable transition operators can be constructed, such as Metropolis sampling 27 and Gibbs sampling 28 ; most of the results we present rely on v ariations on Gibbs sampling. More details on ef ﬁcient implementations of stochastic transition circuits for Gibbs sampling and Metropolis-Hastings can be found elsewhere 26 . Note that if the input X represents observ ed data and the state S represents a hypothesis, then the transition circuit implements Bayesian inference. W e can scale up to challenging problems by exploiting the composition laws that stochas- tic transition circuits support. Consider a probability distribution deﬁned over three variables P ( A, B , C ) = P ( A ) P ( B | A ) P ( C | A ) . W e can construct a transition circuit that samples from the ov erall state ( A, B , C ) by composing transition circuits for updating A | B C , B | A and C | A ; this assembly is shown in Figure 2C. As long as the underlying probability model does not have any zero-probability states, er godic con ver gence of each constituent transition circuit then implies ergodic con ver gence of the whole assembly 29 . The only requirement for scheduling transitions is that each circuit must be left ﬁxed while circuits for v ariables that interact with it are transition- ing. This scheduling requirement — that a transition circuit’ s v alue be held ﬁxed while others that read from its internal state or serve as inputs to its next transition are updating — is analogous to the so-called “dynamic discipline” that deﬁnes valid clock schedules for traditional sequential 7 logic 30 . Deterministic and stochastic schedules, implementing cycle or mixture hybrid kernels 29 , are both possible. This simple rule also implies a tremendous amount of e xploitable parallelism in stochastic transition circuits: if two v ariables are independently caused gi ven the current setting of all others, they can be updated at the same time. Assemblies of stochastic transition circuits implement Bayesian reasoning in a straightfor- ward way: by ﬁxing, or “clamping” some of the v ariables in the assembly . If no variables are ﬁxed, the circuit explores the full joint distribution, as sho wn in Figure 2E and 2F . If a v ariable is ﬁxed, the circuit explores the conditional distribution on the remaining v ariables, as shown in Figure 2G and 2H. Simply by changing which transition circuits are updated, the circuit can be used to answer dif ferent probabilistic queries; these can be v aried online based on the needs of the application. (Figure 2 about her e.) The accuracy of ultra-low-pr ecision stochastic transition circuits. The central operation in many Markov chain techniques for inference is called DISCRETE-SAMPLE, which generates draws from a discrete-output probability distrib ution whose weights are speciﬁed on its input. For e xample, in Gibbs sampling, this distrib ution is the conditional probability of one v ariable gi ven the current value of all other v ariables that directly depend on it. One implementa- tion of this operation is sho wn in Figure 3A; each stochastic transition circuit from Figure 2 could be implemented by one such circuit, with multiplexers to select log-probability v alues based on 8 the neighbors of each random v ariable. Because only the ratios of the ra w probabilities matter , and the probabilities themselves naturally v ary on a log scale, extremely low precision representations can still provide accurate results. High entropy (i.e. nearly uniform) distributions are resilient to truncation because their values are nearly equal to begin with, dif fering only slightly in terms of their low-order bits. Low entropy (i.e. nearly deterministic) distributions are resilient because truncation is unlikely to change which outcomes hav e nonzero probability . Figure 3B quanti- ﬁes this lo w-precision property , showing the relativ e entropy (a canonical information theoretic measure of the dif ference between tw o distributions) between the output distributions of lo w preci- sion implementations of the circuit from Figure 3A and an accurate ﬂoating-point implementation. Discrete distributions on 1000 outcomes were used, spanning the full range of possible entropies, from almost 10 bits (for a uniform distribution on 1000 outcomes) to 0 bits (for a deterministic distribution), with error nearly undetectable until fe wer than 8 bits are used. Figure 3C shows example distributions on 10 outcomes, and Figure 3D shows the resulting impact on computing element size. Extensiv e quantitati ve assessments of the impact of low bit precision ha ve also been performed, providing additional e vidence that only very low precision is required 26 . (Figure 3 about her e.) Efﬁciency gains on depth and motion perception and per ceptual learning pr oblems Our main results are based on an implementation where each stochastic gate is simulated using digital logic, consuming entrop y from an internal pseudorandom number generator 31 . This allows 9 us to measure the performance and f ault-tolerance improv ements that ﬂo w from stochastic archi- tectures, independent of physical implementation. W e ﬁnd that stochastic circuits make it practical to perform stochastic inference ov er se veral probabilistic models with 10,000+ latent variables in real time and at lo w po wer on a single chip. These designs achie ve a 1,000x speed adv antage ov er commodity microprocessors, despite using gates that are 10x slo wer . In 26 , we also sho w architec- tures that e xhibit minimal degradation of accurac y in the presence of fault rates as high as one bit error for e very 100 state transitions, in contrast to con ventional architectures where failure rates are measured in bit errors (failures) per billion hours of operation 32 . Our ﬁrst application is to depth and motion perception, via Bayesian inference in lattice Marko v Random Field models 28 . The core problem is matching pixels from two images of the same scene, taken at distinct b ut nearby points in space or in time. The matching is ambiguous on the basis of the images alone, as multiple pixels might share the same value 33 ; prior knowl- edge about the structure of the scene must be applied, which is often cast in terms of Bayesian inference 34 . Figure 4A illustrates the template probabilistic model most commonly used. The X v ariables contain the unknown displacement vectors. Each Y v ariable contains a vector of pixel similarity measurements, one per possible pair of matched pix els based on X. The pairwise poten- tials between the X variables encode scene structure assumptions; in typical problems, unknown v alues are assumed to vary smoothly across the scene, with a small number of discontinuities at the boundaries of objects. Figure 4B sho ws the conditional independence structure in this problem: e very other X variable is independent from one another , allo wing the entire Marko v chain over the X v ariables to be updated in a two-phase clock, independent of lattice size. Figure 4C shows 10 the dataﬂo w for the software-reprogrammable probabilistic video processor we de v eloped to solve this f amily of problems; this processor takes a problem speciﬁcation based on pairwise potentials and Y values, and produces a stream of posterior samples. When comparing the hardware to hand- optimized C versions on a commodity w orkstation, we see a 500x performance improv ement. (Figure 4 about her e.) W e ha ve also built stochastic architectures for solving perceptual learning problems, based on fully Bayesian inference in Dirichlet process mixture models 35, 36 . Dirichlet process mixtures allow the number of clusters in a perceptual dataset to be automatically discov ered during inference, without assuming an a priori limit on the models’ complexity , and form the basis of many models of human categorization 37, 38 . W e tested our prototype on the problem of discovering and classifying handwritten digits from binary input images. Our circuit for solving this problem operates on an online data stream, and efﬁciently tracks the number of perceptual clusters this input; see 26 for architectural and implementation details and additional characterizations of performance. As with our depth and motion perception architecture, we observ e ov er ∼ 2,000x speedups as compared to a highly optimized software implementation. Of the ∼ 2000x dif ference in speed, roughly ∼ 256x is directly due to parallelism — all of the pixels are independent dimensions, and can therefore be updated simultaneously . (Figure 5 about her e.) 11 A utomatically generated causal reasoning cir cuits and spiking implementations Digital logic gates and their associated design rules are so simple that circuits for many problems can be generated automatically . Digital logic also provides a common target for device engineers, and hav e been implemented using many different physical mechanisms – classically with vaccum tubes, then with MOSFETS in silicon, and even on spintronic de vices 39 . Here we provide two illustrations of the analogous simplicity and generality of stochastic digital circuits, both rele v ant for the re verse-engineering of intelligent computation in the brain. W e ha ve b uilt a compiler that can automatically generate circuits for solving arbitrary causal reasoning problems in Bayesian network models. Bayesian network formulations of causal rea- soning hav e played central roles in machine intelligence 22 and computational models of cognition in both humans and rodents 4 . Figure A sho ws a Bayesian network for diagnosing the behavior of an intensiv e care unit monitoring system. Bayesian inference within this network can be used to infer probable states of the ICU giv en ambiguous patterns of evidence — that is, reason from observed effects back to their probable causes. Figure B sho ws a factor graph representation of this model 40 ; this more general data structure is used as the input to our compiler . Figure C shows inference results from three representativ e queries, each corresponding to a different pattern of observed data. W e have also explored implementations of stochastic transition circuits in terms of spiking elements gov erned by Poisson ﬁring statistics. Figure D shows a spiking netw ork that implements the Marko v chain from Figure . The stochastic transition circuit corresopnding to a latent variable 12 X is implemented via a bank of Poisson-spiking elements { X i } with one unit X i per possible value of the variable. The rate for each spiking element X i is determined by the unnormalized condi- tional log probability of the v ariable setting it corresponds to, follo wing the discrete-sample gate from Figure the time to ﬁrst spike t( X i ) ∼ E xp ( e i ) , with e i obtained by summing ener gy contri- butions from all connected v ariables. The output value of X is determined by argmin i { t( X i ) } , i.e. the element that spik ed ﬁrst, implemented by f ast lateral inhibition between the X i s. It is easy to sho w that this implements exponentiation and normalization of the energies, leading to a correct implementation of a stochastic transition circuit for Gibbs sampling; see 26 for more information. Elements are clocked quasi-synchronously , reﬂecting the conditional independence structure and parallel update scheme from Figure D, and yields samples from the correct equilibrium distribu- tion. This spiking implementation helps to narrow the gap with recent theories in computational neuroscience. For example, there have been recent proposals that neural spikes correspond to samples 41 , and that some spontaneous spiking activity corresponds to sampling from the brain’ s unclamped prior distribution 42 . Combining these local elements using our composition and ab- straction laws into massi vely parallel, lo w-precision, intentionally stochastic circuits may help to bridge the gap between probabilistic theories of neural computation and the computational de- mands of complex probabilistic models and approximate inference 43 . (Figure 6 about her e.) 13 Discussion T o further narro w the ef ﬁciency gap with the brain, and scale to more challenging Bayesian in- ference problems, we need to improve the con vergence rate of our architectures. One approach would be to initialize the state in a transition circuit via a separate, feed-forward, combinational circuit that approximates the equilibrium distribution of the Markov chain. Machine perception software that uses machine learning to construct fast, compact initializers is already in use 9 . An- alyzing the number of transitions needed to close the gap between a good initialization and the target distrib ution may be harder 44 . Howe ver , some feedforward Monte Carlo inference strategies for Bayesian networks prov ably yield precise estimates of probabilities in polynomial time if the underlying probability model is suf ﬁciently stochastic 45 ; it remains to be seen if similar conditions apply to stateful stochastic transition circuits. It may also be fruitful to search for novel electronic de vices — or previously unusable dynamical regimes of existing devices — that are as well matched to the needs of intentionally stochastic circuits as transistors are to logical in verters, potentially e ven via a spiking implemen- tation. Physical phenomena that proved too unreliable for implementing Boolean logic gates may be viable building blocks for machines that perform Bayesian inference. Computer engineering has thus far focused on deterministic mechanisms of remarkable scale and complexity: billlions of parts that are expected to make trillions of state transitions with per - fect repeatability 46 . But we are no w engineering computing systems to exhibit more intelligence than they once did, and identify probable explanations for noisy , ambiguous data, dra wn from lar ge 14 spaces of possibilities, rather than calculate the deﬁnite consequences of perfectly kno wn assump- tions with high precision. The apparent intractability of probabilistic inference has complicated these efforts, and challenged the viability of Bayesian reasoning as a foundation for engineering intelligent computation and for re verse-engineering the mind and brain. At the same time, maintaining the illusion of rock-solid determinism has become increas- ingly costly . Engineers no w attempt to build digital logic circuits in the deep sub-micron regime 47 and ev en inside cells 48 ; in both these settings, the underlying physics has stochasticity that is dif ﬁcult to suppress. Energy budgets ha ve gro wn increasingly restricted, from the scale of the datacenter 49 to the mobile device 50 , yet we spend substantial energy to operate transistors in de- terministic regimes. And efforts to understand the dynamics of biological computation — from biological neural networks to gene expression networks 51 — hav e all encountered stochastic be- havior that is hard to explain in deterministic, digital terms. Our intentionally stochastic digital circuit elements and stochastic computing architectures suggest a new direction for reconciling these trends, and enables the design of a ne w class of fast, Bayesian digital computing machines. Acknowledgements The authors would like to acknowledge T omaso Poggio, Thomas Knight, Gerald Sussman, Rakesh Kumar and Joshua T enenbaum for numerous helpful discussions and comments on early drafts, and T ejas Kulkarni for contrib utions to the spiking implementation. 15 Figure 1. (A) Boolean gates, such as the AND gate, are mathematically speciﬁed by truth tables: determin- istic mappings from binary inputs to binary outputs. (B) Compound Boolean circuits can be synthesized out of sub-circuits that each calculate dif ferent sub-functions, and treated as a single g ate that implements the composite function, without reference to its internal details. (C) Each stochastic gate samples from a dis- crete probability distribution conditioned on an input; for clarity , we sho w an external source of random bits dri ving the stochastic behavior . (D) Composing gates that sample B giv en A and C giv en B yields a network that samples from the joint distribution ov er B and C giv en A; abstraction yields a gate that samples from the marginal distribution C—A. When only one sample path has nonzero probability , this recovers the composi- tion of Boolean functions. (E) The THET A gate is a stochastic gate that generates samples from a Bernoulli distribution whose parameter theta is speciﬁed via the m input bits. Like all stochastic digital gates, it can be speciﬁed by a conditional probability table, analogously to ho w Boolean gates can be speciﬁed via a truth table. (F) When each new output sample is triggered (e.g. because its internal randomness source updates), a dif ferent output sample is generated; time-a veraging the output makes it possible to estimate the entries in the probability table, which are otherwise implicit. (G) The THET A gate can be implemented by comparing the output of a source of (pseudo)random bits to the input coin weight. (H) Deterministic gates, such as the AND gate sho wn here, can be vie wed as degenerate stochastic gates speciﬁed by conditional probability tables whose entries are either 0 or 1. This permits ﬂuid interoperation of deterministic and stochastic gates in compound circuits. (I) A parallel circuit implementing a Binomial random variable can be implemented by combining THET A gates and adders using the composition laws from (D). 16 p( OUT|IN) IN OUT m n OUT ~ P( OUT | IN) f g f g g o f f f g h = g o f h e IN OUT m IN OUT 0000 0001 1111 0 1 0 1 0 1 P 1 0 15/16 1/16 1/16 15/16    (P)RNG IN enc odes P( OUT = 1) IN OUT OUT = R < IN p(B|A) A B p(C|B) B C p(B|A) A B p(C|B) B C p(B, C |A) p(B|A) A B p(C|B) B C p(C |A) p(C |A) S ample IN[0] IN[1] IN[2] IN[3] OUT 0 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 1 0 1 0 1 0 1 a f IN OUT m n Y A B Y = A^B A B Y 0 0 1 1 0 1 0 1 0 0 0 1 b c d e f g h i f IN OUT m n Y A B Y = A^B A B Y 0 0 1 1 0 1 0 1 0 0 0 1 p( OUT|IN) IN OUT m n P = P( Y | A, B) A B Y 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 R AND h P 1 0 1 0 1 0 0 1 17 Figure 2. Stochastic transition circuits and massiv ely parallel Bayesian inference. (A) A deterministic ﬁnite state machine consists of a register and a transition function implemented via combinational logic. (B) A stochastic transition circuit consists of a register and a stochastic transition operator implemented by a combinational stochastic circuit. Each stochastic transition circuit is T S | X is parameterized by some input X , and its internal combinational stochastic block P ( S t +1 | S t , X ) must ergodically con ver ge to a unique stationary distribution P ( S | X ) for all X . (C) Stochastic transition circuits can be composed to construct samplers for probabilistic models over multiple variables by wiring together stochastic transition circuits for each v ariable based on their interactions. This circuit samples from a distribution P ( A, B , C ) = P ( A ) P ( B | A ) P ( C | A ) . (D) Each network of stochastic transition circuits can be scheduled in many ways; here we sho w one serial schedule and one parallel schedule for the transition circuit from (C). Con vergence depends only on respecting the in v ariant that no stochastic transition circuit transitions while other circuits that interact with it are transitioning. (E) The Markov chain implemented by this transition circuit. (F) T ypical stochastic ev olutions of the state in this circuit. (G) Inference can be implemented by clamping state v ariables to speciﬁc values; this yields a restricted Marko v chain that con ver ges to the conditional distribution over the unclamped variables giv en the clamped ones. Here we sho w the chain obtained by ﬁxing C = 1 . (H) T ypical stochastic ev olutions of the state in this clamped transition circuit. Changing which v ariables are ﬁxed allo ws the inference problem to be changed dynamically as the circuit is running. 18 CLK D Q X S|X S t+1 ~ P(S t+1 |S t , X) X S t+1 S t T S|X f A A B C f AC f AB T A|BC T C|A T B|A (1, 0, 0) 0.12 (1, 0, 1) 0.29 (1, 1, 0) 0.29 (0, 0, 0) 0.29 0.04 0.63 (1, 1, 1) 0.29 (0, 0, 1) 0.04 0.04 0.63 0.29 (0, 1, 0) 0.04 0.04 0.04 0.92 0.04 0.88 0.04 0.04 0.29 0.29 0.37 (0, 1, 1) 0.04 0.29 0.29 0.37 0.04 0.33 0.29 0.29 0.08 (0, 0, 1) 0.53 (0, 1, 1) 0.06 (1, 0, 1) 0.42 0.42 0.1 1 (1, 1, 1) 0.47 0.04 0.65 0.31 0.04 0.96 A) B) C) D) E) F) G) H) Tuesday, April 9, 13 CLK D Q X S|X S t+1 ~ P(S t+1 |S t , X) X S t+1 S t T S|X f A A B C f AC f AB T A|BC T C|A T B|A (1, 0, 0) 0.12 (1, 0, 1) 0.29 (1, 1, 0) 0.29 (0, 0, 0) 0.29 0.04 0.63 (1, 1, 1) 0.29 (0, 0, 1) 0.04 0.04 0.63 0.29 (0, 1, 0) 0.04 0.04 0.04 0.92 0.04 0.88 0.04 0.04 0.29 0.29 0.37 (0, 1, 1) 0.04 0.29 0.29 0.37 0.04 0.33 0.29 0.29 0.08 (0, 0, 1) 0.53 (0, 1, 1) 0.06 (1, 0, 1) 0.42 0.42 0.1 1 (1, 1, 1) 0.47 0.04 0.65 0.31 0.04 0.96 A) B) C) D) E) F) G) H) Tuesday, April 9, 13 IN OUT m m D Q S t+1 S t Combination al State T ransition L ook up T ab le CLK D Q X S|X S t+1 ~ P(S t+1 |S t , X) X S t+1 S t T S|X f A A B C f AC f AB T A|BC T C|A T B|A (1, 0, 0) 0.12 (1, 0, 1) 0.29 (1, 1, 0) 0.29 (0, 0, 0) 0.29 0.04 0.63 (1, 1, 1) 0.29 (0, 0, 1) 0.04 0.04 0.63 0.29 (0, 1, 0) 0.04 0.04 0.04 0.92 0.04 0.88 0.04 0.04 0.29 0.29 0.37 (0, 1, 1) 0.04 0.29 0.29 0.37 0.04 0.33 0.29 0.29 0.08 (0, 0, 1) 0.53 (0, 1, 1) 0.06 (1, 0, 1) 0.42 0.42 0.1 1 (1, 1, 1) 0.47 0.04 0.65 0.31 0.04 0.96 A) B) C) D) E) F) G) H) Tuesday, April 9, 13 CLK D Q X S|X S t+1 ~ P(S t+1 |S t , X) X S t+1 S t T S|X f A A B C f AC f AB T A|BC T C|A T B|A (1, 0, 0) 0.12 (1, 0, 1) 0.29 (1, 1, 0) 0.29 (0, 0, 0) 0.29 0.04 0.63 (1, 1, 1) 0.29 (0, 0, 1) 0.04 0.04 0.63 0.29 (0, 1, 0) 0.04 0.04 0.04 0.92 0.04 0.88 0.04 0.04 0.29 0.29 0.37 (0, 1, 1) 0.04 0.29 0.29 0.37 0.04 0.33 0.29 0.29 0.08 (0, 0, 1) 0.53 (0, 1, 1) 0.06 (1, 0, 1) 0.42 0.42 0.1 1 (1, 1, 1) 0.47 0.04 0.65 0.31 0.04 0.96 A) B) C) D) E) F) G) H) Tuesday, April 9, 13 T A|B,C T C|A T B|C A C B D Q S t+1 S|X S t+1 ~P(S t+1 | S t ,X) S t X T S|X X S AMPLE B~ A~ C~ t=0 t=1 t=2 t=3 B~ A~ C~ t=0 t=1 t=2 t=3 a b c d e f g h 19 Figure 3. (A) The discrete-sample gate is a central building block for stochastic transition circuits, used to implement Gibbs transition operators that update a v ariable by sampling from its conditional distribution gi ven the variables it interacts with. The gate renormalizes the input log probabilities it is gi ven, conv erts them to probabilities (by e xponentiation), and then samples from the resulting distribution. Input ener gies are speciﬁed via a custom ﬁxed-point coding scheme. (B) Discrete-sample gates remain accurate e ven when implemented at e xtremely low bit-precision. Here we show the relati v e entrop y between true distributions and their low-precision implementations, for millions of distributions over discrete sets with 1000 elements; accuracy loss is ne gligible ev en when only 8 bits of precision are used. (C) The accuracy of low-precision discrete-sample gates can be understood by considering multinomial distributions with high, medium and lo w entrop y . High entropy distributions inv olve outcomes with very similar probability , insensitiv e to ratios, while low entropy distributions are dominated by the location of the most probable outcome. (D) Lo w- precision transition circuits sa ve area as compared to high-precision ﬂoating point alternati ves; these area savings make it possible to economically exploit massive parallelism, by ﬁtting many sampling units on a single chip. 20 b c 4.2 4.3 5.2 5.3 5.4 6.2 6.3 6.4 6.5 7.2 7.3 7.4 7.5 8.2 8.3 8.4 p r ob p r ob p r ob a d 0 1 0 1 1 0 0 1 m bits n bits = 11.125 1 0 1 1 1 0 0 1 = -7.5 e 1 e K ENE R G Y -IN 1 ENE R G Y -IN K S A MPLE OUT P ( out = i |{ e i } ) = exp( e i ) K exp( e k ) m.n m.n log K f 21 Figure 4. (A) A Markov Random Field for solving depth and motion perception, as well as other dense matching problems. Each X i,j node stores the hidden quantity to be estimated, e.g. the disparity of a pixel. Each f LP ensures adjacent X s are either similar or v ery dif ferent, i.e. that depth and motion ﬁelds v ary smoothly on objects but can contain discontinuities at object boundaries. Each Y i,j node stores a per- latent-pixel vector of similarity information for a range of candidate matches, linked to the X s by the f E potentials. (B) The conditional independencies in this model permit many different parallelization strategies, from fully space-parallel implementations to virtualized implementations where blocks of pixels are updated in parallel. (C) Depth perception results. The left input image, plus the depth maps obtained by software (middle) and hardware (right) engines for solving the Marko v Random Field. (D) Motion perception results. One input frame, plus the motion ﬂow vector ﬁelds for software (middle) and hardware (right) solutions. (E) Energy versus time for software and hardware solutions to depth perception, including both 8-bit and 12-bit hardware. Note that the hardware is roughly 500x faster than the software on this frame. (F) Energy v ersus time for software and hardware solutions to motion perception. 22 ener gy ener gy ener gy v s time f or st er eo vision ener gy ener gy ener gy v s time f or optical o w a. b.                                   c. d. e. f.                                                                 input image software hardware (low-precision) input image software hardware (low-precision) M ake gur e ax es t e x t bigger 23 Figure 5. (A) Example samples from the posterior distribution of cluster assignments for a nonparametric mixture model. The two samples sho w posterior v ariance, reﬂecting the uncertainty between three and four source clusters. (B) T ypical handwritten digit images from the MNIST corpus 52 , showing a high degree of variation across digits of the same type. (C) The digit clusters discov ered automatically by a stochastic digital circuit for inference in Dirichlet process mixture models. Each image represents a cluster; each pixel represents the probability that the corresponding image pix el is black. Clusters are sorted according to the most probable true digit label of the images in the cluster . Note that these cluster labels were not provided to the circuit. Both the clusters and the number of clusters were disco vered automatically by the circuit o ver the course of inference. (D) The receiv er operating characteristic (ROC) curves that result from classifying digits using the learned clusters; quantitati ve results are competitive with state-of-the-art classiﬁers. (E) The time required for one cycle through the outermost transition circuit in hardware, versus the corresponding time for one sweep of a highly optimized software implementation of the same sampler , which is ∼ 2000x slo wer . 24 a. d. b. e. x 0 x 1 sample 1 sample 2 0 1 2 new cluster 0 1 2 new cluster 3 data {x i } P(next cluster | {x i }) cluster ID c. 25 Figure 6. (A) A Bayesian network model for ICU alarm monitoring, showing measurable variables, hidden v ariables, and diagnostic v ariables of interest. (B) A factor graph representation of this Bayesian network, rendered by the input stage for our stochastic transition circuit synthesis software. (C) A representation of the factor graph sho wing evidence variables as well as a parallel schedule for the transition circuits automatically extracted by our compiler: all nodes of the same color can be transitioned simultaneously . (D) Three diagnosis results from Bayesian inference in the alarm network, showing high accuracy diagnoses (with some posterior uncertainty) from an automatically generated circuit. E) The schematic of a spiking neural implementation of a stochastic transition circuit assembly for sampling from the three-v ariable probabilistic model from Figure 2. (F) The spike raster (black) and state sequence (blue) that result from simulating the circuit. (G) The spiking simulation yields state distributions that agree with exact simulation of the underlying Marko v chain. 26                                                                                                                                                                                          H igh: HRBP , HREK G, VCP , HRSA T , PC WP L o w : SA O2, PRESS, BP Z er o: EXPC O2, MINVOL H igh: PRESS L o w : MINVOL S ympt oms: E v er ything Nor mal P( C ause): PA P V entLung Disconnect HR MinV ol ErrLowOutput FiO2 L VFailure CVP L VEDV olume KinkedT ube Intubation BP ExpCO2 MinV olSet SaO2 V entT ube ArtCO2 StrokeV olume PVSat Press HRBP Catechol V entMach V entAlv CO Shunt HREKG Hypovolemia HRSat Insuf fAnesth TPR ErrCauter History Anaphylaxis PulmEmbolus PCWP P ar allel S tages of I nf er enc e C onditioning V ar iables (x ed) a. b. c. e. f. h. d. A=0 A=1 B=0 B=1 C=0 C=1 Wednesday, September 25, 13 tr ansf or ma tion par alleliza tion inf er enc e 27 1. W eiss, Y ., Simoncelli, E. P . & Adelson, E. H. Motion illusions as optimal percepts. Nature neur oscience 5 , 598–604 (2002). 2. K ¨ ording, K. P . & W olpert, D. M. Bayesian integration in sensorimotor learning. Natur e 427 , 244–247 (2004). 3. Grifﬁths, T . L. & T enenbaum, J. B. Optimal predictions in e veryday cognition. Psycholo gical Science 17 , 767–773 (2006). 4. Blaisdell, A. P ., Sa wa, K., Leising, K. J. & W aldmann, M. R. Causal reasoning in rats. Science 311 , 1020–1022 (2006). 5. T enenbaum, J. B., K emp, C., Grif ﬁths, T . L. & Goodman, N. D. How to grow a mind: Statis- tics, structure, and abstraction. science 331 , 1279–1285 (2011). 6. Ferrucci, D. et al. Building watson: An overvie w of the deepqa project. AI magazine 31 , 59–79 (2010). 7. Thrun, S. Probabilistic robotics. Communications of the A CM 45 , 52–57 (2002). 8. Thrun, S., Burgard, W ., Fox, D. et al. Pr obabilistic r obotics , vol. 1 (MIT press Cambridge, 2005). 9. Shotton, J. et al. Real-time human pose recognition in parts from single depth images. Com- munications of the A CM 56 , 116–124 (2013). 28 10. Eckert Jr , J. P ., W einer , J. R., W elsh, H. F . & Mitchell, H. F . The uni v ac system. In P apers and discussions pr esented at the Dec. 10-12, 1951, joint AIEE-IRE computer confer ence: Revie w of electr onic digital computers , 6–16 (A CM, 1951). 11. Shiv akumar , P ., Kistler , M., K eckler , S. W ., Burger , D. & Alvisi, L. Modeling the effect of technology trends on the soft error rate of combinational logic. In Dependable Systems and Networks, 2002. DSN 2002. Pr oceedings. International Confer ence on , 389–398 (IEEE, 2002). 12. Rosenmund, C., Clements, J. & W estbrook, G. Nonuniform probability of glutamate release at a hippocampal synapse. Science 262 , 754–757 (1993). 13. Neumann, J. v . The computer and the brain (1958). 14. Akgul, B. E., Chakrapani, L. N., K orkmaz, P . & P alem, K. V . Probabilistic cmos technology: A survey and future directions. In V ery Lar ge Scale Inte gr ation, 2006 IFIP International Confer ence on , 1–6 (IEEE, 2006). 15. Gaines, B. Stochastic computing systems. Advances in information systems science 2 , 37–172 (1969). 16. Mead, C. Neuromorphic electronic systems. Pr oceedings of the IEEE 78 , 1629–1636 (1990). 17. Choudhary , S. et al. Silicon neurons that compute. In Artiﬁcial Neural Networks and Mac hine Learning–ICANN 2012 , 121–128 (Springer , 2012). 29 18. Ackerman, N. L., Freer, C. E. & Ro y, D. M. On the computability of conditional probability. ArXiv e-prints (2010). 1005.3014 . 19. Mansinghka, V . K. Natively pr obabilistic computation . Ph.D. thesis, Massachusetts Institute of T echnology (2009). 20. Goodman, N. D., Mansinghka, V . K., Roy , D. M., Bono witz, K. & T enenbaum, J. B. Church: a langauge for generati ve models. In Uncertainty in Artiﬁcial Intelligence (2008). 21. Salakhutdinov , R. & Hinton, G. Deep Boltzmann machines. In Pr oceedings of the Interna- tional Confer ence on Artiﬁcial Intelligence and Statistics , v ol. 5. 22. Pearl, J. Pr obabilistic Reasoning in Intelligent Systems: Networks of Plausible Infer ence (Morg an Kaufmann Publishers, San Francisco, 1988). 23. Lin, M., Lebedev , I. & W awrzynek, J. High-throughput bayesian computing machine with reconﬁgurable hardware. In Pr oceedings of the 18th annual ACM/SIGD A international sym- posium on F ield pr ogr ammable gate arrays , 73–82 (A CM, 2010). 24. V igoda, B. W . Continuous-time analog cir cuits for statistical signal pr ocessing . Ph.D. thesis, Massachusetts Institute of T echnology (2003). 25. Shannon, C. E. A symbolic analysis of r elay and switching circuits . Ph.D. thesis, Mas- sachusetts Institute of T echnology (1940). 26. Mansinghka, V . & Jonas, E. Supplementary material on stochastic digital circuits (2014). URL http://probcomp.csail.mit.edu/VMEJ- circuits- supplement.pdf . 30 27. Metropolis, N., Rosenbluth, A. W ., Rosenbluth, M. N., T eller , A. H. & T eller , E. Equation of state calculations by fast computing machines. The journal of chemical physics 21 , 1087 (1953). 28. Geman, S. & Geman, D. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. P attern Analysis and Machine Intelligence, IEEE T ransactions on 721–741 (1984). 29. Andrieu, C., De Freitas, N., Doucet, A. & Jordan, M. I. An introduction to mcmc for machine learning. Machine learning 50 , 5–43 (2003). 30. W ard Jr, S. A. & Halst ead, R. H. Computation Structur es. (The MIT press, 1990). 31. Marsaglia, G. Xorshift rngs. J ournal of Statistical Softwar e 8 , 1–6 (2003). 32. W ang, F . & Agrawal, V . D. Soft error rate determination for nanometer cmos vlsi logic. In 40th Southwest Symposium on Systems Theory , 324–328 (2008). 33. Marr , D. & Poggio, T . Cooperati ve computation of stereo disparity . Science 194 , 283–287 (1976). 34. Szeliski, R. et al. A comparativ e study of energy minimization methods for marko v random ﬁelds with smoothness-based priors. P attern Analysis and Mac hine Intelligence, IEEE T rans- actions on 30 , 1068–1080 (2008). 35. Ferguson, T . S. A bayesian analysis of some nonparametric problems. The annals of statistics 209–230 (1973). 31 36. Rasmussen, C. E. The inﬁnite gaussian mixture model. Advances in neural information pr ocessing systems 12 , 2 (2000). 37. Anderson, J. R. & Matessa, M. A rational analysis of cate gorization. In Pr oc. of 7th Interna- tional Machine Learning Confer ence , 76–84 (1990). 38. Grifﬁths, T . L., Sanborn, A. N., Canini, K. R. & Nav arro, D. J. Categorization as nonpara- metric bayesian density estimation. The pr obabilistic mind: Pr ospects for Bayesian cognitive science 303–328 (2008). 39. Imre, A. et al. Majority logic gate for magnetic quantum-dot cellular automata. Science 311 , 205–208 (2006). 40. Kschischang, F . R., Frey , B. J. & Loeliger , H.-A. Factor graphs and the sum-product algorithm. Information Theory , IEEE T ransactions on 47 , 498–519 (2001). 41. Fiser , J., Berkes, P ., Orb ´ an, G. & Lengyel, M. Statistically optimal perception and learning: from behavior to neural representations. T r ends in cognitive sciences 14 , 119–130 (2010). 42. Berkes, P ., Orb ´ an, G., Lengyel, M. & Fiser , J. Spontaneous cortical activity re veals hallmarks of an optimal internal model of the en vironment. Science 331 , 83–87 (2011). 43. Pouget, A., Beck, J., Ma, W . J. & Latham, P . E. Probabilistic brains: knowns and unkno wns. Natur e Neur oscience 16 , 1170–1178 (2013). 44. Diaconis, P . The marko v chain monte carlo rev olution. Bulletin of the American Mathematical Society 46 , 179–205 (2009). 32 45. Dagum, P . & Luby , M. An optimal approximation algorithm for bayesian inference. Artiﬁcial Intelligence 93 , 1–27 (1997). 46. W eaver , C., Emer , J., Mukherjee, S. & Reinhardt, S. T echniques to reduce the soft error rate of a high-performance microprocessor . In Computer Arc hitectur e, 2004. Pr oceedings. 31st Annual International Symposium on , 264–275 (2004). 47. Shepard, K. L. & Narayanan, V . Noise in deep submicron digital design. In Pr oceedings of the 1996 IEEE/A CM international confer ence on Computer -aided design , 524–531 (IEEE Computer Society , 1997). 48. Elowitz, M. B. & Leibler , S. A synthetic oscillatory network of transcriptional regulators. Natur e 403 , 335–338 (2000). 49. Barroso, L. A. & Holzle, U. The case for energy-proportional computing. Computer 40 , 33–37 (2007). 50. Flinn, J. & Satyanarayanan, M. Energy-a ware adaptation for mobile applications. ACM SIGOPS Operating Systems Re view 33 , 48–63 (1999). 51. McAdams, H. H. & Arkin, A. Stochastic mechanisms in gene expression. Pr oceedings of the National Academy of Sciences 94 , 814–819 (1997). 52. LeCun, Y . & Cortes, C. The mnist database of handwritten digits (1998). 33

Building fast Bayesian computing machines out of intentionally stochastic, digital parts

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment