Monte Carlo Algorithms for the Partition Function and Information Rates of Two-Dimensional Channels

1 Monte Carlo Algorithms for the P artition Function and Information Rates of T wo-Dimensional Channels Mehdi Molkaraie, Member , IEEE and Hans-Andrea Loeliger , F ellow , IEEE Abstract —The paper proposes Monte Carlo algorithms for the computation of the information rate of tw o-dimensional source / channel models. The focus of the paper is on binary-input channels with constraints on the allowed input conﬁgurations. The problem of numerically computing the information rate, and even the noiseless capacity , of such channels has so far remained largely unsolved. Both problems can be reduced to computing a Monte Carlo estimate of a partition function. The pr oposed algorithms use tree-based Gibbs sampling and multilay er (multi- temperature) importance sampling. The viability of the proposed algorithms is demonstrated by simulation results. Index T erms —T wo-dimensional channels, constrained chan- nels, partition function, Gibbs sampling, importance sampling, factor graphs, sum-pr oduct message passing, capacity , inf orma- tion rate. I . I N T RO D U C T I O N Numerically computing the Shannon information rate for source / channel models with memory can be dif ﬁcult. In many cases of practical interest, analytical results are not available or hard to e valuate numerically . For a lar ge class of channels, howe ver , Monte Carlo methods as proposed in [1]–[3] ha ve been shown to yield reliable numerical results. In this paper , we consider the extension of such Monte Carlo methods to source / channel models with a two-dimensional (2-D) structure. The focus of the paper is on 2-D binary-input channels with constraints on the allowed input conﬁgurations; for example, we consider the channel where adjacent channel input symbols must not both hav e the value 1. V ariations of such channels are of interest in magnetic and optical storage, where the constraints are imposed, e.g., in order to reduce the intersymbol interference or to help in timing control [4]– [8]. W e will consider both noiseless and noisy versions of such channels. W ith suitable modiﬁcations (simpliﬁcations), the methods of this paper can also be applied to other 2-D source / channel models such as channels with intersymbol interference. In the one-dimensional (1-D) case, computing the capacity of noiseless constrained channels was addressed and solved by Shannon [9], see also [4]. For the noisy case, the Monte Carlo Mehdi Molkaraie was with the Dept. of Information T echnology and Elec- trical Engineering, ETH Z ¨ urich, CH-8092 Z ¨ urich, Switzerland. He is no w with the Dept. of Statistics and Actuarial Science, University of W aterloo, W aterloo N2L 3G1, Canada. Email: mmolkaraie@uwaterloo.ca. Hans-Andrea Loeliger is with the Dept. of Information T echnology and Electrical Engineering, ETH Z ¨ urich, CH-8092 Z ¨ urich, Switzerland. Email: loeliger@isi.ee.ethz.ch. Preliminary v ersions of the material of this paper were presented in [16]– [18]. methods of [1]–[3] can be used to compute the information rate. The 2-D case is harder . Ev en the noiseless capacity is hard to compute numerically: while very tight analytical results are av ailable for a number of special cases (e.g., [10]–[15]), other cases have remained open problems. The noisy case has remained largely unsolved. The capacity of a noiseless constrained channel is essen- tially the logarithm of the partition function of the correspond- ing indicator function (see Section II). Moreover , computing the information rate of noisy source / channel models can also be reduced to the computation of a partition function (see Section VI). The heart of the paper, therefore, are Monte Carlo algorithms for the computation of partition functions. Several such algorithms are well kno wn [19]–[21], see also [22], [23], but some modiﬁcations will be necessary for the problems of interest in this paper . In particular , we will ﬁnd tree-based Gibbs sampling (due to Hamze and de Freitas [24]) extremely useful. W e will observe that Monte Carlo estimates of a partition function may actually be obtained as a by-product of tree-based Gibbs sampling, which does not seem to have been noticed before. In related prior work, Monte Carlo algorithms hav e been used to compute bounds on, or approximations of, the in- formation rate of 2-D source / channels with memory [25], [26]. Some of this work uses generalized belief propagation [27], which appears to yield very good approximations to the information rate [25], [18], [28], [29]. In contrast to all this prior work, the Monte Carlo methods of this paper are asymptotically unbiased, i.e., in the limit of inﬁnitely many samples, the estimates are guaranteed to con- ver ge to the desired quantity (the information rate). Moreover , the focus of this paper is on constrained channels, for which these computational problems are harder than for intersymbol interference channels (cf. Section VI-B). The empirical success of the proposed algorithms is epito- mized by Fig. 8, which shows the uniform-input information rate of a binary-input channel with input constraints and additiv e white Gaussian noise (A WGN). As far as kno wn to the authors, no such ﬁgure (for such a channel) has been presented before. If the reader is not familiar with Gibbs sampling, the following comments on the general nature of this work may be in order . First, Gibbs sampling is easily proved (under very mild conditions) to yield samples that are eventually distributed according to the desired distrib ution and asymp- totically independent [23] (i.e., deleting the ﬁrst t samples 2 and decimating the remaining sample sequence by a factor m results in an i.i.d. sequence in the limit t, m → ∞ ). Ho wev er , the dependencies among the samples may decay extremely slowly , which is the piv otal issue with Gibbs sampling and makes nai ve Gibbs sampling perfectly useless for the problems of this paper (and for many other problems). The challenge, therefore, is to speed up Gibbs sampling (i.e., to decrease the dependencies of the samples) by various additional tricks and insights so that it becomes useful. Second, the class of problems for which the methods proposed in this paper will work is not easily expressed in e xact terms. Again, the issue is not formal applicability (which is quite sweeping), b ut the speed of con vergence, which strongly depends on the particulars of the case and is not easily predicted. The paper is organized as follo ws. In Section II, we re view partition functions and noiseless 2-D constrained channels, and we introduce the corresponding notation. In Section III, we revie w sev eral Monte Carlo algorithms that will be used in this paper . Howe ver , additional ideas are necessary to make these algorithms work for our applications. In particular, we will need tree-based Gibbs sampling as described in Sec- tion IV. The application to noiseless constrained channels is demonstrated in Section V. The application to noisy channels is described and demonstrated in Section VI. The appendix revie ws sampling from Markov chains, which is needed in Section IV. I I . P AR TI T I O N F U N C T I O N O F 2 - D G R A P H I C A L M O D E L S Let X 1 , X 2 , . . . , X N be ﬁnite sets, let X be the Cartesian product X 4 = X 1 × X 2 × . . . × X N , and let f be a nonnegati ve function f : X → R . W e are interested in computing (exactly or approximately) the partition function Z f 4 = X x ∈X f ( x ) (1) for cases where • N is large and • f has a suitable factorization (as will be detailed below). W e will usually assume Z f 6 = 0 . Then p f ( x ) 4 = f ( x ) Z f (2) is a probability mass function on X . W e also deﬁne the support of f (and of p f ) as S f 4 = { x ∈ X : f ( x ) > 0 } . (3) If f ( x ) has a cycle-free factor graph representation (and if |X 1 | , |X 2 | , . . . , |X N | are not too large), then Z f can be computed efﬁciently by sum-product message passing [30], [31]. In this paper, ho wev er , we consider factor graphs with cycles. In particular, we are interested in examples of the following type. Example: Simple 2-D Constrained Channel Consider a grid of N = M × M binary (i.e., { 0 , 1 } - valued) v ariables x 1 , . . . , x N with the constraint that no two = X 1 = X 2 = X 3 = = = = = = = = = = = = = Fig. 1. Forney factor graph of the indicator function (4). The unlabeled boxes represent factors as in (5). (horizontally or vertically) adjacent v ariables ha ve both the value 1 . Let f : { 0 , 1 } N → { 0 , 1 } be the indicator function of this constraint, which can be factored into f ( x 1 , . . . , x N ) = Y k, ` adjacent κ ( x k , x ` ) , (4) where the product runs over all adjacent pairs ( k , ` ) and with factors κ ( x k , x ` ) =  0 , if x k = x ` = 1 1 , otherwise. (5) The corresponding Forne y factor graph of f is sho wn in Fig. 1, where the boxes labeled “ = ” are equality constraints [31]. (Note that, in Forne y f actor graphs, variables are represented by edges. Fig. 1 may also be viewed as a factor graph as in [30] where the boxes labeled “ = ” are the v ariable nodes.) Note that, in this example, Z f = |S f | . This example is known as the 2-D (1 , ∞ ) run-length limited constrained channel [4]. The quantity C M 4 = 1 N log 2 Z f (6) is known as the (ﬁnite-size) noiseless capacity of the channel. For this particular example, upper and lo wer bounds on the inﬁnite-size noiseless capacity C ∞ 4 = lim M →∞ C M (7) were ﬁrst proposed in [10] and improv ed in [11] and [32]. The bounds in [32] agree on nine decimal digits, which f ar exceeds the accuracy that can be achieved with the Monte Carlo methods of the present paper . Howe ver , the methods proposed in this paper work also for v arious generalizations of this example for which no tight bounds are known. 2 Later on, in Section VI, we will consider noisy versions of such channels. As it turns out, the computation of the infor- mation rates of such channels also requires the computation of partition functions as in (1). 3 I I I . M O N T E C A R L O M E T H O D S F O R P A RT I T I O N F U N C T I O N E S T I M A T I O N One well-kno wn method to estimate Γ f 4 = 1 /Z f (and thus Z f itself) goes as follo ws. Ogata-T anemura Method [19], [21]: 1) Draw samples x (1) , x (2) , . . . , x ( K ) from S f according to p f ( x ) as in (2). 2) Compute ˆ Γ f = 1 K |S f | K X k =1 1 f ( x ( k ) ) (8) 2 It is easy to verify that E( ˆ Γ f ) = 1 /Z f . Howe ver , there are two major issues with this method. First, there is the problem of generating the samples x (1) , x (2) , . . . , x ( K ) according to p f ( x ) . Ideally , we would like these samples to be independent, but (for the purposes of this paper) this is asking too much. In particular , we will use Gibbs sampling [22], [33], which produces dependent samples. Howe ver , with nai ve Gibbs sampling, the dependencies among the samples decay far too slo wly for the estimate (8) to be useful for us (cf. the remarks in the Introduction). W e will see in Section IV, how this issue is eased by tree-based Gibbs sampling as proposed by Hamze and de Freitas [24]. Second, it is usually assumed that f is strictly positiv e. In this case, S f = X and |S f | = |X | is known. Howe ver , this assumption excludes applications to constrained channels as in the example in Section II. Indeed, in that example, we would hav e f ( x ( k ) ) = 1 for all samples x ( k ) , and |S f | = Z f is the desired unkno wn quantity . W e will address this issue in Section IV -B. W ith suitable modiﬁcations, which will address the men- tioned issues, the Ogata-T anemura method will turn out to work well for the capacity of noiseless constrained 2-D channels. Howe ver , for our second application—the information rate of noisy 2-D constrained source / channel models—the Ogata- T anemura method turns out to be inadequate. W e will therefore resort to multilayer importance sampling as described belo w . W e ﬁrst describe the use of standard (single-layer) importance sampling to estimate Z f . Importance Sampling [22], [34]: 1) Draw samples x (1) , x (2) , . . . , x ( K ) from X according to some auxiliary probability distribution q ( x ) = 1 Z g g ( x ) , where g ( x ) is a nonnegati ve function deﬁned on X satisfying g ( x ) 6 = 0 whenev er f ( x ) 6 = 0 . 2) Compute ˆ R = 1 K K X k =1 f ( x ( k ) ) g ( x ( k ) ) (9) 2 It is easy to verify that E( ˆ R ) = Z f / Z g . The key issue with importance sampling is to ﬁnd a useful function g ( x ) such that • q ( x ) is a good approximation of p ( x ) (so that f ( x ) /g ( x ) does not wildly ﬂuctuate), • sampling from q ( x ) is feasible, and = = = = = = = = = = = = = = = = ' & $ % ' & $ % A B A B Fig. 2. Partition of Fig. 1 into two cycle-free parts (one part inside the two ov als, the other part outside the o vals). • computing Z g is feasible. An obvious choice for g ( x ) (and thus q ( x ) ) is g ( x ) 4 = f ( x ) α (10) for 0 < α < 1 . W ith this choice, an y factorization of f ( x ) results in a factorization of g ( x ) that preserves the topology of the corresponding factor graph. (Note, ho wever , that this choice of g ( x ) is not helpful if f ( x ) is { 0 , 1 } -valued.) In order to sample from q ( x ) , we will again use tree-based Gibbs sampling (see Section IV -A). In a variation of the algorithm, the estimator (9) of the ratio Z f / Z g could be replaced by Bennett’ s acceptance ratio method [35], which is also known as bridge sampling [36]. A function g ( x ) with all the required properties may be hard to ﬁnd, or it may not exist. This problem is addressed by multilayer importance sampling, which uses several auxiliary distributions. Multilayer (Multi-T emperature) Importance Sampling: Multilayer importance sampling, as described here, is a variation of annealed importance sampling as proposed by Neal [37], [34]; see also [38]. W e use J parallel versions of importance sampling as follo ws. For j = 0 , 1 . . . , J , let g j ( x ) 4 = f ( x ) α j (11) with 0 ≤ α J < · · · < α 1 < α 0 = 1 . Note that Z g 0 = Z f and Z f Z g J = Z g 0 Z g 1 Z g 1 Z g 2 · · · Z g J − 1 Z g J (12) For j = 1 , 2 , . . . , J , compute the ratio Z g j − 1 / Z g j by impor- tance sampling as described before: 1) Draw samples x (1) , x (2) , . . . , x ( K ) from q j ( x ) ∝ g j ( x ) . 4 2) Compute ˆ R j = 1 K K X k =1 g j − 1 ( x ( k ) ) g j ( x ( k ) ) (13) = 1 K K X k =1 f ( x ( k ) ) α j − 1 − α j . (14) 2 Noting that E( ˆ R j ) = Z g j − 1 / Z g j , we use Q J j =1 ˆ R j as an estimate of Z f / Z g J . If the number of layers J is large, g j ( x ) is a good approx- imation of g j − 1 ( x ) at each layer j . It remains to ﬁnd an estimate of Z g J , which tends to be easier than the original problem of estimating Z f . In particular , for α J = 0 , we have Z g J = | S f | , the cardinality of the support of f . In our application (Section VI), it turns out that Z g J can be computed efﬁciently by the tree-based Ogata-T anemura method of Section IV -B. I V . T R E E - B A S E D G I B B S S A M P L I N G A N D P A RT I T I O N F U N C T I O N E S T I M A T I O N A. T ree-Based Gibbs Sampling T ree-based Gibbs sampling was proposed by Hamze and de Freitas [24]. It works as follows. Let ( A, B ) be a partition of the index set { 1 , 2 , . . . , N } such that, • for ﬁxed x A , the factor graph of f ( x ) = f ( x A , x B ) is cycle-free, and • for ﬁxed x B , the factor graph of f ( x ) = f ( x A , x B ) is also cycle-free. An example of such a partition is sho wn in Fig. 2. Starting from some initial conﬁguration x (0) = ( x (0) A , x (0) B ) , the sam- ples x ( k ) = ( x ( k ) A , x ( k ) B ) , k = 1 , 2 , . . . , are created as follo ws: ﬁrst, x ( k ) A is drawn from p f ( x A | x B = x ( k − 1) B ) ∝ f ( x A , x ( k − 1) B ); (15) second, x ( k ) B is drawn from p f ( x B | x A = x ( k ) A ) ∝ f ( x ( k ) A , x B ) . (16) The point is that dra wing these samples is easy (by means of backward-ﬁltering forw ard-sampling, see the appendix) since the corresponding factor graphs are cycle-free. As in standard Gibbs sampling, the samples ( x ( k ) A , x ( k ) B ) are ev entually (i.e., for k → ∞ ) drawn from p f (provided that the corresponding Markov chain is irreducible and aperiodic [23]). Howe ver , tree-based Gibbs sampling mixes faster than naive Gibbs sampling. B. Application to P artition Function Estimation W ith A and B as above, let f A ( x A ) 4 = X x B f ( x A , x B ) , (17) and f B ( x B ) 4 = X x A f ( x A , x B ) . (18) W e then note that Z f A = X x A f A ( x A ) (19) = Z f , (20) i.e., f A (and analogously f B ) has the same partition function as f itself. W e also note that samples x (1) A , x (2) A , . . . , from p f A ( x A ) 4 = f A ( x A ) Z f = X x B p f ( x A , x B ) (21) can be obtained by tree-based Gibbs sampling as in Sec- tion IV -A simply by dropping the second component (the B - part) of the samples ( x (1) A , x (1) B ) , ( x (2) A , x (2) B ) , . . . W e can now use the Ogata-T anemura method (Section III) to estimate Γ f = 1 / Z f in tw o dif ferent w ays. One way is to directly use the estimator (8) with samples x ( k ) = ( x ( k ) A , x ( k ) B ) as in Section IV -A. Another way is to apply the estimator (8) to f A , i.e., to compute ˆ Γ f A 4 = 1 K |S f A | K X k =1 1 f A ( x ( k ) A ) (22) which is also an estimate of 1 / Z f . The computation of f A ( x ( k ) A ) = X x B f ( x ( k ) A , x B ) , (23) which is required in (22), is easy since the corresponding factor graph is a tree; in fact, the result of this computation is obtained for free as a by-product of tree-based Gibbs sampling as is explained in the appendix. By symmetry , we also have an analogous estimate ˆ Γ f B deﬁned as ˆ Γ f B 4 = 1 K |S f B | K X k =1 1 f B ( x ( k ) B ) (24) W ith the same samples ( x (1) A , x (1) B ) , ( x (2) A , x (2) B ) , . . . , es- timating 1 / Z f from (22) and (24) tends to con verge faster than (8). More importantly for this paper , the direct Ogata- T anemura method (8) cannot be used for constrained channels (cf. the example in Section II) where |S f | = Z f is the desired quantity . In contrast, the computation of |S f A | in (22) and |S f B | in (24) may be easy in such cases as will be e xempliﬁed below . V . A P P L I C ATI O N T O T H E C A PAC I T Y O F N O I S E L E S S 2 - D C O N S T R A I N E D C H A N N E L S W e demonstrate the estimators (22) and (24) by their appli- cation to the example in Section II, the 2-D (1 , ∞ ) runlength- limited constrained channel. W e will use factor graphs as in Fig. 1 with a partitioning as in Fig. 2. In this example, the quantities |S f A | and |S f B | , which are needed in (22) and (24), respecti vely , are gi ven by |S f A | = X x A f ( x A , 0 ) (25) |S f B | = X x B f ( 0 , x B ) , (26) 5 0.57 0.59 0.61 0.63 0.65 1 10 100 1000 10000 100000 1e+06 1e+07 bits/symbol Number of Samples Fig. 3. Estimated noiseless capacity (in bits/symbol) vs. the number of samples K for a 10 × 10 grid with a (1 , ∞ ) constraint. The plot sho ws 10 different sample paths, each with two estimates, one from Γ A and another from Γ B . The horizontal dotted line shows the inﬁnite-size capacity for this constraint. 0.58 0.585 0.59 0.595 0.6 0.605 0.61 1 10 100 1000 10000 100000 1e+06 1e+07 1e+08 bits/symbol Number of Samples Fig. 4. Estimated noiseless capacity (in bits/symbol) vs. the number of samples K for a 60 × 60 grid with a (1 , ∞ ) constraint. The plot sho ws 10 different sample paths, each with two estimates, one from Γ A and another from Γ B . The horizontal dotted line shows the inﬁnite-size capacity for this constraint. since f ( x A , 0 ) =  1 , if f A ( x A ) > 0 0 , if f A ( x A ) = 0 , (27) and likewise for f ( 0 , x B ) . It follows that |S f A | and |S f B | are easy to compute by sum-product message passing in the c ycle- free factor graphs of f ( x A , 0 ) and f ( 0 , x B ) , respectiv ely . Some experimental results are shown in Figs. 3 through 6. All ﬁgures refer to f as in (4) and (5) and sho w the estimates of the ﬁnite-size capacity C M (6) obtained from (22) and (24) vs. K for se veral dif ferent experiments. In Fig. 3, we ha ve N = 10 × 10 and we obtain C 10 ≈ 0 . 6082 bits/symbol. In Fig. 4, we have N = 60 × 60 , and there are issues with slow con vergence. In order to speed up the mixing and thus improving the con- ver gence, we no w partition the f actor graph into vertical strips each containing M × 2 or M × 3 v ariables (rather than M × 1 0.586 0.588 0.59 0.592 0.594 0.596 1 10 100 1000 10000 100000 1e+06 1e+07 bits/symbol Number of Samples Fig. 5. Same conditions as in Fig. 4, but with strips of width tw o. 0.586 0.588 0.59 0.592 0.594 0.596 1 10 100 1000 10000 100000 1e+06 1e+07 bits/symbol Number of Samples Fig. 6. Same conditions as in Fig. 4, but with strips of width three. variables as in Fig. 2). Exact sum-product message passing is still possible on such strips, e.g., by con verting the strip into an equiv alent cycle-free factor graph. The computation time is e xponential in the width of the strip, b ut the faster mixing results in a substantial reduction of total computation time for strips of moderate width. Simulation results for strips of width 2 and 3 are shown in Figs. 5 and 6, respectively , both for a grid of size N = 60 × 60 . Note that the con vergence is much better than in Fig. 4, and we obtain C 60 ≈ 0 . 5914 bits/symbol. Also sho wn in these ﬁgures, as a horizontal dotted line, is the inﬁnite-size capacity C ∞ ≈ 0 . 5879 bits/symbol from [32], which (in this example) is a lo wer bound on the ﬁnite- size capacity . V I . E S T I M A T I N G T H E I N F O R M A T I O N R ATE O F N O I S Y 2 - D S O U R C E / C H A N N E L M O D EL S A. The Problem W e no w consider the problem of computing the information rate of noisy 2-D source / channel models. Although the focus of this paper is on constrained channels, the proposed approach 6 = Z Z Z Z Y 1 X 1 = Z Z Z Z Y 2 X 2 = Z Z Z Z Y 3 X 3 = Z Z Z Z = Z Z Z Z = Z Z Z Z = Z Z Z Z = Z Z Z Z = Z Z Z Z = Z Z Z Z = Z Z Z Z = Z Z Z Z = Z Z Z Z = Z Z Z Z = Z Z Z Z = Z Z Z Z Fig. 7. Extension of Fig. 1 to a factor graph of p ( x ) p ( y | x ) with p ( y | x ) as in (29). can also be applied to other 2-D source / channel models such as 2-D channels with intersymbol interference. For a 2-D grid of size N = M × M (as before), let X = { X 1 , X 2 , . . . , X N } be the source process (i.e., the input process of the channel) and let Y = { Y 1 , Y 2 , . . . , Y N } be the output process of the channel; X takes values in X (as deﬁned in Section II) and Y takes values in R N . W e wish to compute the mutual information rate 1 N I ( X ; Y ) = 1 N  H ( Y ) − H ( Y | X )  (28) for cases where p ( x , y ) has a suitable factor graph. In particu- lar , we will focus on the case where the channel is memoryless, i.e., p ( y | x ) = N Y n =1 p ( y n | x n ) , (29) and where the channel input distribution p ( x ) has a factor - ization with a factor graph as in Fig. 1. It then follows that p ( x , y ) has a factor graph as in Fig. 7. In many cases of practical interest, H ( Y | X ) is analytically av ailable, see our numerical experiments in Section VI-C. In such cases, the problem of computing the mutual information rate (28) reduces to computing H ( Y ) = E  − log 2 p ( Y )  . (30) If H ( Y | X ) is not analytically a vailable, it can be computed by the same method as H ( Y ) , see [2, Section III]. B. The Method As in [2], we approximate the expectation in (30) by the empirical average H ( Y ) ≈ − 1 L L X ` =1 log 2  p ( y ( ` ) )  , (31) where samples y (1) , y (2) , . . . , y ( L ) are drawn according to p ( y ) . The problem of estimating the mutual information rate is thus reduced to 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 -10 -8 -6 -4 -2 0 2 4 6 8 bits/symbol SNR [dB] Fig. 8. Uniform-input information rate (in bits/symbol) vs. SNR for a 24 × 24 channel with a (1 , ∞ ) constraint and additive white Gaussian noise. The horizontal dotted line sho ws the noiseless capacity of this channel. • creating samples y ( ` ) and • computing p ( y ( ` ) ) for each sample. If p ( x , y ) has a cycle-free factor graph (and if |X 1 | , |X 2 | , . . . , |X N | are not too lar ge), then both tasks can be carried out in a single-loop algorithm as in [2]. In this paper , ho wev er , we assume that no such factor graph exists and we propose a double-loop algorithm (with an outer loop and an inner loop) to carry out these tasks. In the outer loop, we create samples y (1) , . . . , y ( L ) as follows. 1) Draw samples x (1) , . . . , x ( L ) according to p ( x ) . In sim- ple cases (such as unconstrained channels with i.i.d. input), this may be trivial; in general, howe ver , we do this by tree-based Gibbs sampling (as in Section IV -A) using the factor graph of p ( x ) . 2) For ` = 1 , . . . , L , draw y ( ` ) from p Y | X ( y | x ( ` ) ) , i.e., by simulating the channel with input x ( ` ) . In the inner loop, we compute an estimate of p ( y ( ` ) ) = X x ∈X p ( x ) p Y | X ( y ( ` ) | x ) (32) as follo ws. Note that, for ﬁxed ` , the right-hand side of (32) is the partition function Z f ` of f ` ( x ) 4 = p ( x ) p Y | X ( y ( ` ) | x ) , (33) which has a suitable factor graph (as, e.g., in Fig. 7). In principle, we can thus estimate (32) by any of the methods of Section III. It turns out, ho wev er , that only multilayer importance sampling is able to handle the more demanding cases (as will be explained in our numerical e xperiments in Section VI-C) while the other methods of Section III suf fer from slow and erratic conv ergence. C. Numerical Experiments In our numerical experiments, we consider a noisy version of the example in Section II, i.e., a noisy version of the 2-D (1 , ∞ ) runlength-limited constrained channel. W e assume that the channel input distrib ution p ( x ) is uniform ov er the 7 0.24 0.28 0.32 0.36 0.4 0.44 1 10 100 500 bits/symbol Number of Samples Fig. 9. Estimated information rate (in bits/symbol) vs. the number of samples L for a noisy 24 × 24 (1 , ∞ ) constraint at 0 dB. The plot shows 12 different sample paths. 0.5 0.52 0.54 0.56 0.58 0.6 0.62 1 10 100 500 bits/symbol Number of Samples Fig. 10. Estimated information rate (in bits/symbol) vs. the number of samples L for a noisy 24 × 24 (1 , ∞ ) constraint at 6 dB. The plot shows 12 different sample paths. allowed conﬁgurations, i.e., p ( x ) = p f ( x ) with f as in (4), and we assume that the noise is additive white Gaussian (and independent of X ), i.e., p ( y | x ) is a product as in (29) with factors p ( y n | x n ) = 1 √ 2 π σ 2 exp  − 1 2 σ 2  y n − ( − 1) x n  2  (34) and thus H ( Y | X ) = N 2 log 2 (2 π eσ 2 ) . (35) W e will use the signal-to-noise ratio (SNR) deﬁned as SNR 4 = 1 σ 2 , (36) which we will specify in dB, i.e., 10 log 10 ( SNR ) . The grid size in all the plots is N = 24 × 24 and the parameters α j in (11) are set to α j = 2 − j , for j = 1 , 2 , . . . , J . Fig. 8 shows the computed information rate vs. the SNR ov er the interv al [ − 10 , 8] dB. The horizontal dotted line in Fig. 8 shows the capacity of the noiseless version of this channel, which is about 0 . 596 bits per symbol. Figs. 9 and 10 illustrate the con vergence of the outer loop of the proposed double-loop algorithm at 0 dB and at 6 dB, respectiv ely . Both ﬁgures show the estimated information rate vs. the number of samples L in (31) for 12 different sample paths. As for the inner loop, the required number of layers J in (12) depends on the SNR. As the SNR increases (or equiv alently as σ 2 decreases), the function f ` ( x ) in (33) tends to hav e more isolated modes. Therefore, in order to obtain a good approximation at each lev el of multilayer importance sampling, lar ger values of J are required for higher SNR. In our numerical e xperiments at 0 dB and 6 dB, J is set to 3 and 6, respectiv ely . Fig. 11 shows the con ver gence of log 2 ˆ R j as in (14) for j = 1 , 2 , . . . , 6 , for a ﬁx ed output sample at 6 dB. The value of Z g J is estimated by the tree-based Ogata- T anemura method of Section IV -B. Fig. 12 shows the con- ver gence of the estimate of log 2 ( Z g 6 ) / N according to (22) for four different sample paths. D. Remarks In statistical physics, the partition function typically has the form Z = X x ∈X e − E ( x ) /T , (37) where T is the temperature and E ( x ) is the energy of the conﬁguration x . W e therefore point out that the noise v ariance σ 2 in (34) may be vie wed as the temperature parameter of the partition function Z f ` of (33). It is well known in statistical physics that computing the partition function is harder at low temperatures than at high temperatures. Similarly , we observe that computing the partition function Z f ` of (33) is harder at high SNR than at low SNR; in particular , at high SNR, more layers (higher values of J ) are required for multilayer (multi- temperature) importance sampling. W e also note that, in the examples of Section VI-C, the choice of the parameters α j = 2 − j in (11) is somewhat arbitrary . It is possible that other choices of these parameters result in faster con ver gence. V I I . C O N C L U S I O N Monte Carlo methods hav e been highly succesful in comput- ing the information rate of source / channel models with 1-D memory . The e xtension of such methods to source / channel models with 2-D memory has been an open research problem. In this paper, we dev elop such methods with a focus on the (difﬁcult) case of channels with input constraints, with or without noise. In contrast to previous techniques, which either use generalized belief propagation or compute only bounds on the information rate, the Monte Carlo algorithms of this paper are guaranteed to conv erge (asymptotically) to the desired information rate. A ke y role in the proposed algorithms is played by tree-based Gibbs sampling by Hamze and de Freitas, which we have sho wn to yield an estimate of the partition function as a by-product. The success of the proposed methods 8 -0.4 -0.35 -0.3 -0.25 -0.2 -0.15 -0.1 -0.05 1 10 100 1000 10000 100000 1e+06 Number of Samples Fig. 11. Computed log 2 ˆ R j as in (14), for j = 1 , 2 , . . . , 6 vs. the number of samples K for a noisy 24 × 24 (1 , ∞ ) runlength-limited constraint at 6 dB. The plot sho ws log 2 ˆ R 6 , log 2 ˆ R 5 , . . . , log 2 ˆ R 1 from top to bottom. 0.52 0.525 0.53 0.535 0.54 0.545 0.55 0.555 0.56 0.565 1 10 100 1000 10000 100000 bits/symbol Number of Samples Fig. 12. Estimated log 2 ( Z g 6 ) / N vs. the number of samples K for a noisy 24 × 24 (1 , ∞ ) runlength-limited constraint at 6 dB. is ex empliﬁed by Fig. 8, which (to the best of our knowledge) is the ﬁrst such plot for a noisy 2-D channel. W e also note that the extension of the proposed methods to computing upper and lower bounds on the information rate as in [2, Section VI] is straightforward. A C K N O W L E D G E M E N T The authors wish to thank Radford Neal for helpful dis- cussions and advice on annealed importance sampling. The authors also wish to thank David MacKay and Iain Murray for pointing out to us [24], and the revie wers and the Associate Editor for helpful comments. A P P E N D I X S A M P L I N G F RO M M A R K OV C H A I N S W e recall some pertinent facts about the simulation of Markov chains and cycle-free factor graphs. Let p ( x ) = p ( x 1 , . . . , x n ) be the probability mass function of a Markov X k − 2 g k − 1 X k − 1 g k X k  g k +1 X k +1  Fig. 13. Forney factor graph of (39) with messages ← − µ X k (40). chain. If p ( x ) is giv en in the form p ( x ) = p ( x 1 ) n Y k =2 p ( x k | x k − 1 ) , (38) then it is obvious ho w to draw i.i.d. samples according to p ( x ) . Now consider the case where p ( x ) is not gi ven in the form (38), but in the more general form p ( x ) ∝ n Y k =2 g k ( x k − 1 , x k ) (39) with general factors g k . It is then still easy to draw i.i.d. samples according to p ( x ) , which may be seen as follows. First, a probability mass function of the form (39) can be rewritten in the form (38) (which allows ef ﬁcient simulation). Second, this reparameterization of p ( x ) may be efﬁciently carried out by backward sum-product message passing, as will be detailed belo w . The resulting algorithm is kno wn as “backward-ﬁltering forward-sampling” (or , in a time-re versed version, as “forward-ﬁltering backward-sampling”) [39]. Speciﬁcally , let ← − µ X k be the backward sum-product message along the edge X k in the factor graph of (39), as is illustrated in Fig. 13 (cf. [31]). W e then have ← − µ X n ( x n ) = 1 and ← − µ X k ( x k ) 4 = X x k +1 g k +1 ( x k , x k +1 ) ← − µ X k +1 ( x k +1 ) (40) = X x k +1 ,...,x n n Y m = k +1 g m ( x m − 1 , x m ) (41) for k = n − 1 , n − 2 , . . . , 1 . Then p ( x 1 ) = X x 2 ,...,x n p ( x 1 , . . . , x n ) (42) ∝ ← − µ X 1 ( x 1 ) (43) and p ( x k | x k − 1 ) = g k ( x k − 1 , x k ) ← − µ X k ( x k ) ← − µ X k − 1 ( x k − 1 ) (44) for k = 2 , . . . , n . The proof of (44) follo ws from noting that p ( x k − 1 ) = γ − → µ X k − 1 ( x k − 1 ) ← − µ X k − 1 ( x k − 1 ) (45) and p ( x k − 1 , x k ) = γ − → µ X k − 1 ( x k − 1 ) g k ( x k − 1 , x k ) ← − µ X k ( x k ) (46) where − → µ X k − 1 is the forward sum-product message along the edge X k − 1 and where γ is the missing scale factor in (39). W e also note that X x 1 ← − µ X 1 ( x 1 ) = X x g ( x ) , (47) where g ( x ) is deﬁned as the right-hand side of (39). In this paper , this fact is used to compute the marginals (23) as a by-product of the sampling. 9 The generalization of all this to arbitrary factor graphs without cycles is straightforward. R E F E R E N C E S [1] D. Arnold and H.-A. Loeliger , “On the information rate of binary-input channels with memory , ” Pr oc. 2001 IEEE Int. Conf . on Communications, Helsinki, Finland, June 11–14, 2001, pp. 2692–2695. [2] D. Arnold, H.-A. Loeliger, P . O. V ontobel, A. Kav ˇ ci ´ c, and W . Zeng, “Simulation-based computation of information rates for channels with memory , ” IEEE T rans. Inf. Theory , vol. 52, no. 8, August 2006, pp. 3498–3508. [3] H. D. Pﬁster , J.-B. Soriaga, and P . H. Siegel, “On the achiev able infor- mation rates of ﬁnite-state ISI channels, ” Pr oc. 2001 IEEE Globecom , San Antonio, USA, Nov . 2001, pp. 2992–2996. [4] K. A. S. Immink, Codes for Mass Data Stora ge Systems. Eindhov en: Shannon F oundation Publishers, 2004. [5] R. Roth, Introduction to Coding Theory. Cambridge University Press, 2006. [6] B. V asic and E. M. Kurtas, Coding and Signal Pr ocessing for Magnetic Recor ding Systems. CRC Press, 2005. [7] K. A. S. Immink, P . H. Siegel, and J. K. W olf, “Codes for digital recorders, ” IEEE T rans. Inf. Theory, vol. 44, Oct. 1998, pp. 2260–2299. [8] R. W ood, M. W illiams, A. Kav ˇ ci ´ c, and J. Miles, “The feasibility of magnetic recording at 10 terabits per square inch on conv entional media, ” IEEE T rans. Ma gnetics, v ol. 45, Feb. 2009, pp. 917–923. [9] C. E. Shannon, “ A mathematical theory of communications, ” Bell Sys. T ech. J ournal, v ol. 27, July 1948, pp. 379–423. [10] N. J. Calkin and H. S. W ilf, “The number of independent sets in a grid graph, ” SIAM J. Discr . Math., vol. 11, Feb. 1998, pp. 54–60. [11] W . W eeks IV and R. E. Blahut, “The capacity and coding gain of certain checkerboard codes, ” IEEE T rans. Inf. Theory , vol. 44, May 1998, pp. 1193–1203. [12] K. Kato and K. Zeger , “On the capacity of two-dimensional run-length constrained channels, ” IEEE Tr ans. Inf. Theory , vol. 45, July 1999, pp. 1527–1540. [13] H. Ito, A. Kato, Z. Nagy , and K. Ze ger, “Zero capacity region of multidimensional run length constraints, ” The Electr onic Journal of Combinatorics, v ol. 6(1), 1999. [14] K. Censor and T . Etzion, “The positiv e capacity region of two- dimensional run-length-constrained channels, ” IEEE T rans. Inf. Theory , vol. 52, Nov . 2006, pp. 5128–5140. [15] I. T al and R. M. Roth, “Concave programming upper bounds on the capacity of 2-D constraints, ” IEEE T rans. Inf. Theory , vol. 57, Jan. 2011, pp. 381–391. [16] H.-A. Loeliger and M. Molkaraie, “Simulation-based estimation of the partition function and the information rate of tw o-dimensional models, ” Pr oc. 2008 IEEE Int. Symp. on Information Theory , T oronto, Canada, July 6–11, 2008, pp. 1113–1117. [17] H.-A. Loeliger and M. Molkaraie, “Estimating the partition function of 2-D ﬁelds and the capacity of constrained noiseless 2-D channels using tree-based Gibbs sampling, ” Proc. 2009 IEEE Information Theory W orkshop, T aormina, Italy , October 11–16, 2009, pp. 228–232. [18] M. Molkaraie and H.-A. Loeliger , “Estimating the information rate of noisy constrained 2-D channels, ” Proc. 2010 IEEE Int. Symp. on Information Theory , Austin, USA, June 13–18, 2010, pp. 1678–1682. [19] Y . Ogata and M. T anemura, “Estimation of interaction potentials of spatial point patterns through the maximum likelihood procedure, ” Ann. Inst. Statist. Math., v ol. 22, 1981, pp. 315–338. [20] M. Jerrum and A. Sinclair , “Polynomial-time approximation algorithms for the Ising model, ” SIAM J. Computing, vol. 11, Oct. 1993, pp. 1087– 1116. [21] G. Potamianos and J. Goutsias, “Stochastic approximation algorithms for partition function estimation of Gibbs random ﬁelds, ” IEEE T rans. Inf. Theory , v ol. 43, Nov . 1997, pp. 1984–1965. [22] D. J. C. MacKay , “Introduction to Monte Carlo methods, ” in Learning in Graphical Models, M. I. Jordan, ed., Kluwer Academic Press, 1998, pp. 175–204. [23] P . Br ´ emaud, Markov Chains: Gibbs F ields, Monte Carlo Simulations, and Queues. Springer, 1999. [24] F . Hamze and N. de Freitas, “From ﬁelds to trees, ” Pr oc. 2004 Conf. on Uncertainty in Artiﬁcial Intelligence, Banff, Canada, July 7–11, 2004, pp. 243–250. [25] O. Shental, N. Shental, S. Shamai (Shitz), I. Kanter , A. J. W eiss, and Y . W eiss, “Discrete-input two-dimensional Gaussian channels with memory: estimation and information rates via graphical models and statistical mechanics, ” IEEE T rans. Inf. Theory , vol. 54, April 2008, pp. 1500–1513. [26] J. Chen, and P . H. Siegel, “On the symmetric information rate of two- dimensional ﬁnite-state ISI channels, ” IEEE T rans. Inf. Theory , v ol. 52, no. 1, Jan. 2006, pp. 227–236. [27] J. S. Y edidia, W . T . Freeman, and Y . W eiss, “Constructing free energy approximations and generalized belief propagation algorithms, ” IEEE T rans. Inf . Theory , v ol. 51, July 2005, pp. 2282–2312. [28] G. Sabato and M. Molkaraie, “Generalized belief propagation algorithm to estimate the capacity of multi-dimensional run-length limited con- straints, ” Proc. 2010 IEEE Int. Symp. on Information Theory, Austin, USA, June 13–18, 2010, pp. 1213–1217. [29] G. Sabato and M. Molkaraie, “Generalized belief propagation for the noiseless capacity and information rates of run-length limited con- straints, ” IEEE T rans. Comm., vol. 60, March 2012, pp. 669–675. [30] F . R. Kschischang, B. J. Frey , and H.-A. Loeliger , “Factor graphs and the sum-product algorithm, ” IEEE T rans. Inf. Theory, vol. 47, Feb . 2001, pp. 498–519. [31] H.-A. Loeliger , “ An introduction to factor graphs, ” IEEE Signal Pr oc. Mag., Jan. 2004, pp. 28–41. [32] Z. Nagy and K. Zeger , “Capacity bounds for the three-dimensional (0 , 1) run-length limited channel, ” IEEE T rans. Inf. Theory , vol. 46, May 2000, pp. 1030–1033. [33] S. Geman and D. Geman, “Stochastic relaxation, Gibbs distribution, and Bayesian restoration of images, ” IEEE T rans. P attern Analys. and Machine Intell., vol. 6, 1984, pp. 721–741. [34] R. M. Neal, “ Annealed importance sampling, ” Statistics and Computing, vol. 11, April 2001, pp. 125–139. [35] C. H. Bennett, “Efﬁcient estimation of free energy differences from Monte Carlo data, ” Journal of Computational Physics, vol. 22, October 1976, pp. 245–268. [36] X.-L. Meng and H. W . W ong, “Simulating ratios of normalizing con- stants via a simple identity: A theoretical e xploration, ” Statistica Sinica vol. 6, Oct. 1996, pp. 831–860. [37] R. M. Neal, “ Annealed importance sampling, ” T echn. Report 9850, Dept. of Statistics, University of T oronto, 1998. [38] C. Jarzynski, “Nonequilibrium equality for free energy differences, ” Phys. Re v . Lett. , vol. 78, 1997, pp. 2690–2693. [39] D. Gamerman and H. F . Lopes, Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Infer ence. 2nd ed., CRC Press, 2006.

Monte Carlo Algorithms for the Partition Function and Information Rates of Two-Dimensional Channels

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment