Conditional Finite Mixtures of Poisson Distributions for Context-Dependent Neural Correlations

Conditional Finite Mixtur es of P oisson Distrib utions f or Context-Dependent Neural Corr elations Sacha Sokoloski Systems and Computational Biology Albert Einstein College of Medicine The Bronx, NY 10461 sacha.sokoloski@einstein.yu.edu Ruben Coen-Cagli Systems and Computational Biology Albert Einstein College of Medicine The Bronx, NY 10461 ruben.coen-cagli@einstein.yu.edu Abstract Parallel recordings of neural spike counts ha v e rev ealed the e xistence of context- dependent noise correlations in neural populations. Theories of population coding hav e also shown that such correlations can impact the information encoded by neural populations about external stimuli. Although studies have sho wn that these correlations often hav e a lo w-dimensional structure, it has pro ven dif ﬁcult to capture this structure in a model that is compatible with theories of rate coding in correlated populations. T o address this difﬁculty we de velop a nov el model based on conditional ﬁnite mixtures of independent Poisson distributions. The model can be conditioned on context variables (e.g. stimuli or task v ariables), and the number of mixture components in the model can be cross-v alidated to estimate the dimensionality of the target correlations. W e deriv e an expectation-maximization algorithm to efﬁciently ﬁt the model to realistic amounts of data from large neural populations. W e then demonstrate that the model successfully captures stimulus- dependent correlations in the responses of macaque V1 neurons to oriented gratings. Our model incorporates arbitrary nonlinear conte xt-dependence, and can thus be applied to improv e predictions of neural activity based on deep neural netw orks. 1 Introduction Computational neuroscientists hav e made signiﬁcant progress in understanding how correlations amongst neural spike-counts inﬂuence information processing [ 1 , 2 ]. In particular , correlated ﬂuctua- tions in the responses to a ﬁxed stimulus (termed noise correlations) depend on stimulus identity , and can both hinder and facilitate information processing in model neural circuits [ 3 – 7 ]. Ho wever , due to the challenges in assessing the strength and order of noise correlations empirically , it remains unclear how noise correlations af fect computation in biological neural circuits. Measuring correlations of ev ery order in large neural populations is infeasible, and modelling neural correlations requires kno wledge of the order and dimensionality of signiﬁcant correlations. Most models measure e xclusiv ely pair-wise correlations [ 8 , 9 ], which can often explain most of the variability in spike-count data [ 10 , 11 ]. Moreover , dimensionality-reduction methods ha ve sho wn that the complete set of pair -wise neural correlations often has a lo w-dimensional structure for both total [ 12 ] and noise [ 13 – 17 ] correlations. Nev ertheless, higher-order correlations become more signiﬁcant as both stimulus complexity and neural population size increase [ 8 ], motiv ating the use of higher-order models when modelling noise correlations in lar ge populations of neurons. Dimensionality reduction methods ha ve pro vided many insights into neural correlations, but corre- lations must be studied through models of population codes in order to understand their effect on neural coding. Maximum entropy models of binary neural spiking dri ve much of the contemporary research on neural correlations in model population codes [ 8 ]. The ﬂe xibility of the maximum entropy Preprint. Under re vie w . framew ork allows the techniques for analyzing and ﬁtting pair-wise models to be generalized to both sparse, higher-order models [ 18 ] and stimulus-dependent models [ 11 ]. Ne vertheless, there is still a need for a rate-based model of population coding that can reliably capture the complex yet low-dimensional noise correlations found in neural data. T o address this we propose a conditional maximum entropy model, speciﬁcally a conditional ﬁnite mixture of independent Poisson distributions (CMP). When conditioned on a context variable (e.g. a stimulus) a CMP is a mixture of component collections of independent Poisson distributions [ 19 ]. A CMP with only one component reduces to a standard rate-based population code with Poisson neurons that are conditionally independent giv en the stimulus [ 20 , 21 ], and adding components to the CMP introduces noise correlations with dimensionality controlled by the number of components. By combining the properties of maximum entropy models with additional features of Poisson neurons we deriv e a hybrid expectation-maximization algorithm that can efﬁciently and reliably ﬁt the CMP to data. W e apply the CMP to neural population data recorded in macaque primary visual corte x (V1). W e ﬁnd that noise correlation structure is best captured with 3–5 components across se veral datasets, and that while overall dimensionality of correlations is largely stimulus-independent, the structure of noise correlations (i.e. the relativ e weights of the mixture components) depends on the stimulus. 2 Conditional Finite Mixtures of Independent Poisson Distributions Many of the equations that describe how to analyze and train the CMP model can be solved by expressing the parameters of the model in an appropriate coordinate system. T wo of the coordinate systems we consider are the mean and natural parameters that arise in the context of exponential families 1 . W e thus begin this section by re vie wing exponential families, primarily for the purpose of dev eloping notation (for a thorough development see [ 22 , 23 ]). W e continue by showing that the standard weighted-sum form of a ﬁnite mixture model is a third parameterization of a kind of exponential family kno wn as an exponential family harmonium [ 24 ]. Finally , we introduce and dev elop conditional ﬁnite mixture models and CMPs, and we deri ve training algorithms for such models including the hybrid expectation-maximization algorithm for training CMPs. 2.1 Exponential Families Consider a random variable X ∈ X with an unkno wn distribution P X , and suppose we are given an independent and identically distributed sample { X i } m S i =1 from P X such that e very X i ∼ P X . W e may model P X based on the sample { X i } m S i =1 by ﬁrst deﬁning a statistic s X : X → H X , where H X is the codomain of s X with dimension m X . W e then look for a probability distrib ution Q X that satisﬁes E Q [ s X ( X )] = 1 m S P m S i =1 s X ( X i ) , where E Q [ f ( X )] = R X f dQ X is the expected value of f ( X ) under Q X . This is of course an under-constrained problem, but if we also assume that Q X must hav e maximum entropy , then we arri ve at a family of distrib utions which uniquely satisfy the constraints for ev ery value of 1 m S P m S i =1 s X ( X i ) . An m X -dimensional e xponential family M X is deﬁned by the sufﬁcient statistic s X , as well as a base measur e µ X which helps deﬁne integrals and expectations within the f amily . An exponential family is parameterized by a set of natur al par ameters Θ X , such that each element of the family Q X ∈ M X may be identiﬁed with some parameters θ X ∈ Θ X . The density q X of the distribution Q X is giv en by log q X ( x ) = s X ( x ) · θ X − ψ X ( θ X ) , where ψ X ( θ X ) = log R X e s X ( x ) · θ X µ X ( dx ) is the log-partition function. Expectations of any Q X ∈ M X are then giv en by E Q [ f ( X )] = R X f dQ X = R X f · q X dµ X . Because each Q X ∈ M X is uniquely deﬁned by E Q [ s X ( X )] , the means of the sufﬁcient statistic also parameterize M X . The space of all mean parameters is H X , and we denote them by η X = E Q [ s X ( X )] . Finally , a sufﬁcient statistic is minimal when its component functions are non-constant and linearly independent. If the suf ﬁcient statistic s X of a gi ven family M X is minimal, then Θ X and H X are isomorphic. Moreover , the transition functions between them τ X : Θ X → H X and τ − 1 X : H X → Θ X are giv en by τ X ( θ X ) = ∂ θ X ψ X ( θ X ) , and τ − 1 X ( η X ) = ∂ η X φ X ( η X ) , where φ X ( η X ) = E Q [log q X ( X )] is the negati ve entropy of Q X . Before we conclude this subsection, we highlight the exponential families of cate gorical distributions M C and of independent P oisson distributions M N . The m C -dimensional categorical e xponential 1 Maximum entropy models are also kno wn as exponential families, and we prefer that name in this paper . 2 family M C contains all the distrib utions ov er integer v alues between 0 and m C . The base measure of M C is the counting measure, and the j th element of its suf ﬁcient statistic s C ( j ) at j is 1, and 0 for any other elements. The suf ﬁcient statistic s C is thus a v ector of all zeroes when j = 0 , and all zeroes except for the j th element when j > 0 . The m N -dimensional family of independent Poisson distributions M N contains distributions ov er m N -dimensional vectors of natural numbers, where each element of the vector is independently Poisson distributed. The suf ﬁcient statistic of M N is the identity function, and the base measure is giv en by µ N ( { n } ) = Q m N j =1 1 n j ! . 2.2 Finite Mixture Models In this subsection we show how ﬁnite mixture models can be expressed as latent-variable models such that the complete joint model is a particular kind of exponential f amily . T o begin, let us consider an exponential family M X . The density of a ﬁnite mixtur e distribution Q ∗ X of m C + 1 distributions in M X has the form q ∗ X ( x ) = P m C j =0 η ∗ C,j q ∗ X,j ( x ) , where each distribution Q ∗ X,j ∈ M X is a mixtur e component with natural parameters θ ∗ X,j , each η ∗ C,j is a mixtur e weight , and the weights satisfy the constraints 0 < η ∗ C,j < 1 and η ∗ 0 = 1 − P m C j =1 η ∗ C,j . Building a model out of mixture distributions is theoretically problematic because swapping compo- nent indices has no ef fect on a mixture distribution, and their parameters are therefore not identiﬁable. Nev ertheless, we may express mixture distributions as latent-v ariable distributions with the identiﬁ- able form P m C j =0 η ∗ C,j q ∗ X,j ( x ) = P m C j =0 q ∗ C ( j ) q ∗ X | C ( x | j ) = P m C j =0 q ∗ X C ( x, j ) , where η ∗ C,j = q ∗ C ( j ) and q ∗ X,j ( x ) = q ∗ X | C ( x | j ) . Moreov er, Q ∗ C is a categorical distribution with mean parameters η ∗ C = ( η ∗ C,j ) m C j =1 , and Q ∗ C may thus be described by the natural parameters θ ∗ C = τ − 1 C ( η ∗ C ) . Giv en M X and m C , we therefore deﬁne a ﬁnite mixtur e model M ∗ X C as the set of all distrib utions Q ∗ X C , and parameterize it by the mixture weights θ ∗ C and mixture components parameters { θ ∗ X,j } m C j =0 . Now , an exponential family harmonium is a kind of product exponential family which includes restricted Boltzmann machines as a special case [ 24 ]. W e may construct an e xponential family harmonium M X Y out of M X and M Y by deﬁning the base measure of M X Y as the product measure µ X × µ Y , and by deﬁning the sufﬁcient statistic of M X Y as the vector which contains the concatenation of all the elements in s X , s Y , and the outer product s X ⊗ s Y . Intuitiv ely , M X Y contains all the distributions with densities of the form q X Y ( x, y ) ∝ e s X ( x ) · θ X + s Y ( y ) · θ Y + s X ( x ) · Θ X Y · s Y ( y ) , where θ X , θ Y , and Θ X Y are the natural parameters of Q X Y ∈ M X Y . As we show , a harmonium deﬁned partially by the categorical e xponential family is equi v alent to a mixture model. Theorem 1. Let M X be an exponential family , let M C be the cate gorical e xponential family of dimension m C , let M ∗ X C be the mixtur e model deﬁned by M X and m C , and let M X C be the exponential family harmonium deﬁned by M X and M C . Then Q X C ∈ M X C with natural parameters θ X , θ C , Θ X C and Q X C ∈ M ∗ X C with mixtur e parameters θ ∗ C and { θ ∗ X,j } m C j =0 if and only if θ ∗ C,j = θ C,j + ψ X ( θ X,j + θ X ) − ψ X ( θ X ) , (1) θ ∗ X,j = θ X,j + θ X , (2) wher e θ C,j and θ ∗ C,j ar e the j th elements of θ C and θ ∗ C , θ X,j is the j th r ow of Θ X C , and θ X, 0 = 0 . Pr oof. W e identify Q X C and Q ∗ X C by identifying the parameters of their conditionals and mar ginals. First note that the conditional distrib ution Q X | C = j at j is an element of M X with parameters θ X + Θ X C · s C ( j ) [ 24 ], and that the suf ﬁcient statistic s C in this expression essentially selects a row from Θ X C , or returns 0 . Thus, where θ X,j is the j th ro w of Θ X C , Q X | C = j = Q ∗ X | C = j if and only if θ ∗ X, 0 = θ X , and θ ∗ X,j = θ X,j + θ X for j > 0 . T o equate Q ∗ C and Q C , ﬁrst note that the marginal density of a harmonium distribution satis- ﬁes log q C ( j ) = θ C · s C ( j ) + ψ X ( θ X + Θ X C · s C ( j )) − ψ X C ( θ X , θ C , Θ X C ) [ 24 ]. Because M X C is partially deﬁned by M C , we may sho w with a bit of algebra that ψ X C ( θ X , θ C , Θ X C ) = ψ X ( θ X ) + ψ C ( θ ∗ C ) , where θ ∗ C satisﬁes Equation 1 , and by substituting this into the expression for the harmonium marginal, we may conclude that Q C = Q ∗ C if and only if Equation 1 is satisﬁed. The advantage of formulating ﬁnite mixture models as exponential family harmoniums is that we may apply the theory of harmoniums and e xponential families to analyzing and training them. For 3 Figure 1: Training a ﬁnite mixture of an independent product of 2 von Mises distrib utions on synthetic data with EM vs batch gradient descent (GD). Ellipsoids indicate the precisions of the component densities of the true mixture (black), the EM-trained mixture (red), and the GD-trained mixture (dashed blue). Black cir - cles are synthetic data. EM and GD ﬁnd the same solutions starting from the same initial conditions and with similar computation time. example, gi ven a sample { X i } m S i =1 and a harmonium distrib ution Q X Y ∈ M X Y with parameters θ X , θ Y , and Θ X Y , an iteration of the expectation-maximization algorithm (EM) may be formulated as: Expectation Step: compute the unobserved means η Y ,i = τ Y ( θ Y + s X ( X i ) · Θ X Y ) for ev ery i , Maximization Step: ev aluate τ − 1 X Y ( 1 m S P m S i =1 s X ( X i ) , 1 m S P m S i =1 η Y ,i , 1 m S P m S i =1 s X ( X i ) ⊗ η Y ,i ) . On the other hand, the stochastic log-likelihood gradients of the parameters of Q X Y are ∂ θ X log q X ( X i ) = s X ( X i ) − η X , (3) ∂ θ Y log q X ( X i ) = τ Y ( θ Y + s X ( X i ) · Θ X Y ) − η Y , (4) ∂ Θ X Y log q X ( X i ) = s X ( X i ) ⊗ τ Y ( θ Y + s X ( X i ) · Θ X Y ) − H X Y . (5) where ( η X , η Y , H X Y ) = τ X Y ( θ X , θ Y , Θ X Y ) . Whether or not the v arious transition functions and their in verses can be e valuated depends on the deﬁning manifolds M X and M Y . When M Y = M C is the categorical family , its transition function τ C is computable, and whether τ X Y or τ − 1 X Y are computable is reducible to whether τ X or τ − 1 Y are computable, respectiv ely . EM is typically preferred when maximizing the likelihood of ﬁnite mixture model parameters, especially for ﬁnite mixtures of normal distrib utions. Howe ver , gradient descent can perform just as well as EM for certain mixture models, as we demonstrate in ﬁgure 1 . 2.3 Conditional Finite Mixture Models According to Equations 1 and 2 we may induce context dependence in all the mixture parameters of a mixture distribution Q X C ∈ M X C by allowing θ X to depend on additional v ariables. W e thus deﬁne a conditional ﬁnite mixtur e model M X C | Z as the set of all the conditional distrib utions Q X C | Z with densities q X C | Z ( x, j | z ) ∝ e s X ( x ) · ( θ X + θ X | Z ( z ))+ θ C · s C ( j )+ s X ( x ) · Θ X C · s C ( j ) , where θ X | Z : Z → Θ X is deﬁned by an additional set of parameters ρ . In the conditional setting we aim to maximize the conditional log-likelihood P m S i =1 q X | Z ( X i | Z i ) of the parameters θ X , θ C , Θ X C , and ρ , gi ven a sample { ( X i , Z i ) } m S i =1 from a tar get conditional distribution P X | Z . Stochastic gradient ascent on the conditional log-likelihood remains relati vely straightforw ard; gi ven a sample point ( X i , Z i ) , one must simply backpropagate the error-gradient S X ( X i ) − η X ( Z i ) through the function θ X | Z , where ( η X ( Z i ) , η C ( Z i ) , H X C ( Z i )) = τ X C ( θ X + θ X | Z ( Z i ) , θ C , Θ X C ) . The expectation step of the EM algorithm also remains tri vial for conditional mixtures. It is easy to sho w that for an y Q X C | Z ∈ M X C | Z the conditional distrib ution Q C | X,Z = Q C | X , and we may therefore continue to e v aluate the unobserved means as η C,i = τ C ( θ C + s X ( X i ) · Θ X C ) . On the other hand, the maximization step can be expressed as maximizing the sum o ver i of the function L ( θ X , θ C , Θ X C , ρ , η C,i , X i , Z i ) = s X ( X i ) · ( θ X + θ X | Z ( Z i ; ρ )) + θ C · η C,i + s X ( X i ) · Θ X C · η C,i − ψ X C ( θ X + θ X | Z ( Z i ; ρ ) , θ C , Θ X C ) (6) with respect to the parameters θ X , θ C , Θ X C , and ρ . It is more dif ﬁcult to e valuate the maximization step for conditional ﬁnite mixtures due to the dependence on Z i , although we may still ﬁnd maxima of the objectiv e by gradient ascent. Nev ertheless, when M X = M N is the family of m N independent Poisson distributions, we can f act e valuate the maximization step in closed-form with respect to the 4 parameters θ N and Θ N C . W e refer to the conditional ﬁnite mixture model M N C | Z as the family of conditional ﬁnite mixtur es of independent P oisson distributions (CMPs). Theorem 2. Let M N C | Z be a CMP model deﬁned by m C , m N , and Z . Then arg max θ N m S X i =1 L ( θ N , θ C , Θ N C , ρ , η C,i , N i , Z i ) = θ † N , 0 , and arg max θ N,j m S X i =1 L ( θ N , θ C , Θ N C , ρ , η C,i , N i , Z i ) = θ † N ,j − θ † N , 0 , wher e θ † N ,j,k = log  P m S i =1 η C,i,j N i,k P m S i =1 η C,i,j e θ N | Z,k ( Z i )  . Pr oof. Similar to Theorem 1 , consider the change of variables θ † N , 0 = θ N , θ † N ,j = θ N ,j + θ N , and θ † C,j ( Z i ) = θ C,j + ψ N ( θ N ,j + θ N + θ N | Z ( Z i )) − ψ N ( θ N + θ N | Z ( Z i )) . W e may express our optimization in this alternativ e parameterization as L † i ( { θ † N ,j } m C j =0 ) = m C X j =0 η C,i,j  N i · θ † N ,j − ψ N ( θ † N ,j + θ N | Z ( Z i ))  + N i · θ N | Z ( Z i ) + θ † C ( Z i ) · η C,i − ψ C ( θ † C ( Z i )) , such that max θ N , Θ N C P m S i =1 L = max { θ † N,j } m C j =0 P m S i =1 L † i . Moreover , the log-partition function of M N is giv en by ψ N ( θ N ) = P m N k =1 e θ N,k and we may therefore conclude that arg max θ † N,j,k m S X i =1 L † i ( { θ † N ,j } m C j =0 ) = arg max θ † N,j,k n m S X i =1 η C,i,j  N i,k θ † N ,j,k − e θ † N,j,k + θ N | Z,k ( Z i )  o = arg max θ † N,j,k n θ † N ,j,k P m S i =1 η C,i,j N i,k P m S i =1 η C,i,j e θ N | Z,k ( Z i ) − e θ † N,j,k o . This is an arg max vari ant of a Legendre transform and its solution is log  P m S i =1 η C,i,j N i,k P m S i =1 η C,i,j e θ N | Z,k ( Z i )  . 2.4 T raining Algorithms Based on these results, we propose three algorithms for training conditional ﬁnite mixture models, the third of which is speciﬁc to CMPs, and we abbre viate them as EM, SGD, and Hybrid. The ﬁrst approximates expectation-maximization (EM) by computing the expectation step in closed-form, and approximating the maximization step by gradient ascent of the objecti ve in Deﬁnition 6 . The second is stochastic gradient descent (SGD) of the negati ve log-likelihood based on Equations 3 , 4 , and 5 . The third is the Hybrid algorithm based on Theorem 2 . The Hybrid algorithm alternates between one of two phases ev ery training epoch: the ﬁrst phase is simply SGD with respect to all the CMP parameters, and the second phase is to update the parameters θ N and Θ N C based on Theorem 2 . 3 A pplications to Synthetic and Real Data In this section we model synthetic and real data with a CMP. In both cases we deﬁne the context variable as an element of the half-circle such that Z = [0 , π ] , which represents e.g. the orientation of a grating, as is common in V1 experiments. W e deﬁne the natural parameter function by θ N | Z ( z ) = Θ N Z · s Z ( z ) , where s Z ( z ) = (cos(2 z ) , sin(2 z )) . This choice of θ N | Z allows us to express the conditional ﬁring rates of the model neurons as E P [ N k | Z = z , C = j ] = γ j,k f k ( z ) , where γ j,k is the gain and f k ( z ) ∝ e ρ k cos( x − µ k ) is a v on Mises density function with preferred stimulus µ k and precision ρ k . The dependence of the model on the cate gorical v ariable is expressed solely through 5 Figure 2: Training a CMP model on synthetic data generated from a ground truth CMP with m N = 20 neurons and m C = 7 mixture components. ( A ) Tuning curves of the ground truth CMP. ( B ) Stimulus- dependent mixture weights of the ground truth CMP, with two weight-curv es highlighted in black. ( C ) Best of 10 descents of the negati ve log-likelihood by the EM (red) SGD (blue) and Hybrid (purple) algorithms, with lower - and upper-bounds giv en by the ground truth CMP and the 1-component model, respecti vely . ( D ) W eights of the CMP learned by the hybrid algorithm, with two weight-curves highlighted that are qualitati vely similar to those in (B). ( E ) Stimulus-dependent, paired correlation matrices of the ground truth and hybrid-trained models. Upper-right and lower-left triangles are correlations of the ground truth and learned CMP , respectiv ely . Bottom, middle, and top matrices are conditioned on z = 0 π , z = 0 . 33 π , and z = 0 . 67 π , respecti vely , matching axis ticks in (D). gain components γ j = ( γ j,k ) m N k =1 , and where E P [ N k | Z = z ] is the tuning curve of the i th neuron, this form ensures that each tuning curve retains a von Mises shape regardless of correlation structure. All simulations were run on a Lenovo P51 laptop with an Intel Xeon processor, and individual simulations of the algorithms in Subsection 2.4 took on the order of tens of seconds to e xecute. All algorithms were implemented in the Haskell programming language. 3.1 Synthetic Data W e ﬁrst consider synthetic data generated from a CMP P C N | Z with m N = 20 neurons and m C = 7 mixture components. The preferred stimuli µ k of the tuning curves are dra wn from a uniform distribution o ver Z , and the precisions ρ k and gains γ j,k are log-normally distrib uted with a mean of 0.8 and 2, respecti vely . W e plot the von Mises part f k of each CMP tuning curve in Figure 2 A. In Figure 2 B we plot p C | Z ( j | z ) for ev ery j , as a function of z . This provides a low-dimensional picture of the stimulus-dependence of the correlations; the number of active components determines the dimensionality of the correlations, and their identity determines the structure of correlations. T o synthesize a dataset consistent with typical recordings in visual corte x we synthesize 62 responses to 8 stimuli tiled evenly o ver the half-circle, for a total of m S = 496 sample points { ( N i , Z i ) } m S i =1 from P N | Z . In Figure 2 C we plot the descent of the resulting negati ve log-likelihood of the CMP parameters for each of the three algorithms proposed in Subsection 2.4 . Each epoch of each algorithm in volves e xactly one pass ov er the dataset, ensuring that their computation time is of a similar order . In each epoch the dataset is broken up into randomized minibatches of 50 samples, resulting in ten gradient steps per epoch for each algorithm, except for the closed-form phase of the Hybrid algorithm which processes the data as a single batch. Gradient descents are implemented with the 6 Figure 3: T raining a CMP model on synthetic data generated from a ground truth CMP with m N = 200 neurons. ( A ) Neg ativ e log-likelihood descents as per ﬁ gure 2 C. ( B & C ) Mixture weights of the ground truth CMP (B) with two weight-curves matched qualitativ ely with weight-curves from the hybrid-train CMP model (C). Adam algorithm [ 25 ] with standard momentum parameters and a learning rate of 0.005, and the Adam parameters are reset e very epoch. T o avoid local minima we run 10 descents in parallel for each algorithm and select the best one. In addition, we plot the true negati ve log-likelihood − P m S i =1 log p N | Z ( N i | Z i ) in Figure 2 C to establish a training lo wer-bound. Since a CMP Q 0 N | Z with m C + 1 = 1 component reduces to a set of conditionally independent Poisson neurons, and ﬁtting such a CMP is a con vex optimization problem, we also plot an upper -bound in the form of − P m S i =1 log q 0 N | Z ( N i | Z i ) for the optimal Q 0 N | Z . This upper-bound tells us ho w much the addition of correlations to the model improves the quality of the ﬁt. As seen in Figure 2 , the Hybrid algorithm con verges more quickly and to a lo wer value than either SGD or EM. All algorithms ov erﬁt the data relativ e to the lo wer-bound, which is unsurprising gi ven the small sample size. W e later apply cross-validation in our analysis of real data to help a void this. In Figure 2 D we plot q C | Z ( j | z ) for the CMP model trained with the Hybrid algorithm. W e highlight two curves to emphasize that the weight-curves learned by model are similar to those of the ground truth CMP. In Figure 2 E we plot stimulus-dependent, paired correlation matrices of the ground truth and hybrid-trained models. As can be seen, the learned-model and ground truth correlations are nearly identical. Also note that when one mixture weight dominates, the correlations disappear . Finally , in Figure 3 we repeat the previous analysis on a CMP target and model with m N = 200 neurons. As can be seen, the results are essentially identical to those in ﬁgure 2 . As such, at least for synthetic data, a small number of sample points (e.g. 496) are suf ﬁcient for modelling lo w-dimensional noise correlations in large populations of neurons. 3.2 Response Recordings from Macaque Primary V isual Cortex In this subsection we repeat our analysis from the previous subsection on multi-electrode recordings of macaque primary visual corte x (V1) in response to the oriented grating stimuli (the data were originally presented in [ 26 ]). W e analyze 8 datasets corresponding to 8 multi-electrode recording sessions of between 28–76 well-tuned neurons. Each dataset is composed of 80 repeated trials of 16 stimuli 2 , where the stimuli are spread ev enly over the half circle. In Figure 4 A we depict the 10-fold cross-v alidation of the log-likelihood of each of the 8 datasets as a function of the number of components m C + 1 in the Hybrid-trained CMP model, and we subtract the 1-component value from the v alue computed for the remaining components. This ﬁgure thus quantiﬁes the amount of information (in nats) that is gained about the true data distrib ution by modelling the data with the giv en number of components. As can be seen, between 3–5 components is optimal for the various datasets before information gain yields to overﬁtting. In Figure 4 B we plot the results of ﬁtting the complete selected dataset with the t hree proposed algorithms. The Hybrid algorithm continues to perform well and con verge quickly , while it appears as though EM and SGD require signiﬁcantly more computation to ﬁnd good solutions. 2 Each set of 80 trials pools 4 distinct phases of the stimulus, which could inﬂate measured correlations. Nev ertheless, recorded neurons were found to be roughly phase independent [ 26 ]. 7 Figure 4: Training a CMP model on response recordings from macaque V1. ( A ) 10-fold cross- validated log-lik elihood of 8 datasets as a function of the number of mixture components, relative to the 1-component log-likelihood for that dataset. W e highlight the curve of a selected dataset which underlies the rest of the plots in this ﬁgure. ( B ) Negati ve log-likelihood descents on the full selected dataset, as per ﬁgure 2 C. ( C ) Stimulus-dependent mixture weights of the CMP learned on the selected dataset trained with the hybrid algorithm, with tick marks indicating the conditioning angle of the correlation matrices in (D). ( D ) Paired correlation matrices, where the upper-right and lower-left triangles are the empirical and learned CMP correlations, respecti vely . Matrices are conditioned on x = 0 . 1 π , x = 0 . 35 π , x = 0 . 65 π , and x = 0 . 85 π , respectiv ely , matching axis labels in (C). In Figure 4 C we plot q C | Z ( j | z ) for the hybrid-trained model Q C N | Z , b ut in this case we colour the four curves to more easily distinguish them. The curves indicate that the correlations of the true data distribution are stimulus-dependent, but less so than those of our random models. Moreover , at no stimulus v alue does a single component dominate, and the ef fective dimensionality is largely stimulus-independent. Figure 4 D shows that the CMP ﬁnds a subtle yet signiﬁcant amount of stimulus-dependence in model correlations, which are not clearly visible in the empirical correlations. 4 Conclusion In this paper we introduced a novel model (CMPs) of conte xt-dependent neural correlations, deriv ed an expectation-maximization method for training it, and demonstrated that CMPs can capture signiﬁcant correlation structure amongst neurons. W e also demonstrated that EM and SGD can often perform equally well in the context of ﬁnite mixture models, which stands in contrast with common practice [ 27 ]. In future work we will compare the correlations learned with a CMP to the results of alternati ve dimensionality-reduction techniques for conte xt-dependent neural correlations [ 11 , 13 , 12 ]. At the same time, CMPs are uniquely suited for studying rate-based neural coding. For example, information-limiting correlations [ 7 ] can be directly incorporated into a CMP by deﬁning the gain components as identical but shifted with respect to neuron inde x. In our applications in Section 3 , the stimulus-dependent parameters had a simple generalized linear form. Theoretically , howe ver , these parameters could be modelled by a deep neural network, thereby allo wing the output of the deep network to exhibit correlations. When the output of the network has a complex structure, such as when modelling the stimulus-dependent neural responses of mid-le vel brain re gions to naturalistic images [ 28 ], incorporating correlations could signiﬁcantly improv e model predictions. Moreover , because CMPs model noise correlations in a manner that is compatible with theories of population coding, such a combined model can e xtend our understanding of neural computation beyond lo w-level sensory areas and lo w-dimensional variables. 8 References [1] Bruno B. A verbeck, Peter E. Latham, and Alexandre Pouget. Neural correlations, population coding and computation. Natur e Reviews Neur oscience , 7(5):358–366, May 2006. [2] Adam K ohn, Ruben Coen-Cagli, Ingmar Kanitscheider, and Ale xandre Pouget. Correlations and Neuronal Population Information. Annual Revie w of Neur oscience , 39(1):237–256, July 2016. [3] Larry F . Abbott and Peter Dayan. The effect of correlated v ariability on the accuracy of a population code. Neural computation , 11(1):91–101, 1999. [4] Haim Sompolinsky , Hyoungsoo Y oon, Kukjin Kang, and Maoz Shamir . Population coding in neuronal systems with correlated noise. Physical Revie w E , 64(5), October 2001. [5] Maoz Shamir and Haim Sompolinsk y . Implications of neuronal div ersity on population coding. Neural computation , 18(8):1951–1986, 2006. [6] Alexander S. Ecker , Philipp Berens, Andreas S. T olias, and Matthias Bethge. The Effect of Noise Correlations in Populations of Div ersely T uned Neurons. J ournal of Neur oscience , 31(40):14272–14283, October 2011. [7] Rubén Moreno-Bote, Jeffre y Beck, Ingmar Kanitscheider , Xaq Pitkow , Peter Latham, and Ale xandre Pouget. Information-limiting correlations. Nature Neur oscience , 17(10):1410–1417, October 2014. [8] Elad Schneidman. T owards the design principles of neural population codes. Curr ent Opinion in Neur obi- ology , 37:133–140, April 2016. [9] Christophe Gardella, Olivier Marre, and Thierry Mora. Modeling the Correlated Activity of Neural Populations: A Re vie w. Neural Computation , 31(2):233–269, December 2018. [10] Elad Schneidman, Michael J. Berry , Ronen Segev , and William Bialek. W eak pairwise correlations imply strongly correlated network states in a neural population. Nature , 440(7087):1007–1012, April 2006. [11] Einat Granot-Atedgi, Gašper Tka ˇ cik, Ronen Sege v , and Elad Schneidman. Stimulus-dependent Maximum Entropy Models of Neural Population Codes. PLOS Computational Biology , 9(3):e1002922, March 2013. [12] Benjamin R. Cowle y , Matthew A. Smith, Adam Kohn, and Byron M. Y u. Stimulus-Driv en Population Activity P atterns in Macaque Primary V isual Cortex. PLOS Computational Biology , 12(12):e1005185, December 2016. [13] Alexander S. Eck er , Philipp Berens, R. James Cotton, Mani vannan Subramaniyan, George H. Denﬁeld, Cathryn R. Cadwell, Stelios M. Smirnakis, Matthias Bethge, and Andreas S. T olias. State Dependence of Noise Correlations in Macaque Primary V isual Cortex. Neur on , 82(1):235–248, April 2014. [14] John P Cunningham and Byron M Y u. Dimensionality reduction for large-scale neural recordings. Natur e Neur oscience , 17(11):1500–1509, November 2014. [15] Robbe L. T . Goris, J. Anthony Movshon, and Eero P . Simoncelli. Partitioning neuronal v ariability . Natur e Neur oscience , 17(6):858–865, June 2014. [16] Michael Okun, Nicholas A. Steinmetz, Lee Cossell, M. Florencia Iacaruso, Ho Ko, Péter Barthó, Tirin Moore, Sonja B. Hofer , Thomas D. Mrsic-Flogel, Matteo Carandini, and Kenneth D. Harris. Di verse coupling of neurons to populations in sensory cortex. Nature , 521(7553):511–515, May 2015. [17] Robert Rosenbaum, Matthe w A. Smith, Adam K ohn, Jonathan E. Rubin, and Brent Doiron. The spatial structure of correlated neuronal variability . Natur e Neur oscience , 20(1):107–114, January 2017. [18] E. Ganmor , R. Sege v , and E. Schneidman. Sparse low-order interaction network underlies a highly correlated and learnable neural population code. Pr oceedings of the National Academy of Sciences , 108(23):9679–9684, June 2011. [19] Dimitris Karlis and Loukia Meligkotsidou. Finite mixtures of multiv ariate Poisson distributions with application. J ournal of Statistical Planning and Infer ence , 137(6):1942–1960, June 2007. [20] W ei Ji Ma, Jeff Beck, Peter Latham, and Alexandre Pouget. Bayesian inference with probabilistic population codes. Natur e Neuroscience , 9(11):1432–1438, October 2006. [21] Srdjan Ostojic and Nicolas Brunel. From Spiking Neuron Models to Linear-Nonlinear Models. PLoS Computational Biology , 7(1):e1001056, January 2011. 9 [22] Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry , volume 191. American Mathematical Soc., 2007. [23] Martin J. W ainwright and Michael I. Jordan. Graphical models, exponential families, and v ariational inference. F oundations and T r ends R  in Machine Learning , 1(1-2):1–305, 2008. [24] Max W elling, Michal Rosen-zvi, and Geoffre y E Hinton. Exponential F amily Harmoniums with an Application to Information Retriev al. In L. K. Saul, Y . W eiss, and L. Bottou, editors, Advances in Neural Information Pr ocessing Systems 17 , pages 1481–1488. MIT Press, 2005. [25] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv pr eprint arXiv:1412.6980 , 2014. [26] Ruben Coen-Cagli, Adam K ohn, and Odelia Schwartz. Flexible g ating of contextual inﬂuences in natural vision. Natur e Neuroscience , 18(11):1648–1655, October 2015. [27] Geoffre y J. McLachlan, Sharon X. Lee, and Suren I. Rathnayake. Finite Mixture Models. Annual Revie w of Statistics and Its Application , 6(1):355–378, 2019. [28] Daniel L. K. Y amins and James J. DiCarlo. Using goal-driven deep learning models to understand sensory cortex. Nature Neur oscience , 19(3):356–365, March 2016. 10

Conditional Finite Mixtures of Poisson Distributions for Context-Dependent Neural Correlations

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment