Differentiable probabilistic models of scientific imaging with the Fourier slice theorem

Differ entiable pr obabilistic models of scientiﬁc imaging with the F ourier slice theor em Karen Ullrich ∗ Univ ersity of Amsterdam Rianne van den Berg Univ ersity of Amsterdam Marcus Brubak er Y ork Univ ersity David Fleet † Univ ersity of T oronto Max W elling ‡ Univ ersity of Amsterdam Abstract Scientiﬁc imaging techniques, e.g., optical and electron microscopy or computed tomogra- phy , are used to study 3D structures through 2D observations. These observations are re- lated to the 3D object through orthogonal in- tegral projections. For computational efﬁ- ciency , common 3D reconstruction algorithms model 3D structures in Fourier space, exploit- ing the Fourier slice theorem. At present it is somewhat unclear ho w to differentiate through the projection operator as required by learn- ing algorithms with gradient-based optimiza- tion. This paper shows how back-propagation through the projection operator in Fourier space can be achieved. W e demonstrate the approach on 3D protein reconstruction. W e further extend the approach to learning prob- abilistic 3D object models. This allo ws us to predict regions of low sampling rates or to es- timate noise. Higher sample ef ﬁciency can be reached by utilizing the learned uncertainties of the 3D structure as an unsupervised esti- mate of model ﬁt. Finally , we demonstrate how the reconstruction algorithm can be ex- tended with amortized inference on unkno wn attributes such as object pose. Empirical stud- ies show that joint inference of the 3D struc- ture and object pose becomes dif ﬁcult when the underlying object contains symmetries, in which case pose estimation can easily get stuck in local optima, inhibiting a ﬁne-grained high- quality estimate of the 3D structure. ∗ mail.karen.ullrich@gmail.com † CIF AR AI Chair , V ector Institute, CIF AR LMB Program ‡ Qualcomm, CIF AR LMB Program Figure 1: Example of electron cryo-microscopy with the GroEL-GroES protein (Xu et al., 1997). Left : 2D obser- vations obtained by projections with an electron beam. Right : T wo different vie ws of the ground truth 3D pro- tein structure represented by its electron density . 1 Introduction The main goal of many scientiﬁc imaging methods is to reconstruct a ( d + 1) -dimensional structure v ∈ V ⊆ R D d +1 from N (d)-dimensional observations x n ∈ I ⊆ R D d , where d is either one or two. For the sake of sim- plicity we will talk about the case d = 2 in the rest of this work. The contributions of this paper are: 1. W e view the process of image formation through a graphical model in which latent v ariables cor - respond to physical quantities such as the hidden structure v or the relativ e orientation/pose of a spec- imen. This enables one to predict errors in the re- construction of 3D structures through uncertainty estimates. This is especially interesting when ob- jects v are only partially observable, as is the case in certain medical scans, such as breast cancer scans. Moreov er , uncertainty prediction enables more data efﬁcient model v alidation. 2. Based on the aforementioned innov ations, we pro- pose a ne w method for (unsupervised) reconstruc- tion ev aluation. P articularly , we demonstrate that learned uncertainties can replace currently used data inefﬁcient methods of ev aluation (see Section 6.2). W e thus learn better model ﬁts than traditional methods giv en the same amount of data. 3. W e extend current approaches such as (Jaitly et al., 2010) to describe the generative process as a dif- ferentiable map by adopting recent techniques from the deep learning community (Jaderberg et al., 2015; Rezende et al., 2016). W e demonstrate that this fenables more adv anced joint inference schemes ov er object pose and structure estimation. Our experimental validation focuses on single particle electron cryo-microscopy (cryoEM). CryoEM is a chal- lenging scientiﬁc imaging task, as it suf fers from com- plex sources of observation noise, low signal to noise ratios, and interference corruption. Interference corrup- tion attenuates certain Fourier frequencies in the observa- tions. Radiation exposure is minimized because electron radiation sev erely damages biological specimens during data collection. Minimal radiation, howe ver , leads to low signal-to-noise ratios, where sensor cells record rel- ativ ely low electron counts. Since imaging techniques like CT suffer from a subset of these difﬁculties, we be- liev e that e v aluating and analyzing our method on cry- oEM problems is appropriate. 2 Background and r elated work Modelling nano-scale structures such as proteins or viruses is a central task in structural biology . By freez- ing such structures and subsequently projecting them via a parallel electron beam to a sensor grid (see ﬁgure 2), CryoEM enables reconstruction and visualization of such structures. The technique has been described as rev o- lutionary because researchers are capable of observing structures that cannot be crystallized, as required for X- ray crystallography (Rupp, 2009). The main task of reconstructing the structure from pro- jections in cryoEM, and the wider ﬁeld of medical imag- ing, is some what similar to multi-vie w scene reconstruc- tion from natural images. There are, howe v er , substantial differences. Most signiﬁcantly , the projection operation in medical imaging is often an orthogonal integral pro- jection, while in computer vision it is a non-linear per- spectiv e projection for which materials exhibit different degrees of opacity . Thus, the generati ve model in com- puter vision is more complex. Medical imaging domains, on the other hand, face signiﬁcant noise and measure- ment uncertainties, with signal-to-noise ratios as lo w as 0.05 (Baxter et al., 2009). Most CryoEM techniques (De la Rosa-Tre v ´ ın et al., 2013; Grigorief f, 2007; Scheres, 2012; T ang et al., 2007) iterativ ely reﬁne an initial structure by matching a max- imum a posteriori (MAP) estimate of the pose (orienta- tion and position) under the proposal structure with the image observation. These approaches suffer from a range of problems such as high sensiti vity to poor initialization (Henderson et al., 2012). In contrast to this approach, and closely related to our work, (Brubaker et al., 2015; Punjani et al., 2017) treat poses and structure as latent variables of a joint density model. MAP estimation en- ables efﬁcient optimization in observation space. Previ- ous work (Sigworth, 1998; Scheres et al., 2007a; Jaitly et al., 2010; Scheres, 2012) has suggested full marginal- ization, howe ver due to its cost, it is usually intractable. This paper extends the MAP approach by utilizing vari- ational inference to approximate intractable integrals. Further , reparameterizing posterior distributions enables gradient based learning (Kingma and W elling, 2013; Rezende et al., 2014a). T o our kno wledge this is the ﬁrst such approach that provides an ef ﬁcient way to learn ap- proximate posterior distributions in this domain. 3 Observation F ormation Model Giv en a structure v , we consider a generic generativ e model of observ ations, one that is common to many imaging modalities. As a speciﬁc e xample, we take the structure v to be a frozen (i.e. cryogenic) protein complex, although the procedure described below ap- plies as well to CT scanning and optical microscopy . v is in a speciﬁc pose p n relativ e to the direction of the electron radiation beam. This yields a pose-conditional projection, with observation x n . Speciﬁcally , the pose p n = ( r n , t n ) , consists of r n ∈ S O (3) , corresponding to the rotation of the object with respect to the microscope coordinate frame, and t n ∈ E (3) , the translation of the structure with respect to the origin. The observations are then generated as follo ws: The specimen in pose p n is subjected to radiation (the elec- tron beam), yielding an orthographic integral projection onto the plane perpendicular to the beam. The expected projection can be formulated as ¯ x n = P T t n R r n v . (1) Here P is the projection operator, T t n is the linear trans- lation operator for translation t n , and R r n is the linear operator corresponding to rotation r n . W ithout loss of generality we can choose the projection direction to be Figure 2: T op: Image formation on the example of cryo EM: The parallel Electron beam projects the elec- tron densities on a surface where a grid of DDD sensors record the number of electrons that hit it. Bottom: T o de- tect the projections (outlines in red) an algorithm seeks out areas of interest (Langlois et al., 2014; Zhao et al., 2013) (Figure from (Pintilie, 2010)). along the z -direction e z . When the projection is recorded with a discrete sensor grid (i.e., sampled), information beyond the Nyquist frequency is aliased. Additionally , the recording is corrupted with noise stemming from the stochastic nature of electron detection events and sen- sor failures (Egelman, 2016). Low doses are necessary since electron exposure causes damage to sensitive bio- logical molecules. Logically , the effect is more sev ere for smaller objects of study . Many sophisticated noise models hav e been proposed for these phenomena (Faruqi et al., 2003; V ulo vi ´ c et al., 2013; Scheres et al., 2007b). In this work, for simplicity , we assume isotropic Gaussian noise; i.e., p ( x n | p n , v ) = N ( x n | ¯ x n , 1 σ 2  ) , (2) where σ  models the magnitude of the observation noise. The image formation process is depicted in Figure 2. The ﬁnal section below discusses how one can generalize to more sophisticated (learnable) noise models. Note that we do not model interference patterns caused by elec- tron scattering, called defocus and modelled with a con- trast transfer function (CTF). This will lead to less real- istic generative models, howe ver we see the problem of CTF estimation as somewhat independent of our prob- lem. Ideally , we would like to model the CTF across multiple datasets, but we lea ve this to future work. 4 Back-propagating thr ough the generative model In this section we aim to bridge the gap between our knowledge of the generativ e process p ( x n | p n , v ) and a differentiable mapping that facilitates direct optimization of hidden variables ( p n , v ) with gradient-descent style schemes. W e start with an explanation of a naive differ - entiable implementation in position space, followed by a computationally more efﬁcient version by shifting the computations to the Fourier domain (momentum space). 4.1 Naive implementation: project in position space Our goal is to optimize the conditional log-likehood log p ( x n | p n , v ) with respect to the unobserv ed p n and v , maximizing the likelihood of the 2D observations. This requires equation (2) to be a differentiable operator with respect to p n and v . Note that the dependence on p n and v is fully determined by equation (1). In order to achiev e this, we ﬁrst need to apply the group action R r n onto v . Matrix representations of the group action such as the Euler angles matrix are deﬁned on the sampling grid G = { ( ν ( j ) 1 , ν ( j ) 2 , ν ( j ) 3 ) } D 3 j =1 of v rather than the vox el values { v j } D 3 j =1 . For example, the action induced by a rotation around the z-axis by an angle α on the position ( ν ( j ) 1 , ν ( j ) 2 , ν ( j ) 3 ) of an arbitrary v oxel j can be written as, R α · ν ( j ) =   cos( α ) − sin( α ) 0 sin( α ) cos( α ) 0 0 0 1      ν ( j ) 1 ν ( j ) 2 ν ( j ) 3    . (3) This entails two problems. First, the volume after trans- formation should be sampled at the same grid points as before. This requires interpolation. Second, to achiev e a differentiable map we need to formulate the transforma- tion of position values as a transformation of the vox el values. Jaderber g et al. (2015) offers a solution to both problems, known as dif ferentiable sampling 1 . The j -th vox el v 0 j = ( v 0 ) j of the transformed volume, v 0 = R r n v , with index vector ζ ( j ) , can be expressed 1 Originally in vented to learn afﬁne transformations on im- ages to ease the classiﬁcation task for standard neural networks, the approach has since been extend to 3D reconstruction prob- lems from images (Rezende et al., 2016). as a weighted sum of all voxels before transformation { v i , ν ( i ) } D 3 i =1 . The weights are determined by a sampling kernel k ( · ) , the argument of which is the difference be- tween the transformed voxel’ s position ζ ( j ) and all trans- formed sampling grid vectors R α · ν ( i ) : v 0 j = D 3 X i =1 v i · k ( R − 1 α · ζ ( j ) − ν ( i ) ) (4) A popular kernel in this conte xt is the linear interpolation sampling kernel 2 k ( R − 1 α · ζ ( j ) − ν ( i ) ) = 3 Y m =1 max(0 , 1 − | ( R − 1 α · ζ ( j ) ) m − ν ( i ) m | ) . (5) Computing one v oxel v 0 j only requires a sum ov er 8 vox- els from the original structure. These are determined by ﬂooring and ceiling the elements of ( R − 1 α ζ ( j ) ) m . Fur - thermore, the partial deriv atives are pro vided by , ∂ v 0 j ∂ v i = k ( R − 1 α · ζ ( j ) − ν ( i ) ) (6) ∂ v 0 j ∂ ( R − 1 α ζ ( j ) ) m = D 3 X i =1 v i Y l 6 = k max(0 , 1 − | ( R − 1 α ζ ( j ) ) l − ν ( i ) l | ) ·      0 if | ( R − 1 α ζ ( j ) ) m − ν ( i ) m | ≥ 1 − 1 elif ( R − 1 α ζ ( j ) ) m ≥ ν ( i ) m 1 elif ( R − 1 α ζ ( j ) ) m < ν ( i ) m . (7) This framework was originally proposed for any dif- ferentiable kernel and any dif ferentiable afﬁne position transformation ν → ζ . In our setting, we restrict our- selves to linear interpolation kernels. The group actions represented by R r are af ﬁne. In this work we represent rotations as Euler angles by using the Z-Y -Z con v ention. One could also use quaternions or exponential maps. As with rotation, the translation operation is also a transfor- mation of the vox el grid, rather than the voxel values. Thus, equation (4) can also be used to obtain a differen- tiable translation operation. Finally , the orthogonal integral projection operator is ap- plied by summing voxel values along one principal direc- tion. Since the position of the hidden volume is arbitrary we can ﬁx this direction to be the Z-axis as discussed in section 3. Denoting a volume, rotated and translated according to p = ( r , t ) , by v 0 = T t R r v , the ( ζ 1 , ζ 2 ) -th element of its expected projection is gi ven by ¯ x n [ ζ 1 , ζ 2 ] = ( P 3 → 2 v 0 )[ ζ 1 , ζ 2 ] = X ζ 3 v 0 [ ζ 1 , ζ 2 , ζ 3 ] , (8) 2 Linear kernels are efﬁcient and yield fairly good results. More complex ones such as the Lanczos re-sampling kernel may actually yield worse results due to smoothing. where v 0 [ ζ 1 , ζ 2 , ζ 3 ] denotes the element ( ζ 1 , ζ 2 , ζ 3 ) of v 0 . This concludes a naiv e approach to modelling a differ - ential map of the expected observ ation in position space. This approach is not particularly efﬁcient, as according to equation (8) we need to interpolate all D 3 vox els to compute one D 2 dimensional observation. Moreov er , back-propagating through this mapping requires trans- porting gradients through all v oxels. Next, we show ho w to reduce the cost of this naive approach without a loss of precision by shifting the problem to the Fourier domain. 4.2 Projection-slice theorem The projection-slice theorem or Fourier-slice theorem states the equiv alence between the Fourier transform F d of the projection P d 0 → d of a d 0 dimensional function f ( r ) onto a d -dimensional submanifold F d P d 0 → d f ( r ) and a d -dimensional slice of the d 0 -dimensional Fourier trans- form of that function. This slice is a d -dimensional lin- ear submanifold through the origin in the Fourier domain that is parallel to the projection submanifold S d F d 0 f ( r ) . In our setting, gi ven an axis-aligned projection direction (see equation (1)) and the discrete grid G , the expected observation in 2D Fourier space is equal to a central slice through the Fourier transformed 3D structure parallel to the projection direction e z : F 2 ¯ x = F 2 P 3 → 2 v 0 = S 2 ( F 3 v 0 ) = S 2 ˆ v 0 , (9) where F 2 and F 3 denote discrete Fourier transforms in 2 and 3 dimensions. The slice operator S is the Fourier equiv ariant of the projection operator . In our case, it is applied as follows: ( S 2 ˆ v 0 )[ ω 1 , ω 2 ] = ˆ v 0 [ ω 1 , ω 2 , 0] , (10) where ω m are Fourier inde xes. This allows one to ex ecute the generativ e model in po- sition or momentum space. It has proven more efﬁcient for most reconstruction algorithms to do computation in the Fourier domain (Sigworth, 2016). This also applies to our algorithm: (i) W e reconstruct the structure in the Fourier domain. This means we only need to apply an in- verse Fourier transformation at the end of optimization. (ii) W e may save the Fourier transformed expected pro- jections a-priori, further this is easily parallelized. Thus ev en though in its original formulation we can not e xpect a computational beneﬁt, when sharing the computation across many data points to reconstruct one latent struc- ture the gain is signiﬁcant. W e elaborate on this point with respect to differentiation belo w . 4.3 Differentiable orthographic integral projection W e incorporate the Fourier Slice Theorem into our ap- proach to build an efﬁcient differentiable generativ e model. For this we translate all elements of the model to their counterparts in the Fourier domain. The Gaus- sian noise model becomes (Havin and Jrick e, 1994), p ( F 2 x n | p n , ˆ v ) = N ( F 2 x n |F 2 ¯ x n , 1 σ 2  2 ) , (11) where ˆ v = F 3 v . In the following, we aim to deter- mine an expression for the expected Fourier projection F 2 ¯ x n = F 2 ( P 3 → 2 T t n R r n v ) by deriving the operators counterparts F 2 ( P 3 → 2 T t n R r n v ) = ˆ P 3 → 2 ˆ T t n ˆ R r n ˆ v . W e start by noting that it is useful to keep ˆ v in memory ˆ v to a void computing the 3D discrete Fourier transform multiple times during optimization. The in verse Fourier transform F − 1 3 ˆ v = v is then only applied once, after con vergence of the algorithm. Next, we restate that the Fourier transformation is a rotation-equi v ariant mapping ˆ R r n = R r n (Chirikjian and Kyatkin, 2000). This means the deriv ations from Section 4.1 with respect to the ro- tation apply in this context as well. A translation in the Fourier domain ˆ T t n , howe ver , induces a re-weighting of the original Fourier coef ﬁcients, ( ˆ T t n R r n ˆ v )[ ω 1 , ω 2 , ω 3 ] = e − i 2 π t n · ω ( R r n ˆ v )[ ω 1 , ω 2 , ω 3 ] . (12) Finally , the last section established the equi valence of the slice and the projection operator ˆ P 3 → 2 = S 2 (see equa- tion (9)) in momentum and position space. Speciﬁcally , for the linear interpolation kernel, we compute the set of interpolation points by ﬂooring and ceiling the elements of the vector R − 1 r n ω ( j ) , ∀ ω ( j ) = ( ω ( j ) 1 , ω ( j ) 2 , 0) . This en- tails 6 interpolation candidates per voxel of the central slice, in total 6 D 2 . Remember, this computation abov e in volved D 3 vox els and 6 D 3 candidates. W e can further improve the efﬁcienc y of the algorithm by swapping the projection and translation operators. That is, due to parallel radiation, and hence orthographic pro- jection, P T t n R r n v = T τ n P R r n v , (13) where τ n = ( t n · e x ) e x + ( t n · e y ) e y . This is more efﬁcient because we reduce the translation to a two dimensional translation. Thus this modiﬁes equation (12) to its two dimensioanl equiv alent. For the naiv e implementation the cost of a real space forward projection is O ( D 3 ) . In contrast, con v erting the volume to the Fourier space O ( D 3 log D ) , project- ing O ( D 2 ) and applying the inv erse Fourier transform O ( D 2 log D ) . At ﬁrst glance this implies a higher com- putational cost. Howe ver , for large datasets the cost of transforming the volume is amortized over all data points. For gradient descent schemes, we iterate over the dataset more than once, hence the cost of Fourier transforming the observations is further amortized. Fur- thermore, it is often reasonable to consider only a sub- set of Fourier frequencies, so back projection becomes O ( r 2 ) with r < D . The efﬁcienc y of this algorithm in the context of cryo-EM was ﬁrst recognized by Grig- orieff (1998). (W e provide code for the differentiable observation model and in particular for the Fourier op- erators: https://github.com/KarenUllrich/ pytorch- backprojection .) 5 V ariational inference Here we describe the v ariational inference procedure for 3D reconstruction in Fourier space, enabled by the Fourier slice theorem. W e assume we have a dataset of observations from a single type of protein with ground truth structure v , and its Fourier transform ˆ v = F 3 v . W e consider two scenarios. In the ﬁrst, both the pose of the protein and the projection are observed, and inference is only performed ov er the global protein structure. In the second, the pose of the protein for each observation is unknown. Therefore, inference is done ov er poses and the protein structure. The ﬁrst scenario is synonymous with the setting in to- mography , where we observe a frozen cell or larger com- plex positioned in known poses. This case is often chal- lenging because the sample cannot be observed from all viewing angles. For example, in cryo-EM tomography the specimen frozen in an ice slice can only be rotated till the verge of the slice comes into view . W e ﬁnd similar problems in CT scanning, for example in breast cancer scans. The second scenario is relev ant for cryo-EM sin- gle particle analysis. In this case multiple identical parti- cles are conﬁned in a frozen sample and no information on their structure or position is av ailable a priori. In this work we lay the foundations for doing inference in either of the two scenarios. Howe ver , our experiments demonstrate that joint inference over the poses and the 3D structure is very sensitiv e to getting stuck in local minima that correspond to approximate symmetries of the 3D structure. Therefore, the main focus of this work is the setting where the poses are observed. 5.1 Inference over the 3D structur e Here the data comprise image projections and poses, { ( x n , p n ) } N n =1 , with Fourier transformed projections de- noted ˆ x n = F 2 x n . Our goal is to learn a model q ( ˆ v ) of the latent structure that as closely as possible resembles the true posterior p ( ˆ v |{ ˆ x n } , { p n } ) . For this, we assume Figure 3: Graphical model : Latent structure v , pose p n and noise σ  can be learned from observations x n through back-propagation. The latent structure distribu- tion is thereby characterized by a set of parameters, in the Gaussian example µ v and σ v . a joint latent variable model p ( { ˆ x n } N n =1 , ˆ v |{ p n } N n =1 ) = p ( { ˆ x n } N n =1 |{ p n } N n =1 , ˆ v ) p ( ˆ v ) . T o av oid clutter below , we use short-hand notations like { ˆ x n } for { ˆ x n } N n =1 . Speciﬁcally , we minimize an upper bound to the Kullback-Leibler (KL) di vergence: D KL [ q ( ˆ v ) k p ( ˆ v |{ ˆ x n } , { p n } )] ≥ − Z d ˆ v q ( ˆ v ) ln  p ( { ˆ x n }|{ p n } , ˆ v ) p ( ˆ v ) q ( ˆ v )  = N X n =1 − E q ( ˆ v ) [ln p ( ˆ x n | p n , ˆ v )] + D KL [ q ( ˆ v ) k p ( ˆ v )] . (14) Here, we have assumed that, given the volume, the data are IID: p ( { ˆ x n }|{ p n } , ˆ v ) = Q N n =1 p ( ˆ x n | p n , ˆ v ) . W e hav e bounded the di ver gence by the data log-likelihood ln p ( { ˆ x n } ) , a constant with respect to ˆ v and { p n } . This is equi v alent to lo wer bounding the model evidence by introducing a variational posterior (Jordan et al., 1998). In this work we focus on modelling q ( ˆ v ) as isotropic Gaussian distribution. The prior is assumed to be a standard Gaussian. In practice, we use stochastic gra- dient descent-like optimization, with the data organized in mini-batches. That is, we learn the distrib ution param- eters η = { µ v , σ v } for q ( ˆ v ) = q η ( ˆ v ) through stochastic optimization, ef ﬁciently by using the reparameterization trick (Kingma and W elling, 2013; Rezende et al., 2014a). In equation (14), the reconstruction term depends on the number of datapoints. The KL-di ver gence between the prior and approximate posterior does not. As the sum of the mini-batch objecti ves should be equal to the objectiv e of the entire dataset, the mini-batch objectiv e is X n ∈ D i − E q ( ˆ v ) [ln p ( ˆ x n | p n , ˆ v )] + | D i | N D KL [ q ( ˆ v ) k p ( ˆ v )] . (15) where D i is the set of indices of the data in minibatch i , and | D i | denotes the size of the i -th minibatch. 5.2 Joint inference ov er the 3D structure and poses In the second scenario the pose of the 3D structure for each observation is unknown. The data thus comprises the observed projections { x n } N n =1 . Again, we perform inference in the Fourier domain, with transformed pro- jections { ˆ x n } N n =1 as data. W e perform joint inference ov er the poses { p n } N n =1 and the volume ˆ v . W e assume the latent v ariable model can be factored as follo ws, p ( { ˆ x n } , ˆ v , { p n } ) = p ( { ˆ x n }|{ p n } , ˆ v ) p ( { p n } ) p ( ˆ v ) . Upper bounding the KL-di ver gence, as above, we obtain D KL [ q ( ˆ v ) q ( { ˆ x n } ) k p ( ˆ v , { ˆ x n } , { p n } )] ≥ − Z Z d N p n d ˆ v q ( { p n } ) q ( ˆ v ) × ln  p ( { ˆ x n }|{ p n } , ˆ v ) p ( { p n } ) p ( ˆ v ) q ( { p n } ) q ( ˆ v )  = N X n =1 − E q ( ˆ v ) E q ( p n ) [ln p ( ˆ x n | p n , ˆ v )] + N X n =1 D KL [ q ( p n ) k p ( p n )] + D KL [ q ( ˆ v ) k p ( ˆ v )] . (16) The prior, approximate posterior and condi- tional likelihood all factorize across datapoints: p ( { p n } ) = Q N n =1 p ( p n ) , q ( { p n } ) = Q N n =1 q ( p n ) , and p ( { ˆ x n }|{ p n } , ˆ v ) = Q N n =1 p ( ˆ x n | p n , ˆ v ) . Like equa- tion (15), for mini-batch optimization algorithms we make use of the objecti ve X n ∈ D i − E q ( ˆ v ) E q ( p n ) [ln p ( ˆ x n | p n , ˆ v )] (17) + X n ∈ D i D KL [ q ( p n ) k p ( p n )] + | D i | N D KL [ q ( ˆ v ) k p ( ˆ v )] . In Section 5.1, we learn the structure parameters shared across all data points. Here the pose parameters are unique per observ ation and therefore require separate in- ference procedures per data point. As an alternative, we can also learn a function that approximates the inference procedure; This is called amortized inference (Kingma and W elling, 2013). In practice, a complex parameter- ized function f φ such as a neural network predicts the parameters of the variational posterior η = f φ ( · ) . 6 Experiments W e empirically test the formulation above with simu- lated data from the well-known GroEL-GroES protein (Xu et al., 1997). T o this end we generate three datasets, each with 40K projections onto 128 × 128 images at a res- olution of 2.8 ˚ A per pixel. The three datasets hav e signal- to-noise ratios (SNR) of 1.0, 0.04 and 0.01, referred to below as the noise-free , medium-noise and high-noise cases. Figure 7 shows one sample per noise lev el. As previously stated in Section 3, we do not model the mi- croscope’ s defocus or electron scattering effects, as cap- tured by the CTF (K ohl and Reimer, 2008). Using synthetic data allows us to ev aluate the algorithm with the ground-truth structure, e.g., in terms of mean- squared error (MSE). With real-data, where ground truth is unknown, the resolution of a ﬁtted 3D structure is of- ten quantiﬁed using F ourier Shell Corr elation (Rosen- thal and Henderson, 2003): The N observations are par- titioned randomly into two sets, A and B , each of which is then independently modeled with the same reconstruc- tion algorithm. The normalized cross-correlation co- efﬁcient is then computed as a function of frequency f = 1 /λ to assess the agreement between the two re- constructions. Giv en two 3D structures, F A and F B , in the Fourier do- main, FSC at frequency f is gi ven by F S C ( f | F A , F B ) = P f i ∈ S f F A ( f i ) · F B ( f i ) ∗ 2 r P f i ∈ S f | F A ( f i ) | 2 · P f i ∈ S f | F B ( f i ) | 2 . (18) where S f denotes the set of frequencies in a shell at dis- tance f from the origin of the Fourier domain (i.e. with wa velength λ ). This yields FSC curves like those in Fig. 4. The quality (resolution) of the ﬁt can be measured in terms of the frequency at which this curve crosses a threshold τ . When one of F A or F B is ground truth, then τ = 0 . 5 , and when both are noisy reconstructions it is common to use τ = 0 . 143 (Rosenthal and Henderson, 2003)). The structure is then said to be resolved to wa ve- length λ = 1 /f for which F S C ( f ) = τ . 6.1 Comparison to Baseline algorithms When poses are kno wn, the current state-of-the-art (SO T A) is a con ventional tomographic reconstruction (a.k.a. back-projection). When poses are unknown, there are se veral well-kno wn SO T A cryo-EM algo- rithms (De la Rosa-T re v ´ ın et al., 2013; Grigorieff, 2007; Scheres, 2012; T ang et al., 2007). All provide point esti- mates of the 3D structure. In terms of graphical models, point estimates correspond to the case in which the pos- terior q ( ˆ v ) is modeled as a delta function, δ ( ˆ v | µ v ) , the parameter of which is the 3D vox el array , µ v . W e compare this baseline to a model in which the posterior q ( ˆ v ) is a multi variate diagonal Gaussian, N ( ˆ v | µ v , 1 σ 2 v ) . While the latent structure is modeled in Fourier domain, the spatial domain signal is real-v alued. P osterior δ ( v | µ v ) N ( v | µ v , 1 σ 2 v ) T ime until ∼ 390 ∼ 480 con verged [ s ] MSE 2.29 2.62 [ 10 − 3 /vox el] Resolution [ ˚ A] 5.82 5.82 T able 1: Results for modelling protein structure as latent variable. Fitting a Gaussian or Dirac posterior distrib u- tion with VI leads to similar model ﬁts, as measured by MSE and FSC between ﬁt and ground truth with τ = 0 . 5 . W e restrict the learnable parameters, µ v and σ v , accord- ingly 3 . W e use the reparameterization trick thus corre- late the samples accordingly . Finally , the prior in equa- tion (14) in a multiv ariate standard Gaussian, p ( v ) = N ( v | 0 , 1 ) . T able 1 sho ws results with these two models, with kno wn poses (the tomographic setting), and with noise-free ob- servations. Giv en the sensor resolution of r = 2 . 8 ˚ A , the highest possible resolution would be the Nyquist wa ve- length of 5 . 6 ˚ A . Our results sho w that both models ap- proach this resolution, and in reasonable time. 6.2 Uncertainty estimation leads to data efﬁciency In this section, we explore how modelling the latent structure with uncertainty can improve data efﬁcienc y . For this, recall that FSC is computed by comparing re- constructions based on dataset splits, A and B . As an al- ternativ e, we propose to utilize the learned model uncer- tainties σ v to achie ve a similar result. W e thus only need one model ﬁt that includes both dataset splits. Speciﬁ- cally , we propose to extend a measure ﬁrst presented in Unser et al. (1987): the spectral SNR to the 3D case, and hence refer to it as spectral shell SNR (SS-SNR). When modelling the latent structure as diagonal Gaus- sian N ( v | µ v , 1 σ 2 v ) , the SS-SNR can be computed to be α ( f ) = P f i ∈ S f | µ v ( f i ) | 2 P f i ∈ S f σ 2 v ( f i ) . (19) Follo wing the formulation by Unser et al. (1987), we can then express the FSC in terms of the SS-SNR, i.e., F S C ≈ α/ (1 + α ) . Figure 4 shows FSC curves based on reconstructions from the medium-noise (top) and high-noise (bottom) 3 That is ∀ φ ∈ { µ v , σ v } : < ( φ )[ ζ ] = < ( φ )[ − ζ ] and = ( φ )[ ζ ] = −= ( φ )[ − ζ ] with ζ = ( ζ 1 , ζ 2 , ζ 3 ) . Figure 4: The FSC curves for various model ﬁts. The grey lines indicate resolution thresholds. The higher the resolution of a structure the better the model ﬁt. W e con- trast the FSC curves with the proposal we make to ev al- uate model ﬁt. datasets. First, we aim to demonstrate that the FSC curves between the Gaussian ﬁt (with all data) vs ground truth, and the MAP estimate model with all data vs ground truth, i.e., FSC( f | δ , δ GT ) and FSC( f |N , δ GT ), yield the same ﬁt quality . Note that we would not usu- ally have access to the ground truth structure. Secondly , because in a realistic scenario we would not hav e access to the ground truth we would need to split the dataset in two. For this we e v aluate the FSC between ground truth and two disjoint splits of the dataset FSC( f | δ A , δ GT ) and FSC( f | δ B , δ GT ). This curve not surprisingly lies under the previous curves. Also note, that the actual measure we would consider FSC( f | δ A , δ B ) is more conservati v e. Finally , we show that α/ (1 + α ) curve has the same in- ﬂection points as the FSC curve. As one w ould e xpect, it lies abov e the conserv ati ve FSC( f | δ A , δ B ). Using α one can quantify the model ﬁt with learned un- certainty rather than FSC curve. As a consequence there is no need to partition the data and perform two separate reconstructions, each with only half the data. Figure 5: Center slice through the learned Fourier v ol- ume uncertainties σ v . Left: real part, Right: imaginary part. W e learn the model ﬁt with observations coming only from a 30 ◦ cone, a scenario similar to breast cancer scans where observations are available only from some viewing directions. Uncertainty close to 1 means that the model has no information in these areas, close to zero represents areas of high sampling density . In contrast to other models, our model can identify precisely where in- formation is missing (high variance). 6.3 Uncertainty identiﬁes missing information Abov e we discussed how global uncertainty estimation can help estimate the quality of ﬁt. Here we demonstrate how local uncertainty can help ev aluate the local quality of ﬁt 4 . In many medical settings, such as breast cancer scans, limited vie wing directions are av ailable. This is an issue in tomography , and also occurs in single particle cryo-EM when the distribution of particle orientations around the viewing sphere are highly anisotropic. T o recreate the same effect we construct a dataset of 40K ob- servations, as before with no noise to separate the sources of information corruption 5 . W e restrict viewing direc- tions to Euler angles ( α, β , γ ) with β ∈ ( − 15 , +15) ; i.e., no observ ations outside a 30 ◦ cone As abov e, we assume a Gaussian posterior ov er the latent protein structure. Figure 5 sho ws the result. W e can show that the uncer- tainty the model has learned correlates with regions in which data has been observed (uncertainty close to 0) and has not been observed (uncertainty close to 1). Due to the pressure from the KL-di ver gence (see equation (14)) the latter areas of the model default to the prior N ( v | 0 , 1 ) . This approach can be a helpful method to identify areas of low local quality of ﬁt. 4 Other studies recognize the importance of local uncertainty estimation, measuring FSC locally in a wa velet basis (Cardone et al., 2013; Kucukelbir et al., 2014; V ilas et al., 2018). 5 For completeness we present the same experiment with noise in the appendix. 6.4 Limits: T reating poses as random variables Extending our method from treating structures as latent variables to treating poses as latent variables is difﬁcult. In the follo wing we analyze why this is the case. Note though that, our method of estimating latent structure can be combined with common methods of pose estimation such as branch and bound (Punjani et al., 2017) without losing any of the beneﬁts we offer . Ho we ver , ideally it would be interesting to also learn latent pose posteriors. This would be useful to for example detect outliers in the dataset which is common in real world scenarios. In an initial experiment, we ﬁx the volume to the ground truth. W e subsequently only estimate the poses with a simple Dirac model for each data point p n ∼ δ ( p n | µ p n ) . In ﬁgure 6, we demonstrate the problem of SGD-based learning for this. F or one example (samples on top), we sho w its true error surface, the global optimum (red star) and the changes over the course of optimization (red line). W e observe that due to the high symmetries in the structure, the true pose posterior error surface has many symmetries as well. An estimate depending on its start- ing position, seems to con ver ge to the closest local opti- mum only rather than the global one. W e would hope to be able to ﬁx this problem in the future by applying more advanced density estimation approaches. 7 Model Criticism and Future W ork This paper introduces practical probabilistic models into the scientiﬁc imaging pipeline, where practical refers to scalability through the use of the reparameterization trick. W e show how to turn the operators in the pipeline into dif ferentiable maps, as this is required to apply the trick. The main focus of the experiments is to show why this novelty is important, addressing issues such as data efﬁciency , local uncertainty , and cross validation. Speciﬁcally , we found that a parameterized distribution, i.e. the Gaussian, achie ves the same quality of ﬁt as a point estimate, i.e. the dirac, while relying on less data. W e conclude that our latent variable model is a suitable building block. It can be plugged into many SO T A ap- proaches seamlessly , such as (De la Rosa-Tre v ´ ın et al., 2013; Grigorief f, 2007; Scheres, 2012; T ang et al., 2007). W e also established that the learned uncertainty is pre- dictiv e of locations with too few samples. Finally , we demonstrated the limits of our current methods in treat- ing poses as latent variables. This problem, howe ver , does not limit the applicability of our method to latent structures. W e thus propose to combine common pose estimation with our latent v ariable structure estimation. This method beneﬁt from the uncertainty measure but also ﬁnd globally optimal poses. Figure 6: Example of gradient descent failing to estimate the latent pose of a protein. Small images fr om left to right: observation, real and imaginary part of the Fourier transpose. Larg e ﬁgur e: Corresponding error surface of the poses for 2 of 3 Euler angles. The red curve shows the progression of the pose estimate ov er the course of optimization. It is clear that the optimization fails to re- cov er the true global optimum (red star). In future work we hope to ﬁnd a way to efﬁciently learn pose posterior distributions as well. W e hope that a rea- sonable approach would be to use multi-modal distribu- tions and thus more advanced density estimation tech- niques. W e will also try to incorporate amortized in- ference, mentioned in Section 5. Amortization would giv e the additional advantage of being able to transfer knowledge from protein to protein. Transfer could then lead to more advanced noise and CTF models. Bias in transfer will be a key focus of this effort; i.e., we only want to transfer features of the noise and not the la- tent structure. Another problem we see with the ﬁeld of reconstruction algorithms is that the model e v aluation can only help to detect variance but not bias in a model class. This is a problem with FSC comparison, but also with our proposal. W e believ e that an estimate of the data-log-likelihood of a hold out test dataset is generally much better suited. In a probabilistic vie w , this can be achiev ed by importance weighting the ELBO (Rezende et al., 2014b). Acknowledgements W e thank the re vie wers for their valuable feedback, in particular AnonReviewer1 . This research was funded in part by Google and by the Canadian Institute for Ad- vanced Research. References Alemi, A. A., Poole, B., Fischer, I., Dillon, J. V ., Saurous, R. A., and Murphy , K. (2017). Fixing a bro- ken elbo. arXiv pr eprint arXiv:1711.00464 . Baxter , W . T ., Grassucci, R. A., Gao, H., and Frank, J. (2009). Determination of signal-to-noise ratios and spectral snrs in cryo-em lo w-dose imaging of molecules. J ournal of structural biology , 166(2):126– 132. Brubaker , M. A., Punjani, A., and Fleet, D. J. (2015). Building proteins in a day: Efﬁcient 3d molecular re- construction. In Pr oceedings of the IEEE Conference on Computer V ision and P attern Recognition , pages 3099–3108. Cardone, G., Heymann, J. B., and Ste ven, A. C. (2013). One number does not ﬁt all: Mapping local variations in resolution in cryo-em reconstructions. J ournal of structural biology , 184(2):226–236. Chirikjian, G. and K yatkin, A. (2000). Engineering Applications of Noncommutative Harmonic Analysis: W ith Emphasis on Rotation and Motion Gr oups . CRC Press. De la Rosa-Tre v ´ ın, J., Ot ´ on, J., Marabini, R., Zaldi v ar , A., V argas, J., Carazo, J., and Sorzano, C. (2013). Xmipp 3.0: an impro ved software suite for image pro- cessing in electron microscopy . J ournal of structural biology , 184(2):321–328. Egelman, E. H. (2016). The current rev olution in cryo- em. Biophysical journal , 110(5):1008–1012. Faruqi, A., Cattermole, D., Henderson, R., Mikulec, B., and Raeburn, C. (2003). Ev aluation of a hybrid pixel detector for electron microscopy . Ultr amicr oscopy , 94(3-4):263–276. Grigorieff, N. (1998). Three-dimensional structure of bovine nadh: ubiquinone oxidoreductase (complex i) at 22 ˚ a in ice. J ournal of molecular biology , 277(5):1033–1046. Grigorieff, N. (2007). Frealign: high-resolution reﬁne- ment of single particle structures. Journal of structural biology , 157(1):117–125. Havin, V . and Jricke, B. (1994). The Uncertainty Princi- ple in Harmonic Analysis . Springer-V erlag. Henderson, R., Sali, A., Bak er , M. L., Carragher , B., De- vkota, B., Downing, K. H., Egelman, E. H., Feng, Z., Frank, J., Grigorief f, N., et al. (2012). Outcome of the ﬁrst electron microscopy validation task force meet- ing. Structure , 20(2):205–214. Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). Spatial transformer networks. In Advances in Neural Information Pr ocessing Systems , pages 2017– 2025. Jaitly , N., Brubaker , M. A., Rubinstein, J. L., and Lilien, R. H. (2010). A bayesian method for 3d macromolecu- lar structure inference using class average images from single particle electron microscopy . Bioinformatics , 26(19):2406–2415. Jordan, M. I., Ghahramani, Z., Jaakkola, T . S., and Saul, L. K. (1998). An introduction to variational methods for graphical models. In Learning in graphical mod- els , pages 105–161. Springer . Kingma, D. P . and W elling, M. (2013). Auto-encoding variational bayes. arXiv preprint . K ohl, H. and Reimer , L. (2008). T ransmission electr on micr oscopy: physics of image formation . Springer . Kucukelbir , A., Sigw orth, F . J., and T agare, H. D. (2014). Quantifying the local resolution of cryo-em density maps. Nature methods , 11(1):63. Langlois, R., Pallesen, J., Ash, J. T ., Ho, D. N., Rubin- stein, J. L., and Frank, J. (2014). Automated parti- cle picking for low-contrast macromolecules in cryo- electron microscopy . J ournal of structural biology , 186(1):1–7. Marino, J., Y ue, Y ., and Mandt, S. (2018). Iterative amor- tized inference. arXiv preprint . Pettersen, E. F ., Goddard, T . D., Huang, C. C., Couch, G. S., Greenblatt, D. M., Meng, E. C., and Ferrin, T . E. (2004). Ucsf chimeraa visualization system for exploratory research and analysis. Journal of compu- tational chemistry , 25(13):1605–1612. Pintilie, G. (2010). Greg pintilie’ s homepage: people.csail.mit.edu/gdp/cryoem.html . Punjani, A., Rubinstein, J. L., Fleet, D. J., and Brubaker , M. A. (2017). cryosparc: algorithms for rapid un- supervised cryo-em structure determination. Nature Methods , 14(3):290–296. Rezende, D. J., Eslami, S. A., Mohamed, S., Battaglia, P ., Jaderber g, M., and Heess, N. (2016). Unsupervised learning of 3d structure from images. In Advances in Neural Information Pr ocessing Systems , pages 4996– 5004. Rezende, D. J., Mohamed, S., and W ierstra, D. (2014a). Stochastic backpropagation and approximate infer- ence in deep generative models. arXiv pr eprint arXiv:1401.4082 . Rezende, D. J., Mohamed, S., and Wierstra, D. (2014b). Stochastic backpropagation and approximate infer- ence in deep generati ve models. In Pr oceedings of the 31st International Confer ence on Machine Learning , Proceedings of Machine Learning Research, pages 1278–1286. Rosenthal, P . B. and Henderson, R. (2003). Optimal determination of particle orientation, absolute hand, and contrast loss in single-particle electron cryomi- croscopy . Journal of Molecular Biology , 333(4):721– 745. Rupp, B. (2009). Biomolecular crystallogr aphy: prin- ciples, practice , and application to structural biolo gy . Garland Science. Scheres, S. H. (2012). Relion: implementation of a bayesian approach to cryo-em structure determination. Journal of structur al biology , 180(3):519–530. Scheres, S. H., Gao, H., V alle, M., Herman, G. T ., Egger- mont, P . P ., Frank, J., and Carazo, J.-M. (2007a). Dis- entangling conformational states of macromolecules in 3d-em through likelihood optimization. Natur e methods , 4(1):27. Scheres, S. H., N ´ u ˜ nez-Ram ´ ırez, R., G ´ omez-Llorente, Y ., San Mart ´ ın, C., Eggermont, P . P ., and Carazo, J. M. (2007b). Modeling experimental image formation for likelihood-based classiﬁcation of electron microscopy data. Structure , 15(10):1167–1177. Sigworth, F . (1998). A maximum-lik elihood approach to single-particle image reﬁnement. J ournal of structural biology , 122(3):328–339. Sigworth, F . J. (2016). Principles of cryo-em single- particle image processing. Micr oscopy , 65(1):57–67. T ang, G., Peng, L., Baldwin, P . R., Mann, D. S., Jiang, W ., Rees, I., and Ludtke, S. J. (2007). Eman2: an extensible image processing suite for electron mi- croscopy . J ournal of structural biology , 157(1):38–46. Unser , M., Trus, B. L., and Steven, A. C. (1987). A new resolution criterion based on spectral signal-to-noise ratios. Ultramicr oscopy , 23(1):39–51. V ilas, J. L., G ´ omez-Blanco, J., Conesa, P ., Melero, R., de la Rosa-Tre v ´ ın, J. M., Ot ´ on, J., Cuenca, J., Mara- bini, R., Carazo, J. M., V argas, J., et al. (2018). Monores: automatic and accurate estimation of local resolution for electron microscopy maps. Structur e , 26(2):337–344. V ulovi ´ c, M., Rav elli, R. B., v an Vliet, L. J., Koster , A. J., Lazi ´ c, I., L ¨ ucken, U., Rullg ˚ ard, H., ¨ Oktem, O., and Rieger , B. (2013). Image formation modeling in cryo- electron microscopy . Journal of structural biology , 183(1):19–32. Xu, Z., Horwich, A. L., and Sigler , P . B. (1997). The crystal structure of the asymmetric groel–groes–(adp) 7 chaperonin complex. Natur e , 388(6644):741. Zhao, J., Brubaker , M. A., and Rubinstein, J. L. (2013). Tmacs: A hybrid template matching and classiﬁca- tion system for partially-automated particle selection. Journal of structur al biology , 181(3):234–242. A ppendices A V isual impressions First, to give a visual impression of the observations, we learn from we present in Figure 7 samples from the 3 data sets we use. All of the datasets are based on the same protein estimate of the GroEL-GroES protein(Xu et al., 1997). They differ in the signal-to-noise ratio (SNR). The left row represents noise free data, the middle a mod- erate common noise lev el, and the right an extreme le v el of noise. For each observ ation example, we show also the corresponding Fourier transformations, their real part in the second column and their imaginary part in the third column. Further we present the qualitati ve results of ﬁt- Figure 7: Left to Right: W e show samples from the dataset we use: (i) no noise (such as in Experiment 6.1), (ii) moderate noise and (iii) high noise (such as in exper - iment 6.2). T op to Bottom: Observation in (a) real space, ﬁrst 20 F ourier shells (b) real part and (c) imaginary part (for better visibility log-scaled). The latter two are being used for the optimization due to the application of the Fourier slice theorem e xplained in Section 4. ting the mid and high le vel datasets with our method. W e visualize the respectiv e protein ﬁt from experiment 6.2 in Figure 8 with the Chimera X software package (Pettersen et al., 2004). The two pictures on the top row represent the middle noise ﬁt, respectiv ely the bottom two the high noise ﬁt. Figure 8: T op: Side and top vie w of the GroEL-GroES protein ﬁt with moderate noise level data. Bottom: Side and top view of the respecti ve high noise lev el dataset. B Extension to experiment 6.3 W e shall execute the same experiment as in section 6.3 giv en the dataset with intermediate noise. W e display the experiments of this result in Figure 9. It is clear that while, missing information leads to large de viation in v ariance we also ﬁnd that the noise leads to some v ari- ance in the observed area. Again we visualize the result of the ﬁt in Figure 10. C Amortized inference and v ariational EM f or pose estimation W e hav e not used amortized inference in our experi- ments. In experiment 6.4 we have modelled poses as lo- cal variables and trained them by v ariational expectation maximization. Other work shows that the amortization gab can be signiﬁcant (Scheres et al., 2007b; Alemi et al., 2017; Marino et al., 2018). Hence in order to e xclude the gap as a reason for failure, we decided to model local variables. W e did run experiments though with ResNets as encoders without success either . W e belie ve the core problem in probabilistic pose estimation is the number of local optima. This makes simple SGD a somewhat poor choice, because we rely on ﬁnding the global minimum. Figure 9: Center slice through the learned Fourier v ol- ume uncertainties σ v . Left: real part, Right: imaginary part. W e learn the model ﬁt with observations coming only from a 30 ◦ cone, a scenario similar to breast cancer scans where observations are available only from some viewing directions. Uncertainty close to 1 means that the model has no information in these areas, close to zero represents areas of high sampling density . In contrast to other models, our model can identify precisely where in- formation is missing (high variance). Figure 10: Side and top view of the GroEL-GroES pro- tein ﬁt with moderate noise level data and all observa- tions stemming from a limited pose space. D Remarks to the chosen observation model The Gaussian noise model is a common but surely over - simpliﬁed model (Sigworth, 1998). A better model would be the Poisson distribution. The Gaussian is a good approximation to it given a high rate parameter meaning if there is a reasonable high count of radiation hitting the sensors. This is a good assumption for most methods of the ﬁeld, but can actually be a poor model in some cases of cryo electron microscopy . An example of an elaborate model is presented in V ulovi ´ c et al. (2013).

Differentiable probabilistic models of scientific imaging with the Fourier slice theorem

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment