Order-preserving factor analysis (OPFA)

1 Order -pr eserving factor analysis (OPF A) Arnau T ibau Puig and Alfred O. Hero III EECS Department, Uni versity of Michigan, Ann Arbor , M I 48109-2122, USA Communications a nd Signal Processing Labor atory T echnical Report: cspl-396 Date: May . 5 2011 (v2) A s horter version o f this TR was ac cepted for publication to the IEEE Tr ans actions on Signal Processing in April 2011. 2 Order -preserving f actor analysis (OPF A) Arnau T ibau Puig and Al fred O. Hero III Index T erms Dictionary learn ing, structured factor analy sis, genomic signal processing, misalign ed data pro cessing I . I N T RO D U C T I O N W ith the a dvent of high-throughput data co llection technique s, low-dimensional matrix factorizations have become an essen tial tool for pre-proces sing, i nterpreting or co mpressing high-dimensional data. They are widely used in a v a riety of signal processing domains including electrocardiogram [1], image [2], or sound [3] proc essing. These me thods c an take advantage of a large range of a priori knowledge on the form of the factors, en forcing it through con straints on sparsity or pa tterns in the factors. Howev er , these methods do n ot work well wh en there are unknown misalignments b etween sub jects in the p opulation, e.g., unknown subject-spec iﬁc time shifts. In s uch cases, one can not apply standard patterning c onstraints without ﬁrst a ligning the data; a dif ﬁcu lt task. An a lternati ve approa ch, explored in this paper , is to impose a factorization constraint tha t is in variant to factor misalignmen ts but preserves the relativ e o rdering of the factors over the population. Th is o rder- prese rving factor an alysis is accomplishe d us ing a p enalized least s quares formulation us ing shift-in variant y et order-preserving model s election (group las so) pena lties on the factorization. As a byproduc t the factorization produc es estimates of the factor ordering an d the order- pres erving time shifts. In traditional matrix factorization, the data is mo deled as a linear combination o f a number of factors. Thus, given an n × p data matrix X , the Linear Factor mod el is deﬁne d a s: X = M A + ǫ , (1) where M is a n × f matrix of factor loadings or dic tionary e lements, A is a f × p matrix of scores (also called c oordinates) and ǫ is a s mall residu al. For example, in a g ene express ion time cou rse analys is, n is the number of time points and p is the number of genes in the study , the columns of M con tain t he features summarizing the gene s’ temporal trajectories and the columns of A rep resent the coordinate s of each gene 3 on the sp ace spa nned by M . Given this model, the problem is to ﬁnd a parsimon ious factorization tha t ﬁts the data well acc ording to selected criteria, e .g. minimizing the recons truction error or maximizing the explained variance. There are two main approac hes to such a parsimonious factorization. One, called Factor Ana lysis, as sumes that the numb er of factors is s mall and yields a lo w-rank matrix factorization [4], [5]. The other , called Dictionary Learning [6], [7] or Sparse Cod ing [8], assume s that the loa ding matrix M comes from an overcomplete dictiona ry of functions a nd resu lts in a spa rse sc ore matrix A . There are also h ybrid a pproach es s uch as Sparse Factor Analysis [1], [9], [2] tha t try to e nforce low rank and s parsity simultane ously . In ma ny situations, we obs erve not one but several matrices X s , s = 1 , · · · , S and there are phy sical grounds for believing that the X s ’ s share an u nderlying model. This happe ns, for instance, when the observations con sist of different time-blocks of sound from the sa me mu sic piece [3], [10], when they consist of time sa mples of gene expression microarray data from different individuals inoc ulated with the same virus [11], o r when they arise from the recep tion of digital data with code , sp atial an d temporal div ersity [12]. In these situations, the ﬁxed factor model (1) is overly simplistic. An examp le, which is the main moti vation for this work is shown in Figu re 1, which shows the effect of temporal misalignment a cross s ubjects in a viral c hallenge s tudy rep orted in [11]. Figure 1 shows the expression tr ajectory for a particular gene that undergoes an increase (up-regulati on) after viral inocu lation at time 0, whe re the momen t when up-regulation occ urs dif fers over the p opulation. Tr aining the mod el (1) on this data will produce p oor ﬁt due to misa lignment of g ene expression ons et times. 0 20 40 60 80 100 6 7 8 9 10 11 CCRL2 Time Expression Level Subject 3 Subject 1 Subject 4 Subject 8 Fig. 1. Example of temporal misalignment across subjects of upreg ulated gene CCRL2 . Subject 6 and subject 10 show the earliest and the latest up-regulation responses, respecti vely . A more se nsible approac h for the d ata in Figure 1 would be to s eparately ﬁt ea ch su bject with a translated version of a common u p-regulation factor . This moti vates the followi ng extension of mo del (1), whe re the factor matrices M s , A s are allowed to vary acros s observations. Given a number S of 4 n × p data matrices X s , we let: X s = M s A s + ǫ s s = 1 , · · · , S. (2) Follo wing the ge ne expres sion example, here n is the number of time p oints, p is the number of gen es in the study , and S is the nu mber o f subjects participating in the study . He nce, the n × f matrices M s contain the trans lated temporal features correspo nding to the s -th subjec t and the f × p ma trices A s accommod ate the pos sibility of subjec ts having dif ferent mixing we ights. For dif ferent cons traints on M s , A s , this model s pecializes to several well-known pa radigms s uch as P rincipal Co mponents Analys is (PCA) [4], s parse PCA [1], k -SVD [6], structured PCA [2], Non -Negati ve Matrix Factorization (NNMF) [13], Maximum-Mar gin Matrix Factorization (MMMF) [14], Sparse Shift-in variant models [3], Parallel Factor Ana lysis (P ARAF A C) [5], [15 ] or Higher -Order SVD (HOSVD) [16]. T able II summarize s the characteristics o f the se dec omposition mode ls when s een as different instanc es of the g eneral mo del (2). T ABLE I S P E C I A L C A S E S O F T H E G E N E R A L M O D E L ( 2 ) . Decomposition Structure of M s Structure of A s Uniqueness Reference PCA Orthogonal M s = F Orthogonal A s Y es SVD Sparse-PCA Sparse M s = F Sparse A s No Sparse PCA [1] , [17], k-SVD [6], PMD [9] Structured-PCA M s = F Structured Sparse A s No [2] NNMF Non-ne gativ e M s = F Non-nega tive A s No [13] Sparse M s = [ M ( F , d 1 ) · · · M ( F , d D )] Sparse A s No [18], Shift-in variant where { d j } D j =1 are all possible [3], [10] models translations of t he n -dimensional vectors i n F . P ARAF A C/CP M s = F A s = diag ( C · ,s ) B ′ Y es [15] HOSVD Orthogonal M s = F A s = ( G × 3 C · ,s ) B ′ Y es [16] where slices of G are orthogonal OPF A M s = M ( F , d s ) , d s ∈ K Non-ne gativ e, No This work. where F is smooth sparse A s and non-negati ve and K enforces consistent precedence order In this pa per , we will restrict the columns of M s to be translated versions of a common set of factors, 5 where these f actors hav e onsets that occur in s ome relati ve order that is con sistent a cross all subjects. Our model dif fers from previous sh ift-in variant models conside red in [18], [3], [10] in that it restricts the possible s hifts to those which prese rve the relati ve order of the factors among different subjec ts. W e call the problem of ﬁnding a decompo sition (2 ) under this ass umption the Order Prese rving Factor A nalysis (OPF A) problem. The c ontrib utions of this paper a re the follo wing. First, we propose a non-negati vely constrained linear mode l that accounts for temp orally misaligned f actors and orde r restrictions. Se cond, we gi ve a computational a lgorithm that allows us to ﬁt this model in rea sonable time. F inally , we demo nstrate that our method ology is able to succe sfully extract the principa l features in a simulated d ataset and in a real gene expression datas et. In ad dition, we sh ow that the ap plication of OPF A produce s factors that can be used to signiﬁca ntly reduc e the variabili ty in clustering o f ge ne expres sion res ponses . This pape r is organized as follo ws. In S ection II we p resent the biolog ical problem that motiv ates OPF A and introduce our mathema tical mode l. In Se ction III, we formulate the non-co n vex optimization prob lem assoc iated with the ﬁtting of our mod el and give a s imple local optimization algo rithm. In Sec tion IV we ap ply our method ology to both s ynthetic data and real gene expression data. Finally we co nclude in Section V. For lack of spa ce ma ny tec hnical de tails are left out of our prese ntation but are av ailable in the accompa nying technica l rep ort [19]. I I . M OT I V AT I O N : G E N E E X P R E S S I O N T I M E - C O U R S E D A TA In this se ction we moti vate the OPF A mathematical model in the co ntext of g ene expression time- course an alysis. T empo ral proﬁles of gene expres sion often exhibit motifs that co rrespond to casca des of up -regulation/do wn-regulation pa tterns. For examp le, in a s tudy of a person’ s host immune response after ino culation with a certain patho gen, one would expec t genes related to immune respon se to exhibit consisten t pa tterns of activ ation ac ross pathog ens, persons , and e n vironmental co nditions. A simple app roach to cha racterize the resp onse patterns is to encode them as sequen ces of a few bas ic motifs su ch as (see, for instance, [20]): • Up-r e gulation : Gen e express ion c hanges from low to h igh. • Down-r e gulation : Ge ne expres sion change s from a high to a low level. • Steady : Ge ne expres sion does not vary . If gen e express ion is coheren t over the po pulation of several individuals, e .g., in respons e to a common viral insult, the resp onse patterns can be expected to show some d egree o f con sistency a cross subjec ts. Human immune system response is a highly e volved s ystem in which sev eral biological pathways a re 6 0 20 40 60 80 100 5 5.5 6 6.5 7 7.5 8 8.5 9 Expression Level Subject 1 0 20 40 60 80 100 7.6 7.8 8 8.2 8.4 8.6 8.8 9 ORM1 CD1C 0 20 40 60 80 100 4 6 8 Expression Level Subject 2 0 20 40 60 80 100 7.5 8 8.5 9 0 20 40 60 80 100 5 6 7 8 Expression Level Subject 3 0 20 40 60 80 100 8 8.5 9 9.5 Peak Peak Down−regulation Down−regulation Peak Down−regulation Fig. 2. Example of gene patterns w ith a consistent precedence-order across 3 subjects. The down -regulation motif of gene CD1C precedes the peak motif of gene ORM1 across these three subjects. recruited and organized over time. Some of these pa thways will be c omposed of gen es whos e expres sions obey a p receden ce-ordering, e.g., virally induce d ribosomal protein produ ction may precede toll-lik e receptor a ctiv a tion and a ntigen prese ntation [21]. Th is co nsistency exists d espite temporal misalignment: ev en though the order is preserved, the spe ciﬁc timing of these events can vary across the individuals. For instance, two d if ferent persons c an have dif fere nt inﬂa mmatory respon se times, pe rhaps due to a slower immune s ystem in one of the subjec ts. This precede nce-ordering of mo tifs in the sequ ence of immune system respons e events is in variant to time shifts that preserve the ordering. Thus if a motif in one gene p recedes ano ther motif in another gene for a few subjec ts, we might expect the same precede nce relationship to h old for all other subjects . Figure 2 shows two genes from [11] whos e motif prece dence- order is c onserved a cross 3 different subjects. This c onservation of order a llo ws one to impo se ordering constraints on (2) without a ctually knowing the particular order or the pa rticular factors that ob ey the order- pres erving prope rty . Often gene s a re co-regulated o r co-expresse d and h av e highly c orrelated expression p roﬁles. This can happen , for example, when the genes belong to the same signaling pathway . Figure 3 s hows a set of dif ferent gene s tha t exhibit a similar expression p attern (up-regulation motif). The existence of high correlation be tween large g roups of genes allows one to impose a low rank property on the factorization in (2). In s ummary , our OPF A mode l is bas ed on the following ass umptions: • A1: Moti f consistenc y acr o ss subjec ts : Gene expression patterns ha ve consistent (though not-neces sarily time aligned) motifs acros s su bjects undergoing a similar treatment. 7 0 20 40 60 80 100 7 8 9 10 11 12 13 14 15 Time Expression Level IRF7 CCR1 LY6E TLR7 CCRL2 Fig. 3. Example of gene patterns exhibiting co-expression for a particular subject in the viral challenge study in [11 ]. • A2: Motif s equen ce con sistency a cr oss subjec ts : If mo tif X p recedes motif Y for s ubject s , the s ame preceden ce must h old for subject t 6 = s . • A3: Motif con sistency ac r oss groups o f genes : Th ere are (not nece ssarily kn own) g roups of genes that exhibit the s ame temp oral expression p atterns for a giv en s ubject. • A4: Ge ne Expression data is non-ne gative : Gene express ion on a microarray is measured as a n abundance and stand ard normalization proced ures, such as RMA [22], preserve the non-negativity of this mea surement. A few microarray normalization s oftware pa ckages produce gen e expres sion s cores that do n ot satisfy the non-negativity assumption A4. In suc h ca ses, the n on-negativit y constraint in the algo rithm implementing (9) ca n be disabled . Note that in ge neral, only a s ubset o f ge nes may satisfy a ssumptions A1 - A3 . I I I . O P FA M A T H E M AT I C A L M O D E L In the OPF A mod el, e ach of the S obse rvati ons is represented b y a linear c ombination o f temporally aligned factors. Each obs ervation is o f dimension n × p , where n is the number of time po ints and p is the numb er of ge nes under c onsideration. Let F be an n × f matrix whose columns are the f c ommon alignable factors, and let M ( F , d ) be a matrix valued func tion that app lies a circular sh ift to each column of F ac cording to the vec tor of shift parameters d , as depicted in Figure 12. T hen, we can reﬁne model (2) by restricting M s to h av e the form: M s = M ( F , d s ) . (3) 8 Fig. 4. Each subject’ s factor matrix M s is obtained by applying a circular shift to a common set of factors F parameterized by a vector d . where d s ∈ { 0 , · · · , d max } f and d max ≤ n is the max imum shift a llo wed in our mode l. This mode l is a generalization of a simpler one that restricts a ll factors to b e aligned but with a co mmon de lay: M s = U s F , (4) where U s is a circular shift ope rator . S peciﬁcally , the fundamental ch aracteristic of o ur mode l (3) is that each column ca n have a different delay , whereas (4) is a restriction o f (3) with d s i = d s j for all s and a ll i , j . The c ircular s hift is not restricti ve. By embedd ing the o bservation into a larger time window it can accommod ate trans ient gene expression proﬁles in add ition to periodic one s, e.g., circadian rhythms [19]. There are several ways to do this embed ding. One way is to s imply extrapolate the windowed, transient data to a la r ger n umber of time points n F = n + d max . This is the strategy we follow in the nu merical experiments of Section IV - B. This aligna ble factor model parameterizes each obse rv ation’ s intrinsic tempo ral dynamics through the f -di mens ional vector d s . T he prec edenc e-ordering cons traint A2 is enforced by impos ing the condition d s 1 j 1 ≤ d s 1 j 2 ⇔ d s 2 j 1 ≤ d s 2 j 2 ∀ s 2 6 = s 1 , (5) that is, if factor j 1 precedes factor j 2 in subject s 1 , then the s ame ordering will hold in all other subjects. Since the ind exing of the factors is arbitrary , we can assume without loss of generality that d s i ≤ d s i +1 for all i and all s . Th is cha racterization c onstrains eac h ob servation’ s de lays d s independ ently , allowing for a co mputationally efﬁcient algorithm for ﬁtting mode l (3). 9 A. R elationship to 3-way factor models. Our proposed OPF A framework is s igniﬁcantly different from other factor analysis metho ds and these dif ference s are illustrated in the simulated performance compa risons below . Howev er , there a re s ome similarities, e specially to 3-way factor models [15], [23] that are worth po inting out. An n -th order tensor or n -way array is a data structure whose elements are indexed b y an n -tuple of indices [23]. n -way arrays can be seen as multidimensiona l gen eralizations o f vectors a nd matrices: an 1 -way a rray is a vector and a 2 -w ay array is a matrix. Thus, w e ca n view our obse rvati ons X s as the slices of a third order tensor X o f dimen sion p × n × S : X s = X · , · ,i . T ens or d ecompos itions aim at extending the ideas of matrix (secon d o rder arrays) factorizations to higher order arrays [15], [23] and have found many applications in signal proce ssing and elsewhere [23], [15], [16], [12], [24]. Since o ur data tensor is of order 3, we will only co nsider here 3-way decompositions, which typically take the follo wing gene ral form: X i,j,k = P X p =1 Q X q =1 R X r =1 G pq r F ip B j q C k r (6) where P , Q , R are the n umber of c olumns in eac h of the factor matrices F , B , C and G is a P × Q × R tensor . This class of d ecompos itions is known as the T ucker model. When orthog onality is enforced among F , B , C an d dif ferent matrix slices of G , one obtains the Higher Order SVD [16]. When G is a superdiago nal tensor 1 and P = Q = R , this model a mounts to the P ARAF A C/Canonica l Deco mposition (CP) model [5], [12 ], [24]. The P ARAF A C mo del is the c losest to OPF A. Unde r this mo del, the slices of X i,j,k can b e written as : X CP s = F diag ( C · ,s ) B ′ . (7) This expres sion is to be comp ared with our O PF A mo del, which we state again here for con venienc e: X OPF A s = M ( F , d s ) A s . (8) Essentially , (7) shows that the P ARAF A C deco mposition is a sp ecial case of the OPF A model (8) where the factors are ﬁxed ( M s = F ) a nd the scores only vary in magnitude across obs ervations ( A s = diag ( C · ,s ) B ′ ). This structure en hances un iqueness (under some cond itions con cerning the linea r independ ence of the vectors in F , B , C , se e [15]) b ut lacks the additional ﬂexibility neces sary to mo del possible translations in the co lumns of the factor matrix F . If d s = 0 for all s , then the OPF A (8) and 1 G is superdiagonal t ensor when G ij k = 0 exce pt for i = j = k . 10 the Linea r Factor mo del (1) also coincide. The OPF A mode l can b e therefore s een as a n extension of the L inear Factor and P ARAF A C models where the factors are a llo wed to experiment order-preserving circular trans lations ac ross diff erent individuals. B. O PF A a s an op timization pr oblem OPF A tries to ﬁt the model (2)-(5) to the data { X s } S s =1 . For this p urpose, we deﬁn e the follo wing penalized a nd co nstrained leas t squares problem: min P S s =1 || X s − M ( F , d s ) A s || 2 F + λP 1 ( A 1 , · · · , A S ) + β P 2 ( F ) (9) s.t. { d s } s ∈ K , F ∈ F , A s ∈ A s where ||·|| F is the Frobenius norm, λ and β are regularization pa rameters, a nd the set K constrains the delays d s to b e order- prese rving: K = n d ∈ { 0 , · · · , d max } f : d i +1 ≥ d i , ∀ i o . (10) where d max ≤ n . Th e other soft and hard constraints a re brieﬂy des cribed as follo ws. For the gene expression application we wish to extract factors F that are s mooth over time and non- negati ve. Smoothn ess will be c aptured by the constraint that P 2 ( F ) is s mall where P 2 ( F ) is the squared total variation op erator P 2 ( F ) = f X i =1 || W F · ,i || 2 2 (11) where W is an appropriate weigh ting matrix and F · ,i denotes the i -th column of matrix F . From A4 , the data is non-negati ve a nd hence non-negati vity is en forced on F and the loadings A s to av oid masking of positiv e and negati ve valued f actors whose overall contrib ution sums to zero. T o a void nu merical i nstab ility assoc iated with the sc ale in variance M A = 1 α M α A for any α > 0 , we constrain the Froben ius norm of F . This leads to the follo wing c onstraint se ts: F = n F ∈ R n × f + : || F || F ≤ δ o (12) A s = R f × p + , s = 1 , · · · , S The parameter δ a bove will be ﬁxed to a p ositi ve value a s its pu rpose is purely co mputational and has little practical impact. Since the factors F are c ommon to all subjec ts, as sumption A3 req uires that the number of columns of F (and therefore, its rank) is small compa red to the nu mber o f genes p . In order to enforce A1 we cons ider two different models. In the ﬁrst mo del, which we shall na me OPF A, 11 we co nstrain the c olumns of A s to be s parse and the sparsity pattern to be c onsistent across d if ferent subjects. Notice that A 1 does not imply that the mixing w eights A s are the sa me for all sub jects a s this would not accommo date ma gnitude variability ac ross subje cts. W e also consider a more restricti ve mo del where we constrain A 1 = · · · = A S = A with sparse A and we call this model OPF A-C, the C standing for the additional co nstraint that the su bjects s hare the same sequ ence A of mixing we ights. The OPF A-C model has a smaller nu mber of parame ters than OPF A, p ossibly at the expense of introducing bias with respect to the uncon strained model. A similar co nstraint has been succes fully adopted in [25] in a factor model for multi-view learning. Similarly to the app roach taken in [26] in the con text of simultaneous spa rse coding, the co mmon sparsity pattern for OPF A is en forced by constraining P 1 ( A 1 , · · · , A S ) to be sma ll, wh ere P 1 is a mixed-norm group -Lasso type pena lty func tion [27]. For each of the p × f sco re variables, we crea te a group c ontaining its S dif ferent values across su bjects: P 1 ( A 1 , · · · , A S ) = p X i =1 f X j =1 k [ A 1 ] j,i · · · [ A S ] j,i k 2 . (13) T able II su mmarizes the cons traints of each of the mode ls con sidered in this pa per . Follo wing co mmon practice in factor analys is, the n on-con vex p roblem (9) is addres sed u sing Block Coordinate Descent, displaye d in the ﬁgure labeled Algorithm 1, which iterati vely minimizes (9) with respect to the shift parameters { d s } S s =1 , the s cores { A s } S s =1 and the factors F while keep ing the other variables ﬁxed. This a lgorithm is guarantee d to monotonic ally d ecrease the objecti ve func tion at each iteration. Since both the Froben ius norm and P 1 ( · ) , P 2 ( · ) a re no n-negati ve functions, this e nsures that the algorithm c on ver ges to a (possibly local) minima or a saddle po int of (9). The subroutines EstimateF actors and EstimateScores solve t he follo wing penalize d regression problems: min F P S s =1 || X s − M ( F , d s ) A s || 2 F + β P f i =1 || W F · ,i || 2 2 (14) s.t.          || F || 2 F ≤ δ F i,j ≥ 0 i = 1 , · · · , n, j = 1 , · · · , f 12 Algorithm 1: BCD algorithm for ﬁn ding a loca l minima of (9). OPF A Objectiv e  F , { A s } S s =1 , { d s } S s =1  denotes the objective function in (9) Input : Initial es timate of F and { A s } S s =1 , ǫ , λ , β . Output : F , { A s } S s =1 , { d s } S s =1 c 0 = ∞ c 1 ← OPF A Objective  F , { A s } S s =1 , { d s } S s =1  t = 1 while c t − 1 − c t ≥ ǫ do { d s } S s =1 ← EstimateDelays  F , { A s } S s =1  { A s } S s =1 ← EstimateScores  F , { d s } S s =1  F ← EstimateFactors  { A s } S s =1 , { d s } S s =1  c t ← OPF A Objectiv e  F , { A s } S s =1 , { d s } S s =1  t ← t + 1 and min { A s } S s =1 P S s =1 || X s − M ( F , d s ) A s || 2 F + λ P p i =1 P f j =1 k [ A 1 ] j,i · · · [ A S ] j,i k 2 (15) s.t.          [ A s ] j,i ≥ 0 i = 1 , · · · , n, j = 1 , · · · , f , s = 1 , · · · , S (16) Notice that in OPF A-C, we also inc orporate the c onstraint A 1 = · · · = A S in the o ptimization prob lem above. T he ﬁ rst is a c on vex quadratic problem with a qua dratic an d a linear cons traint over a domain of dimension f n . In the applications considered here, b oth n and f are small and hen ce this problem can be solved using any standard con vex optimization so lver . EstimateSc ores is trickier beca use it in volves a non- dif ferentiable co n vex penalty and the dimension of its domain is equa l to 2 S f p , where p can be very lar ge. In our implemen tation, we use an efﬁcient ﬁrst-order method [28] des igned for con vex problems in volving a quad ratic term, a no n-smooth penalty a nd a sepa rable constraint set. Th ese proced ures are described in more detail in Ap pendix C and the refore we focus on the EstimateDelay s subroutine. EstimateDelays is 2 This refers to t he OP F A model. In the OP F A-C model, the additional constraint A 1 = · · · = A S = A reduces the dimension to f p . 13 a discrete op timization that is solved us ing a branc h-and-bound (BB) a pproach [29]. In this approach a binary tree is c reated by recursively di v iding the feasible s et into subse ts (“branch”). On each of the nodes of the tree lower and up per bounds (“bound”) a re c omputed. When a ca ndidate s ubset is fou nd whose upper bo und is less than the smallest lower boun d of previously cons idered su bsets these latter subse ts can be eliminated (“p rune”) as candidate minimizers. Whenever a leaf (singleton subs et) is ob tained, the objectiv e is ev aluated at the corresp onding point. If its value exce eds the current o ptimal value, the lea f is rejected as a c andidate minimizer , otherwise the optimal value is updated and the leaf inc luded in the list of c andidate minimizers. Details on the ap plication of BB to OPF A are giv en be lo w . The s ubroutine Es timateDelays solves S uncou pled problems of the form: min d ∈K || X s − M ( F , d ) A s || 2 F , (17) where the set K is de ﬁned in (10). The “bran ch” p art of the optimization is accomp lished by rec ursiv e splitting of the set K to form a b inary tree. The recursion is initialized by setting S o = { 0 , · · · , d max } f , I o = { d ∈ K ∩ S o } . The splitting of the s et I o into two subse ts is d one a s follows I 1 = { d ∈ K ∩ S o : d ω 1 ≤ γ 1 } (18) I 2 = { d ∈ K ∩ S o : d ω 1 > γ 1 } , and we up date S 1 = { d ∈ S o : d ω 1 ≤ γ 1 } , S 2 = { d ∈ S o : d ω 1 > γ 1 } . Here γ 1 is an integer 0 ≤ γ 1 ≤ d max , and ω 1 ∈ { 1 , · · · , f } . I 1 contains the e lements d ∈ K whose ω 1 -th co mponent is strictly larger than γ 1 and I 2 contains the elements whose ω 1 -th compone nt is smaller than γ 1 . Th e same kind of splitting procedure is the n su bseque ntly a pplied to I 1 , I 2 and its res ulting s ubsets. A fter k − 1 suc cessive applications of this deco mposition there will be 2 k − 1 subsets a nd the k -th sp lit will be : I t := { d ∈ K ∩ S t } (19) I t +1 := { d ∈ K ∩ S t +1 } where S t = { d ∈ S π k : d ω k ≤ γ k } S t +1 = { d ∈ S π k : d ω k > γ k } . and π k ∈  1 , · · · , 2 k − 1  denotes the parent set of the two new sets t and t + 1 , i.e. pa ( t ) = π k and pa ( t + 1) = π k . In our implementation the sp litting coordinate ω k is the o ne correspon ding to the coordinate in the s et I π k with largest interval. The dec ision point γ k is taken to be the midd le point of this interval. 14 The “ bound” part of the optimization is as follows. Denote g ( d ) the objectiv e function in (17) and deﬁne its minimum over the s et I t ⊂ K : g min ( I t ) = min d ∈I t g ( d ) . (20) A lower b ound for this value c an be obtaine d by relax ing the con straint d ∈ K in (19): min d ∈S t g ( d ) ≤ g min ( I t ) (21) Letting X s = X ⊥ s + X k s where X k s = X s A † s A s and X ⊥ s = X s  I − A † s A s  , we have: || X s − M ( F , d ) A s || 2 F =        X s A † s − M ( F , d )  A s       2 F +     X s ⊥     2 F , where A † s denotes the pse udoin verse of A s . This leads to: λ  A s A T s        X s A † s − M ( F , d )       2 F +       X s ⊥       2 F ≤ g ( d ) , (22) where λ  A s A T s  denotes the sma llest eigenv alue of the symme tric matrix A s A T s . Combining the relaxation in (21) with inequa lity (22), we obtain a lower bound o n g min ( I t ) : Φ lb ( I t ) = m in d ∈S t λ  A s A T s        X s A † s − M ( F , d )       2 F +     X s ⊥     2 F ≤ g min ( I t ) , (23) which can be ev aluated b y performing f de coupled disc rete grid searches . At the k -th step, the splitting node π k will be ch osen a s the one with s mallest Φ lb ( I t ) . Finally , this lower bo und is compleme nted by the upper boun d g min ( I t ) ≤ Φ ub ( I t ) = g ( d ) for ∀ d ∈ I t . (24) These bo unds en able the branch -and-bound optimization o f (17). C. S election of the tuning parameters f , λ an d β From (9), it is clear that the OPF A factorization depend s on the ch oice of f , λ and β . This is a paramount problem in un supervised lea rning, and several heuristic approa ches have been devised for simpler factorization mod els [30], [31], [9]. The se approa ches are base d on training the factorization model on a subset of the elements of the d ata matrix (training s et) to subseque ntly vali date it on the excluded elements (test s et). 15 2 4 6 8 10 12 14 16 18 20 0 0.2 0.4 0.6 Time Magnitude Factor 1 Factor 2 Fig. 5. Dictionary used to generated t he 2-factor synthetic data of Section I V. The variational charac terization o f the OPF A decompo sition allows for the presenc e of missing vari- ables, i.e. missing elements in the ob served matrices { X s } S s =1 . In s uch ca se, the Lea st Squares ﬁtti ng term in (9) is only a pplied to the observed set of indices 3 . W e will hen ce follow the ap proach in [9] and train the OPF A model over a fraction 1 − δ of the entries in the obs ervati ons X s . Let Ω s denote the se t of δ ( n × p ) exclude d entries for the s -th ob servation. These e ntries will constitute our test se t, and thu s our Cross-V alidation error mea sure is: CV ( f , λ, β ) = 1 S S X s =1         h X s − M  ˆ F , ˆ d s  ˆ A s i Ω s         2 F where ˆ F , n ˆ d s o S s =1 , n ˆ A s o S s =1 are the OP F A estimates obtained on the training set excluding the entries in { Ω s } S s =1 , for a given cho ice of f , λ , and β . I V . N U M E R I C A L R E S U L T S A. S ynthetic d ata: P eriodic mod el First we ev aluate the p erformance of the OPF A a lgorithm for a periodic mo del observed in add iti ve Gaussian w hite no ise: X s = M ( F , d s ) A s + ǫ s s = 1 , . . . , S. (25) Here ǫ s ∼ N n × p  0 , σ 2 ǫ I  , d s = sort ( t s ) where σ 2 ǫ is the variance of ǫ s and t s ∼ U  0 , q 12 σ 2 d + 1  are i.i.d. T he f = 2 columns of F are non-rand om smooth signa ls from the predeﬁne d dictiona ry shown in Figure 5. The s cores A s are generated a ccording to a c onsistent sparsity pattern across all subjects and its no n zero elements are i.i.d. no rmal truncate d to the n on-negativ e orthant. 3 See the Appendix C and D for t he exten sion of t he EstimateFactors, EstimateScores and Estimatedelays procedures to the case where there exist missing observations. 16 T ABLE I I M O D E L S C O N S I D E R E D I N S E C T I O N I V - A . Model M s A s OPF A M s = M ( F , d s ) Non-negati ve d s ∈ K , F smooth sparse A s and non-negati ve OPF A-C M s = M ( F , d s ) Non-negati ve d s ∈ K , F smooth sparse and non-negati ve A 1 = · · · = A S SF A M s = M ( F , d s ) , Non-ne gativ e d s = 0 , F smooth sparse A s and non-negati ve Here the n umber of su bjects is S = 10 , the numbe r of variables is p = 100 , and the numbe r of time points is n = 20 . In these experiments experiments we ch oose to initialize the factors F wit h temporal proﬁles obtained by hierarch ical cluste ring o f the d ata. Hierarchical clustering [32] is a s tandard unsupe rvised learning technique that groups the p variables into increasing ly ﬁne r partitions a ccording to the normalized e uclidean distance of their temporal proﬁ les. The average express ion patterns o f the clusters found are used as initial e stimates for F . The loadings { A s } S s =1 are initialized by regressing the obtained factors onto the d ata. W e compa re O PF A and OPF A-C to a standard Sparse Factor Analysis (SF A) solution, obtained by imposing d max = 0 in the original OPF A mod el. T able II summarizes the cha racteristics of the three models c onsidered in the simulations . W e ﬁx f = 2 and choo se the tuning parame ters ( λ, β ) using the Cross-V alidation procedu re of Se ction III-C with a 5 × 3 grid and δ = . 1 . In the se expe riments, we co nsider two measures o f pe rformance, the Mean Square Error (MSE) with respect to the gene rated data: M S E := 1 S S X s =1 E       D s − ˆ D s       2 F , where E is the expectation operator , D s = M ( F , d s ) A s is the gene rated noiseless data and ˆ D s = M  ˆ F , ˆ d s  ˆ A s is the e stimated da ta, and the Dis tance to the T rue Factors (DTF), deﬁne d as: D T F := 1 − 1 f f X i =1 E F T · ,i ˆ F · ,i || F · ,i || 2       ˆ F · ,i       2 , where F , ˆ F are the gen erated and the es timated factor matrices, respectively . 17 Figure 6 shows the estimated MSE and DTF performance curves as a fun ction of the delay variance σ 2 d for ﬁxed SNR = 15 dB (which is d eﬁned as S N R = 10 log  1 S P s E ( || M ( F , d s ) A s || 2 F ) npσ 2 ǫ  ). OPF A and OPF A-C perform at least as well as SF A for zero delay ( σ d = 0 ) and signiﬁcantly better for σ d > 0 in terms of DTF . OP F A-C outpe rforms OPF A for high delay vari anc es σ 2 d at the price of a larger MSE due to the bias introduced by the cons traint A 1 = · · · = A S . In Figure 7 the pe rformance curves are plotted as a function of SNR, for ﬁxed σ 2 d = 5 . No te that OPF A an d OPF A-C outperform SF A in terms of DTF an d that OPF A is b etter than the others in terms of MSE for SNR > 0 db. Again, OPF A -C s hows increased robustness to n oise in terms o f DTF . W e also performed simulations to demo nstrate the value of imposing the order- prese rving c onstraint in (17). This was acc omplished by comparing OPF A to a version o f OPF A for which the cons traints in (17) are not enforced. Data was gen erated a ccording to the model (25) with S = 4 , n = 20 , f = 2 , and σ 2 d = 5 . T he results of our simulations (not s hown) we re that, while the order-preserving constraints never degrade OPF A pe rformance, the c onstraints improve performance wh en the SNR is small (belo w 3dB for this example). Finally , we conc lude this sub-sec tion by s tudying the se nsiti vity of the ﬁnal OPF A e stimates with respect to the initialization c hoice. T o this e nd, we initialize the OPF A algorithm with the c orrect model perturbed with a ran dom gau ssian vec tor of increasing variance. W e analyze the performance of the estimates in terms o f MSE and DTF a s a function of the n orm of the mode l perturbation relative to the norm of the noiseless data, which we de note by ρ . Notice tha t larger ρ corresp onds to incre asingly random initialization. The results in T able III show that the MSE and DTF of the OPF A estimates are very similar for a large range of values of ρ , an d therefore are robust to the initialization. B. E xperimen tal da ta: Predictive Health and Diseas e (PHD) The PHD data set was collected as pa rt of a viral ch allenge study that is d escribed in [11]. In this study 20 human subjec ts were inocu lated with live H3N2 virus and G enechip mRNA gene expression in peripheral blood of each subject was me asured over 16 time po ints. Th e raw Gene chip array d ata was pre-process ed using robust multi-array analysis [22] with q uantile normalization [33]. In this section we show resu lts for the cons trained OPF A model (OPF A-C). While not sh own here, we have ob served tha t OPF A-C gives very s imilar res ults to uncons trained OP F A but with reduced c omputation time. Speciﬁca lly , we us e OPF A-C to pe rform the follo wing tasks : 1) Subject Alignmen t: Determine the alignment of the factors to ﬁt eac h su bject’ s respons e, therefore rev ealing ea ch su bject’ s intrinsic respons e delays . 18 T ABLE I II S E N S I T I V I T Y O F T H E O P FA E S T I M A T E S T O T H E I N I T I A L I Z A T I O N C H O I C E W I T H R E S P E C T T O T H E R E L A T I V E N O R M O F T H E P E RT U R B A T I O N ( ρ ) . DTF [ mean (standard devia tion) ] × 10 − 3 SNR ρ = 0 . 002 ρ = 1 . 08 ρ = 53 . 94 22 . 8 0 . 0 ( 0 . 0 ) 3 . 4 ( 9 . 4 ) 1 . 9 ( 3 . 2 ) − 2 1 . 3 ( 0 . 5 ) 1 ( 9 . 4 ) 1 . 25 ( 1 . 5 ) − 27 . 1 46 ( 20 ) 58 ( 17 ) 63 ( 8 ) MSE [ mean (standard dev iation × 10 − 3 ) ] SNR ρ = 0 . 002 ρ = 1 . 08 ρ = 53 . 94 22 . 8 0 . 02 ( 1 . 5 ) 0 . 05 ( 69 ) 0 . 11 ( 99 ) − 2 0 . 35 ( 7 . 9 ) 0 . 36 ( 22 ) 0 . 38 ( 32 ) − 27 . 1 0 . 96 ( 19 ) 0 . 99 ( 18 ) 1 . 00 ( 24 ) 2 4 6 8 10 12 14 10 −1 10 0 σ d MSE 2 4 6 8 10 12 14 10 0 σ d DTF OPFA SFA OPFA−C Fig. 6. MSE (top) and DTF (bottom) as a function of delay v ariance σ 2 d for OP F A and Sparse Factor Analysis (S F A). These curves are plotted with 95% conﬁdence interv als. For σ 2 d > 0 , OPF A outperforms SF A both i n MSE and DTF , maintaining its adv antage as σ d increases. For large σ d , OPF A-C outperforms the other two. 2) Gene Clustering: Disc over grou ps of gene s with c ommon express ion signature by clus tering in the low-dimensional spa ce spann ed by the aligned factors. Since we are using the OPF A-C mod el, the projection of eac h subject’ s data on this lower dimens ional spac e is gi ven by the scores A := A 1 = · · · = A S . Ge nes with similar scores w ill have similar exp ression signatures. 3) Symptomatic Gene Signa tur e disc overy: Using the gene c lusters ob tained in step 2 we construct temporal signatures co mmon to subje cts who b ecame sick . The data was no rmalized by dividing eac h element of ea ch data matrix by the sum of the e lements in 19 −27 −20 −13 −6 1 9 16 23 10 −1 10 0 SNR MSE −27 −20 −13 −6 1 9 16 23 10 −2 10 −1 10 0 SNR DTF OPFA SFA OPFA−C Fig. 7. Same as Figure 6 except that the performance curv es are plotted with respect t o SNR for ﬁxed σ 2 d = 5 . its column. Since the data is non-negativ e valued, this will en sure that the mixing we ights of diff erent subjects are within the same order of magnitude, which is nec essary to res pect the assumption tha t A 1 = · · · = A S in OPF A-C. In order to selec t a subset of strongly varying ge nes, we ap plied one -way Analysis-Of-V ariance [34] to test for the equ ality of the mean of each gene at 4 different groups of time points, an d selec ted the ﬁrst p = 300 g enes ranked ac cording to the resulting F-statistic. T o thes e gen e trajectories we applied OPF A-C to the S = 9 symptoma tic subjects in the study . In this con text, the columns in F are the se t of signa ls emitted by the c ommon immune sys tem resp onse and the vector d s parameterizes each s ubject’ s c haracteristic o nset times for the factors contained in the co lumns of F . T o av oid wrap-aroun d effects, we worked with a factor model of dimens ion n = 24 in the temporal axis. The OPF A-C algorithm was run with 4 rand om initializations and retaine d the s olution yielding the minimum of the objectiv e function (6). For e ach f = 1 , · · · , 5 (number of factors), we estimated the tuning pa rameters ( λ, β ) followi ng the Cross -V alidation a pproach de scribed in III-C over a 10 × 3 grid. The resulting results, s hown in T able IV resulted in selecting β = 1 × 10 − 8 , λ = 1 × 10 − 8 and f = 3 . The choice of three factors is a lso consisten t with an expe ctation that the p rincipal gene trajectories over the period of time studied are a linear comb ination of increa sing, dec reasing or constant expre ssion pa tterns [11]. T o illustrate the goodne ss-of-ﬁt of our model, we plot in Figure 8 the observed gene express ion patterns of 13 strongly varying gene s a nd compare them to the OPF A-C ﬁtted response for three of the s ubjects, together with the relative res idual error . The average relativ e res idual error is below 10% and the plots demonstrate the ag reement between the observed and the ﬁtted patterns. Figure 9 s hows the trajectories 20 T ABLE I V C RO S S V A L I DAT I O N R E S U LT S F O R S E C T I O N I V - B . f = 1 f = 2 f = 3 f = 4 f = 5 min CV ( f , λ, β ) 20.25 13.66 12.66 12.75 12.72 Relativ e residual 7.2 4.8 4.5 4.5 4.4 error ( × 10 − 3 ) ˆ λ ( × 10 − 8 ) 5 .99 1 1 1 35.9 ˆ β ( × 10 − 6 ) 3.16 3.16 0.01 0.01 100 0 50 100 0.5 0.6 0.7 0.8 0.9 1 1.1 Subject 3 (O) Expression Level 0 50 100 0.5 0.6 0.7 0.8 0.9 1 1.1 Subject 3 (R) OAS1 OASL CCR1 LY6E CD1C CX3CR1 ORM1 HOXA6 0 50 100 0.5 0.6 0.7 0.8 0.9 1 1.1 Subject 9 (O) Time (hours) 0 50 100 0.5 0.6 0.7 0.8 0.9 1 1.1 Time (hours) Subject 9 (R) 0 50 100 0.5 0.6 0.7 0.8 0.9 1 1.1 Expression Level Time (hours) Subject 8 (O) 0 50 100 0.5 0.6 0.7 0.8 0.9 1 1.1 Time (hours) Subject 8 (R) Fig. 8. Comparison of observed (O) and ﬁtt ed responses (R ) for three of the subjects and a subset of genes in the PHD data set. Gene expression proﬁles for all subjects were reconstructed with a relative residual error below 10%. The trajectories are smoothed while respecting each subject’ s i ntrinsic delay . for ea ch subjec t for four gen es having dif ferent regulation motifs: up-regulation and down-regulation. It is clear that the gene trajectories have bee n smoo thed while con serving the ir temporal p attern a nd their preceden ce-order , e .g. the up-regulation of gene O AS1 cons istently follows the down-regulation of g ene ORM1 . In Figure 10 we show the 3 factors along with the factor delays and factor loading discovered by OPF A- C. The three f actors, shown in the three bottom panels of the ﬁgure, exhibit features of three d if ferent motifs: factor 1 and factor 3 co rrespond to up-regulation mo tifs occurring at different times; and factor 2 is a s trong down-regulation motif. The three top p anels s how the o nset times of each motif as comp ared to the clinically determined peak symptom onset time. Note, for examp le, that the strong up-regulation 21 0 50 100 0.5 0.55 0.6 0.65 0.7 0.75 0.8 OAS1 (O) Expression Level 0 50 100 0.5 0.55 0.6 0.65 0.7 0.75 0.8 OAS1 (R) 0 50 100 0.5 0.55 0.6 0.65 0.7 0.75 0.8 CCR1 (O) 0 50 100 0.5 0.55 0.6 0.65 0.7 0.75 0.8 CCR1 (R) 0 50 100 0.5 0.55 0.6 0.65 0.7 0.75 0.8 Time (hours) CX3CR1 (O) Expression Level 0 50 100 0.5 0.55 0.6 0.65 0.7 0.75 0.8 CX3CR1 (R) Time (hours) 0 50 100 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 Time (hours) ORM1 (O) 0 50 100 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 ORM1 (R) Time (hours) Subject 3 Subject 4 Subject 8 Fig. 9. Comparison of observe d (O) and ﬁtted responses (R) for four genes ( O AS1 , CCR1 , CX3CR1 , ORM1 ) sho wing up- regulation and do wn-regulation motifs and three subjects in the PHD dataset. T he gene trajectories have been smoothed while conserving their temporal pattern and their precedence-order . The OPF A-C model rev ealed that O AS1 up-regulation occurs consistently after OR M1 down-re gulation among all subjects. 2 4 6 8 0 50 100 Delays, Factor 1 Time (hours) Subject Order 0 50 100 1 2 3 4 Time (hours) Expression Level Factor 1 Subject 3 Subject 1 Subject 7 Subject 4 Subject 2 Subject 5 Subject 9 Subject 8 Subject 6 2 4 6 8 0 50 100 Delays, Factor 2 Subject Order 0 50 100 0.5 1 1.5 Time (hours) Factor 2 2 4 6 8 0 50 100 Delays, Factor 3 Subject Order 0 50 100 0.2 0.4 0.6 0.8 1 1.2 Time (hours) Factor 3 Fig. 10. T op plots: Motif onset time for each factor ( 2 ) and peak symptom time reported by expert clinicians (O). Bottom plots: Aligned factors for each subject. Factor 1 and 3 can be interpreted as up-reg ulation motifs and factor 2 is a strong do wn-regulation pattern. The arrows show each factor’ s motif onset time. pattern of the ﬁrst f actor coincides c losely with the o nset peak time. Gene s strongly assoc iated to this factor h av e been closely as sociated to a cute anti-viral and inﬂammatory hos t resp onse [11]. Interes tingly , the down-re gulation motifs ass ociated with factor 2 consistently prec edes this up-regulation motif. Finally , w e co nsider the ap plication of OPF A as a pre-proces sing step preced ing a clustering a nalysis. Here the goal is to ﬁnd groups of genes that share s imilar express ion p atterns an d determine their 22 characteristic expression patterns. In orde r to obtain gene c lusters, we pe rform hierarch ical clus tering on the raw data ( { X s } S s =1 ) an d on the lower dimensiona l space of the e stimated factor scores ( { A s } S s =1 ), obtaining two different sets of 4 well-dif feren tiated clusters. W e then compute the av erage expression signatures o f the g enes in each cluster by averaging over the o bserved data ( { X s } S s =1 ) a nd averaging over the da ta after OPF A co rrection for the temporal misalignments. Figure 11 illustrates the res ults. Clustering using the OPF A -C factor sc ores prod uces a very signiﬁcant improvement in cluster co ncentration as compared to clustering using the r aw data { X s } S s =1 . The ﬁrst tw o column s in Figure compare t he variation of the ge ne proﬁ les over e ach cluster for the temp orally realigned da ta (labe led `A´) as compa red to to the proﬁle variation of these s ame g enes for the misaligned o bserved data (labe led `M´). For co mparison, the las t co lumn shows the results of applying hierarch ical c lustering directly to the original misaligned dataset { X s } S s =1 . It is clea r that clus tering on the lo w-dimension al spa ce of the OPF A-C sco res un veils interesting motifs from the original noisy temp oral expres sion trajectories. V . C O N C L U S I O N S W e hav e proposed a general me thod of order-preserving factor ana lysis that accoun ts for poss ible temporal misalignments in a po pulation of subjec ts un dergoing a commo n treatment. W e h av e de scribed a simple model ba sed o n circular-shift translations of p rototype motifs and have shown how to embed transient gene expression time courses into this pe riodic model. The OPF A model can signiﬁcantly improve interpretability o f complex misaligned data. The metho d is applica ble to other sign al p rocessing areas beyond gene express ion time c ourse ana lysis. A Matlab package implementing OPF A and OPF A-C will be av ailable at the Hero Group Reproducible Research page (http://tbayes.ee cs.umich.e du ). A P P E N D I X A. C ir culant time shift mod el Using circular shifts in (3) introduces periodicity into our mode l (2 ). Some types of gene expres sion may display p eriodicity , e.g . circad ian transcripts, while othe rs, e.g. transient host respons e, may not. For transient ge ne expres sion proﬁles such as the ones we a re interested in here, we us e a trunca ted version of this periodic model, where we ass ume that ea ch sub ject’ s response arises from the observation of a longer pe riodic vector within a time window (se e Figure 12): X s = [ M ( F , d s ) A s + ǫ s ] Ω . (26) 23 0.6 0.8 Expression level Cluster 1 (A) σ =0.02 0.6 0.8 Cluster 1 (M) σ =0.08 0.6 0.8 Cluster 1 σ =0.06 0.6 0.7 0.8 Expression level Cluster 2 (A) σ =0.02 0.6 0.7 0.8 Cluster 2 (M) σ =0.06 0.6 0.7 0.8 Cluster 2 σ =0.04 0.6 0.7 Expression level Cluster 3 (A) σ =0.02 0.6 0.7 Cluster 3 (M) σ =0.04 0.6 0.7 Cluster 3 σ =0.04 0 50 100 0.55 0.6 0.65 0.7 0.75 Time (hours) Expression level Cluster 4 (A) σ =0.01 0 50 100 0.55 0.6 0.65 0.7 0.75 Time (hours) Cluster 4 (M) σ =0.04 0 50 100 0.55 0.6 0.65 0.7 0.75 Time (hours) Cluster 4 σ =0.04 Hierarchical Clustering on OPFA−C scores Hierarchical Clustering on data Fig. 11. The ﬁrst two column s sho w the av erage expression signatures and their estimated upper/lo wer conﬁdence intervals for each cluster of genes obtained by: averaging the estimated A ligned expression patterns ov er t he S = 9 subjects (A) and directly av eraging the misaligned observed data for each of the gene clusters obtained from the OP F A-C scores (M). The conﬁdence interv als are computed according to + / − the estimated st andard deviation at each time point. The cluster av erage standard de viation ( σ ) is computed as the averag e of t he standard de viations at each time point. The last column sho ws the results of applying hierarchical clustering directly to t he original misaligned dataset { X s } S s =1 . In the ﬁ rst column, each gene expression pattern i s obtained by mixing the estimated aligned factors F acco rding to the estimated scores A . The alignment ef fect is clear , and interesting motifs become more e vident. Here, the factors a re of dimension n F ≥ n a nd the window size is of dimens ion n (in the s pecial case that n = n F , we hav e the original periodic mode l). In this model, the observed features are no n-periodic as long as the delays d s are sufﬁciently small as co mpared to n F . More concretely , if the maximum delay is d max , then in order to av oid wrap-around ef fects the dimens ion should be chosen as at least n F = n + d max . F inally , we deﬁne the index set Ω co rresponding to the observation window a s: Ω =  ω 1 , ω 1 + 1 , ω 1 + 2 ..., ω 2  p (27) where ω 1 and ω 2 are the lower and uppe r obse rv ation en dpoints, verifying n = ω 2 − ω 1 . 24 Fig. 12. Right : Each subject’ s factor matrix M i is obtained by applying a circular shift to a common set of factors F parameterized by a vec tor d . Left : In order to av oid wrap-around effects when modeling t ransient responses, we consider instead a higher dimensional t runcated model of dimension n F from which we only observe the elements wit hin the windo w characterized by Ω . B. D elay es timation an d time-course a lignment The solution to problem (9) yields an e stimate ˆ d s for each subject’ s intrinsic factor delays. Thes e delays a re relative to the patterns found in the estimated factors an d therefore require con version to a common reference time. For a giv en up-regulation or down-regulation motif, I , which we c all the feature o f interest, found in factor g , we choo se a time point o f interest t I . S ee Figu re 1 3 (a) for an example of choice of t I for an up-regulation feature. Then, given t I and for each sub ject s an d ea ch factor k , we de ﬁne the absolute feature oc currence time as follows: t s,k =  ˆ d s k + t I  mod n F − ω 1 . (28) where ˆ d s g is the estimated de lay co rresponding to factor k and subject s a nd ω 1 is the lower endp oint of the observation window (see (27)). Figure 1 3 illustrates the computation of t s,k in a 2 -factor example. The qu antities t s,k can b e use d for interpretation purposes or to realign temporal proﬁles in order to ﬁnd c ommon, low-v ariance gene express ion s ignatures, as shown in S ection IV -B. C. Impleme ntation of Es timateF actors an d EstimateSc or es W e cons ider here the implementation of EstimateFactors and EstimateScores unde r the presen ce o f missing da ta. Let Ω s =  ω s 1 , · · · , ω s p  ∈ { 0 , 1 } n × p be the set o f ob served entries in ob servation X s . The 25 Fig. 13. (a) T ime point of interest ( t I ) for t he up-regulation feature of factor 1. (b) The absolute t ime points t 1 , 1 , t 1 , 2 are sho wn in red font for two different subjects and hav e been computed according to their correspondin g relativ e delays and t he formula in (28). objectiv e in (9) is the n: S X s =1     [ X s − M ( F , d s ) A s ] Ω s     2 F (29) W e will show how to reformulate problems Es timateFactors (14) and EstimateScores (15) in a stand ard quadratic o bjectiv e with linear and/or qu adratic con straints. First, we rewrite the objective (29) as: S X s =1 p X j =1       diag  ω s j  [ X s ] · ,j − diag  ω s j  M ( F , d s ) [ A s ] · ,j       2 F . Expanding the squ are we obtain: S X s =1     [ X s − M ( F , d s ) A s ] Ω s     2 F = P S s =1 P p j =1 [ A s ] T · ,j M ( F , d s ) T diag  ω s j  M ( F , d s ) [ A s ] · ,j − 2 [ X s ] T · ,j diag  ω s j  M ( F , d s ) [ A s ] · ,j + P S s =1     [ X s ] Ω s     2 F . (30) T o obtain the EstimateFactors objectiv e, ﬁrst we will re write the OPF A mod el (26) using ma trix notation. Le t U i be a circular shift matrix U i parameterized by the i -th delay d c omponen t. T hen M ( F , d ) = [ U 1 F 1 , · · · , U f F f ] = H ˜ F 26 where F j denotes the j -th column o f F , H is the c oncaten ation of the U i matrices and ˜ F is a matrix containing the c olumns of F with the a ppropriate pa dding of zeros. W ith this notation and (30) we obtain: S X s =1     [ X s − M ( F , d s ) A s ] Ω s     2 F ∝ F P S s =1 P p j =1 tr  [ A s ] · ,j [ A s ] T · ,j ˜ F T H T s diag  ω s j  H s ˜ F  − 2 tr  [ A s ] · ,j [ X s ] T · ,j diag  ω s j  H s ˜ F  . W e n ow us e the iden tity ([35], Thm. 3 Se c. 4): tr  Z X T Y W  = vec ( W ) T Z ⊗ Y T vec ( X ) , to write: S X s =1     [ X s − M ( F , d s ) A s ] Ω s     2 F ∝ F vec  ˜ F  T  P S s =1 P p j =1 [ A s ] · ,j [ A s ] T · ,j ⊗ H T s diag  ω s j  H s  vec  ˜ F  − 2 vec  P S s =1 P p j =1 H T s diag  ω s j  [ X s ] · ,j [ A s ] T · ,j  vec  ˜ F  . Finally , making use of the fact that ˜ F is a block-column matrix with the columns of F padde d by ze ros, we co nclude: S X s =1     [ X s − M ( F , d s ) A s ] Ω s     2 F ∝ F vec ( F ) T Q F vec ( F ) − 2 q T F vec ( F ) where we h av e de ﬁned Q F =   S X s =1 p X j =1 [ A s ] · ,j [ A s ] T · ,j ⊗ H T s diag  ω s j  H s   J , J (31) q F =   vec   S X s =1 p X j =1 H T s diag  ω s j  [ X s ] · ,j [ A s ] T · ,j     J (32) and J are the ind ices c orresponding to the non-zero e lements in vec  ˜ F  . Hence , EstimateFactors can be written as : min F vec ( F ) T  Q F + β diag  W T W , · · · , W T W  vec ( F ) − 2 q T F vec ( F ) s.t.          || F || 2 F ≤ δ F i,j ≥ 0 i = 1 , · · · , n, j = 1 , · · · , f The dimension of the variables in this p roblem is nf . In the applications con sidered here, both n an d f are relati vely s mall a nd hence this program can b e solved with a standa rd conv ex solver suc h as Se DuMi [36] (upo n conv ersion to a standard conic problem). 27 On the other hand , we ca n follow the same p rocedure to reformulate the objective in EstimateScores (15) into a pen alized quad ratic form. First we use (30) an d (31) to write: S X s =1     [ X s − M ( F , d s ) A s ] Ω s     2 F ∝ { A s } S i =1 P S s =1 vec ( A s ) T Q s A vec ( A s ) − 2 q s A T vec ( A s ) where Q s A =      M ( F , d s ) T diag ( ω s 1 ) M ( F , d s ) 0 . . . 0 0 0 · · · 0 0 · · · 0 M ( F , d s ) T diag  ω s p  M ( F , d s )      (33) q s A = h [ X s ] T · , 1 diag ( ω s 1 ) M ( F , d s ) · · · [ X s ] T · ,p diag  ω s p  M ( F , d s ) i . (34) Thus EstimateFactors can be written as: min F P S s =1 vec ( A s ) T Q s A vec ( A s ) − 2 q s A T vec ( A s ) + λ P p i =1 P f j =1 k [ A 1 ] j,i · · · [ A S ] j,i k 2 (35) s.t.          || F || 2 F ≤ δ F i,j ≥ 0 i = 1 , · · · , n, j = 1 , · · · , f This is a con vex, no n-dif ferentiable an d potentially high-dimens ional problem. For this type of opti- mization problems, there exists a class of simple and scalable algorithms which has rece ntly rec eiv ed much a ttention [37], [38], [39], [28 ]. The se algorithms rely only o n ﬁrst-order up dates of the typ e: x t ← T Γ ,λ  v − 2 x t − 1 ( α I − Q )  , (36) which only in volv es matrix-vector multiplications and evaluation of the operator T , which is c alled the proximal o perator [39] ass ociated to Γ an d C and is de ﬁned as : T Γ ,λ ( v ) := min 1 2 x ′ x + v ′ x + λ Γ ( x ) s.t. x ∈ C . This o perator takes the vector v as a n input a nd outpu ts a s hrunk/thresholde d version of it d epending on the nature of the pen alty Γ and the constraint s et C . For some class es o f penalties Γ (e.g. l 1 , l 2 , mixed l 1 − l 2 ) and the positivity con straints considered here, this op erator has a c losed form solution [40], [39]. W eak conv ergence of the sequ ence (36) to the optimum of (35) is as sured for a suitable c hoice o f the constant α [28 ], [41]. 28 D. De lay Estimation lower boun d in the p r esen ce of Missing Data As we mentioned e arlier , the lower bound (23) does not hold anymore unde r the pres ence of missing data. W e derive here another bound that can be us ed in s uch case. From expression (30), we ﬁrst obtain the objectiv e in EstimateDelay s (17) in a qu adratic form: S X s =1     [ X s − M ( F , d s ) A s ] Ω s     2 F ∝ d s P S s =1 P p j =1 tr  M ( F , d s ) T diag  ω s j  M ( F , d s ) P s,j  − 2 tr ( Q s,j M ( F , d s )) . Where we have let P s,j := [ A s ] · ,j [ A s ] T · ,j and Q s,j = [ A s ] · ,j [ X s ] T · ,j diag  ω s j  . Notice that each of the terms ind exed by s is ind epende nt of the others. Using (31), we obtain     [ X s − M ( F , d s ) A s ] Ω s     2 F ∝ vec ( M ( F , d s )) T P p j =1  P s,j ⊗ diag  ω s j  vec ( M ( F , d s )) − 2 vec  P p j =1 Q T s,j  T vec ( M ( F , d s )) . W e n ow c an minorize the function above b y:     [ X s − M ( F , d s ) A s ] Ω s     2 F ≥ λ  P p j =1 P s,j ⊗ diag  ω s j  vec ( M ( F , d s )) T vec ( M ( F , d s )) − 2 vec  P p j =1 Q T s,j  T vec ( M ( F , d s )) + P S s =1     [ X s ] Ω s     2 F which lea ds to:     [ X s − M ( F , d s ) A s ] Ω s     2 F ≥ λ  P p j =1 P s,j ⊗ diag  ω s j  || F || 2 F − 2 vec  P p j =1 Q T s,j  T vec ( M ( F , d s )) +     [ X s ] Ω s     2 F . Using the relax ation in (21), we ca n now compu te a lower bound Φ lb ( I t ) as Φ lb ( I t ) = λ  P p j =1 P s,j ⊗ diag  ω s j  || F || 2 F +     [ X s ] Ω s     2 F − 2 max d ∈S t vec  P p j =1 Q T s,j  T vec ( M ( F , d s )) . R E F E R E N C E S [1] I. Johnstone and A. Lu, “On consistenc y and sparsity for principal compon ents analysis in high dimensions, ” Journ al of the American Statist ical Association , vo l. 104, no. 486, pp. 682–693, 2009. [2] R. Jenatton, G. Obozinski, and F . Bach, “Structured Sparse Principal Component Analysis, ” in Pr oceedings of the 13th International Confere nce on Artiﬁcial Intelligenc e and Statistics (AIST A TS) 2010 , vo l. 9, 2010. [3] T . Blumensath and M. Davies, “Sparse and shift-inv ariant representations of music, ” IEEE T ransa ctions on Speech and Audio Processin g , vol. 14, no. 1, p. 50, 2006. 29 [4] K. Pearson, “LIII. On lines and planes of closest ﬁt to systems of points in space, ” P hilosophical Magazine Series 6 , vol. 2, no. 11, pp. 559–572, 1901. [5] J. Carroll and J. Chang, “Analysis of individual difference s in multidimensional scaling via an N -way generalization of Eckart-Y oung decomposition, ” P sycho metrika , vol. 35, no. 3, pp. 283–319, 1970. [6] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designin g overcomp lete dictionaries for sparse representation, ” IEEE Tr ansactions on signal pr ocessing , vol. 54, no. 11, pp. 4311–43 22, 2006. [7] K. Kreutz-Delgado, J. Murray , B. Rao, K. E ngan, T . Lee, and T . Sejnowski, “Dictionary learning algorithms for sparse representation, ” Neural computation , vol. 15, no. 2, pp. 349–396, 2003. [8] B. Olshausen and D. Field, “Sparse coding with an overcomplete basis set: A strategy emplo yed by V1?” V i sion resea rc h , vol. 37, no. 23, pp. 3311–332 5, 1997. [9] D. Witten, R. T ibshirani, and T . Hastie, “A penalized matrix d ecomposition, with app lications to sparse principal components and canonical correlation analysis, ” Biostatistics , vol. 10, no. 3, p. 515, 2009. [10] B. Mailh ´ e, S. Lesage, R. Gribon val, F . Bimbot, and P . V anderghey nst, “Shift-inv ariant dictionary learning for sparse representations: extending K-SVD, ” in Pro ceedings of the 16th Eur opean Signal Pr ocessing Confer ence , vol. 4, 2008. [11] A. Zaas, M. Chen, J. V arke y , T . V eldman, A. Hero, J. Lucas, Y . Huang, R. Turner , A. Gil bert, R. Lambkin-W illiams et al. , “Gene Expression Signatures Di agnose Inﬂuenza and Ot her Symptomatic Respiratory V iral Infections in Humans, ” Cell Host & Micr obe , vol. 6, no. 3, pp. 207–2 17, 2009. [12] N. Sidiropoulos, G. Giannakis, and R. Bro, “Bl ind P ARAF A C receiv ers for DS-CDMA systems, ” IEEE T ransactions on Signal Pro cessing , vol. 48, no. 3, pp. 810–82 3, 2000. [13] D. Lee and H. Seung, “Learning the parts of objects by non-neg ative matrix factorization, ” Natur e , vol. 401, no. 6755, pp. 788–79 1, 1999. [14] N. Sr ebro, J. Rennie, and T . Jaakk ola, “Maximum-margin matrix factorization, ” Advances in neural i nformation pr ocessing systems , vol. 17, pp. 1329–13 36, 2005. [15] T . K olda and B. Bader , “T ensor decompositions and applications, ” SIAM Review , vol. 51, no. 3, pp. 455–50 0, 2009. [16] G. Bergqvist and E . Larsson, “The higher-order singular value decomposition: Theory and an application [lecture notes], ” Signal Pro cessing Ma gazine, IEEE , vol. 27, no. 3, pp. 151 –154, 2010. [17] A. d’Aspremont, L . El Ghaoui, M. Jordan, and G. Lanckriet, “A Direct Formulation for S parse PCA Using Semideﬁnite Programming, ” SIAM Review , vol. 49, no. 3, pp. 434–448, 2007. [18] M. Lewicki and T . Sejnowsk i, “Coding time-varying signals using sparse, shift -in v ariant representations, ” Advances i n neura l information pro cessing systems , pp. 730–736 , 1999. [19] A. Tibau Puig, A. W iesel, and A. Hero, “Order-preserving factor analysis, ” Uni versity of Michigan, T ech. Rep., June 2010. [20] L. Sacchi, C. Larizza, P . Magni, and R. Bellazzi, “P recedence T emporal Networks to represent temporal relationships in gene expression data, ” J ournal of Biomedical Informatics , vol. 40, no. 6, pp. 761–774, 2007. [21] A. Aderem and R. Ulevitch, “T oll-like receptors in the induction of the innate immune response, ” N atur e , vol. 406, no. 6797, pp. 782–7 87, 2000. [22] R. Irizarry , B. Hobbs, F . Collin, Y . Beazer-Barclay , K. Antonellis, U. S cherf, and T . Speed, “Exploration, normalization, and summaries of high density oligonucleotide array probe lev el data, ” Biostatistics , vol. 4, no. 2, p. 249, 2003. [23] P . Comon, T ensor decompositions , ser . Mathematics in signal processing V . Oxford Univ ersity Press, USA, 2002. [24] L. De Lathauwer , B. De Moor , and J. V ande walle, “A multil inear singular value decomposition, ” SIAM Jo urnal on Matrix Analysis and Applications , vol. 21, no. 4, pp. 1253–1278, 2000. 30 [25] Y . Jia, M. Salzmann, and T . Darrell, “Factorized latent spaces with structured sparsity , ” E ECS Department, Univ ersity of California, Berkeley , T ech. Rep. UCB/EEC S-2010-99, Jun 2010. [26] J. Mairal, F . Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Non-local sparse models for image restoration, ” in Computer V ision, 2009 IEEE 12th International Confer ence on . IEEE, 2010, pp. 2272–227 9. [27] M. Y uan and Y . Lin, “Model selection and estimati on in regress ion with grouped v ariables, ” J ournal of the Royal Statistical Society Series B Statistical Methodology , vol. 68, no. 1, p. 49, 2006. [28] N. Pustelnik, C. Chaux, and J. Pesquet, “A constrained forward-backward algorithm for image recovery problems, ” in Pr oceedings of the 16th Europ ean Signal Proces sing Confer ence , 2008, pp. 25–29. [29] E. Lawler and D. W ood, “Branch-and-bound methods: A survey, ” Operations rese ar ch , vol. 14, no. 4, pp. 699–719, 1966. [30] A. Owen and P . Perry , “Bi-cross-v alidation of the SVD and the nonnegati ve matrix factorization, ” The Annals of A pplied Statistics , vol. 3, no. 2, pp. 564–5 94, 2009. [31] S. W old, “Cross-v alidatory estimation of the number of components in factor and principal components models, ” T echnometrics , vol. 20, no. 4, pp. 397–40 5, 1978. [32] T . Hastie, R. T ibshirani, and J. F riedman, The elements of statistical learning: data mining, inference and prediction . Springer , 2005. [33] B. Bolstad, R. Iri zarry , M. Ast rand, and T . S peed, “A comparison of normalization method s for high density oligonucleotide array data based on v ariance and bias, ” Bioinformatics , vol. 19, no. 2, p. 185, 2003. [34] J. Neter , W . W asserman, M. Kutner et al. , Applied linear statistical models . Irwin Burr Ridge, Ill inois, 1996. [35] H. Neudeck er and J. Magnus, “Matrix Differen tial Calculus with Applications in St atistics and Econometrics, ” 1999. [36] J. S turm, “Using SeDuMi 1.02, a MA TLAB toolbox for optimization ov er symmetric cones, ” Optimization methods and softwar e , vo l. 11, no. 1, pp. 625–653, 1999. [37] M. Zibule vsky and M. El ad, “L1-l2 optimization in signal and image processing, ” Signal Proce ssing Maga zine, IE EE , vol. 27, no. 3, pp. 76 –88, may 2010. [38] I. Daubech ies, M. Defrise, and C. De Mol, “An iterativ e thresholding algorithm for linear in verse problems with a sparsity constraint, ” C ommunications on Pure and Applied Mathematics , vol. 57, no. 11, pp. 1413–145 7, 2004. [39] P . Combettes and V . W aj s, “Signal reco very by proximal forward -backward splitting, ” Multi scale Modeling and Simulation , vol. 4, no. 4, pp. 1168–12 00, 2006. [40] A. T ibau P uig, A. Wiesel, and A. O. Hero, “A multidimensional shrinkage-threshold ing operator, ” in Statistical Signal Pr ocessing, 2009. SSP ’09. IEEE/ SP 15th W orkshop on , 2009, pp. 113 – 116. [41] A. Beck and M. T eboulle, “A fast iterativ e shrinka ge-thresholding algorithm for linear in verse problems, ” SIAM J. Imaging Sci , vol. 2, pp. 183–202 , 2009. 20 40 60 80 100 Time (hours) Cluster:1 (R) σ =0.02 0 20 40 60 80 100 0.8 0.9 1 Cluster1 (O) σ =0.03 Time (hours) 0 20 40 60 0.8 0.9 1 Time (hours ) Cluster:1 (D) σ = 0.03 20 40 60 80 100 Time Cluster:2 (R) σ =0.01 0 20 40 60 80 100 0.8 0.9 1 Time (hours) Cluster2 (O) σ =0.02 0 20 40 60 0.8 0.9 1 Time (hours ) Cluster:2 (D) σ = 0.06 20 40 60 80 100 Time (hours) Cluster:3 (R) σ =0.06 0 20 40 60 80 100 0.7 0.8 0.9 1 Time (hours) Cluster3 (O) σ =0.08 0 20 40 60 0.7 0.8 0.9 1 Time (hours ) Cluster:3 (D) σ = 0.08 20 40 60 80 100 Time (hours) Cluster:4 (R) σ =0.02 0 20 40 60 80 100 0.9 0.95 1 1.05 Time (hours) Cluster4 (O) σ =0.02 0 20 40 60 0.9 0.95 1 1.05 Time (hours ) Cluster:4 (D) σ = 0.01 Average Expression Pattern Upper CI Lower CI Hierarchical Clustering after OPFA Hierarchical Clustering on the raw Data 40 60 80 100 Time (hours) OAS1 (O) 0 20 40 60 80 100 0.5 0.55 0.6 0.65 0.7 0.75 0.8 OAS1 (R) Time (hours) 0 20 40 60 80 100 0.5 0.55 0.6 0.65 0.7 0.75 0.8 Time (hours) CCR1 (O) 0 20 40 0.5 0.55 0.6 0.65 0.7 0.75 0.8 CCR1 (R) Time (h ours) 40 60 80 100 Time (hours) RPL3 (O) 0 20 40 60 80 100 0.5 0.55 0.6 0.65 0.7 0.75 0.8 RPL3 (R) Time (hours) 0 20 40 60 80 100 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 Time (hours) ORM1 (O) 0 20 40 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 ORM1 (R) Time (h ours) −17.04 −7.16 3.08 13.22 22.88 32.93 10 −1 10 0 SNR MSE −17.04 −7.16 3.08 13.22 22.88 32.93 10 −1 10 0 SNR DTF OPFA MisFA

Order-preserving factor analysis (OPFA)

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment