Dependency detection with similarity constraints

Dep endency detection with simil arit y constrain ts Leo Lah ti 1 , 2 , Sam uel Myllyk angas 3 , Sak ari Kn uutila 2 and Sam uel Kaski 1 . 1. Helsinki Univ ersit y of T ec hnolog y , Departmen t of Information and Computer Science, PO Bo x 5400, FI-02015 TKK, Finland 2. Univ ersity of Helsinki and Helsinki Univ ersit y Cen tra l Hos pital, Haartman Institute and HUSLAB, Departmen t of P athology , Helsink i, Finland. 3. Stanford Univ ersit y Sc hoo l of Me dicine, Departmen t of Medicine, Division of Oncology , and Stanford Genome T ec hnology Cen ter, Stanford Univ ersit y , Stanford, USA. Preprin t of Lah ti et al. 2009 in Tülay Adali, Jo celyn Chanussot, Christian Jutten, and Jan Larsen, editors, Pro ceedings of the 2009 IEEE Interna tional W o r kshop on Mac hine Learning for Signal Pro cessing XIX, pages 8 9–94. IEEE, Pi sc ata wa y , NJ, USA, 2009. Deriv ations presen ted in more detail in h ttp:// lib.tkk.ﬁ/Diss/2010/i s bn9789526033686 Implementation of the metho d av ail able at h ttp://bio conductor.org/pac k ages/ dev el/bio c/h tml/pint.h tml Abstract Unsup ervised t wo-view learning, or detection of dep en dencie s b e tw een tw o p a ired data sets, is typical ly done by some v ari ant of canonical correlation analysis (CCA). CCA searc hes for a linear pro jection for each view, such that the correlations b et w een the pro jections are maximized. The solution is inv arian t to any linea r transformation of either or both of the views; fo r tasks with small sample size suc h ﬂexibility implies ov erﬁtting, whic h is even worse for more ﬂexible nonparametric or kernel-based dep endency discov ery metho ds. W e develop var ian ts which reduce the degrees of freedom by assuming constrain ts on simil arit y of the pro jections in the tw o views. A particular example is p ro vided b y a cancer gene disco very application where c hromosoma l distance aﬀects th e dep endencies b et w een gene copy num b er and activity lev els. Similarity constrain ts are sho wn to impro ve detection performance of kno wn cancer genes. 1 In tro duction W e develop methods for the tas k o f detecting s t atistical dependencies betw een m ultiple sources of co - o ccurring data. The so urces are a ssumed to share re l ev ant co m mon informa t ion, and additionally co n tain independent but unknown type of noise. The task is to discover the relev ant information; both to detect and analyse or interpret it. This is a particular type of a data fusion task, shared b y multi-view le arning . In m ulti-view lea rning each source is interpreted as a diﬀerent view to the same items, and the task is to enhance classiﬁcation per f ormance b y combin ing the views. Our task can be interpreted as unsup ervised m ulti-view lea rning. The tr a dit ional statistical w ay of ﬁnding dep endencies betw een data so u rces is ca no nical correlation analysis, CCA, which g eneralizes corr elation to multid imensional so urces, retaining some of the nice in terpretability of corr e la tion co eﬃcien ts. While the basic cor relation co eﬃcien t a ssumes paired scala r v alues, ca nonical correla ti ons ass u me paired vectorial v a lues. The v ectors a re pro jected to sca lar com- po nent s befor e computing the cor relations, using linear pr o jections that maximize the co rrelations . F o r m ultidimensional data there will b e many cor relation coeﬃcie nts; the se c ond compo nent s are constrained to b e unco rrelated with the ﬁrst, and so on. CCA is known to have tw o nice prop erties: the result is inv ariant to linea r tra nsformations of the data spaces, a nd the solution for any ﬁxed num ber o f compo nent s max imizes mut ual information b etw een 1 linear pro jections for Gaussian data. These insights ca n b e interpreted as motiv ations for genera lizing using nonpar a metric metho ds [1, 2] and kernel CCA [3, 4]. The ﬂexibilit y of CC A can ca use ov er ﬁtting pr oblems that are speciﬁca lly harmful with small sample sizes that ab o und in biomedical studies, for insta nce. When the v ie ws are high-dimensional, the co m- pletely unconstrained linear pro jections inv olv e high degrees of freedom; several wa ys to regula rize the CCA solution have b een suggested to o vercome some of the as so ciated problems [5, 6, 7]. W e introduce a complementary approach that is based on bringing in prior knowledge to constra in the mo del family . Assuming the dimensions of the diﬀerent views ar e no t co mpletely unrelated but instead are formed of r elated pair s, it makes sense to search for more constrained pro jections. In our application, the views are diﬀerent mea surements made on the same lo c a tions of the genome, and the dimensions corre s po nd to these particular loca tions. Constraining the pro jections to b e the same or at least similar in the diﬀerent views will additionally enhance interpretabilit y of the results, g iven tha t rela tionships b etw een the same comp onents in the tw o views a r e natural. Correla tio n-based CCA has b een shown to co rresp ond to the maximum likeliho o d so lution of a simple generative mo del [4], where the t w o views a re assumed to stem from a shar ed Gaussian latent v aria ble and no rmally distributed data -set-sp eciﬁc noise. This has op ened up the road to pro babilistic and Bay esian formulations [8, 9] which mak e it p ossible to dea l r igoro us ly with uncertaint y in sma ll sample sizes and to include prio r knowledge as Bay esian priors. W e sugge st also a probabilistic version for co nstrained dependency search that provides a robust alternative for direct maximization of correlations. While the proba bilistic version is slo w er to compute, it is the recommended choice when prior information of the types of dep endency is av ailable, or sample size is small. The methods will b e applied in a very pro mising application setup fo r knowledge discov ery with dependency detection. The task is to ﬁnd p otential cance r ge nes by studying the r e la tionship b etw een changes caused by cancer in gene expression a nd g ene c o py n um ber s, that is, ampliﬁcations or deletions caused by mutations in cancer samples. Copy num b er changes are a k ey mechanism for cancer, and combination of copy num b er information with gene expression measurements can reveal functional eﬀects of the m utations; gene expression data is informative of gene activity . The rationale go es a s follows: Mutations ha ving no functional eﬀect will not cause cancer, and cancer -related gene expression c hanges may be side eﬀects. Gene expression changes caused by mutations would b e strong candidates for cancer mechanisms, and they contribute to the dep endencies b etw een the tw o data sources. While causatio n can b e diﬃcult to gras p, study of the depe ndencies ca n provide an eﬃcient proxy for such eﬀects. 2 Canonical correlations with simil arit y constrain ts 2.1 Correlation-based approac h Correla tio n-based CCA sear ches for a maximally corr elated linear pro jection of the original data sets with paired samples X and Y . It maximizes the correla tion b etw een the pro jections, cor ( X v x , Y v y ) , with resp ect to ar bitrary pr o jection v ectors v x , v y . How ever, this ﬂexibilit y e a sily leads to overﬁtting as demonstra ted by the case study in Section 3. In many a pplications prior infor mation of the p otential relationships b etw ee n the features of the in vestigated data s e ts is av ailable. Constraining the pro jections acco rdingly can p otentially reduce ov erﬁtting and help to fo cus on sp eciﬁc types of dep endencies betw een the tw o data se ts. A particular example of s uch a model is provided by our cancer ge ne discov ery application, where gene cop y num ber changes are systematically cor r elated with the gene expre s sion measurements from the s ame genes. The relationship be tw een the pro jections ca n be parametrize d with a tra nsformation matrix T s uch that v y = T v x . Maximiza tion of the correla tions b etw een the pro jections leads to the follo wing opti- mization problem: arg max v ,T = v T ˜ Σ xy T v p v T ˜ Σ xx v q ( T v ) T ˜ Σ y y T v , (1) 2 where the observed cov ar iances of the t w o data sets are deno t ed by the ˜ Σ . Constraints on T ca n b e used to guide the dep endency search. W e refer to this model as Similarity-constrained CCA ( SimCCA ). Suitable constra int s depend on the particular a pplications; the solutions can be made to prefer pa r ticular t ypes o f dependencies in a so ft manner with an a ppropriate pe nalty term on T . While we consider o nly one- dimensio nal pro jections in the case s tudy , m ultidimensional pro jection matrices are also p ossible. The optimal pro jection v ectors can b e sought iteratively as in o rdinary CCA. Direct o ptimization of the correlations provides a simple and computationally eﬃcien t wa y to detect dependencies betw een data so urces but it la cks an explicit mo del to deal with the uncertain t y in the data and mo del par a meters. 2.2 Probabilistic approac h An explicit mo del-bas ed approa ch for the dep endency explor ation task is provided by the pro babilistic mo deling framework. W e derive a proba bilistic approach w hich should be more r o bust to small sample sizes. The correla tion-based CCA has a direct connectio n to the maximum lik eliho o d (ML) so lution of the gener ative mo del [4 , 1 0]: X ∼ N ( W x z , Ψ x ) Y ∼ N ( W y z , Ψ y ) , (2) assuming normally distributed z , and data-se t- s pe c iﬁc cov a riances Ψ x , Ψ y . The dep endency betw een the data sets is captured by the shared laten t v ariable z , and W x , W y characterize the relationship betw een the data sets. The co v ariance s Ψ x , Ψ y characterize data set-s pe c iﬁc eﬀects. Note that while optimal pro jections v in the corre la tion-based CCA (Eq. 1) op erate on the obser ved da ta, the parameters of in terest, W x , W y , in pr o babilistic CCA mediate transfor mations of the la tent v ariable z . The solutions o f the pro babilistic CCA ca n b e co nstrained ana logously to the cor relation-ba s ed approach in Eq (1 ), by extending the form ulation to include appropriate prior terms. The join t likelihoo d of the mo del is given b y P ( X ,Y , W, Ψ) (3) ∼ P ( X , Y | W x , W y , Ψ) P ( W y | W x ) P ( W x ) P (Ψ) (4) = Z P ( X , Y | W x , W y , Ψ , z ) (5) P ( W y | W x ) P ( W x ) P (Ψ) P ( z ) d z . (6) Here Ψ deno tes the blo ck-diagonal matrix of Ψ x and Ψ y . While incorp oration o f prio r informa tion of the data set-sp eciﬁc eﬀects thro ugh the W x and Ψ provides promising lines for further work, w e fo c us on the shared latent v ariables as a probabilistic a lternative to the correla tion-based SimCCA. The relation b etw een the trans formation matrices for the shared laten t v aria ble is enco ded b y the prio r term P ( W y | W x ) and ca n b e pa r ametrized with a transfo r mation matrix T such that W y = T W x . Assuming in vertible W T x W x , we hav e T = W y ( W T x W x ) − 1 W T x . By setting a prior on T it is p ossible to emphas ize certa in types of dep endencies. With unconstr a ined T the solution reduces to ordinar y pro babilistic CCA. In the other extreme T is an identit y ma trix, T = I , a nd the tw o shar ed comp onents, derived from x and y resp ectively , w ould b e identical. The formulation w ould also allow tuning of T b etw een these tw o extremes. W e consider the following simple prior for T : P ( T ) = N + ( k ( T − I ) k | 0 , σ 2 T ) = N + ( k W y ( W T x W x ) − 1 W T x ) − I k | 0 , σ 2 T ) . This can b e plugged int o P ( W y | W x ) in Eq. (3). W e hav e used F ro be nius norm, and N + refers to truncated normal distribution for p ositive input v alues. The σ 2 T can tune the deviation of T from the identit y matrix; a strict v ersion of proba bilistic SimCCA (pSimCCA) is obtained with σ 2 T → 0 , while σ 2 T → ∞ yields ordinary pro babilistic CCA (pCCA). With uninformative priors P ( W ) , P (Ψ) ∼ 1 and normally distributed shared la ten t v a riable z ∼ N (0 , I ) , the mo del has the neg ative log -likelihoo d − l og P ( X, Y , W, Ψ) ∼ l og | Σ | + tr Σ − 1 ˜ Σ + k T − I k ˆ σ 2 T . (7) 3 Here Σ = W W T + Ψ co n tains the ma trices W x , W y and da ta set sp eciﬁc cov aria nce s Ψ x , Ψ y . W e hav e added the pr ior for T , whic h tunes the relationship b etw een W y and W x . F or o ther details, see [4, 5 ]. 3 Analysis of functional cop y n um b er c hanges in gastric cancer A pr omising bio medical application highlights the p otential practica l v alue of our a pproach. Constraints on the p o ten tial dep endencies b etw een g ene ex pression and copy num ber ar e shown to improv e the detection of known cancer genes. The adv an tages of constrained and pro babilistic versions b ecome particularly salient when the dimensionality increases a nd ordinary co rrelation- based CCA seriously ov erﬁts to the data. 3.1 Bac kground and motiv ation Copy nu m ber c hanges in chromosomal reg ions with tumor -suppresso r or other cancer-asso ciated genes hav e imp or tant contribution to cancer developmen t and progr ession. Chromosomal gains a nd los ses are lik ely to b e p ositively co r related with the expression levels of the a ﬀected genes; copy num b er gain is likely to increase the expression of some of the asso ciated genes whereas deletion will blo ck gene expr ession. Iden tiﬁcation of cancer -asso cia ted reg ions with functional copy nu m ber changes has po tent ial diagnos tic, prognostic and clinical impact for cancer studies. Canonical correlations provide a principled fr amework for detecting the share d v ar iation in ge ne expression and cop y n um ber data. Systematic cop y n um ber c hanges in a particular c hromosomal region are captured b y m ultiple co py num b er pro be s, and this is also visible in the expression lev els o f the genes within the aﬀected region. The dependent signals can b e subtle, how ev er, as g ene expression and copy num ber data ar e aﬀected by high levels of unrelated biological a nd measurement v ariation, and the sample sizes ar e typically small. Both correla tion-based and probabilistic SimCCA combine p ower o v er the adjacent ge nes b y captur- ing the s trongest shared sig na l in gene expres s ion and copy n um ber observ a tions. They can also ignore unrelated sig nal from p o orly p er forming prob es, or pro be s that measure genes that a re not function- ally aﬀected by the cop y num b er change. This provides to o ls to distinguish betw een so-ca lled driver m utations having functional eﬀects fro m less activ e pa ssenger mutations, which is an impor tant task in cancer studies. A further adv an tage of the probabilistic formulation is that the shared latent v a riable z provides a robust measure of the ampliﬁcation eﬀects in ea ch patien t. 3.2 Implemen tation SimCCA is used to study the a s so ciation betw een gene express ion and copy num ber in a gastric cancer data set with 41 patient s and 10 controls [11]. The gene ex pression and co py num b er data sets were matched for the analys is such that the clo sest pro b e b y genomic lo cation in gene expressio n data w as selected fo r each copy nu m ber prob e, and pr ob es with no match betw een gene expression and copy n um ber within 500 0 bp in terv al were discarded. The prepro cesse d data ha s gene express io n and copy n um ber measurements from 5 596 genes from ∼ 700 chromosomal regions (cytobands). T o satisfy the normality ass umptions of our mo del, the da ta was l og 2 -transformed and the mean of the signa ls for ea ch prob e was set to 0 b efore the ana lysis. Ordinary and constrained versions of canonical correlation analysis, CCA/SimCCA, were applied to inv estigate the dependencies b etw e en gene expressio n a nd co py num ber s. The corr elations were computed within a sp eciﬁc chromosomal window around each g ene. The observed correlations provide a measure of dep endency b etw een g ene copy num ber a nd expressi on da ta for each window, or chromosomal region. With unconstrained T , the mo dels deﬁned by Eqs. (1) and (7) reduce to ordinary corr elation-based and pro babilistic CCA, res p ectively . W e assume that the constraints for T are pr ovided prior to analysis, i.e. the prior parameter σ T is ﬁxed. Alter na tively , σ T could b e o ptimized based on exter nal criteria such as iden tiﬁcation o f the known cancer genes in o ur a pplication. Our empirical res ults show, how ev er, 4 30 40 50 60 70 80 −2 −1 0 1 2 3 4 Expression: 17q Mb Signal 30 40 50 60 70 80 −2 0 2 4 Amplifications: 17q Mb Signal 30 40 50 60 70 80 0.5 0.6 0.7 0.8 0.9 Correlation−based SimCCA Mb Dependency Figure 1: Gene expression, copy n um ber signa l, and the depen dency scor e for a sliding windo w of 15 genes along the chromosome ar m 17q from the SimCCA metho d of Eq. (1). Kno wn g astric-cance r asso ciated genes from an exp ert-cura ted list are mar ked with black dots. that already a simple prior for T without an explicit optimization pro ce dure can improv e the detection of known cancer genes . W e co nsider here the tw o ex treme cases o f the mo del where T is (i) completely unconstrained (ordinary CCA; σ T = ∞ ), and (ii) T = I ( σ T = 0 ). P oin t estimates for the mo del parameters were estimated with EM algorithm in the probabilistic version. Streng th of the shared signal versus marg inal eﬀects is meas ured with T r ( W W T ) /T r (Ψ) , wher e T r deno tes matrix trace. This yields a dependency score b etw een c o py num ber and expression data for the in v estigated ch romoso ma l neighbo r gho o d around each gene. High scores highlight regions w her e the dep endent s ignal b etw een the tw o data sets is particularly high relative to the data-s et-sp eciﬁc v a r iation. In addition to the correlatio n-based and pr obabilistic SimCCA, we tested a simpliﬁed proba bilistic version with one-dimensional shar e d comp onent z and isotropic cov ariances for the data-set- s pe c iﬁc ef- fects: ( Ψ x = σ 2 x I ; Ψ y = σ 2 y I ). This is a spec ia l case of the full probabilistic mo del, and it reduces to principal co mpo nent analysis (PCA) for concatenated data ( X , Y ) . W e refer to this method as pSim- PCA. The s impliﬁed mo del do es not distinguish b etw een the shar ed and marg inal eﬀects as eﬀectively as the full probabilistic CCA but it has few er mo del par ameters. L ow-dimensional latent models are also faster to compute, and interpretation of the results is p o tent ially mo re straightforw ard. 3.3 V alidation Results from the correla tio n-based SimCCA are illustrated fo r chromosome ar m 17 q in Fig. 1, where SimCCA highlights a known ca nce r -asso c iated regio n. The Figure shows the dependency s core for the correla tio n-based SimCCA with a sliding windo w of 15 genes genes along the c hromosome arm. The correla tio n-based and pr o babilistic a pproaches were compa red in v arious window s izes (10 , 15, 20 , 25, and 3 5 g enes). In ea ch e xp e r iment , the gene list or dered by the dep endency measur e was compar ed to an exp ert-cura ted list o f 59 g astric-ca nce r as s o ciated gene s in our in vestigated da ta set [11]. The corr elation-based and probabilistic mo dels w ere compared with respect to their ability to detect the known cancer genes, measured with the AUC v alue of the R OC cur ve for each metho d. Results are summarized in Fig. 2. The b est A UC v alue (0.79) was obtained with a chromosomal window of 15 genes for the correla tion-based SimCCA that directly maximizes the corr elations assuming iden tical pro jec- tions (Eq. (1)). The corres po nding R OC curve is shown in Fig. 3 and pr esents the tradeoﬀ b etw een true 5 10 15 20 25 30 35 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 Window size (genes) A UC SimCCA CCA pSimPCA pPCA pSimCCA pCCA Figure 2: AUC comparison. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 F alse positiv e rate T r ue positive r ate Figure 3: ROC curve for the res ults from cor relation-ba s ed SimCCA with a 15- g ene sliding window. and ’false’ p ositive ﬁndings along the or dered ge ne list. While a large prop ortion of the most signiﬁcan t ﬁndings are in fact known cancer genes, the remaining ﬁndings with no known as so ciations to g astric cancer are promising candidates for further studies; among the 10 0 g e nes with highest depe ndencies b e- t ween gene expr ession and copy num ber in their chromosomal n eighborgho o d, 30% of the cor resp onding regions had previously known ass o ciation with g astric cancer, while the prop or tion in the w ho le data set is 5%. The constrained dep endency detection metho ds in tro duced in this pap er outper formed the uncon- strained mo dels in most cases . The improv ed detection perfo rmance of the constrained mo dels is likely explained b y their ability to reduce ov e r ﬁtting. In terestingly , the mos t constrained probabilistic mo del, pSimPCA, o utper forms the other a pproaches in the highest-dimensional case. In co nt rast, the p erfor- mance of corr e la tion-based CCA decr eases steadily with increasing dimensionality (window size ) as the n um ber of samples (patients) remains ﬁxed to 51. In our particular application, ge ne expression a nd cop y num ber are ex pec ted to ha v e stro ng linear correla tio ns in cancer-a sso ciated chromosomal regions. Corr elation-based appr oach is therefore directly suited for the cancer gene detection task and it has also few er parameters than the pro babilistic versions. How ever, the perfo r mance o f cor r elation-based SimCCA r e duces with increasing dimensio nality . A lik ely explanation is that the correla tion-based version models also some of the data set-sp eciﬁc eﬀects, which is emphasized in higher -dimensions. The probabilistic form ulations provide an alternative wa y to bring in 6 prior knowledge of the rela tio ns hips in a pr incipled framework. A po ten tial adv antage o f the probabilistic approaches is that they have an ex plicit model for disting uis hing the share d signal fro m data set-spec iﬁc v aria tion. 3.4 Biomedical in terpretation of the ﬁndings The results obtained using the SimCCA algorithm are in g e ner al concorda nt with the output from signal-to-no ise sta tistics and ra ndom permutation method that was applied previously to analyze the same data [11, 12]. The adv a ntage of the curre nt method is that it co mb ines the signal across adjacent genes within a par ticular c hromosoma l regio n alrea dy in the mo deling step. Probabilistic SimCCA estimates the stro ng est shared signal betw een the data sets and ig no res other v ar iation using explicit mo deling assumptions. Probabilistic v ersions also pr ovide a measure of the a mpliﬁcation eﬀect for each patient whic h allows robust identiﬁcation of small patien t groups w ith profound ampliﬁcation eﬀects that would be missed in previo us p ermutation-based tests due to low ev en t frequency . In conco rdance with the pr evious analyses , the chromosomal a r ea showing the most signiﬁcant cor- relation b et ween the g ene co py n um ber and express ion was 17 q12-q 21 (Fig. 1 ). There are a num ber of po tent ial target genes in that regio n, including ERBB2 and PPP1R1B , which sho w clinical and biolog- ical r elev ance. The ERBB2 g ene enco des a tra nsmembrane tyrosine kinase receptor, which is a ta rget of Herceptin. This mono clonal antibo dy s p eciﬁca lly inactiv a tes the overexpressed ERBB2 protein and is used to treat metastatic brea st cancer pa tien ts. The expression o f PPP1R1B ha s b een shown to be asso ciated with repressio n o f pro grammed cell death and incr ease the surviv al o f the cancer cells in upper ga strointestinal tract cancers [1 3]. Another genomic regio n with correlated gene copy n um ber and ex pr ession changes is 10q26, and F GFR2 was ident iﬁed as one of the putativ e target ge nes of that r egion. It w as recently shown that in a set o f ga stric cancer cell lines, F GFR2 ampliﬁcation is dr iving the cell proliferatio n a nd promoting cancer cell surviv a l. F urthermore, inhibition o f the FGFR2 protein by small molecules retained the growth arr esting a nd ap opto tica lly activ e phenot ype [14]. The detected 1q2 2 region harb ors the MUC1 gene, whose expression was shown to b e asso ciated with the intestinal subtype of gastric cancer [11]. The 20q is one of the most frequently a mpliﬁed c hromosomal regions in gastric c ancer. How ever, despite of high frequency of the a mpliﬁcations the target ge nes in that ar ea remain to b e describ ed. Our a nalysis pinpo inted the strongest correla ting lo ci to 2 0 q13.12 and signiﬁcan tly narrow the list of putative targ et genes. Some of the detected chromosomal r egions did not have known asso cia tion with gastric cancer; we are currently investigating these results more closely . The current applicatio n shows pr omising per formance in detecting functional cop y num ber changes, but biomedical studies provide also a n um ber of other po tent ial applications . F or example, an increasing num ber of paired data s ets are av aila ble in the futu re for studying the relationships b etw een methylation, sing l e-nucleotide p olymor phisms, miRNAs, and other geno mic features. 4 Discussion W e hav e in troduced metho ds tha t regula rize CCA s olutions b y taking into a ccount similarity constraints. The metho ds a ssume tha t the dep endencies b etw een the diﬀerent views ar e visible in the same dimen- sions, that is, the pro jection matr ices are similar. W e introduced the c onstraints to standar d CCA, resulting in a quick method that helps in s o lving the “small n lar g e p problem”, wher e n is the num ber of sa mples and p their dimensio nality . If n is very s ma ll compared to p , ev en the constrained CCA may not be s uﬃcien t, and we intro duced a Bay esian v a riant into which further prior knowledge can b e easily inserted, a nd which is capa ble of rig orously handling uncertaint y in the data. While we only compare SimCCA and CCA in the present work, the probabilistic form ulation allows s mo oth tradeoﬀ b etw een these tw o extr emes, which is p otentially useful in many applications. Impo rtantly , the constrained approaches for dep endency detection can b e directly applied in pra c tical tasks in knowledge discovery; go o d results w ere obtained in a pr omising medical application on sea rching 7 for potential c ancer genes by detecting dep endencies b etw e en gene expression a nd DNA copy num ber changes of the ge nes. A c knowledgemen ts: The pro ject was funded by T ekes MultiBio pro ject. LL and SK b elong to the Adaptiv e Informatics Researc h Cen tre and He lsinki Insti tute for Information T ec h- nology HI IT. LL i s funded by the Graduate School of Com puter Science and Engi neering. SK is partially supp orted b y E U FP7 No E P ASCAL2, ICT 21 6886. References [1] J.W. Fisher I I I, T. Dar rell, W.T. F reeman, and P .A. Vio la, “Lea rning joint statistical mo dels for audio-visual fusion a nd segr egation,” in A dvanc es in Neur al Information Pr o c essing Systems 13 , Cambridge, MA, 200 0, pp. 77 2–77 8, MIT Press. [2] A. Klami and S. Kaski, “Non-para metric dep endent components,” in Pr o c e e dings of ICASSP’05, IEEE International Confer enc e on A c oustics, Sp e e ch, and Signal Pr o c essing , pp. V–20 9–V–21 2. IEEE, 200 5. [3] C. F yfe and P .L. L a i, “ ICA using kernel canonical cor relation analysis.,” in Pr o c e e dings of t he International W orkshop on Indep en dent Comp onent Analy sis and Blind Signal Sep ar ation (ICA 2000) , 200 0, pp. 27 9–284 . [4] F.R. Ba ch and M.I. Jor dan, “A pro babilistic interpretation of canonical correlation analysis,” T ech. Rep. 688, Department o f Statistics, University of California, Berkeley , 20 05. [5] T. De Bie and B. De Mo o r , “On the regula rization of canonical co r relation analysis,” in Pr o c e e dings of the International Confer enc e on In dep endent Comp onent Analysis and Blind Sour c e Sep ar ation (ICA2003) , S.-I. Amari, A. Cichocki, S. Makino, and N. Mur a ta, Eds. 2003. [6] L. Sun, S. J i, and J . Y e, “A least squar es formulation for canonica l cor relation analysis,” in ICML ’08: Pr o c e e dings of the 25th international c onfer enc e on Machine le arning , New Y ork, NY, USA, 2008, pp. 1 0 24–1 031, A CM. [7] H.D. Vino d, “Canonical ridge and the eco nometrics of joint pro duction,” J. Ec onometrics , v ol. 4, no. 2, pp. 147 –166, 1976. [8] A. Klami and S. K aski, “Genera tiv e mo dels that discover dependencies b etw een data sets,” in Ma- chine le arning for signal pr o c essing XVI , S. McLo one, T. Adali, J. Lars en, M. V a n Hulle, A. Rog ers, and S.C. Do ug las, Eds., pp. 12 3–12 8. IEEE, 2006 . [9] A. K la mi and S. Kaski, “Loc a l dep e nden t comp onents,” in Pr o c e e dings of ICML 2007, the 24th International Confer enc e on Mac hine L e arning , Zoubin Ghahramani, Ed., pp. 425 –432 . Omnipress , 2007. [10] C. Arc ham bea u, N. Delanna y , a nd M. V erleysen, “Robust pr obabilistic pro jections,” in Pr o c e e dings of the 23r d International c onfer enc e on machine le arning , W.W. Cohen and A. Mo ore, Eds. 2 006, pp. 33 –40, A CM. [11] S. Myllyk a ngas, S. Junnila, A. Kokkola, R.Autio, I. Sch einin, T. K iviluoto, M.L. Ka rjalainen- Lindsbe r g, J. Hollmén, S. Knuut ila, P . Puolakk ainen, a nd O. Monni, “Integrated ge ne copy num b er and expression microar r ay analys is of g astric cancer highlight s po ten tial targe t genes.,” Int J Canc er , vol. 123, no. 4, pp. 8 17–25 , 20 0 8. [12] S. Hautaniemi, M. Ringnér , P . Kaura niemi, R. Autio, H. Edgren, O. Yli-Harja, J. Astola, A. Kallioniemi, and O.-P . Ka llio niemi, “A strateg y for identi fying putative ca uses of gene expression v aria tion in human cancers,” J F ra nklin Institute , vol. 341, pp. 77– 88, 2004. 8 [13] A. Belkhiri, A. Zaik a, N. Pidko vk a, S. Kn uutila, C. Mosk aluk, and W. El-Rifai, “ Darpp-32: a nov el antiapoptotic gene in upp e r g astrointestinal carcinomas,” Canc er R es , v ol. 65, pp. 65 83–9 2 , 20 0 5. [14] K. Kunii, L. Davis, J. Gorenstein, H. Hatch, M. Y ashiro, A. Di Ba cco, C. Elbi, a nd B. Lutterbac h, “ FGFR2-ampliﬁed gastric cancer cell lines require FG FR2 and Erbb3 signa ling for growth and surviv a l,” Canc er R es. , vol. 68, no. 7 , pp. 2 3 40–8 , 2 008. 9

Dependency detection with similarity constraints

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment