Exploratory Analysis of Functional Data via Clustering and Optimal Segmentation

Exploratory Analysis of F unctional Data via Clustering and Optimal Segmen tation Georges H ´ ebrail b , Bernard Hugueney a , Yv es Lechev allier c , F abrice Rossi ∗ ,b a LAMSADE, Universit´ e Paris Dauphine, Place du Mar ´ echal de L attr e de T assigny, 75016 Paris – F r anc e b BIL ab, T ´ el´ ec om ParisT e ch, L TCI - UMR CNRS 5141 46, rue Barr ault, 75013 Paris – F r anc e c Pr ojet AxIS, INRIA, Domaine de V oluc e au, Ro c quenc ourt, B.P. 105 78153 L e Chesnay Ce dex – F r anc e Abstract W e prop ose in this pap er an exploratory analysis algorithm for functional data. The metho d partitions a set of functions in to K clusters and represents eac h cluster b y a simple prototype (e.g., piecewise constan t). The total num b er of segmen ts in the protot yp es, P , is chosen by the user and optimally distributed among the clusters via t wo dynamic programming algorithms. The practical relev ance of the metho d is shown on tw o real world datasets. Key wor ds: F unctional Data, Multiple time series, Exploratory analysis, Clustering, Segmen tation, Dynamic programming 1. In tro duction F unctional Data Analysis [26] addresses problems in which the observ ations are describ ed by functions rather than ﬁnite dimensional vectors. A well known real world example of such data is given by sp ectrometry in which each ob ject is analysed via one sp ectrum, that is a function which maps w av elengths to absorbance v alues. Online monitoring of hardware is also a go od example of suc h data: each ob ject is describ ed b y several time series asso ciated to ph ysical quan tities monitored at sp eciﬁed sampling rates. Another application domain is related to electric p ow er consumption curv es, also called (electric) load curves. Suc h data describ es the electric p o wer con- sumption o v er time of one household or one small or large industry . Load curv es can b e v ery voluminous since there are many consumers (ov er 35 millions in F rance) whic h can b e observed during a long p eriod (often several y ears) and at a rate up to one point every minute. Consequen tly , there is a strong need for applying unsup ervised learning metho ds to summarize suc h datasets of load curv es. A typical wa y to do so is to split each curve into daily (or weekly) p eriods and perform a clustering of such daily curves. The motiv ation of such analyses is related to sev eral practical problems: understanding the consumption ∗ Corresponding author Email addr esses: Georges.Hebrail@telecom-paristech.fr (Georges H´ ebrail), Yves.Lechevallier@inria.fr (Yves Lechev allier), Fabrice.Rossi@telecom-paristech.fr (F abrice Rossi) Pr eprint submitted to Neur o c omputing Octob er 24, 2018 b eha viour of consumers in relation to their equipment or w eather conditions, deﬁning new prices, optimizing the pro duction of electric pow er, and in the next y ears monitoring the household consumption to face high p eaks. W e focus in this pap er on the exploratory analysis of a set of curves (or time series). The main idea is to provide the analyst with a summary of the set with a manageable complexit y . A classical solution for m ultiv ariate data consists in using a prototype based clustering approach: each cluster is summarized by its protot yp e. Standard clustering metho ds suc h as K means and Self Organizing Map hav e b een adapted to functional data and could b e used to implement this solution [1, 6, 8, 29]. Another p ossibilit y comes from the symbolization approac hes in which a time series is represen ted b y a sequence of sym b ols. In the SAX approach [24], the time series is transformed into a piecewise representation with con tiguous time interv als of equal size: the v alue asso ciated with each in terv al is the av erage of actual v alues in the in terv al. This approach is very fast but does not give an y guarantee on the associated error. In [15], a piecewise constan t approximation of a time series is constructed via a segmentation of the time domain in to con tiguous interv als on which the series is represen ted b y its a verage v alue, which can be turned into a lab el in a subsequent quantization step. When w e are giv en a set of curves, an unique segmen tation can b e found in order to represent all the curves on a common piecewise constant basis (see [28] for an optimal solution). This w as used as a prepro cessing step in e.g. [19, 30]. W e prop ose in this pap er to merge the tw o approaches: we build a K means lik e clustering of a set of functions in which each prototype is giv en by a simple function deﬁned in a piecewise wa y . The input interv al of each protot yp e is par- titioned into sub-interv als (segments) on which the prototype assumes a simple form (e.g., constan t). Using dynamic programming, we obtain an optimal seg- men tation for each protot yp e while the num b er of segments used in eac h cluster is also optimally chosen with resp ect to a user sp eciﬁed total num b er segments. In the case of piecewise constan t prototypes, a set of functions is summarized via 2 P − K real v alues, where K is the num b er of prototypes and P the total n umber of segments used to represent the prototypes. The rest of this pap er is organized as follows. Section 2 in tro duces the prob- lem of ﬁnding a simple summary of a single function, links this problem with optimal segmen tation and provides an ov erview of the dynamic programming solution to optimal segmentation. Section 3 shows how single function sum- marizing methods can be combined to clustering metho ds to giv e a summary of a set of functions. It also introduces the optimal resource allo cation algo- rithm that computes an optimal num ber of segments for each cluster. Section 4 illustrates the metho d on tw o real w orld datasets. 2. Summarizing one function via optimal segmen tation: a state-of- the-art The prop osed solution is built on tw o elements that are mostly indep en- den t: an y standard clustering algorithm that can handle functional data and a functional summarizing technique that can build an appropriate low com- plexit y representation of a set of homogeneous functions. The presen t section describ es the summarizing tec hnique for a single function; the extension to a set of functions is describ ed in Section 3. 2 Building a summary of a function is deeply connected to building an opti- mal segmentation of the function, a problem which b elongs to the general task of function appro ximation. Optimal segmentation has b een studied under dif- feren t point of views (and diﬀerent names) in a large and scattered literature (see [2, 3, 16, 21, 22, 25, 32] and references therein). The goal of the present section is not to provide new material but rather to give a detailed exp osition of the relations b et ween summary and segmen tation on the one hand, and of the optimal segmen tation framework itself on the other hand. 2.1. Simple functions W e consider a function s sampled in M distinct p oin ts ( t k ) M k =1 from the in terv al [ t 1 , t M ] (points are assumed to b e ordered, i.e., t 1 < t 2 < . . . < t M ). Our goal is to appro ximate s by a simpler function g on [ t 1 , t M ]. The simplicity concept targeted in this article is not based on smoothness as in regularization [33] or on capacit y as in learning theory (see e.g., [9]). It corresp onds rather to an informal measure of the simplicity of the visual represen tation of g for a h uman analyst. The diﬀerence b et ween those three aspects (smo othness, capacity and visual simplicit y) can b e illustrated b y an elementary example: let us assume that g is c hosen as a linear combination of N ﬁxed functions, i.e., g ( t ) = N X i =1 β i φ i ( t ) . (1) It is w ell known that the set of indicator functions based on functions of the previous form has a VC dimension of N , as long as the functions ( φ i ) N i =1 are linearly indep enden t (see e.g., [9]). If we consider now the L 2 norm of the second deriv ativ e of g as a roughness measure, it is quite clear that the actual smo othness will depend b oth on the functions ( φ i ) N i =1 and on the v alues of the co eﬃcien ts ( β i ) N i =1 . As an extreme case, one can consider φ 1 ( t ) = sign( t ) while the ( φ i ) N i =2 are smo oth functions. Then any g with β 1 6 = 0 is non smo oth while all other functions are. The visual complexity of g will also clearly dep end on b oth the basis func- tions and the coeﬃcients, but in a wa y that will b e generally quite opp osite to the smo othness. Let us again consider an extreme case with N = 2, the interv al [0 , 1] and tw o choices of basis functions, ( φ 1 , φ 2 ) and ( ψ 1 , ψ 2 ) deﬁned as follows: φ 1 ( t ) = ( 1 if t < 1 2 0 if t ≥ 1 2 φ 2 ( t ) = ( 0 if t < 1 2 1 if t ≥ 1 2 (2) and ψ 1 ( t ) = exp − 16  t − 1 4  2 ! ψ 2 ( t ) = exp − 8  3 4 − t  2 ! . (3) Ob viously , functions generated b y ( φ 1 , φ 2 ) are far less smo oth than functions generated b y ( ψ 1 , ψ 2 ) (the former are piecewise constant). Ho wev er, as sho wn on Figure 1, functions generated b y ( φ 1 , φ 2 ) are m uch easier to describ e and to understand than functions generated b y ( ψ 1 , ψ 2 ). Indeed, piecewise constant functions admit simple textual description: the function on the left part of 3 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 t g(t) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 t g(t) Figure 1: Left: function generated by ( φ 1 , φ 2 ); right: function generated by ( ψ 1 , ψ 2 ) Figure 1 tak es v alues 0.5 for the ﬁrst half of the p erio d of study and then switc hes to 1 for the last part. In the contrary , smo oth functions are inherently more complex to describe: the function on the right part of Figure 1 needs a long textual description sp ecifying its evolution trough time. In addition, no reasonable textual description will giv e a w ay to reconstruct exactly the original function. T o summarize, classes of functions with identical capacity (as measured by the VC dimension) can contain functions of arbitrary smo othness and arbitrary unrelated complexity . W e need therefore a speciﬁc solution to induce a simple appro ximation of a function. Exp erimen tal studies in information visualization (see, e.g., [13]) ha ve shown that diﬀeren t tasks (e.g., clustering, c hange detection, etc.) ha ve diﬀerent t yp e of adapted visual features. In our con text, we target applications in which a textual (written or spoken) description of b ehavior of the function is useful. F or instance in mass spectrometry , p eaks p ositions are v ery informative and corresp ond to descriptions such as “there is a p eak of this heigh t at that p osition”. In the case of electrical consumption, the mean level consumption during a time frame is an in teresting asp ect and corresp onds to description such as “the outlet consumes this amount of pow er from time a to time b ”. Hardware monitoring provides another example in which quantities of in terest are generally constan t on some time frames and switc h linearly from one stable v alue to another. T o implement this idea, w e rely on a segmentation approach detailed in the next section (and co vered in a large b o dy of literature, as recalled in the intro- duction). A simple function in our con text is given b y a partition of [ t 1 , t M ] in to a small num b er of segmen ts on which the function tak es a simple applica- tion dependent parametric form. If the parametric form is reduced to a single v alue, the simple function is piecewise constant and is easy to describ e in text. Other conv enient parametric forms are aﬃne functions (to provide less jump y transitions) and p eaks. This approac h is closely related to (and inspired by) sym b olization techniques used for instance in time series indexing [15]: the idea is to build a piecewise constan t approximation of a function and then to quan tize the levels (the constants) in to a small num ber of v alues. The obtained v alues are replaced by symbols (e.g., letters) and the time series is therefore translated 4 in to a sequence of symbols. 2.2. Se gmentation Let us analyze ﬁrst an example of the general scheme describ ed in the previ- ous section. W e consider a piecewise constan t appro ximation given by P inter- v als ( I p ) P p =1 and the corresp onding P v alues ( a p ) P p =1 (w e assume that ( I p ) P p =1 is a partition of [ t 1 , t M ]). The approximating function g is deﬁned by g ( t ) = a p when t ∈ I p . If we consider the L 2 error criterion, w e obtain a particular case of the Stone appro ximation problem [32] as we hav e to minimize E  s, ( I p ) P p =1 , ( a p ) P p =1  = Z t M t 1 ( s ( t ) − g ( t )) 2 dt = X p Z I p ( s ( t ) − a p ) 2 dt. (4) The integral can b e appro ximated by a quadrature sc heme. W e consider the simplest setting in whic h a summation o v er ( t k ) M k =1 pro vides a suﬃcient approx- imation, but the generalization to weigh ted sc hemes such as [23] is straightfor- w ard. In practice, we optimize therefore E  s, ( I p ) P p =1 , ( a p ) P p =1  = X p X t k ∈ I p ( s ( t k ) − a p ) 2 . (5) The diﬃcult y lies in the segmen tation, i.e., in the c hoice of the partition ( I p ) P p =1 , as, given this partition, the optimal v alues of the ( a p ) P p =1 are equal to the µ p = 1 | I p | P t k ∈ I p s ( t k ), where | I p | denotes the n umber of t k in I p . More generally , the prop osed metho d is built up on a parametric approxima- tion on each segment. Let us denote Q ( s, I ) the minimal error made by this parametric approximation of s on interv al I . The piecewise constant appro xi- mation describ ed ab ov e corresp onds to Q ( s, I ) = X t ∈ I s ( t ) − 1 | I | X v ∈ I s ( v ) ! 2 . (6) Similar expressions can b e deriv ed for a piecewise linear approximation or any other reasonable solution. F or instance Q ( s, I ) = X t ∈ I | s ( t ) − m( s, I ) | , (7) where m( s, I ) is the median of { s ( t k ) | t k ∈ I } , provides a more robust solution. Giv en Q , the optimal segmentation of s is obtained by minimizing E  s, ( I p ) P p =1  = X p Q ( s, I p ) , (8) o ver the partitions ( I p ) P p =1 of [ t 1 , t M ]. This formulation emphasizes the cen tral h yp othesis of the metho d: the optimal appro ximation on a segment dep ends only on the v alues taken b y s in the segment. While a more general segmen tation problem, in which this constrain t is remov ed, can b e easily deﬁned, its resolution is m uch more computationally intensiv e than the one targeted here. 5 2.3. Optimal se gmentation Bellman sho wed in [3] that the Stone approximation problem [32] problem can b e solved exactly and eﬃcien tly by dynamic programming: this is a conse- quence of the independence b etw een the approximation used in e ac h segmen t. This has b een rediscov ered and/or generalized in numerous o ccasions (see e.g., [2, 16, 25]). A diﬀerent point of view is given in [21, 22]. As the function s is known only through its v alues at ( t k ) M k =1 , deﬁning a partition of [ t 1 , t M ] into P interv als ( I p ) P p =1 is not needed. A partition ( C p ) P p =1 of t 1 < t 2 < . . . < t M is suﬃcient, as long as it is or der e d : if t k ∈ C p and t l ∈ C p , then { t k , t k +1 , . . . , t l } ⊂ C p . The goal of the segmentation is then to ﬁnd an optimal ordered partition of ( t k ) M k =1 according to the error criterion deﬁned in equation (5) (where I p is replaced by C p and Q is modiﬁed accordingly). As shown in [21, 22], this problem can b e solv ed by dynamic programming as long as the error criterion is additive , i.e., of the general form E  s, ( C p ) P p =1  = X p Q ( s, C p ) . (9) Moreo ver, the summation operator can be replaced by an y commutativ e aggregation op erator, e.g., the maximum. As already men tioned in the previous section, the crucial p oin t is that the error criterion is a combination of v alues obtained indep enden tly on eac h cluster. The dynamic programming algorithm for minimizing the cost of equation (9) is obtained as follo ws. W e deﬁne ﬁrst a set of clustering problems: F ( s, k, j ) = min ( C p ) j p =1 , ordered partition of { t k ,t k +1 ,...,t M } X p Q ( s, C p ) . (10) The basic idea is to compute F ( s, k, j ) using F ( s, ., j − 1), as the b est partition in to j clusters is obtained with F ( s, k, j ) = min k ≤ l ≤ M − j +1 Q ( s, { t k , . . . , t l } ) + F ( s, l + 1 , j − 1) . (11) Indeed the partition is ordered and therefore, up to a ren um b ering of the clus- ters, there is l suc h that C 1 = { t 1 , . . . , t l } . Moreo ver, the quality measure is additiv e and therefore ﬁnding the best partition in j clusters with the constrain t that C 1 = { t 1 , . . . , t l } corresp onds to ﬁnding the best partition of { t l +1 , . . . , t M } in j − 1 clusters. This leads to Algorithm 1, in which W ( k, j ) is the winner split, i.e., the l that realizes the minim um in equation (11). The ﬁnal loop is the bac k- trac king phase in whic h the split p ositions are assembled to pro duce the result of the algorithm, C , which giv es the last indexes of the P − 1 ﬁrst clusters of the partition. It should b e noted that while Algorithm 1 outputs only the opti- mal partition in P clusters, an actual implemen tation will of course pro vide in addition the optimal mo del. Moreo ver, as the algorithm pro duces all v alues of F ( ., . ), it can pro vide at once all optimal partitions in p clusters for p ranging from 1 to P . This is done via P backtrac king lo ops for a total cost in O ( P 2 ). Figures 2 and 3 show an example of the results obtained by the algorithm. The original function on the left side of Figure 2 is a sp ectrum from the T ecator 6 Algorithm 1 Dynamic programming summarizing of a single function 1: for k = 1 to M do 2: F ( s, k, 1) ← Q ( s, { t k , t k +1 , . . . , t M } ) 3: W ( k, 1) ← NA { no meaningful v alue at for j = 1 } 4: end for 5: for j = 2 to P do 6: for k = 1 to M − j + 1 do 7: F ( s, k, j ) ← Q ( s, { t k } ) + F ( s, k + 1 , j − 1) 8: W ( k, 1) ← k 9: for l = k + 1 to M − j + 1 do 10: if Q ( s, { t k , . . . , t l } ) + F ( s, l + 1 , j − 1) < F ( s, k , j ) then 11: F ( s, k, j ) ← Q ( { s, t k , . . . , t l } ) + F ( s, l + 1 , j − 1) 12: W ( k, 1) ← l 13: end if 14: end for 15: end for 16: end for 17: C ← (NA , . . . , NA) 18: C ( P − 1) ← W (1 , P ) 19: for j = P − 2 to 2 do 20: C ( j ) ← W ( C ( j + 1) + 1 , j + 1) 21: end for dataset 1 for which M = 100. The sp ectrum is segmented into 1 to 10 segments (p ositions of the optimal segments are giv en on the right side of Figure 2). The resulting appro ximations for 4 and 6 segments are given on Figure 3. 850 900 950 1000 1050 2.8 3.0 3.2 3.4 wavelength absorbance wavelength number of segments 850 900 950 1000 1050 10 9 8 7 6 5 4 3 2 1 Figure 2: Segmen tation of a single function: original function on the left, segment p ositions on the right 2.4. A lgorithmic c ost Giv en all the Q ( s, { t k , . . . , t l } ), Algorithm 1 runs in O ( P M 2 ). This is far more eﬃcient that a naiv e approac h in which all p ossible ordered partitions 1 Data are av ailable on statlib at http://lib.stat.cmu.edu/datasets/tecator and consist in near infrared absorbance sp ectrum of meat samples recorded on a T ecator Infratec F o o d and F eed Analyzer. 7 850 900 950 1000 1050 2.8 3.0 3.2 3.4 wavelength absorbance 850 900 950 1000 1050 2.8 3.0 3.2 3.4 wavelength absorbance Figure 3: Piecewise constant approximations with 4 segments on the left and 6 segments on the right w ould b e ev aluated (there are M ! ( M − P )! suc h partitions). How ever, an eﬃcient implemen tation is link ed to a fast ev aluation of the Q ( s, { t k , . . . , t l } ). F or in- stance, a naiv e implementation based on equation (6) leads to a O ( M 3 ) cost that dominates the cost of Algorithm 1. F ortunately , recursive form ulation can b e used to reduce in some cases the computation cost asso ciated to the Q ( s, { t k , . . . , t l } ) to O ( M 2 ). As an illus- tration, Algorithm 2 computes Q for equation (6) (piecewise constan t approx- imation) in O ( M 2 ). Similar algorithms can be derived for other c hoices of the Algorithm 2 Recursive calculation of Q ( s, { t k , . . . , t l } ) 1: for k = 1 to M do 2: µ ( s, { t k } ) ← s ( t k ) 3: Q ( s, { t k } ) ← 0 4: end for 5: for l = 2 to M do 6: µ ( s, { t 1 , . . . , t l } ) ← 1 l [( l − 1) µ ( s, { t 1 , . . . , t l − 1 } ) + s ( t l )] 7: Q ( s, { t 1 , . . . , t l } ) ← Q ( s, { t 1 , . . . , t l − 1 } ) + l l − 1 [ s ( t l ) − µ ( s, { t 1 , . . . , t l } )] 2 8: end for 9: for k = 2 to M − 1 do 10: for l = k + 1 to M do 11: µ ( s, { t k , . . . , t l } ) ← 1 l − k +1 [( l − k + 2) µ ( s, { t k − 1 , . . . , t l } ) − s ( t k − 1 )] 12: Q ( s, { t k , . . . , t l } ) ← Q ( s, { t k − 1 , . . . , t l } ) − l − k +1 l − k +2 [ s ( t k − 1 ) − µ ( s, { t k , . . . , t l } )] 2 13: end for 14: end for appro ximation solution used in eac h segment. The memory usage can also b e reduced from O ( M 2 ) in Algorithms 1 and 2 to O ( M ) (see, e.g., [21, 22]). 2.5. Extensions and variations As mentioned b efore, the general framework summarized in equation (9) is v ery ﬂexible and accommo dates numerous v ariations. Those v ariations include the approximation mo del (constan t, linear, p eak, etc.), the qualit y criterion (quadratic, absolute v alue, robust Huber loss [14], etc.) and the aggregation op erator (sum or maximum of the p oint wise errors). 8 Figure 4 gives an application example: the sp ectrum from Figure 2 is ap- pro ximated via a piecewise linear represen tation (on the righ t hand side of the Figure): as exp ected, the piecewise linear approximation is more accurate than the piecewise constant one, while the former uses less numerical parameters than the latter. Indeed, the piecewise linear appro ximation is easier to describ e 850 900 950 1000 1050 2.6 2.8 3.0 3.2 3.4 wavelength absorbance 850 900 950 1000 1050 2.6 2.8 3.0 3.2 3.4 wavelength absorbance Figure 4: Segmen tation of a single function: piecewise constant with 10 segments (left) versus piecewise linear with 5 segments (right) than the piecewise constant approximation: one needs only to sp ecify the 4 breakp oin ts betw een the segments together with 10 extremal v alues, while the piecewise constant appro ximation needs 10 numerical v alues together with 9 breakp oin ts. Ho wev er, this example is particularly fav orable for the piecewise linear ap- pro ximation. Indeed, the independence h yp othesis embedded in equation (9) and needed for the dynamic programming approach preven ts us from intro- ducing non lo cal constrain ts in the approximating function. In particular, one cannot request for the piecewise linear appro ximation to b e con tinuous. This is emphasized on Figure 4 b y the disconnected dra wing of the approximation function: in fact, the function g is not deﬁned b et ween t k and t k +1 if those tw o p oin ts b elong to distinct clusters (in the case of Bellman’s solution [3], the func- tion is not w ell deﬁned at the boundaries of the interv als). Of course, one can use linear interpolation to connect g ( t k ) with g ( t k +1 ), but this can in tro duce in practice P − 1 additional v ery short segments, leading to an o v erly complex mo del. In the example illustrated on Figure 4, the piecewise appro ximation does not suﬀer from large jumps b et ween segments. W ould that be the case, one could use an enriched description of the function sp ecifying the starting and ending v alues on each segment. This would still be simpler than the piecewise constan t description. Ho wev er, if the original function is contin uous, using a non con tinuous approximation can lead to false conclusion. T o illustrate the ﬂexibilit y of the prop osed framework beyond the simple v ariations listed ab ov e, we derive a solution to this contin uity problem [22]. Let us ﬁrst deﬁne Q ( s, { t k , . . . , t l } ) = l X j =1  s ( t j ) − ( s ( t k ) − s ( t l )) t + ( t k s ( t l ) − t l s ( t k )) t k − t l  2 . (12) This corresponds to the total quadratic error made by the linear in terp olation of s on { t k , . . . , t l } based on the tw o interpolation p oints ( t k , s ( t k )) and ( t l , s ( t l )). 9 Then w e replace the quality criterion of equation (9) by the following one: E (( k l ) P l =0 ) = P X l =1 Q ( s, { t k l − 1 , . . . , t k l } ) , (13) deﬁned for an increasing series ( k l ) P l =0 with k 0 = 1 and k P = M . Obviously , E (( k l ) P l =0 ) is the total quadratic error made b y the piecewise linear in terp olation of s on [ t 1 , t M ] based on the P + 1 in terp olation p oin ts ( t k l , s ( t k l )) P l =0 . While ( k l ) P l =0 do es not corresp ond strictly to a partition of { t 1 , . . . , t M } (b ecause each in terp olation p oin t b elongs to tw o segments, exp ect for the ﬁrst and the last ones), the criterion E is additive and has to b e optimized on an ordered struc- ture. As such, it can b e minimized b y a slightly mo diﬁed v ersion of Algorithm 1. Figure 5 shows the diﬀerences b et ween the piecewise linear approach and the 850 900 950 1000 1050 2.6 2.8 3.0 3.2 3.4 wavelength absorbance 850 900 950 1000 1050 2.6 2.8 3.0 3.2 3.4 wavelength absorbance Figure 5: Segmentation of a single function: piecewise linear with 5 segmen ts (left) versus linear interpolation with 5 segments (right) linear interpolation approach. F or contin uous functions, the latter seems to b e more adequate than the former: it do es not introduce arbitrary discon tin uity in the approximating function and provides a simple summary of the original function. The solutions explored in this section target sp eciﬁcally the summarizing issue b y providing simple approximating function. As explained in Section 2.1, this rules out smo oth approximation. In other applications con text not ad- dressed here, one can trade simplicity for smo othness. In [4], a mo del based approac h is used to build a piecewise polynomial approximation of a function. The algorithm provides a segmen tation useful for instance to iden tify underlying regime switching as well as a smo oth appro ximation of the original noisy sig- nal. While the method and framework are related, they opp ose on the fav ored qualit y: one fav ors smo othness, the other fav ors simplicity . 3. Summarizing several functions Let us no w consider N functions ( s i ) N i =1 sampled in M distinct points ( t k ) M k =1 from the in terv al [ t 1 , t M ] (as in Section 2, the points are assumed to b e ordered). Our goal is to leverage the function summarizing framew ork presented in Section 2 to build a summary of the whole set of functions. W e ﬁrst address the case of a homogeneous set of functions and tac kle after the general case. 10 3.1. Single summary Homogeneit y is assumed with resp ect to the chosen functional metric, i.e., to the error criterion used for segmen tation in Section 2, for instance the L 2 . Then a set of functions is homogeneous if its diameter is small compared to the t ypical v ariations of a function from the set. In the quadratic case, w e request for instance to the maximal quadratic distance b etw een tw o functions of the set to b e small compared to the v ariances of functions of the set around their mean v alues. Then, if the functions ( s i ) N i =1 are assumed to b e homogeneous, ﬁnding a single summary function seems natural. In practice, this corresp onds to ﬁnding a simple function g that is close to each of the s i as measured by a functional norm: one should choose a summary type (e.g., piecewise constant), a norm (e.g., L 2 ) and a wa y to aggregate the individual comparison of eac h function to the summary (e.g., the sum of the norms). Let us ﬁrst consider the case of a piecewise constant function g deﬁned by P in terv als ( I p ) P p =1 (a partition of [ t 1 , t M ]) and P v alues ( a p ) P p =1 (with g ( t ) = a p when t ∈ I p ). If w e measure the distance b et ween g and s i via the L 2 distances and consider the sum of L 2 distances as the target error measure, we obtain the follo wing segmentation error: E  ( s i ) N i =1 , ( I p ) P p =1 , ( a p ) P p =1  = N X i =1 P X p =1 X t k ∈ I p ( s i ( t k ) − a p ) 2 , (14) using the same quadrature sc heme as in Section 2.2. Huygens’ theorem gives N X i =1 ( s i ( t k ) − a p ) 2 = N ( µ ( t k ) − a p ) 2 + N X i =1 ( s i ( t k ) − µ ( t k )) 2 , (15) where µ = 1 N P N i =1 s i is the mean function. Therefore minimizing E from equation (14) is equiv alent to minimizing E 0  ( s i ) N i =1 , ( I p ) P p =1 , ( a p ) P p =1  = P X p =1 X t k ∈ I p ( µ ( t k ) − a p ) 2 . (16) In other w ords, the problem consists simply in building an optimal summary of the mean curv e of the set and is solv ed readily with Algorithm 1. The additional computational cost induced by the calculation of the mean function is in O ( N M ). This will generally b e negligible compared to the O ( P M 2 ) cost of the dynamic programming. The reasoning ab o v e applies more generally to piecewise mo dels used with the sum of L 2 distances. Indeed if g ( t ) is such that g ( t ) = g p ( t ) when t ∈ I p , then w e hav e N X i =1 ( s i ( t k ) − g p ( t k )) 2 = N ( µ ( t k ) − g p ( t k )) 2 + N X i =1 ( s i ( t k ) − µ ( t k )) 2 , (17) whic h leads to the same solution: g can b e obtained as the summary of the mean curv e. The case of linear in terp olation proposed in Section 2.5 is a bit diﬀeren t in the sense that in terp olating several curv es at the same time is not 11 850 900 950 1000 1050 3.5 4.0 4.5 5.0 5.5 wavelength absorbance Figure 6: Group summary via an linear interpolation of the mean; the grey dashed/dotted curves are the original ones, the grey solid curve is the mean curve and the black curve is the piecewise linear interpolation curve that provides the group summary p ossible. Therefore, the natural wa y to extend the prop osal to a set of curv es it to interpolate the mean curv e. An example of the result obtained by this metho d is given in Figure 6. In the general case, we assume given an error criterion Q that applies to a set of contiguous ev aluation p oin ts { t k , . . . , t l } : it gives the error done by the lo cal mo del on { t k , . . . , t l } as an approximation of all the functions ( s i ) N i =1 . Then given an ordered partition of { t 1 , . . . , t M } , ( C p ) P p =1 , we use the sum of the Q (( s i ) N i =1 , C p ) as a global error measure and obtain an additiv e form as in equation (9). Algorithm 1 applies straightforw ardly (it applies also to taking, e.g., the maxim um of the Q (( s i ) N i =1 , C p ) o ver p ). The only diﬃculty compared to the case of a unique function is the eﬃcien t calculation of the Q (( s i ) N i =1 , { t k , . . . , t l } ). As shown ab o ve, the L 2 case is very fa vorable as it amounts to summarizing the mean curve. Ho wev er, more complex norms and/or aggregation rules lead to diﬃculties. If we use for instance Q (( s i ) N i =1 , { t k , . . . , t l } ) = min a max 1 ≤ i ≤ N l X j = k ( s i ( t j ) − a ) 2 , (18) the problem cannot be form ulated as a single curv e summarizing problem, while is consists simply in considering the maximal L 2 distance rather than the sum of the L 2 distances: if the curv es hav e diﬀerent v ariances on { t k , . . . , t l } , the optimal v alue a will not b e the mean on the set of ev aluation p oin ts of the mean curv e. More generally , the computation of all Q (( s i ) N i =1 , { t k , . . . , t l } ) migh t scale muc h faster than in O ( M ( M + N )) and th us might dominate the total computation time. Indeed, if there is no closed form expression for Q , then the v alue has to b e computed from scratc h for eac h pair ( k , l ), whic h implies O ( M 2 ) problem solving. Then, if no sp ecial structure or preprocessing can b e used to 12 simplify the problem, computing one v alue of Q implies clearly to lo ok at all the in volv ed functions, i.e, to O ( N ) observ ations. Therefore, if no simpliﬁcation can b e done, computing all v alues of Q will imply at least at O ( N M 2 ) calculation cost. Care must therefore be exercised when choosing the quality criterion and the summarizing mo del to av oid excessive costs. That said, the dynamic programming approach applies and leads to an optimal segmentation according to the c hosen criterion. 3.2. Clustering Assuming that a set of functions is homogeneous is too strong in practice, as shown on Figure 6: some of the curv es hav e a small bump around 940 nm but the bump is very small in the mean curve. As a consequence, the summary misses this feature while the single curve summary used in Figure 5 was clearly pic king up the bump. It is therefore natural to rely on a clustering strategy to pro duce homoge- neous clusters of curves and then to apply the metho d describ ed in the previous section to summarize eac h cluster. How ev er, the direct application of this strat- egy migh t lead to sub optimal solutions: at better solution should b e obtained b y optimizing at once the clusters and the asso ciated summaries. W e derive suc h an algorithm in the present section. In a similar w a y to the previous section, we assume given an error measure Q ( G, { t k , . . . , t l } ) (where G is a subset of { 1 , . . . , N } ): it measures the error made b y the lo cal model on the { t k , . . . , t l } as an appro ximation of the functions s i for i ∈ G . F or instance Q ( G, { t k , . . . , t l } ) = min a X i ∈ G l X j = k ( s i ( t j ) − a ) 2 , (19) for a constan t lo cal mo del ev aluated with respect to the sum of L 2 distances. Let us assume for no w that each cluster is summarized via a segmentation sc heme that uses a cluster indep enden t n um b er of segments, P /K (this v alue is assumed to b e an in teger). Then, given a partition of { 1 , . . . , N } into K clusters ( G k ) K k =1 and given for each cluster an ordered partition of { t 1 , . . . , t M } , ( C k p ) P /K p =1 , the global error made b y the summarizing scheme is E  ( s i ) N i =1 , ( G k ) K k =1 ,  ( C k p ) P /K p =1  K k =1  = K X k =1 P /K X p =1 Q ( G k , C k p ) . (20) Building an optimal summary of the set of curves amounts to minimizing this global error with resp ect to the function partition and, inside each cluster, with resp ect to the segmentation. A natural w a y to tac kle the problem is to apply an alternating minimiza- tion scheme (as used for instance in the K-means algorithm): w e optimize suc- cessiv ely the partition given the segmentation and the segmen tation given the partition. Given the function partition ( G k ) K k =1 , the error is a sum of indepen- den t terms: one can apply dynamic programming to build an optimal summary of eac h of the obtained clusters, as sho wn in the previous section. Optimizing with resp ect to the partition given the segmentation is a bit more complex as 13 the feasibilit y of the solution depends on the actual choice of Q . In general ho w- ev er, Q is built by aggregating via the maxim um or the sum op erators distances b et w een the functions assigned to a cluster and the corresp onding summary . It is therefore safe to assume that assigning a function to the cluster with the closest summary in term of the distance used to deﬁne Q will giv e the optimal partition given the summaries. This leads to Algorithm 3. The computational Algorithm 3 Simultaneous clustering and segmentation algorithm (uniform case) 1: initialize the partition ( G k ) K k =1 randomly 2: repeat 3: for k = 1 to K do 4: apply Algorithm 1 to the ﬁnd the optimal summary g k of the set of curv es { s i | i ∈ G k } in P /K segments 5: end for 6: for i = 1 to N do 7: assign s i to G k suc h that the distance b etw een s i and g k is minimal o ver k 8: end for 9: un til ( G k ) K k =1 is stable cost dep ends strongly on the av ailability of an eﬃcient calculation scheme for Q . If this is the case, then the cost of the K dynamic programming applications gro ws in O ( K ( P /K ) M 2 ) = O ( P M 2 ). With standard L p distances b etw een functions, the cost of the assignment phase is in O ( N K M ). Computing the mean function in each cluster (or similar quantities suc h as the median function that might b e needed to compute eﬃciently Q ) has a negligible cost compared to the other costs (e.g., O ( N M ) for the mean functions). Con vergence of Algorithm 3 is obvious as long as b oth the assignment phase (i.e., the minimization with respect to the partition giv en the segmen tation) and the dynamic programming algorithm use deterministic tie breaking rules and lead to unique solutions. While the issue of ties b et ween protot yp es is generally not relev an t in the K-means algorithm, for instance, it plays an important role in our con text. Indeed, the summary of a cluster can b e very simple and therefore, sev eral summaries migh t ha v e identical distance to a given curve. Then if ties are broken at random, the algorithm migh t never stabilize. Algorithm 1 is form ulated to break ties deterministically by fav oring short segments at the b eginning of the segmentation (this is implied b y the strict inequality test used at line 10 of the Algorithm). Then the optimal segmen tation is unique given the partition, and the optimal partition is unique given the segmentation. As a consequence, the alternativ e minimization scheme conv erges. Figure 7 giv es an example of the results obtained b y the algorithm in the case of a piecewise linear summary with ﬁve segmen ts p er cluster (the full T ecator dataset con tains N = 240 sp ectra with M = 100 for each sp ectrum). In practice, we will frequently use a piecewise constant mo del whose quality is assessed via the sum of the L 2 distances, as in equation (19). Then equation 14 850 900 950 1000 1050 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 absorbance 850 900 950 1000 1050 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 850 900 950 1000 1050 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 850 900 950 1000 1050 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 wavelength absorbance 850 900 950 1000 1050 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 wavelength 850 900 950 1000 1050 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 wavelength Figure 7: Summary of the full T ecator dataset with 6 clusters and 5 segments p er cluster: grey curves are the original sp ectra, dark curves are the summaries for each cluster (20) can b e instantiated as follows: E  ( s i ) N i =1 , ( G k ) K k =1 ,  ( C k p ) P /K p =1  K k =1  = K X k =1 X i ∈ G k P /K X p =1 X t j ∈ C k p ( s i ( t j ) − µ k,p ) 2 , (21) with µ k,p = 1 | G k || C k p | X i ∈ G k X t j ∈ C k p s i ( t j ) . (22) Let us denote s i = ( s i ( t 1 ) , . . . , s i ( t M )) the R M v ector representation of the sampled functions and g k = ( g k ( t 1 ) , . . . , g k ( t M )) the vector represen tation of the piecewise constan t functions deﬁned by g k ( t i ) = µ k,p when t i ∈ C k p . Then equation (21) can b e rewritten E  ( s i ) N i =1 , ( G k ) K k =1 ,  ( C k p ) P p =1  K k =1  = K X k =1 X i ∈ G k k s i − g k k 2 . (23) This formulation emphasizes the link b et ween the proposed approac h and the K means algorithm. Indeed the criterion is the one optimized b y the K means algorithm for the vector data s i with a constraint on the protot yp es: rather than allowing arbitrary prototypes as in the standard K means, we use simple protot yp es in the sense of Section 2.1. Any piecewise mo del w hose quality is assessed via the sum of the L 2 distances leads to the same formulation, but with diﬀeren t constraints on the shap e of the functional protot yp es. Equation (23) opens the w ay to many v ariations on the same theme: most of the protot yp e based clustering methods can b e mo diﬁed to em b ed constrain ts 15 on the prototypes, as Algorithm 3 do es for the K means. In teresting candidates for such a generalization includes the Self Organizing Map in its batch version [18], Neural Gas also in its batch version [7] and deterministic annealing based clustering [27], among others. 3.3. Optimal r esour c es al lo c ation A drawbac k of Algorithm 3 is that it do es not allo cate resources in an optimal w ay: we assumed that eac h summary will use P /K segmen ts, regardless of the cluster. F ortunately , dynamic programming can be used again to remo ve this constrain t. Let us consider indeed a resource assignment, that is K positive in tegers ( P k ) K k =1 suc h that P K k =1 P k = P . The error made by the summarizing sc heme when cluster G k uses P k segmen ts for its summary is given by E  ( s i ) N i =1 , ( G k ) K k =1 , ( P k ) K k =1 ,  ( C k p ) P k p =1  K k =1  = K X k =1 P k X p =1 Q ( G k , C k p ) . (24) As for the criterion given by equation (20) in the previous section, we rely on an alternating minimization scheme. Given the function partition ( G k ) K k =1 , the c hallenge is to optimize E with respect to b oth the resource assignment and the segmen tation. F or a ﬁxed partition ( G k ) K k =1 , let us denote R k ( P k ) = min ( C k p ) P k p =1 ordered partition of { t 1 ,...,t M } P k X p =1 Q ( G k , C k p ) (25) Then minimizing E from equation (24) for a ﬁxed partition ( G k ) K k =1 corresp onds to minimizing: U  ( P k ) K k =1  = K X k =1 R k ( P k ) (26) with resp ect to the resource assignment. This form ulation emphasizes t w o ma jor p oin ts. Firstly , as already mentioned in Section 2.3, giv en a num b er of segments P Algorithm 1 and its v arian ts provide with a negligible additional cost all the optimal segmen tations in p = 1 to P segments. Therefore, the calculation of R k ( p ) for p ∈ { 1 , . . . , P } can b e done with a v arian t of Algorithm 1 in O ( P M 2 ) pro vided an eﬃcien t calculation of Q ( G k , C k p ) is possible (in fact, one needs only v alues of R k ( p ) for p ≤ P − K + 1 as we request at least one segmen t p er cluster, but to ease the calculation w e assume that we compute v alues up to P ). Secondly , U as deﬁned in equation (26) is an additive measure and can therefore b e optimized by dynamic programming. Indeed let us deﬁne S ( l, p ) as S ( l , p ) = min ( P k ) l k =1 such that P l k =1 P k = p l X k =1 R k ( P k ) (27) The S (1 , p ) are readily obtained from the optimal segmentation algorithm ( S (1 , p ) = R 1 ( p )). Then the additiv e structure of equation (26) shows that S ( l , p ) = min 1 ≤ u ≤ p − l +1 S ( l − 1 , p − u ) + R l ( u ) . (28) 16 Algorithm 4 Simultaneous clustering and segmentation algorithm with optimal resource assignmen t 1: initialize the partition ( G k ) K k =1 randomly 2: repeat 3: for k = 1 to K do 4: apply Algorithm 1 to compute R k ( p ) for all p ∈ { 1 , . . . , P − K + 1 } 5: end for 6: for p = 1 to P − K + 1 do { initialisation of the resource allo cation phase } 7: S (1 , p ) ← R 1 ( p ) 8: end for 9: for l = 2 to K do { dynamic programming for resource allo cation } 10: for p = l to P do 11: W ( l , p ) ← 1 12: S ( l , p ) ← S ( l − 1 , p − 1) + R l (1) 13: for u = 2 to p − l + 1 do 14: if S ( l − 1 , p − u ) + R l ( u ) < S ( l , p ) then 15: W ( l , p ) ← u 16: S ( l , p ) ← S ( l − 1 , p − u ) + R l ( u ) 17: end if 18: end for 19: end for 20: end for 21: P opt ← (NA , . . . , NA) { backtrac king phase } 22: P opt ( K ) ← W ( K, P ) 23: av ailabe ← P − W ( K, P ) 24: for k = K − 1 to 1 do 25: P opt ( k ) ← W ( k , av ailabl e ) 26: av ailabe ← av ail abe − W ( k , av ailabl e ) 27: end for 28: for k = 1 to K do { summary calculation } 29: build g k with P opt ( k ) segments 30: end for 31: for i = 1 to N do 32: assign s i to G k suc h that the distance b etw een s i and g k is minimal o ver k 33: end for 34: un til ( G k ) K k =1 is stable 17 Giv en all the R l ( p ) the calculation of S ( K , P ) has therefore a cost of O ( K P 2 ). This calculation is plugged in to Algorithm 3 to obtain Algorithm 4. Giv en the v alues of Q , the total computational cost of the optimal resource assignmen t and of the optimal segmentation is therefore in O ( K P ( M 2 + P )). As we target simple descriptions, it seems reasonable to assume that P  M 2 and therefore that the cost is dominated b y O ( K P M 2 ). Then, using the optimal resource assignment m ultiplies by K the computational cost compared to a uniform use of P /K segments p er cluster (which computational cost is in O ( P M 2 )). The main source of additional computation is the fact that one has to study summaries using P segmen ts for eac h cluster rather than only P /K in the uniform case. This cost can be reduced by introducing an arbitrary (user chosen) maximal num b er of segments in eac h clusters, e.g., λP /K . Then the running time of the optimal assignmen t is λ times the running time of the uniform solution. As w e aim at summarizing the data, using a small λ suc h 2 seems reasonable. The computation of the v alues of Q remains the same regardless of the wa y the num b er of segments is c hosen for each cluster. The optimisation of the error with respect to the partition ( G k ) K k =1 follo ws also the same algorithm in b oth cases. If w e use for instance piecewise constan t appro ximation ev aluated with the sum of the L 2 distances, then, as already men tioned previously , the summaries are computed from the cluster means: computing the means costs O ( N M ), then the calculation of Q for all functional clusters is in O ( K M 2 ) and the optimization with resp ect to ( G k ) K k =1 is in O ( N K M ). In general, the dominating cost will therefore scale in O ( K P M 2 ). Ho wev er, as p oin ted out in Section 3.1, complex c hoices for the Q error criterion may induce larger computational costs. 3.4. Two phases appr o aches The main motiv ation of the in tro duction of Algorithm 3 was to a v oid rely- ing on a simpler but p ossibly sub optimal dual phases solution. This alternative solution simply consists applying a standard K means to the vector representa- tion of the functions ( s i ) N i =1 to get homogeneous clusters and then to summarize eac h cluster via a simple prototype. While this approach should intuitiv ely give higher errors for the criterion of equation (20), the actual situation is more complex. Algorithms 3 and 4 are b oth optimal in their segmentation phase and will generally b e also optimal for the partition determination phase if Q is a standard criterion. Ho wev er, the alternating optimization scheme itself is not optimal (the K means algorithm is also non optimal for the same reason). W e hav e therefore no guarantee on the quality of the lo cal optimum reached by those algorithms, exactly as in the case of the K means. In addition, even if the tw o phase approac h is initialized with the same random partition as, e.g., Algorithm 3, the ﬁnal partitions can b e distinct b ecause of the distinct prototypes used during the iterations. It is therefore diﬃcult to predict whic h approac h is more lik ely to giv e the b est solution. In addition a third p ossibilit y should b e considered: the K means algorithm is applied to the vector representation of the functions ( s i ) N i =1 to get homogeneous clusters and is follow ed by an application of Algorithm 3 or 4 using the obtained clusters for its initial conﬁguration. T o assess the diﬀerences betw een these three p ossibilities, we conducted a simple exp eriment: using the full T ecator dataset, we compared the v alue of 18 E from equation (20) for the ﬁnal summary obtained by the three solutions using the same random starting partitions with K = 6, P = 30 and piecewise constan t summaries. W e used 50 random initial conﬁgurations and compared the metho ds in the case of uniform resource allo cation (5 segments p er cluster) and in the case of optimal resource allo cation. In the case of uniform resources assignmen t the picture is rather simple. F or all the 50 random initialisations both tw o phases algorithms gav e exactly the same results. In 44 cases, using Algorithm 3 gav e exactly the same results as the tw o phases algorithms. In 5 cases, its results w ere of low er quality and in the last case, its results w ere slightly b etter. The three algorithms manage to reac h the same ov erall optimum (for which E = 472). The case of optimal resources assignment is a bit more complex. F or 15 out of 50 initial conﬁgurations, using Algorithm 4 after the K means improv es the results ov er a simple optimal summary (with optimal resource assignment) applied to the results of the K means (results w ere identical for the 35 other cases). In 34 cases, K means follow ed by Algorithm 4 gav e the same results as Algorithm 4 alone. In the 16 remaining cases, results are in fa vor of K means follo wed by Algorithm 4 whic h wins 13 times. Nevertheless, the three algorithms manage to reac h the same ov erall optimum (for which E = 467). While there is no clear winner in this (limited scop e) exp erimen tal analysis, considering the computational cost of the three algorithms helps choosing the most appropriate one in practice. In fav orable cases, the cost of an iteration of Algorithm 4 is O ( K M N + K P M 2 )) while the cost of an iteration of the K means on the same dataset is O ( K M N ). In functional data, M is frequently of the same order as N . F or summarizing purp oses, P should remain small, around 5 K for instance. This means that one can p erform sev eral iterations of the standard K means for the same computational price than one iteration of Algorithm 4. F or instance, the T ecator dataset v alues, P = 30, M = 100 and N = 240, give 1 + P M / N = 13 . 5. In the exp erimen ts describ ed ab o v e, the median n umber of iterations for the K means was 15 while it w as respectively 15 and 16 for Algorithms 3 and 4. W e can safely assume therefore that Algorithm 4 is at least 10 times slo wer than the K means on the T ecator dataset. Ho wev er, when initialized with the K means solution, Algorithms 3 and 4 conv erges very quic kly , generally in t w o iterations (up to ﬁve for the optimal assignmen t algorithm in the T ecator dataset). In more complex situations, as the ones studied in Section 4 the num b er of iterations is larger but remains small compared to what would b e needed when starting from a random conﬁguration. Based on the previous analysis, we consider that using the K means algo- rithm follow ed by Algorithm 4 is the b est solution. If enough computing re- sources are av ailable, one can run the K means algorithm from several starting conﬁgurations, keep the b est ﬁnal partition and use it as an initialization p oin t for Algorithm 4. As the K means pro vides a go o d starting p oin t to this latter algorithm, the num ber of iterations remains small and the o verall complexity is acceptable, esp ecially compared to using Algorithm 4 alone from diﬀerent random initial conﬁgurations. Alternativ ely , the analyst can rely on more ro- bust clustering sc heme suc h as Neural Gas [7] and deterministic annealing based clustering [27], or on the Self Organizing Map [18] to pro vide the starting con- ﬁguration of Algorithm 4. 19 4. Exp erimen tal results W e giv e in this section t wo examples of the type of exploratory analyses that can b e p erformed on real world datasets with the prop osed metho d. In order to pro vide the most readable display for the user, we build the initial conﬁguration of Algorithm 4 with a batch Self Organizing Map. W e optimize the initial radius of the neighborho o d inﬂuence in the SOM with the help of the Kaski and Lagus top ology preserv ation criterion [17] (the radius is chosen among 20 radii). 4.1. T op ex/Poseidon satel lite W e study ﬁrst the T op ex/P oseidon satellite dataset 2 . The T op ex/Poseidon radar satellite has b een used ov er the Amazonian basin to pro duce N = 472 w av eforms sampled at M = 70 p oin ts. The curves exhibit a quite large v ari- abilit y induced by diﬀerences in ground type b elo w the satellite during data acquisition. Details on the dataset can b e found in e.g. [11, 12]. Figure 8 displa ys 20 curves from the dataset chosen randomly . Figure 8: 20 curves from the T op ex/Poseidon radar satellite dataset W e conduct ﬁrst a hierarc hical clustering (based on Euclidean distance be- t ween the functions and the W ard criterion) to get a rough idea of the p otential n umber of clusters in the dataset. Both the dendrogram (Figure 9) and the total within class v ariance ev olution (Figure 10) tend to indicate a rather small n umber of clusters, around 12. As the visual representation obtained via the SOM is more readable than a randomly arranged set of protot yp es, we can use a slightly ov ersized SOM without impairing the analyst’s work. W e decided therefore to use a 4 × 5 grid: the rectangular shap e has b een preferred o ver a 2 Data are av ailable at http://www.lsp.ups- tlse.fr/staph/npfda/npfda- datasets.html 20 0 10000 20000 30000 40000 Height Figure 9: Dendrogram of the hierarc hical clustering for the T opex/Poseidon radar satellite dataset 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 10000 20000 30000 Figure 10: T otal within class v ariance decrease for the 20 ﬁrst merges in the dendrogram depicted by Figure 9 21 square one according to results from [34] that sho w some superiority in top ology preserv ation for anisotropic maps compared to isotropic ones. Figure 11: Prototypes obtained b y SOM algorithm on the T op ex/P oseidon radar satellite dataset (grey curves are the original curves as assigned to the clusters) The results of the SOM are given in Figure 11 that displa ys the protot yp e of each cluster arranged on the SOM grid. Each cell contains also all the curves assigned to the asso ciated clusters. As exp ected, the SOM prototypes are well organized on the grid: the horizon tal axis seems to enco de approximately the p osition of the maximum of the protot yp e while the vertical axis corresp onds to the width of the p eak. How ever, the prototypes are very noisy and inferring the general b eha vior of the curv es of a cluster remains diﬃcult. 22 Figure 12: Summarized prototypes obtained b y Algorithm 4 on the T opex/Poseidon radar satellite dataset (grey curves are the original curves as assigned to the clusters) 23 Figure 12 represents the results obtained after applying Algorithm 4 to the results of the SOM. In order to obtain a readable summary , we set P to 80, i.e., an a v erage of 4 segments for each cluster. W e use the linear interpolation approac h describ ed in Section 2.5. Con trarily to results obtain ed on the T ecator dataset and rep orted in Section 3.4, Algorithm 4 used a signiﬁcant n umber of iterations (17). Once the new clusters and the simpliﬁed prototypes were obtained, they were display ed in the same wa y as for the SOM results. The resulting map is almost as w ell organized as the original one (with the exception of a large peak on the b ottom line), but the simpliﬁed prototypes are muc h more informativ e than the original ones: they giv e an immediate idea of the general b eha vior of the curves in the corresp onding cluster. F or example, the diﬀerences b et w een the tw o central prototypes of the second line (starting from the top) are more obvious with summaries: in the cluster on the left, one can iden tify a slow increase follo wed by a t w o phases slow decrease, while the other cluster corresp onds to a sharp increase follo wed by a slo w decrease. Those diﬀerences are not as ob vious on the original map. 4.2. Ele ctric p ower c onsumption The second dataset 3 consists in the electric p ow er consumption recorded in a p ersonal home during almost one year (349 days). Eac h curve consists in 144 measures which giv e the p ow er consumption of one da y at a 10 minutes sampling rate. Figure 13 display s 20 randomly chosen curves. 0 5 10 15 20 0 2 4 6 8 h Figure 13: 20 electric p o wer consumption curves W e analyse the curves in a very similar w ay as for the T op ex/Poseidon dataset The results of the hierarchical clustering display ed by Figures 14 and 3 This dataset is av ailable at http://bilab.enst.fr/wakka.php?wiki=HomeLoadCurve . 24 15 lead to the same conclusion as in the previous case: the num b er of clusters seems to b e small, around 13. As b efore, w e ﬁt a slightly larger SOM (4 × 5 rectangular grid) to the data and obtain the representation pro vided by Figure 16. The map is well organized: the v ertical axis seems to encode the global p o w er consumption while the horizontal axis represents the ov erall shap e of the load curve. Ho w ever, the shap es of the protot yp es are again not easy to in terpret, mainly b ecause of their noisy nature. 0 100 200 300 400 500 Height Figure 14: Dendrogram of the hierarchical clustering for the load curves 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 100 200 300 400 500 Figure 15: T otal within class v ariance decrease for the 20 ﬁrst merges in the dendrogram depicted by Figure 14 Figure 17 sho ws the results obtained after applying Algorithm 4 to the re- sults of the SOM (Algorithm 4 used 9 iterations to reac h a stable state). In order to obtain a readable summary , w e set P to 80, i.e., an a verage of 4 segments for each cluster. W e depart from the previous analysis b y using here a piece- wise constant summary: this is more adapted to load curves as they should b e fairly constant on signiﬁcant time p erio ds, b ecause many electrical appliances ha ve stable pow er consumption once switched on. As for the T op ex/P oseidon dataset, the summarized prototypes are muc h easier to analyze than their noisy coun terpart. Even the global organization of the map is more obvious: the top 25 Figure 16: Prototypes obtained by SOM algorithm for the load curves (grey curves are the original curves as assigned to the clusters) 26 righ t of the map for instance, gathers da ys in whic h the p o w er consumption in the morning (starting at 7 am) is signiﬁcan t, is follow ed by a perio d of lo w consumption and then again b y a consumption p eak starting appro ximately at 7 pm. Those da ys are week da ys in which the home remains empty during w ork hours. Other t ypical clusters include holida ys with almost no consumption (second cluster on the ﬁrst line), more constant days which corresp ond to week ends, etc. In addition, the optimal resource assignment tends to emphasize the diﬀerence b et w een the clusters by allocating less resources to simple ones (as exp ected). This is quite obvious in the case of piecewise constant summaries used in Figure 17 (and to a lesser extent on Figure 12). Figure 17: Summarized prototypes ( P = 80) obtained by Algorithm 4 for the load curves (grey curves are the original curves as assigned to the clusters) 27 In order to emphasize the imp ortance of a drastic reduction in prototype complexit y , Algorithm 4 was applied to the same dataset, using the same initial conﬁguration, with P = 160, i.e., an av erage of 8 segments in each cluster. Results are represen ted on Figure 18. As expected, protot yp es are more complex Figure 18: Summarized prototypes ( P = 160) obtained by Algorithm 4 for the load curves (grey curves are the original curves as assigned to the clusters) than on Figure 17. While they are not as rough as the original ones (see Figure 16), their analysis is more diﬃcult than the one of protot yp es obtained with stronger summarizing constrain ts. In particular, some consumption p eaks are no more roughly outlined as in Figure 17 but more accurately approximated (see for instance the last cluster of the second row). This leads to series of 28 short segmen ts from which expert inference is not easy . It should b e noted in addition, that b ecause of the noisy nature of the data, increasing the n umber of segments improv es only marginally the approximation quality . T o show this, w e ﬁrst introduce a relative approximation error measure given by P K k =1 P i ∈ G k P P k p =1 P t j ∈ C k p ( s i ( t j ) − µ k,p ) 2 P N i =1 P M j =1 ( s i ( t j ) − µ i ) 2 , (29) using notations from Section 3.2, and where µ i denotes the mean of s i . This rel- ativ e error compares the total squared approximation error to the total squared in ternal v ariability inside the dataset. T able 1 gives the dependencies b et w een the num b er of segmen ts and the relative appro ximation error. It is quite clear that the relative loss in approximation precision is acceptable even with strong summarizing constraints. This type of numerical quality assessment should be Summary No summary P = 160 P = 80 Relativ e approximation error 0.676 0.696 0.740 T able 1: Approximation errors for the load curves pro vided to the analyst either as a w ay to con trol the relev ance of the summa- rized protot yp es, or as a selection guide for c ho osing the v alue of P . 5. Conclusion W e ha ve proposed in this pap er a new exploratory analysis metho d for func- tional data. This metho d relies on simple approximations of functions, that is on mo dels that are piecewise constant or piecewise linear. Giv en a set of homogeneous functions, a dynamic programming approach computes eﬃciently and optimally a simple model via a segmentation of the interv al on whic h the functions are ev aluated. Given several groups of homogeneous functions and a total n umber of segmen ts, another dynamic programming algorithm allo cates optimally to each cluster the n umber of segmen ts used to build the summary . Finally , those t wo algorithms can b e embedded in any classical protot yp e based algorithm (suc h as the K means) to reﬁne the groups of functions given the optimal summaries and vic e versa in an alternating optimization scheme. The general framework is extremely ﬂexible and accommo dates diﬀerent t yp es of qualit y measures ( L 2 , L 1 ), aggregation strategies (maximal error, mean error), approximation model (piecewise constant, linear interpolation) and clus- tering strategies (K means, Neural gas). Exp eriments show that it provides meaningful summaries that help greatly the analyst in getting quic kly a goo d idea of the general b eha vior of a set of curv es. While we ha ve listed many v ariants of the metho d to illustrate its ﬂexibility , n umerous were not mentioned as they are more distant from the main topic of this paper. It should be noted for instance that rather than building a unique summary for a set of curv es, one could lo ok for a unique se gmentation support- ing curve by curve summaries. In other w ords, eac h curv e will b e summarized b y e.g., a piecewise constant mo del, but all mo dels will share the same segmen- tation. This strategy is very interesting for analyzing curves in which internal c hanges are more imp ortant than the actual v alues tak en b y the curv es. It is 29 closely related to the so called b est basis problem in which a functional basis is built to represen t eﬃcien tly a set of curves (see e.g., [5, 28, 31]). This is also related to v ariable clustering methods (see e.g., [19]). This ﬁnal link op ens an in teresting persp ective for the proposed metho d. It has b een shown recently [10, 20] that v ariable clustering metho ds can b e adapted to sup ervised learning: v ariables are group ed in a wa y that preserves as muc h as p ossible the explana- tory pow er of the reduced set of v ariables with resp ect to a target v ariable. It w ould be extremely useful to pro vide the analyst with suc h a to ol for exploring a set of curves: she would b e able to select an explanatory v ariable and to obtain summaries of those curves that mask details that are not relev ant for predicting the c hosen v ariable. Ac kno wledgment The authors thank the anonymous referees for their v aluable comments that help ed improving this pap er. References [1] C. Abraham, P .-A. Cornillon, E. Matzner-Lob er, and N. Molinari. Unsup er- vised curv e clustering using b-splines. Sc andinavian Journal of Statistics , 30(3):581–595, Septem b er 2003. [2] I. E. Auger and C. E. Lawrence. Algorithms for the optimal identiﬁcation of segment neighborho ods. Bul letin of Mathematic al Biolo gy , 51(1):39–54, 1989. [3] R. Bellman. On the approximation of curves b y line segmen ts using dy- namic programming. Communic ation of the ACM , 4(6):284, 1961. [4] F. Chamroukhi, A. Sam ´ e, G. Go v aert, and P . Aknin. A regression mo del with a hidden logistic pro cess for signal parametrization. In Pr o c e e dings of XVIth Eur op e an Symp osium on Artiﬁcial Neur al Networks (ESANN 2009) , pages 503–508, Bruges, Belgique, April 2009. [5] R. R. Coifman and M. V. Wic k erhauser. Entrop y-based algorithms for b est basis selection. IEEE T r ansactions on Information The ory , 38(2):713–718, Marc h 1992. [6] M. Cottrell, B. Girard, and P . Rousset. F orecasting of curv es using a k ohonen classiﬁcation. Journal of F or e c asting , 17:429–439, 1998. [7] M. Cottrell, B. Hammer, A. Hasenfuß, and T. Villmann. Batc h and median neural gas. Neur al Networks , 19(6–7):762–771, July–August 2006. [8] A. Debr´ egeas and G. H ´ ebrail. In teractive interpretation of kohonen maps applied to curv es. In Pr o c. International Confer enc e on Know le dge Disc ov- ery and Data Mining (KDD’98) , pages 179–183, New Y ork, August 1998. [9] L. Devroy e, L. Gy¨ orﬁ, and G. Lugosi. A Pr ob abilistic The ory of Pattern R e c o gnition , volume 21 of Applic ations of Mathematics . Springer, 1996. 30 [10] D. F ran¸ cois, C. Krier, F. Rossi, and M. V erleysen. Estimation de redon- dance p our le clustering de v ariables sp ectrales. In A ctes des 10` emes journ ´ ees Eur op´ eennes A gr o-industrie et M´ etho des statistiques (A gr ostat 2008) , pages 55–61, Louv ain-la-Neuve, Belgique, January 2008. [11] F. F rappart. Catalogue des formes d’onde de l’altim ` etre top ex/p os ´ eidon sur le bassin amazonien. Rapp ort technique, CNES, T oulouse, F rance, 2003. [12] F. F rappart, S. Calmant, M. Cauhop ´ e, F. Seyler, and A. Cazena ve. Prelimi- nary results of envisat ra-2-derived w ater levels v alidation ov er the amazon basin. R emote Sensing of Envir onment , 100(2):252 – 264, 2006. ISSN 0034-4257. [13] C. G. Healey , K. S. Bo oth, and J. T. Enns. Visualizing real-time m ulti- v ariate data using preatten tive pro cessing. ACM T r ansactions on Mo deling and Computer Simulation , 5(3):190–221, July 1995. [14] P . J. Hub er. Robust estimation of a location parameter. A nnals of Math- ematic al Statistics , 35(1):73–101, 1964. [15] B. Hugueney . Adaptive segmentation-based symbolic representations of time series for b etter mo deling and low er bounding distance measures. In Pr o c e e dings of 10th Eur op e an Confer enc e on Principles and Pr actic e of Know le dge Disc overy in Datab ases, PKDD 2006 , volume 4213 of L e ctur e Notes in Computer Scienc e , pages 545–552, September 2006. [16] B. Jackson, J. Scargle, D. Barnes, S. Arabhi, A. Alt, P . Gioumousis, E. Gwin, P . Sangtrakulcharoen, L. T an, and T. T. Tsai. An algorithm for optimal partitioning of data on an in terv al. IEEE Signal Pr o c essing L etters , 12(2):105–108, F eb. 2005. [17] S. Kaski and K. Lagus. Comparing self-organizing maps. In C. von der Malsburg, W. von Seelen, J. V orbr ¨ uggen, and B. Sendhoﬀ, editors, Pr o c e e dings of International Confer enc e on A rtiﬁcial Neur al Networks (ICANN’96) , v olume 1112 of L e ctur e Notes in Computer Scienc e , pages 809–814. Springer, Berlin, German y ,, July 16-19 1996. [18] T. Kohonen. Self-Or ganizing Maps , volume 30 of Springer Series in Infor- mation Scienc es . Springer, third edition, 1995. Last edition published in 2001. [19] C. Krier, F. Rossi, D. F ran¸ cois, and M. V erleysen. A data-driven functional pro jection approach for the selection of feature ranges in sp ectra with ica or cluster analysis. Chemometrics and Intel ligent L ab or atory Systems , 91 (1):43–53, Marc h 2008. [20] C. Krier, M. V erleysen, F. Rossi, and D. F ran¸ cois. Sup ervised v ariable clustering for classiﬁcation of nir spectra. In Pr o c e e dings of XVIth Eur op e an Symp osium on Artiﬁcial Neur al Networks (ESANN 2009) , pages 263–268, Bruges, Belgique, April 2009. [21] Y. Lec hev allier. Classiﬁcation automatique optimale sous con trainte d’ordre total. Rapp ort de recherc he 200, IRIA, 1976. 31 [22] Y. Lec hev allier. Recherc he d’une partition optimale sous con trainte d’ordre total. Rapp ort de recherc he RR-1247, INRIA, June 1990. http://www. inria.fr/rrrt/rr- 1247.html . [23] J. A. Lee and M. V erleysen. Generalization of the lp norm for time series and its application to self-organizing maps. In Pr o c e e dings of the 5th Work- shop on Self-Or ganizing Maps (WSOM 05) , pages 733–740, Paris (F rance), Septem b er 2005. [24] J. Lin, E. J. Keogh, S. Lonardi, and B. Y. c hi Chiu. A symbolic represen- tation of time series, with implications for streaming algorithms. In M. J. Zaki and C. C. Aggarw al, editors, DMKD , pages 2–11. ACM, 2003. [25] F. Picard, S. Robin, E. Lebarbier, and J.-J. Daudin. A segmen ta- tion/clustering mo del for the analysis of array cgh data. Biometrics , 63 (3):758–766, 2007. [26] J. Ramsay and B. Silverman. F unctional Data Analysis . Springer Series in Statistics. Springer V erlag, June 1997. [27] K. Rose. Deterministic annealing for clustering, compression,classiﬁcation, regression, and related optimization problems. Pr o c e e dings of the IEEE , 86 (11):2210–2239, No vem ber 1998. [28] F. Rossi and Y. Lec hev allier. Constrained v ariable clustering for func- tional data representation. In Pr o c e e dings of the ﬁrst joint me eting of the So ci´ et´ e F r anc ophone de Classiﬁc ation and the Classiﬁc ation and Data Anal- ysis Gr oup of the Italian Statistic al So ciety (SFC-CLAD AG 2008) , Caserta, Italy , June 2008. [29] F. Rossi, B. Conan-Guez, and A. El Golli. Clustering functional data with the SOM algorithm. In Pr o c e e dings of XIIth Eur op e an Symp osium on Ar- tiﬁcial Neur al Networks (ESANN 2004) , pages 305–312, Bruges (Belgium), April 2004. [30] F. Rossi, D. F ran¸ cois, V. W ertz, and M. V erleysen. F ast selection of sp ectral v ariables with b-spline compression. Chemometrics and Intel ligent L ab or a- tory Systems , 86(2):208–218, April 2007. [31] N. Saito and R. R. Coifman. Local discriminant bases and their applica- tions. Journal of Mathematic al Imaging and Vision , 5(4):337–358, 1995. [32] H. Stone. Approximation of curves by line segmen ts. Mathematics of Computation , 15(73):40–47, 1961. [33] A. N. Tikhonov and V. Y. Arsenin. Solutions of ill-p osed problems. SIAM R eview , 21(2):266–267, April 1979. [34] A. Ultsch and L. Herrmann. The arc hitecture of emergen t self-organizing maps to reduce pro jection errors. In Pr o c e e dings of the 13th Eur op e an Sym- p osium on Artiﬁcial Neur al Networks (ESANN 2005) , pages 1–6, Bruges (Belgium), April 2005. 32

Exploratory Analysis of Functional Data via Clustering and Optimal Segmentation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment