Mining Complex Hydrobiological Data with Galois Lattices

We have used Galois lattices for mining hydrobiological data. These data are about macrophytes, that are macroscopic plants living in water bodies. These plants are characterized by several biological traits, that own several modalities. Our aim is t…

Authors: Aurelie Bertaux (CEVH, Lsiit), AGN`es Braud (LSIIT)

Mining Complex Hydrobiological Data with Galois Lattices
Mining Compl ex Hydr obiolo gical Data with Galois Lattice s A. Bertaux 1, 2 , A. Braud 2 , F . Le B er 1,3 (1) CEVH UM R MA 101 - EN GEES 1 quai Koch, BP 61039, F 6707 0 Strasbourg cedex {aurelie.b ertaux, flor ence.leb er}@enge es.u-stra sbg.fr (2 ) LSIIT UMR 7005 Bd Sébasti en Brant, B P 10413, F 674 12 Illkirch ced ex agnes.braud@urs .u-strasbg. fr (3) LORIA UMR 7503 BP 3 5, F 545 06 V andœuvre-lès-Nancy c edex Article publ ié dans Int ern ationa l W o rk sho p on A d v an ces in C onc e ptu al K no wle d ge En g ine er ing ( A CK E' 07 ) Abstract W e have used Galoi s lattices for min ing hydrobio logical data. These dat a are about macr ophytes, that are macro scopic plants l iving in water bodies. These plants are characterized by sever al bio logical traits, that own several m odalities. Our aim is to cl uster the plants according to their c omm on traits and modalities and to find out the relations be tween traits. Galo is lattices are effic ient methods for su ch an aim, but apply on bi nary data. In this article, we det ail a few approaches we used to transf orm comp lex hydr obiolo gical dat a into binary data and compare the first result s obtained thanks to Galo is lattices. 1. Introduction W ater quality i s a maj or proble m in Europe , underlined by the recent European W ate r Framework Directive. T o ev aluate the phys ico- chemi cal quality of a water body appe ared to be not suff icient , new tools are requ ired for evaluating the quality of the whole ec osystem [1]. Further more a com parison of existi ng tools and approaches is nece ssary to get a coherent monitor ing of w ater bodies in Europe. There exist several bio logica l indice s based on the faunistic and flori stic species livi ng in fres h water (e.g. fi ve indice s are used i n France for qualif ying ru nni ng waters). T hese indice s are usefu l, but it is diffic ult to compare the ir resu lts fro m different areas, sin ce the ki nd of specie s livi ng in a river also depend on regional characteristic s. A promis ing approach to avoid this drawback is to det ermi ne fu nctiona l traits, s hared by different s pecies of different areas, that can be used to characterize wat er quality [8] or other eco system s [7]. Curre ntly , these fu nctiona l traits have still to be defined for mo st of the categories of aquatic l iving species. In thi s project, we foc us o n biological t raits of eur opean macroph ytes, coll ected from the litterature, and w e t ry to explore these data with Gal ois lattices [2,4]. Our ai m is to find o ut sets of biological traits and specie s which can be interpreted as f uncti onal groups by the hydr obiologi sts. The pa per is o rganized as foll ows. First part i s the current introd uction , seco nd part introd uces the data, third part presents the method s we u sed to convert the dat a int o a su itable format and the results we o btained with Galo is lattice s. Th e fourt h part i s a disc ussi on on related work whi le fifth part give s so me co nc lusi ons and perspectives o f ou r work. 2. Biolog ical traits of macrop hytes The data we de al with a re about macroph ytes ( macros copic plants livi ng in water bodies, e.g. water lily) . Each plant is described by a set of traits –or attributes– l ike potential size, reproducti on period or anchorage mode. .. F or each attribute there are several qualitative modalities . F or example, the 'potential size ' att ribute owns four modalities : “ under 0,08 meter” , “between 0,08 and 0,3 meter”, “between 0,3 and 1 meter”, “ between 1 and 5 meters ”. The 'reproduct ion period' attribute owns eight modalities (month s from march to october).. . The modalitie s are assoc iated to a value between 0 and 3 to i ndicate the affinity of the plants toward the modality . 0 means the re is no plant having this modal ity , 1 means that a few p lants have it, 2 a bit m ore, and 3 many . F or example, the 'potential size' of Ber ula erecta ( BER E) i s given by the 4-set (1 , 2, 3, 0) whi le it is (0, 1, 2, 2) for Cal litrich e obtusang ula (CALO) , which means, i n particular , that you will never find a berula erecta plant greate r than 1 meter and no c allitriche o btusangula plant smal ler than 0,08 meter ( see T able 1). The triple (trait, modalit y , affin ity) allows to descr ibe the bi ological characteristic s of macrophyte s in a qualitative and rather complex way . F or ex ample, the dat a we d eal wit h represent about 50 plants, des cribed by 15 traits and 6 0 modalities. So, tools are needed to explore the se dat a, and especial ly to c luster the plants according to their co mmon traits and modalities and to find o ut the relations between variou s traits and modalit ies. T able 1. T r aits dat a (potential size) 3. Using Galois latt ices on biolog ical traits Galoi s lattices are able to perform cluster ing o n bi nary data and to extract impl ication s sets of attributes [2,4,5]. F urthermo re, work was done to ada pt Gal ois lattice s on m ore complex data [10]. In a preli minary step of our work, we decided to study the possibi lities for us ing cla ssica l algorithm s to bui ld Galoi s lattices before u sing m ore complex and c ostly techn iques. As those clas sica l algor ithms apply on binary data, we had to transfor m the origina l traits dat a. The transformati ons we used and the result s we obtained -im plication sets- are det ailed hereaft er . W e worked with a s ubset of 15 plants descr ibed by 15 traits and 60 modalitie s. Before, let us recall some definition s. Let E and F be two fin ite sets and R a binary relation on E x F . E i s a set of objects, F a set o f properties, xRy means that the object x ow ns the property y . Let f be a mapping fr om 2 E to 2 F suc h that, if X is an arbitrary part of 2 E , f(X) = {y in F | for all x in X: xRy}. The mapping g is defined d ually from 2 F to 2 E s uch that, if Y i s an arbitrary part of 2 F , g(Y) = {x in E | for al l y in Y : xRy} The co uple {f ,g} is said to be a Galoi s c onnect ion between the sets E and F . Fro m this c on nection , we get a set of concepts (X , Y) , su ch th at gof(X) = X and Y = f(X) , that are organized withi n a lattice. Y is a set of att ributes, called i ntens ion , and X is a set of o bjects, cal led extensi on . Furthem ore, the lattice order al lows to detect implicatio n set s of properties and associatio n r ules [9]. 3.1. C omplete disjunctive tabl e Con sidering the origi nal three levels f ormat of the dataset, we transform it within a c omplete disjunct ive table (or b inary table) (T able 2). W e d enote the new att ributes followi ng a ' Lxx' model. The letter 'L' denotes a trait (' S' for potential Size, 'R ' for potential o f Regeneration ...) . The first 'x ' is a number whi ch indi cates a modality and the s econd 'x' gives an affinity . F or example, S21 means “ few plants (1) havin g a potential size ( S) bet ween 0,08 and 0,3 m (2 nd modality)” . F or clarity purpose, we call th ose new attributes “ properties” in the fol lowing. The Ga lois lattice based o n the disjunct ive table i s sh own on Figure 1 (we sho w a s ublattice in cludi ng three traits, p otential si ze, perennation and potential of regeneration) . The wh ole lattice c ontain s 1401 concepts , i.e. sets of macroph ytes shari ng the sa me modalities of the same traits with the same affin ity . W e have us ed the ConE xp tool (f or Con cept Explorer [11]) both to b uild and to analyze the lattice. Actual ly Con Exp all ows to edit a context, to d raw the ass ociated lattice, to calcu late the Duquen ne- Guigue s-Ba sis for im plication s bet ween attributes, and to give the assoc iation rule s that are true in thi s context. T able 2. The com plete disj unctive table o f traits dat a (pote ntial size) Figure 1. The Galois lattice bu ilt from three traits of the co mplete disju nctive t able The info rmation provided by the lattice structure is intere sting for hydrobiol ogist s sin ce they want to define equi valences between specie s with regard to their bi ological traits. F or example, the Gal ois lattice i n Figure 1 poi nts out that the three plants E LON (El odea nuttall ii), ELO E (El odea ernstae) and ELOC ( Elodea canadensi s) are grouped in the sam e concept –at the bottom of the lattice– with the fo llowi ng char acteristics : P13, P2 1, P30, P40, S10, S22 , S 33, S41, R10, R20, R33. Act ually , this can be directly read in the original table. The co ncepts in the middle of the lattice are more i nteresting. F or example, the co ncept highl ighted in Figure 1 is the fo llow ing : ((R10 , R23, R30, P13 , P30, P40) (PTNO , PTCO , CALO , ME NA, NASO , B ER E)). T hi s concept mea ns that the 6 (am ong 15) foll owing plants, Potam ogeton nodos us, P ot amogeton c oloratus , Callitric he obtusangu la, Mentha aquatica, Nasturtium off icinale , and Berula erecta, share the same foll owing traits: potent ial of regeneration (low = 0, intermediate = 3, high = 0) and perennation (perennial underground organ s = 3, bisann ual = 0, annual = 0). Further more we ca n extract the implicati on s ets, and analyze them , e.g. P13 = > P30 (true for 13 indiv iduals); R23 = > P30 R10 R30 (true for 8 indiv iduals) ; for a better interpretation of the concepts . Final ly we can say that the characteristic s of these 6 plants are an i ntermediate potential o f regeneration and perennial undergrou nd organs. Thi s relations hip between the two traits has stil l to be interprete d by hydrobiol ogist s. Consi dering the who le lattice (149 Lxx properties, 1401 concepts ), we can extract 4 30 i mplicati on sets. 2 8 have a support equal to 14, 140 have a support betwe en 5 and 9, and 262 bet ween 1 and 4. T hu s we obtain a few representative imp lication s bet ween traits. Let us i llu strate thi s with on e of the implicati on sets whi ch support is 14: F10 A20 M10 => D30 . Th is r ule means: F1≠0 or A2≠0 or M1≠ 0 or D30. F or hydrobio logists it means that 14 plants have not a weak potential of dispers ion (D30) o r have a fle xibility <10° (F1≠ 0) or a contact to the grou nd (A2≠ 0) or a reproduction period in march (M1≠0). Actually , look ing at the origina l table you see that non e of the 15 specie s have a reproducti on period in marc h n or a c ontact to the gro und. So, the final interpretation will be that al l species (except one, Nuphar lutea ) have a flexibil ity (> 10°) and an intermediate or hig h potential of dispersi on. T he implicatio n set highl ights the mec hanica l l ink bet ween flex ibility and dispersi on. Nevertheles s, the c onversi on o f the origina l data within a disju nctive table has three main problems. Fir st, 1401 c oncepts gi ve a lattice too h uge to be readable. Se cond, the nu mber of extracted i mplicatio ns is high. Third , it breaks an i nfor mation whic h is meani ngful for hydrobi ologi sts, namely the distributi on of the affinitie s of a macr ophyte am ong the different modalities of a trait. W e tried another approach to overco me this problem and present it i n the foll owing s ection . 3.2. Pattern approach Before describi ng the new approach proposed, let us exami ne an ill ustrative example of the infor mation we wou ld li ke to represent. F or instance , con sider the plant BE RE ( Berula erecta ), w hose potential size i s as fo llows (1, 2, 3, 0) according to the four mo dalities of this trait. T his patte rn (1, 2, 3, 0) i s interesting for the hydrobio logists , because it show s the c ontin uity of the size distributio n of Beru la erecta . Actually , having two plants with (almo st) the same distribut ion is more meaningf ul than havi ng two plants with the sa me affin ity f or o ne modality . Thu s, we have tried another conver sio n o f the init ial dataset. W e have propo sed to represent the distributio n of the affinit ies of a plant according to the different modalities of a trait as a un ique property , call ed a pattern . This pat tern i s composed as follow s: first co mes a lett er that refers to the trait (like 'S' f or potential Size) and then n nu mbers that refer to the affinity value of the modalities . F or example S0122 mean s “ the potential size o f m embers of thi s species is never of the first c lass (<0,08 m ), so metime s of the s econd class (between 0,08 and 0,3 m), often of the third and fourth classes (between 0,3 and 1 m and between 1 and 5 m) ”. The c orrespo nding binary table -man ually bui lt- is s hown o n T able 3 for the potential size. L ooki ng at this table, one can see th at very fe w pat terns are comm on to m ore than two indi viduals . T he lattice bui lt from these data has 76 co ncepts spread o n 6 levels (excepting top and bottom). The lat tice built for the three traits potential s ize, p erennatio n and potential of regeneration , i s sh own o n Figure 2. W e can see that m ost of the pat terns belong to only one indiv idual. T able 3. Pattern table of traits data (potential size) Figure 2. T he Galois lattice b uilt fr om three traits of the pa ttern ta ble Furtherm ore, from the w hole lattice, 219 i mplicatio n sets were extracted with a support under 5. Thi s m eans only 5 plants (for the best result) support the se impl ication s. Thi s is due to the pat terns whic h are very preci se and s o few macrop hytes match each of the m. T o s olve this problem we can decrease the precisi on of the patt ern, which can be done simp ly by grouping affinitie s. Either we co nsider the prese nce (affinitie s 1, 2 and 3 grouped together) and the lack (the affin ity 0) of the modality , or we cons ider the affinity as low (affin ities 0 and 1 grouped together) or h igh (affinit ies 2 and 3 gathered together). The im plicatio ns extracted foll owing th ose meth ods have a support unt il 7 for the fir st solut ion and 8 for the sec ond, wh ich is muc h bet ter . Neverthele ss gathering those affinitie s is n ot pertinent f or the hydrobiol ogist s. 4. Discussion The two app roaches w e studied until no w are not very effi cient according to the hydrobiol ogists requirement. The fir st one gives too muc h, unstru ctured i nformati on, while the sec ond one gives very few but structured infor mation . T o explore f urther this seco nd app roach we will rely on [10] whic h proposed methods to deal with c omplex data within the Ga lois lattice theory . Actually [10] proposes to build and compare two lattices : • Unio n lattice : the con cept i ntent contain s all the properties of the indiv iduals belonging to the extent. • Intersection lattice : the c oncept intent contain s the properties belonging to all the indiv iduals of the extent. The se lattices are built on specific Galoi s conn ection s, depending o n the object types (histogram , interva l .. .). As o ur application deals with hist ogram dat a, Ө (x) = [ Ө 1 , Ө 2 , Ө 3 ..] , we co uld u se the fol lowing Galoi s conn ection s. Unio n: f(X) = [ max x inX Ө 1 , max x inX Ө 2 , max x inX Ө 3 .. .] g(Y) = {x | for all y in Y , Ө (x ) ≤ y} Intersectio n: f(X) = [min x inX Ө 1 , m in x inX Ө 2 , m in x inX Ө 3 .. .] g(Y) = {x | for all y in Y , Ө (x ) ≥ y} Thi s approach al lows to c ompare two s pecies for whi ch trait pa ttern are different. F or example, cons idering the two s pecies Berula erecta and Callitrich e obtusa ngula which si ze pat terns are respectively [1, 2, 3, 0] and [0, 1, 2, 2] , they cou ld form a un ion- con cept where the intent is [1, 2, 3, 2], and a intersecti on- concept where the i ntent is [0 , 1, 2, 0]. In the o rdinary way , there i s n o comm on size p roperty bet ween the two species (see Figure 2). Other approaches are able to deal with suc h com plex data, by building several lattices and the n combini ng them, or by cutting big lattices ( see e.g. [6]). Using fuzzy lat tices [3] is another interesting way , s ince the affinity properties are very si mi lar to probabilities. 5. Conclusions Our aim is to hel p hydrobio logist s in defining a new evaluation sy stem of the quality of wat er bodies. In this pape r , the mai n con cern wit h respect to that problem is to extract knowledge from data that do not depend o n regional c haracteristic s. Thi s is an i mportant problem in order to be able to c ompare the qual ity of wat er bodies in different region s and to bu ild a c ohere nt evaluati on s ystem over Eur ope. Analyzi ng biolog ical traits and determini ng fu ncti onal groups is a pro misi ng approach for suc h an aim as they allow to evaluate wat er quality in more general way than the species themse lves. In this paper , we foc us on the analysis of biolog ical traits o f macroph ytes. In order to determine funct ional groups of macr ophytes, we have proposed to us e Galoi s lattices and have tried to extract groups of biological traits shared by groups of species, and to analyze implicatio ns bet ween bio logical traits. W e have pointed out the fact that traits data are represented as triples (trait, modal ity , affin ity) wh ich make them to o co mplex to directly build a lattice fro m them. W e have thu s proposed two co nversi on s from thos e dat a to binary o nes : building a full disjunct ive table and u sin g patt erns which represent the distributio ns of species affinities wrt the m odalities of biological traits. N one of the se approaches i s reall y sati sfactory . The first one give s too muc h, unstr uctured infor mation , while the seco nd one give s very few but structured infor mation . As further re search, we propose to investigate the benef its of usin g lattices with a more com plex structure , as those defined in [10]. Tho se lattice s will all ow u s to extend the second approach studied, by building more general and th us mo re representative c onc epts. The y shou ld overc ome the problem s of both the approaches already used: the infor mation on the di stribution s wil l be kept, but wi ll be more general so that it will enable to extract m ore usefu l c oncepts . F urthermo re, we want to explore the assoc iation ru les provided by these lattices. T o validate the approach, the co ncepts and rule s extracted will be sh own to the experts who h ave to give them a fu nctio nal interpretation wrt aquatic ecosy stems. 6. Acknowledgm ents A. Bertaux and F . Le Ber thank the AERM, Agen ce de l' Eau Rhi n- Meuse, for supporting th is project. 7. Referen ces [1] M.-F . Bazerques, "Direc tive -cad re sur l’eau : le bon état écologiq ue des eaux douce s de surface : sa défi nition, son éval uation", Com munication au Ministère de l’Écologie et du Développ eme nt Durable, Pa ris, 2004 . [2] M. Barb ut, B. Monjardet, Ordre et class ific ation – A lgè bre et combi natoire , Hachette , P aris, France, 1 970 . [3] R. Běloh lávek. Fuzzy Galois Connections. Math. Logic Quaterly , 1999, volume 4 5, num ber 4, pages 497 -504. [4] B. Davey , H. Priestley , Introduction to Lattices and Order , Cambridge Univer sity Press, Camb ridg e, UK, 1 990 . [5] V . Duque nne, "Contextual implic ations betwee n attributes and some represe ntational proper ties for finite lattices ", in: Beiträge zur Begriffs analyse , B. I. W isse nschaft sve rlag, Mannheim, 198 7, pages 213-239. [6] V . Duquenne, "Latticial structures in data analysis", Theoreti cal Compute r Scie nce , 19 99, volume 217 , pages 407- 436 . [7] B. Hérault, O . Honnay , “Using life- history traits to achieve a functional class ific ation of habitats”, Applie d V ege tation Scie nce , 2007, volum e 10, pages 73-80 . [8] M. Lafont, P . Breil, P . N amour , J .-C . Camus, F . Malard, P . Le Pimpec, Concept d'ambi ance é cologique dans les systèmes aquatique s continentaux (AESYS), in: Actes du sém inaire "État écologiq ue des milie ux aquatiques continentaux" , Cema gref É ditions, 20 01, pages 1 36- 15 3. [9] A. Napoli, "A smooth introduction to symb olic methods in knowledg e disc ove ry ", i n : Categorization in Cognitive Sc ien ce , H. Cohen and C . Lefe bvre editors, Elsevier , Amste rdam, 200 6. [10 ] G . P olaill on, Organis ation et interpré tation par les treillis de Galois de données de type multivalué, intervalle ou histogram me , PhD thesi s, P aris IX Dauphin e, 19 98. [11 ] S . Y evtushen ko and co ntributors, Copyr ight (c) 200 0- 2006. http://conexp.so urceforge. net/

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment