A new selection strategy for selective cluster ensemble based on Diversity and Independency

This research introduces a new strategy in cluster ensemble selection by using Independency and Diversity metrics. In recent years, Diversity and Quality, which are two metrics in evaluation procedure, have been used for selecting basic clustering re…

Authors: Muhammad Yousefnezhad, Ali Reihanian, Daoqiang Zhang

A new selection strategy for selective cluster ensemble based on   Diversity and Independency
A new se lection st rategy for select ive clust er en sem bl e based on D iversity and Inde p ende n c y Muha m mad Yousefnezh a d a , Ali R e ihanian b , Daoq i ang Zhang a and Behr ouz Minaei- Bidgol i c a Department of C om puter Scie nce , N anji ng Un i versity o f A eronautics and Astro nautics, C h i na. b Depa rtment of Electric a l and Computer En g ineerin g , U n iversit y of Tabriz, Iran. c Depa rtme n t of C o mp u ter E ng i n eeri ng , I r a n U n iversit y o f Sc ie nce and Tec hno logy, I r a n . Abstra ct Th i s research intro d uc es a new st rategy in clu s ter ensemble se lec t i on b y using I ndep e ndency and Div ersi ty met r i cs. In rec ent ye ar s, Diversi ty and Q ual ity, whic h a r e two metri cs in evalua t ion p roc edure, have b een used for sele cting b asic clusterin g resul ts in t he cl u s t e r en s e mb l e s e lecti o n. Al t hough q ua l ity can improve t he final resul ts in clu s ter ensem b le , it cannot control the pr o cedures of g e nerat i ng bas ic result s , whi c h causes a g ap in p re d i ction of the g e nerat e d ba si c result s’ a ccu r ac y. Instead o f q uali t y , this pa per i n tro duces Independency as a supp lem enta ry method to b e used in conjuncti on w ith Dive rsity . Th e refore , thi s p ap e r uses a heuris t i c m etri c, whi ch is b ased on the p roce d u re of conve r ti ng c o de t o grap h in S o f tw are Testing, i n order to calc ulate th e I n dependenc y of two basic c l ustering alg o ri t hm s . Moreover, a new mode li ng languag e , w hich we call ed a s “Clu s t e ring Algorit h m s Independen cy Langu age” (CAI L), is i n tro d uce d in orde r to gener a te g r a ph s wh ich depict I n depe n denc y o f algorithm s . Als o , Uni f o rmi t y , w h i ch i s a n e w s i mi larity metri c, h as be e n introduced for evaluating t h e di v e rsity of b asic r e s u lts. As a crede ntia l , our exp er ime n tal results on v ari ed diffe rent stan dard da ta s e t s s h o w tha t the propo sed frame work improves th e accu ra cy of fi na l result s dramaticall y in comparison wi th other cl ust e r en s e m ble met h ods. Ke ywo rds : In d e p endenc y of algori t hm s , Di versi ty of p ri mary re sults, sele ctive cl us ter e n sem b l e, Alg orithm 's Graph . 1. Introduct i o n C l ustering , one of t h e main ta sks i n data min ing is t o discover m ean i ngful p atter n s in the non-labeled dat a s e t s (Fr e d a nd Louren ç o , 2 00 8 ; S tr e h l a nd G hosh, 2 0 02; Top chy et al., 20 03 ). G ener a l ly , bas ic cl u ste ri n g algori t hm s can not rec o gni z e accurate p att e rns in a co mplex data set because they op ti mize final cl u ste ri n g res u l ts accordi n g to their ob je ctive functions. In ot h er words, pa ttern s o f ea c h dat a s e t are recog n i z e d b y a sp ec ial perspectiv e , a c cording t o the a lg orithm' s ob je ctive func t io n instead of n atural rel a tion s betwee n data p o in t s in eac h data s e t (Jain et al. , 200 4). Combinin g the p ri mary c l ustering results w h i ch a re ge nerated b y bas ic c luster ing a l gorithms w ill cau se cl u ste r ensem b l e to achie v e better final result s . T here are two steps in clu s ter ensemble : i n t he first step, di fferent res u lt s a r e g e n e r ated from th e ba sic c lusteri ng methods by u si ng diff erent a lgo rithms and cha n ging the number of part i tions. In the sec o nd s tep, basic r esults (ensem b l e commi ttee) a re com b i n e d b y using an aggregati ng me cha ni sm whi ch le a ds to the gener a tio n o f t he final result (Al i z adeh et al., 2 01 1, 2 01 4; Ali z adeh et a l ., 2 01 2; Strehl an d Gh os h , 2 00 2). The se con d step is p erf o r med by con s ensus func t i ons . In t rod uced b y Fern a n d Lin , 2 00 8, sele c ti v e clu s ter ensemble is a new a pproa ch whi ch combines a sel ecte d gro up of best pri mary r esu l ts a cc o r d i ng to c o n sensus metr ic (s ) fro m ensem b le com mi t tee in order to improve t h e accuracy of f ina l result s . The sel ection strat egy a i ms to sele c t b etter pa r t it ions of en s e mb le com mi tt e e. In rec ent y ea r s, Diversi ty a nd Quality have been used to selec t the ba sic cl ust e r i ng r esults. A prop er sel ection strategy can refle ct t he i mplici t features of data sets, a n d t h e c l ustering p e rform a nce can be imp r o ve d (Ali z adeh et al., 2 01 1, 2 01 4; Ali z adeh e t al., 20 1 2; Fern an d Lin, 2 00 8 ; Jia et al., 2 01 2; Lim in and Xiaop i ng , 20 12 ). Q uality cannot control t he p roced u r e s of gener a ting b as i c results an d the predicti on of the sa me res u l ts' accur a cy. I n order to eva l u ate the ba sic resul ts a c cording to t he process of ba sic cl ustering alg o ri t hm s , Yo u s e f n ezhad (2 01 3) in t r o duc ed I ndep e n de ncy metri c in D&I 1 me tho d. With the same idea, Alizadeh et al., 20 15 introduce d a new me t hod wh ich is call ed WOC CE 2 in w h i ch the In dep end ency is used a s a cri t er ion t o map WOC 3 , a theory in social scie nce, t o Cluster Ensemble Sel ection. Al thou g h, the p er formances and processe s of D&I and WOCC E ar e the same (Al iza deh et a l ., 20 15 ; Yousef n e z had et al. , 2 013 ), Al izadeh et al., 20 15 use d the concepts of WOC for he uris t ic proving of each components i n D &I . In additi o n, th e y integr a ted the mai n a s s um p tions of D&I i nto a new metric w h i ch i s c a l led “ De centrali za tion”. I n the p r o c ess of D&I a nd WOC CE, two a l go r ithms whic h a r e o f two diff erent t y p es are considere d t o b e completel y independent, and the In dep end ency d e g r ee of two algori t hm s whic h are t he same t y pe is ca l cu la ted by a ra ndom values ma t ri x o f those algorithm s . For instance, t he random values of k-me an s are the ran dom v alues of cl u ste rs ’ cente r s in the first ite r ation of the algorithm (Al iza de h e t al., 2 01 5; Yousefn ez h a d e t a l ., 20 13 ). S i n c e the performance a nd m a ny conce p ts of D &I and W OC CE are t he same, o n ly D&I is used in t h is pa per. T his pa per propo ses a new method for c alculati ng I n dependency wh ich is b ased on the p rocedu re of convert ing code to grap h in “ So ftw a re T es t ing ” . In thi s method b oth the sa m e t y p e algorithm s an d the algori t hm s whi ch a re in diff erent types co u l d h ave t he Independen cy degree. Th e I n d ependency d e g r ee is a value b et we e n z e ro and one whic h show s the p roba bil ity of the generated result's accuracy b ased on analyzing the p roblem solvi n g p roce du r es o f the a l go r ithms. In additi on , a ne w m od e ling language named as CAI L 4 , is introduced in this p ap er wh ich n o r m alizes cluster ing a l g o r i th m s' co des a nd ps e u do codes. Moreover , this pa per p r o poses a new me t r ic bas e d on APMM 5 for eval u ati ng the divers ity of bas i c results. Als o , a new method for com b ing b asic results wh ich is ba sed on EAC 6 (Fred an d Ja i n, 20 05 ), w h i ch is call ed WE AC 7 , is intr o duced i n thi s p aper. The m ain c ont r ib uti o ns o f this p ap e r are: 1 . In this pa per, a n ew s trategy fo r ev a luating a nd sele cting the best b asi c results in Clus ter Ensemble S e lect ion is introduc ed. This new strategy is b ased o n the Ind e penden cy an d D iversi t y m etrics . 2 . Unli k e th e p re v i ous calcul a ti on of I ndepe n dency wh ich consid e r ed the two same type a l gori thms to b e c omplete ly independent, t hi s p aper introduces a n ew method for ca l culating t he real value of In d e p endenc y degre e b e t w een tw o same typ e algori thms. Also, this method c a n ca l c u lat e the In d e p endenc y d e gree betw e e n two differ ent typ e s of algor it h ms using t he p revi ous ca l culation of In d e p endenc y. 3 . F o r eva l uat i ng I ndep e n den cy metric , th i s p ape r introdu c e s a n e w mode ling l a ngua g e na me d a s C A I L w h i ch is designed for es t im at i ng Independency degree in Clusteri n g problems. 1 D ive r s ity and I ndep e ndenc y 2 Wisdom Of Cro wds Clust er Ensem b le 3 T he Wisdom o f Cr owd s 4 C l u s terin g Al g o rit h m s Independency La ngua g e 5 A l iza deh-Parvin-Mosh k i -Minae i 6 E vid ence A ccumul atio n Cl u s tering 7 Weight ed Evide n c e Accum ulati on Cluste ring 4 . T his p ap e r introduce s Un i formi ty whi ch is a gree d y me t ri c for eva l uat in g diver s it y o f two b asic results. T hi s metri c is b ased on APMM . 5 . T his pa per introduce s WE AC whic h is a new me t ho d for combini ng wei ght e d bas i c results ba sed on EAC. Whil e t hi s p ap er uses Independency degre e a s a w eight in WEAC for generating f i nal clus tering result , any other metric c an b e used as a w eight in WEAC for dif ferent c lusteri n g solutions i n futu r e wor k s. T he M ai n goa l s o f thi s pa per a re to improve the p er formance of D&I (Yousefnezhad, 2 01 3; Yousefne z had e t a l ., 2 01 3 ) , or WOCC E (Ali za deh et a l ., 2 01 5) b y p ropos i ng the new ca l culation of In dep end ency and Di versity me t ri cs; an d also to om it the thr e s h old ing p roce dures of Independency in the two mentione d methods (D&I & WOCCE). T his pa per is organized as f o ll ows . Secti o n 2 desc r i b e s p re vious works on s e lecti ve c luster ensemble. Section 3 p re s e n ts o ur p roposed method. In S ect ion 4 , our ex perim ental results on 1 7 dif ferent sca l ed sta nd a rd data sets are p resente d . Finally , c o ncl usions are give n in Secti on 5. 2. Bac kgr ound 2.1. Clust e ring analys i s T he major aim of data cl usteri n g i s t o find groups of pa tterns (cl ust e rs) i n such a w ay th at p atterns in one c lust e r can be more sim il a r to each other than to pa ttern s of oth e r cl ust e rs (Akba ri et a l. , 2 01 5). A cl u ste ri n g alg o r it h m deci des each inpu t da ta b e longs to w hich clu s ter (Bah r o l oloum et a l ., 2 01 5). Thus, Clusteri ng can b e consi d e red a s a powerf u l tool to r e v e a l an d visuali z e st ruc tu r e of data (I z aki a n et a l . , 20 15 ). Ba si c clusteri ng a lgori th m s opt im ize th e final cluster ing r e s u lt s a c cordin g to t hei r objecti v e functi o ns . In o t her w ords, pa tterns of each d ata set a r e recogni zed by a sp e cial p erspective a c cording to the ob je ctive fun c t i ons o f algorit h m s instead of natural rel a tions between d ata po i nts in each dat a set. Analy z i n g si milarit y an d p r o perti es of clu st e ring algori t h ms' ob j ective f u ncti ons is necessary for gener a tin g b e st result s in cluster ensemble s e lecti o n. Jain et a l ., 2 00 4 p roposed ta xonomy of cl us teri ng algori t hm s a ccord ing t o th e ir ob j ective functions. T h e y proved that the methods whi ch a re in a group (w ith the same objecti v e function) ha ve al most th e same p er formances on a part i cular data s e t . Moreover, many algorithm s can be found whi ch are devel op ed b ased on a s pecif ic algori t hm such as , a l gorithms w h i ch a re t he ex ten s io n of k-means (Ja i n, 2 0 10), or lin k ages (Go s e , 1 99 7). T hes e fa ct s moti vat e research ers to prop ose cluster ensemble m et ho d s. C l uster ens e mb l e p r o ved that b etter final r esul t s c a n be generated b y combinin g b asic results inste a d of onl y choosing t h e b est one. Generall y, a cl ust e r ensem b l e has two important step s (Jain et al., 1 99 9; Strehl and Ghosh , 20 0 2): 1. Generating di fferen t result s from p r imary cl us te r i ng metho ds using diffe rent algorithm s and changing t he numb e r of thei r part i tions. T his step i s c alled g e n e rat i ng diver sity o r v ari ety. 2. Combinin g the primary r esults an d generating the fin a l ensem b le . T hi s s t e p is p er forme d b y consensus f u n c t ions (aggregating m e chanism). It is clear that an ensemble w ith a set of i d e n t ical model s doesn't hav e any ad vantages. T hus, the aim i s to comb i ne mod e l s whi ch predict diff eren t outcom es. In order to a chi eve thi s goal, t her e a re four com po n ents to b e ch a nged w hich a re dat a set, clu steri ng a l gorithms , eva l u ation me t ri cs, and combine met h ods. A set of m od e ls can b e cre a ted fro m two app roaches: Choosing data repr e s e n t a tion, and Choosing c lusteri n g algor ithms or algori t h mi c p arameter s. S t r e h l a nd Gho s h, 20 0 2 prop o s e d the Mu tual Info rmation (MI ) f o r me a suri ng the c onsistency of dat a pa r titions; Fred and Jain, 20 05 propo sed Normali ze d Mutual I nform a tion (NMI) , whi ch is indep e n de nt of cl u ste r size. This me t ri c can b e u sed to eva l ua te cluster s and the p artiti o ns in many a pp l icat i ons. For instance, Zhong and Gho s h, 20 05 use d NMI for evaluating clu ste r s in docume n t cl ust e r i ng and K an dy las et a l ., 2 0 08 used it for comm unity knowl edge ana l ysis. Fern a nd Lin, 2 00 8 devel op ed a m e t h od w h i ch effe ctively us e s a sele ct i on of b asi c p artitions to pa rtic ipa te i n the ensem b l e, and c o ns e q uentl y in the final dec is i on. T hey also u sed the Sum NMI (SNMI ) and Pai r w ise NMI a s qua li ty and d i versity me t r ics bet w een part i tions, respective ly . Jia et a l. , 2 01 2 p r o posed S I M for diver sity measure ment whic h w orks ba sed on the NMI. Az i mi and Fern, 2 00 9 us e d clu s ter ensem b l e s e lecti o n to a void c o ns e n sus part i tions w h i ch are excessive ly dif ferent from t he b ase p artitio ns t h ey result from . They demo ns trated that t h e i r met h od can result in p artition s wi t h e n hanced SNMI. Limin and Xi a o p i ng, 2 012 used Compactness an d Sepa ration for choosing the ref erence part i tion in cluster ens e mb l e s e lecti on . T hey also used new div e rs ity a nd qua l ity m etri cs as a selec t i on st rategy . Al iza deh et al., 20 11 , 2 01 4 a n d Al iza deh et al., 2 01 2 exp lo red the disadva n tag e s of NMI a s a sym metri c criteri on. They used t he AP MM a nd MAX me trics to measure div ersity an d stab il ity, respectivel y, a nd suggest e d a new method for buildi ng a co- a ssoci a tion matri x from a subs e t of b ase cluster results. Th i s p ap e r introduces Uniform i t y f o r dive rsity me a surem ent, w h i ch work s b ased o n the A PM M m etri c. Algori thm's Indep e ndency degree in cl u s ter e n sem b le sel ection is i ntroduce d in (Ali z adeh et al., 20 1 5; Yousefne z had, 2 0 13 ; Y ousefne z had et al. , 20 13 ). In t heir method, the prim a ry cl us te ri n g a l go r ithms w ith diff erent typ e s a re consider ed to b e complete ly independent. F urtherm ore, the In d e p endenc y d e grees of cl u ste ri n g algorithm s with the sa m e types are cal culated b y “ BPI” functi o n. Also, Yousefnezhad et a l. , 20 13 a ch ieved final resul t by threshol d i ng on generated b asic result s . Algor ithm 1 show s B PI function’s ps e u do code ( Yousefnezhad, 201 3 ; Y ou sefnezhad et al. , 2 013 ). Algo rit hm 1 : Basic Primary I n depe n dency function (Yousef n e z had, 20 13 ; Y ou s efne z h a d et al. , 201 3 ) Fu nct ion BPI (C1 , C 2 , P1 , P 2) Return [Re sult] If C1 and C 2 ar e eq u al T h en Distan ce-Mat rix i s dis tance bet ween P1 and P2 Do u nti l Dista nc e-Mat rix is not nul l F ind mini mum c ell of Dista nce- Matrix Store ce ll in Tem p-Array Re move Row and C olu mn of f ound e d cell Cr eate new Dist ance - M at r i x End loop Return Result =Averag e of T e mp - A r r a y Else Return Result = 1 f or depicti ng t wo a lgorith ms a re independe n t End If End Function In Algorit h m 1 , C1 a nd C2 represent t h e t y p es o f cluster ing a lgo rithms . A c cording to A l gori thm 1, BP I returns “ Result = 1 ” when the algorithm s are of tw o d i ffere n t types. I n d eed, BPI consider s ea ch two diff erent type s of a l gorithms to b e ful l y i ndependent. Also in Algori t hm 1, P1 and P2 a r e b asic par am eters of the algorithm s such a s t he initi a l see d p oi nts i n k-me an s. Actuall y, an y random values or p arameters w h i ch can change the fi n al result in b asi c cluste r i ng algorithm s c a n be represented by P1 and P 2 (Yousef n e z had, 20 13; Yo u sef n e z had et a l ., 20 13 ). In t hi s pa per, tw o parts of t he a pp roach whic h is introdu ced b y Yousef n e z had et a l ., 2 01 3 , have b een im p roved . F i rst, in order to mode l a nd evaluate the Ind ep enden cy of clu s teri ng a lgori thms, a new techni q ue w hich is b ased on alg orit hms’ graph codes is presented. Second, the algorit h m s' Independency degre es are u s e d as wei ghts to eva l uat e di versity in the process of gene rating the final result . Afte r mo d if yi n g t h e s e tw o pa rts, the threshold ing for I ndep e n d ency m etric (Yo u sef n e z had, 2 013 ; Yo u sefne z had et a l ., 2 0 13) in the p roce s s of clu s ter ensemble select ion is omi t ted . A s a resul t , t her e wil l b e no init ial value fo r “ iT ” (inde p endenc y Threshold) pa r a me t e r as an input of t h e algor it h m whi ch wi ll be p roposed in the nex t sectio n. I n additi on, in this algorithm , the Independency de g r e e b etw een each t w o bas ic cl u ste ri n g algor ithms wi ll b e calcul at e d b y grap h-ba sed m odeli ng. 2.2 . G raph so f twa re tes t ing S o f t ware te s t i ng i s a n i mp ortant p art o f s oftw are d e v e lop m ent to w hich alm ost 60% of t h e t otal production cos t i s a ssi gned. “ Softw ar e m odeli ng” , o n e of the main tasks i n software testing, can b e im p le mented wi th the h e l p of syntax , input sp ace, logic, or grap h . A graph-b ased m odeli n g ca n p rovi de a grap hi ca l represe n tation of t h e source code, soft w are design, use ca s es, a nd etc. (Amm ann and O ffutt, 20 08 ). It can be a u sef u l me cha ni sm for evaluation of p roce d ure s in cluster ing a lg orithms. T his p aper intr o duce s C l ustering Al gori thms I nd e p ende ncy La nguage (CAI L) whi ch is a ne w m o de ling language for norm a li z i ng codes a nd p seu do codes in w hich the conce p ts of graph -based model ing a re used for calcu la ti ng t he degree of In d e p endency for b as i c clu s ter ing a l gorithms. Also, a new instructi on, w h i ch is ba sed on the requirem ents in Independency evaluation, i s p roposed for trans fo rmi ng CAI L codes into grap hs. 3. Pro po sed m ethod T his sec tion in t r o d uc es a supp le mentary me t h o d, using diver sity and Indep e ndency metri cs, for sele cting b e s t partitions in an ense mb l e c ommi t te e. Figure 1 ill ust rates our p r o posed f r am ework . Fig . 1 . T he f r amew ork o f the proposed m e t hod F i gure 1 s ho ws how a final resul t is gene rat ed i n our p ropo s ed m ethod. Generally , it can be said t h a t in the prop o sed method, a da ta set is divi ded i n to non-ali gn e d cl usters in 3 stages; in t he firs t stag e , a b asic cl u ste ri n g algor ithm g e n e r ates a result from t he dat a set. In t he sec o nd sta ge , th i s generated result is evaluated b y di v e rsity me t r ic a nd t he evaluated res u l t i s a dded to ensemble comm i t tee only i f it has a n acceptab l e divers ity degre e. Th e abo ve two sta ge s are repeat e d u nti l the num b er of ensemble comm ittee mem b e rs rea c h to enough amount. Th e n , the final r e s u lt is cre a t e d b y using t h e mem b e rs of ensemble com mi tt e e and their in dependence degre es. T he rest of t hi s s e ction is organized as foll o w s: Fir st, diversi ty metri c is introduced. After t hat, the conce p t of Independency in cluste r i ng alg o r it h ms is exp lai ned. Then, a new met h od for trans f ormi ng the cl u ste ri n g algorit h m 's code s a nd p seudo codes i n to grap hs is pr e s e nted. Next , CAIL code a naly zer, softw a re for automatical ly c o m p arin g the Independenc y of cl u ste ring algor ithms, is i nt r o d uced. After tha t, the p seudo code of our p ropos e d method is p re s e nted. F i n all y, the summ a ry of the p roposed method is giv e n . 3.1. Diver sity After generating indiv idual clusterin g r esults in Cluster E nsem b l e Select ion methods, a consensus functi o n must be used to evaluate them . NMI is used as the c o ns ensus functi o n by most o f the classical met h ods. Since NMI is a sym metric m ethod, Alizadeh et al., 20 11, 2 01 4 a nd Alizadeh et al. , 201 2 concl uded t he di s adva n tages of it an d t h e r efore, they p roposed AP MM an d MAX for solvi ng th e sym metry prob l em in t he NMI. T he AP MM is calcul a ted as foll ows (Ali zad e h et a l ., 2 01 1 , 2 014 ):                            kp i p i p i c c c c n n n n n n n n n P C AP MM 1 log log log 2 ) , ( (1) In Eq . 1, c n , p i n , and n are the size o f cluster C, th e size of t he i-th cluster of p artitio n P , and the number of samp l es whi ch a re a v a i lab le i n t he p artiti o n o f clu s ter C, respectiv e l y. k p is the numb e r of cl u ste rs in the part i tion P. As a matter of fact, the only diff erence betwee n NMI and APMM is that th e fi rst one (NMI) com p ares t w o pa rtiti ons w hile the sec ond one (APMM) compares a pa rtit ion wi t h a cl u ste r. T o calcu lat e the sim il a rity of p artiti o n P wi t h resp e ct to a pa rtit ion o f the refe rence set (ensemble com mi tt e e), this pa per u s e s AAPM M whic h i s calcu la ted as fol low s (Al izad e h et a l ., 2 01 1, 2 014 ):    N i i P C APM M N P P AAPM M 1 * * ) , ( 1 ) , ( (2) In Eq . 2, P * is a p artiti on from re ferenc e s e t , C i is t he i- th cl uster of p artiti o n P, a nd N is the number of cl u ste rs in the pa r t it ion P. T his p ap e r p r o poses a redefi ned ver sion of APMM, because the o r iginal versio n only me as ure s t he div e rs ity b e t w een a cl u s t e r in the first pa rtit ion and a l l of clusters in the s e cond pa rti t i on (Al izadeh e t al., 20 14 ). T his redef ined metr ic whi ch i s c al led Unifo rmi ty is used for eva l uat i ng the diver s it y b etw een a pa r tition a nd a refe r e n c e set a s ensem ble comm ittee. I n o the r words, this metric i s u sed to s atisfy the Div ersi ty cr it e rion i n the propo s e d method. The U ni form it y i s define d as foll ow s : )) , ( ( max ) ( Uniformity 1 i n i P P AAPMM P   (3) In Eq . 3 , P i is the i-th pa rtit ion in ensem ble com mittee. n is the num b e r o f me mb e rs in ref erence set. Uni f o r m ity rep r esents the max i mum v alu e o f simi larity betwee n p artiti on P a nd t he o t her p artitions of ensem b le com mittee . S i nce Uni formity is normal ized b etw een zero a nd one, w e consi d e r 1 – U ni for mity to represe n t the divers it y a s follow s: ) ( U ni f ormity 1 ) ( P P DI V   (4) As m entione d b efore , o ne of the condi tio n s th at should be esta bli shed i n o r der t o a ppend a pa rti t i on to the en s e mble comm ittee (wh ich is know n as the di versity condi t i on) is as foll ows : dT P DIV  ) ( (5 ) T his means tha t i f the diver s i ty of a generated p artiti o n satisfie s d T (diversi ty threshol d), it wil l b e added to the r efere nce set. 3.2. Indep endency B e fore this pap er st arts to ex plain t h e d e t ail s of I n d e pe ndency , tw o questions m ust b e answ ered: F i rst, it sho u l d be clear w hat the main g oal of usin g I ndep en d e n c y is. I n the p roposed me t hod, the corre ctness of the gener a ted indiv idual cl u ste r i ng resul t s can b e estim at ed b y I nd e p ende n c y. As a matter of fa ct , In dep end ency tr ie s to estimate the corre ctn e s s by comp ari ng the sim ilari t y of cl us t e r ing a lg orithms i n the process of solvi ng a clusteri ng p roble m. I n other wo rds , this p aper considers the correctne s s of two same (l o w -value in the diversit y est i mation) indiv idual cl u ste ring result s . T he indi vidual cluster ing results a re consi d e red to b e l o w , when they are generated by the cluster ing a l gori thms wi t h simi lar ob je ctive functi o ns ; on the ot he r hand, the corre ctness of two same i n di vidual cluster ing results are conside red to b e hig h , whe n they are generated by t w o cl u s tering a l g ori thms w ith d iff erent objecti v e functio n even i f those result s don ’ t have a signi ficant dive rsity . T hi s comparison is consider e d to b e rel ia ble for com p le x d ata sets, p ractical ly. I ndeed, i n real-w orld da ta s e ts , t h ere is no cl a ss-l a bel. T he re f ore, t hi s is one o f the b es t w a y s for estimati n g the co rrectne s s of t he gener a ted results, especiall y the sa m e ‘indi vidual cl usteri ng result s ’ (Fr ed and Lourenço, 2 00 8). T he second question is th at how this t e chniq ue c a n b e use d in softw a re such as SAS o r SPS S, in wh ich the code o f cl u steri ng algorithm s cannot b e fi nd ? For ea ch cl u ste ri n g algori thm, t h e p roces s o f so l ving p roblem is uni q ue. Th e refore, if the imple men tat i on s of a n algori t hm in tw o diffe r e n t progra m mi n g languag e s a re co n ver t e d a nd n orm ali z ed b as e d on the p ro po s ed met h od of thi s p ap er, t he results must be the same. Th e refore , each op en s ource codes o f that algori t hm can be used. F i gure 2 shows how a n a lgo rithm's graph a rray is gene r ated for eva l uat in g c lusteri n g a lgo rithm' s In dep end ency degre e. Acc ording to F i g ure 2 , it ca n b e said t hat a cl usteri ng a l gorithm' s code i s converted to graph arra y in 4 stages. First, Standard Cod e Ma pp in g T ab le (SCMT ) , whi ch is a consens u s table, is prep are d b y looking in a l go r ithms' codes . In other w ords, t h is ta ble contains ma t h e ma ti cal, sta ti st i cal, heuris t i c, a n d other k i nds o f functi ons w hich are us e d in the algorithm s . Al so , this tab l e, whi c h i s u ni q ue for each clusteri ng prob l em, contains a l l me n ti on e d functions whi ch a re used in t he b asic clusteri ng algori t hm s. After that, t he clusterin g al go r ithms’ codes a r e manu all y conve r te d to C AI L scripts with consi d e ring the SCM T t ab l e. T he n , the a l gorithms ’ grap hs ar e generated by using the CA I L codes wh ich are gener a ted i n the p revi ous st age. F i n all y, the wei ghted edges a re stored in an a r ray for evaluat i n g algori t hm 's Independency degre e. This arr a y is ca l led th e grap h's array. Fig . 2. The f ramework of th e cluste r ing algorithms' In d e p en d e n cy e valuation 3.2.1 . C lu s t e rin g Algo rit h m s I nde pendency L angua ge In CAIL mode ling, sym b ol s are us ed i n ste a d of ori g i nal codes or ps e u d o-c o des of c lust e ring algori t hm s. T he main reasons a re th at: firs t , Codes or ps e u do -codes are us u ally wr itten in a s tandard language structure , so they ne ed to be converted in a homo genous form i n orde r to b e c o m p ared wi th each other. What's mo re, the codes h ave many u sel ess details such a s d i ffere n t variab l es' defin itions. Also, many math e matical eq uations and pseudo cod e s , used in algori thms, are not cle a r in p ap e rs. T his pap er p r o poses a m odelin g m ethod w ith consi d e ring SCMT's sym b ol s . Thi s met h od is not sensiti ve t o imple mentation detail s . Th e p roc edu r e o f converting codes to CAIL format is p e r f o r med in fi v e stages. These sta ge s are li sted as f o l lows : 1 . F i rst, all additional codes suc h a s t he defin itions of differ ent v ariable s and constants , d e s c r i p ti ons, input a nd outp ut c o mmands, a nd ea c h code tha t is n ot involv ed i n the cl u ste r i ng p roce s s are omitte d . Also, t h e imple mentations of the sp ec ific f u nct i ons w h i ch are used in mai n function ar e omitte d . For examp l e, the im p l ementation of evaluation m etrics such as N MI, APMM and etc. ca n b e om itted becaus e they are s ho w n in the SCMT tab le as s y mb o ls. 2 . T he logical op er a tors i n conditi ons an d l o ops are re moved b ecause they do not a ffe ct the shap e of the algori th m 's grap h. 3 . All condi tions, such as if, case, a n d etc., are convert ed t o a unique for ma t such as “ i f, e l se, end”. Also, the loops like for, w hil e, repeat, an d etc., a re converte d to a uniq u e format such as “ w hile b r e ak end”. Inde ed, a l l formats o f conditi o n s a nd loo p s are used for q uic k l y im p le mentation of a l gorithms ' codes by p rogramm ers. T hey are not i mp ort a nt because algorithm s ' processes are im p lem ented in clusteri ng a lgori th m 's Inde p e ndency mode ling instead o f imple mentat i ons of indivi dua l codes. T hes e proces ses m ust aff ect the I ndependency . T hi s pap er uses “ if , e lse, end” f o r a l l for ms of conditions and “whil e, b reak, e nd” for a ll forms of loops. 4 . T he key word “ Begi n” is ad d e d at the b egi n ni ng a nd the key wor d “ End” is ad de d a t the end of a C A I L code for cl a ri ty of our defi nitions. 5 . Generat i ng SCMT tab l e. This consensus tab le c o ntains all mathematical , st atisti cal, h e u r istic and other func t i on s w hich a r e used in basic clu s teri ng algori th m s. In orde r t o name s y mbols in S CMT tab le, t hi s p ap er rec o m men d s the foll o w ing instructio ns: first, a l l func t io n s should be grouped a c cording to their types. Each group can be shown by a single Engli sh w ord. For inst an ce, th e R shows random function group , M shows mathematic a l functio n group and H show s heuris t ic function group. E ach function ca n b e shown b y the n am e of its group along wi th a number in a b r a cket ( s e e Ta ble 3 a s an ex ample o f the SCMT t ab l e). Algori thm 2 and Algorit h m 3 show tw o examples of CAIL scr ip ts for k - means and FCM algorithm s, respecti v e ly. T hes e a l g ori thms are g e n e r a ted ac c o r d i ng to the SCMT ta ble whi ch i s i llustrated in Ta ble 3 . Acc ording to these two a l gorithms, o ne of t he a d vantag e s of the CAIL code is tha t it d oe s not contain any im p le mentation details. Al s o, a nother adva ntage of usin g CAI L a nd SC MT is that the codes and p seu d o codes ca n be used togethe r f o r mo d e ling c lusteri n g algorithm s. Algorithm 2 : K - me a ns in the CAIL format B e g i n R(1 ) Wh il e F(1 ) M(1 ) End End Algorithm 3 : FCM in t h e CAI L f orma t B e g i n R(1) Whil e M (2) M (3) End End 3.2.2 . C onv erting CAI L to Independenc y Gr aph In d e p endenc y grap h is a sp eci al-purp ose ap p l icati o n g r a ph . I n ad d it i on, t he CAI L c o des model algori t hm s’ I ndependency to a sta ndard fo rma t . T hus, th i s pa per does not u se t h e same g r a ph - b as e d model structure, whi ch is used in softw are testing, for con v e rting CAI L to an a lg orithm’ s gra ph. Ho we ver, th i s pa per uses a custom format of grap h-b ased m odeli n g accordi ng t o the evaluation o f a lg orithms' In dep end ency requirem ents. In th i s met hod, the cod e co n j unctions, w h ic h are the “ Begin”, “ En d ”, condi t i ons, loop s a nd t he ir sub- sec to r s a re converted to nodes. T he codes b etwe en each two nodes are consi d e red as thei r edge. Like softw a re testi n g ap proaches, t hi s m et ho d uses dir ected graph for mo deli n g algori t hm s. In the gene rat ed graph s in software testing, codes o f ea c h s e g m ent are wri tt e n insi d e of its node. Also, t he logical op er a tio n of condition s o r loop s is usually wri tt e n on the edges. Unl ike s o f t ware testing, in our prop osed m ethod, the code s o f each segme n t are placed on the cor r e s pondin g edges. Th e mai n rea sons for t h is can b e menti o ne d a s fol lows ; first, the logic a l op e rat i ons a r e o m i t ted. T h e n , the evaluation p rocess is important i n t hi s method. After that , the CAIL code s a re pruned in the p revi ous sec t io n. These pruned codes use sta ndard codes accordin g to the SC MT ta ble. Fi nally , the proc e s s of each algori t hm c a n c learly b e vi sib l e in this s t a t u s. In our prop osed method, the codes on each edge are consi d e red as a non-num er i cal weig ht. An arra y of w eighted e d ge s , whi ch is calle d the I n de p endenc y graph ' s array, is used for storing the I n dependen c y grap h i n mem ory. Arrays are compa r ed in order to calcul a te the I ndependenc y degree of t he algorit h m s. Figure 3 and Figure 4 show t w o ex am p les o f CAIL code s and their converted grap h. Fig . 3. An ex ample o f a CAIL c od e Fig . 4. An e xample o f a CAIL code when it contains a lo o p a n d a conditio n when it c ont a ins two c onditi o ns 3.2.3 . Eva l uat i ng Indep endency Gra ph F i gure 5 show s the general structure of the Code Dependence Degr ee Mat r ix (CDDM) w hich is used for evaluat i ng the Ind e penden cy of two cl u ste r i ng algori t hm s . Fig. 5. The Co d e Dependenc e Degree M atri x (CD D M) Accor d i ng to Figure 5, ea c h cell of CDD M matri x is c a lc ulated by the “ Compare” fu n c t i on. Al g ori thm 4 gi ves the p se udo co d e of the “ Compare” fun c tion. As this fi g ure show s, this functio n com p ares the ce lls of In d e p endency gra ph arrays. T hi s figure show s h ow each cell of the fi rst a l gori t hm 's a rray com p ared w ith a ll cel ls o f the secon d algorit h m 's arr a y . In this functi on, the “ Count ” variab le i s i ncrem ented if it c an fi n d one same s y mbol in th e s e cond a l gori thm’ s a r ra y for each symbol in t he fi rst a lg orithm’ s a r ray. MSymbol repre sents the m a ximum number of sy mb ol s (blocks ) i n cell 1 and cel l 2 . Fo r instance, if cel l 1 contains 5 b l o c k s an d cel l 2 c onta ins 6 blocks, t he v alue of Max sy m wil l be 6 . Final re s u lt is norm a li z e d by di v i ding the “Co unt” b y M Sy mbol. Th e norm alized va l ue, whic h is cal led CDD, is stored in CDDM matri x cell w h i ch repre s e nts t he intersec t io n of two men t i oned cell s. T his functi on finds the max im um value of CDDM’s cel ls (the max i mum values of CDDs), w hic h i s call ed the Ma xCell i , and s t ore s them for cal c u lat in g the I ndepend e n c y d e g r e e o f a n alg orithm accordi ng to its corr esp onding C DD M matrix. After that, t hi s functi on remov e s the MaxC el l's row and col u m n in C D DM matrix. I n o t h e r wor d s, for ea c h block - each s y mb ol i n a ce ll is call e d a block - of the fi rst al gori thm, the most sim il a r block in the second algori t hm is found. F ro m the second algorithm 's b l o c ks , t he functi o n f inds t h e most sim ilar b l o c k to the next b loc k of the first algorithm by rem oving the row a nd column s o f this block (Max Cel l) from the CCDM matri x . Then, the functi on c alculates t he new Max Cell for new generated CDDM ma t r i x . Finall y, w hen t he s i ze of the CDDM matri x re a ches to zer o , this p roce ss is fini shed. Algorithm 4 : Co m p are Function Fu nct ion Com p are ( Ce ll1, Cel l2) Return [CDD] Count = 0 Wh i le w e h ave S y m b ol in Cell 1 Sy m1 = S e lect an S y mbol i n Cel l1 Foreach Sy m2 in Cel l2 If Sy m2 = Sy m1 is f ound Then Count++ Brea k End I f End Fo rea c h End w h i le MSy m b ol = Max- S y m ( Ce ll1, Cel l2 ) R e tu rn CDD = Count / MSy mbol End Function Eq . 6 s how s the Algori thms' In d e p endenc y Degre e (AI D) w hich is calcul at e d a t each s tep. In thi s equa ti on , n is t he mi n i mum number of c e lls i n t he fi rs t and s e con d a lgori th m 's a rray s, and m is the maximum number of ce lls in the f irst and second al go r ithm's arr ay s.     n i i j i MaxCe l l m A A AID 1 1 1 ) lg , lg ( (6) T he calculated resul t s of Eq. 6 are stored in the Alg orithm In d e p endenc y Degree Ma tri x (AI DM) in order to be used i n clu ster e n s e m b l e selecti on. The size of AI DM matrix is n n  in w h i ch n i s the number of alg o r it h ms in the cl u ste r en s e mb le . The Eq. 7 shows how the AI DM ce ll s are ca l culated.        j i j i A A AID a j i ij 1 ) lg , lg ( (7) S i nce t hi s pa per uses B PI f u nct i on (Yousefnezhad, 20 13 ; Yous e fnezha d et a l. , 2 01 3) f o r calc u l a ti n g the I ndependency degree of a l g o ri t hm s wi th the s am e type, Eq. 7 assig ns “-1” to I n depe n dency degr ee of each algorithm in c o mpa ri s on w ith itsel f. The final I n dependency de g r e e is calcul at ed b y Eq . 8 dur ing the runni n g p roce s ses of alg o r ithms.           j i k i BPI m j i A A AIDM A A AI m k i i j i 1 ] , [ 1 ] l g , lg [ ] lg , l g [ (8) In Eq . 8 , Alg i is a mem b er of the sel ected a l gorithms in ensemble comm ittee. m i s t he number of algori t hm s in the ensem b le com mittee whi ch are i n the same type of Algi or Algj. Fu r t herm ore, AI DM is calcu la ted by Eq. 7 ; and B PI is calcul a ted by the p seud o code whi ch is rep r esented in Al gorit h m 1 (The ba sic p arameter s Independency function). Fi g ure 6 i llu s trates an example of CDDM mat r ix for co mparing k-m eans ( K) and FCM ( F ) al gori thms, w hich are def ined b y CA I L s c ript i n Al gorithm 2 and Algori t hm 3. Furtherm o r e, the Ma xCell values a r e { Max Cell 1 =1, Max Cell 2 =0} , an d also A I [K ,F] = AI D[K, F ] = 0.5 (see Figur e 9). Fig. 6. The C DDM f or c o mparin g k- m eans and FC M bas ed on Alg orithm 2 and Algor i thm 3 3.3. Weig h ted Evidence A ccum ulati o n Cl ustering In order t o sel ect t he eva l uat e d indi v i dua l result s in cluster ensemble sele ct i on, thresholdi ng is used. Th e n, wit h u s ing the consensus fu ncti o n o n the sel ecte d re s ul t s, th e co-associatio n m a trix is ge n e rat ed. At last, b y app ly ing l inkage me t ho d s on the c o - a ssoci a ti o n m a tri x , the fi n al result is ge n e rat e d . T hese met h ods generate the Dendrogram. After t h a t , they cut the Dendrogram ba sed on the number o f cl us t e r s in the re s ul t (Al izadeh e t al ., 20 15 ; F re d and Jain, 20 0 5). I n recent y ea r s , Evidenc e Accum u l a ti o n Clusteri ng (EA C ) has b een u s e d in m an y researche s as a high perf ormance consensus functi on for com bining indivi dua l result s (Ali za de h et al., 20 1 1, 2 01 4; Alizadeh et a l ., 2 01 2; Ali z adeh et al., 2 0 15; Azim i and F ern , 2 0 09 ; Fern a nd Lin, 20 0 8; F r e d and Ja i n, 20 0 5). EAC div id e s t he number sha re d b y ob j ects over the number of p artiti o n s in w hich each sele cted p ai r o f ob j ects is si multaneously presented. EAC uses E q . 9 for generating t h e co- a ssoci a tion m at r ix . j i j i m n j i C , , ) , (  (9) In the ab o ve equat i on, m i,j is t h e number of partition s in whi ch thi s pa i r of objec t s (i and j) is sim u l tan e o u sly p r esented a nd n i,j represents the number of clu s ters shared b y objects wi th i nd i ces i and j . As a matter of fa ct , EAC conside rs that t he we ights o f a l l a l g ori thms’ result s are the sa m e. T his p ap e r prop oses Eq . 10 for g e n e rat i ng the co-association matrix w ith consideri ng the In dependency degre e of algori t hm s as a wei ght of co mbini n g the ba sic results. In this eq uation, AI i s calculated b y Eq. 8: j i n j i m A A AI j i C j i , , ] lg , lg [ ) , (   (10) F i gure 7 show s the proce s s of generating fi n al resul t by usin g WEA C. Fig. 7. The pr oc ess o f genera ti ng final result As Figure 7 d e p ict s , t h e p r o c e s s of g ener a t ing the AI DM m a tri x is done befo re runni n g t he algori t h m. In deed, it decreases the runti me o f the al g ori thm, whi ch wi ll be dis cus s e d later i n sect ion 4 . 2 . 3.4. Summ ar y o f th e Pr op osed Method Algori thm 5 dep ic ts t h e ps e u do code o f t he p roposed method. I n Algorithm 5 , Kb is the n um b er of cl u ste rs i n t he final result, an d dT i s the diversi ty t hre sho l d. T he d i st ance s a re a l so m easured b y a Eucli dean metri c. The Generate-Bas i c-Al g ori thm f u ncti on b uil ds the p arti t io n s of ba se cluste rs (b asic result s ) , G e nerate-AI -Matrix b uil ds the co-association matrix a c cording t o Eq. 9 b y using the AIDM matri x a nd the res u l ts o f BPI f u ncti on. T h e Average-Link a ge and Cluster f u nc t i ons build the final ensem b le accordi n g to the A ve rag e Linkage method. Th e p ara m eter Result is the final ensem b le result, and nCE is t h e num b er o f mem b er s in the ense mb l e com mittee . Algorithm 5 : T he Proposed Me t hod Fu nct ion CES (Datas et, Kb, dT) Ret urn [Resu lt, n CE] Ini tial ize nCE to z ero Wh i le w e h ave b a se cl uster [I DX, Basi c-Param e ter] = G enera te-B asic- A l gorithm (D at ase t, K b ) If (D iv ersit y (IDX) >dT ) th e n Find the Alg o ri thms AID f rom AI D M Inse r t idx, AI D, a nd Bas ic - Para met er to En s emb l e-Comm itte e nCE = nCE + 1 End if End w h i le AI = Gener ate-AI -Matrix (AI D M, BPI) W-Co-Ac c = W E AC ( E nse mb le -Commi tte e, AI) Z = Av erag e-Li nkage (W -Co-Acc ) Resul t = Cl u s ter (Z, Kb) End Function T he re a r e t hre e q uestions, whi ch must b e a n swere d b efore this p aper sta r t s to exp l a in the em p iri cal result s . First, “w h at i s the main goal of using Inde p endency e s ti mation?” . I n the prop osed met h od , In dep end ency tr ie s to estimate th e cor rectness of gen erated indi vidual cl u ster ing resul t s by compa r ing the sim ilari t y of clusteri ng a l gorithm s in the p roce s s of solvin g a clusterin g prob l e m . In other wo rds, this pa per consi d e rs t he corre ct ne ss of tw o same (lo w-value in the diver s i ty estim a tion) indi v i dua l clu st e ring result s to b e l o w when they a re ge n e r ated b y the cl u ste r i ng al go r ithms wi th si milar object ive f u n ction; and also, it consi d e rs t h e c o r rectness of two sa m e i n di vid u al cl u steri ng results to b e high w h e n they are gener a ted by two cl u steri ng a l go r ithms wi th diff erent o bject ive f unctions even if those results don’t h ave a si g ni fic a nt di ver s i ty. I n pra c tice , this comparison can be reli ab le for complex data s e t s. I ndeed, there is no class-l a bel i n r eal-worl d data sets; a nd this is one of the b est way s f o r e st i mating t he corre c t n e s s o f the gener a ted resul ts , especiall y the sa m e “ in dividual cl us t eri ng results” (Ali z adeh et al., 20 1 5). T he N e x t qu e s t ion to be answer ed is “ how can we u s e t hi s t e chnique in ap pli cations such as S AS or S PSS , whi c h do not h ave the code of clusteri ng algorithm s ?”. T he p r ocess of solvi n g p r o ble ms for each clu s teri ng algori t hm is unique. So, i f one a lgo rithm, w h i ch is im p l emented by two diff erent p r o g r am ming la ng u age or ev en t w o diffe rent structures of implem enta ti on , is converted and normali z e d b ased on the p roposed met h od, the resul t s must b e the sa m e. Furtherm ore, th e p ropos e d “ Compa r e” function can calcul a te the same resul ts f o r two dif fere n t structures of im p lem entat i on because it only uses the contents of cel ls. For instance, t he results a re the same w h e n you c ha nge the b l o c k s (“ THEN” and “ELSE” ) in the “ I F ” condi t i on. As a result, w e can use other op en source codes for t h ose a l gorit h m s f rom the I nternet. Th e last qu e s t ion to b e answ ered is that “ w hich l evel of ab straction must b e used for co nverting the codes to CAIL scri p ts?” . As w e can s e e in t he examp le s (Al gorithm 2 , Al gorithm 3 a nd Figure 8), the CAI L code s are gener a ted ba sed on gen eral st r u c t ure s of clusteri ng algori thms. We mo s tl y desi re t o compa r e t he ob j ect iv e functi o ns , distance m etrics, ge n e r al proce s ses of sol ving p roblem , and every thing whi ch ca n mathem a tic a l ly o r tec h nic a l ly change the p erform ance of clu s ter ing results. Indeed, it is importan t to use a unique s tructure (and a commo n SC MT tab le ) for all clu st e ring algorithm s in an indi v i dua l clu s teri ng prob l em, whi le a given algorit h m can b e implem ented i n dif fere nt ways . I n fact, we c an report the employ ed C A IL scr ip ts li k e ot h er p arame ters in the exp eri me n t , s uch as distance me t ri c, t y pes of bas ic cl u ste ri n g algor ithms, etc. 4. Exp erim ents T his s e c ti o n descri b es a ser ie s of e m p i ri ca l studi e s a n d reports th e ir resul t s. In real w o r ld, unsupe r v i sed methods a r e us e d t o find me a ni ngful p atterns in n o n-lab e led data sets such as w eb docume nts . Sinc e rea l data sets d on' t have cl a ss lab e ls, there is no dire ct ev a l u ati o n m ethod for evalu a ting the p er formance in u nsupervis ed methods. Like many previous researche s (Ali z adeh et a l ., 20 1 4; Ali z adeh e t al., 20 1 2; Al iza deh et a l ., 2 01 5; Fern a nd Lin, 20 0 8; F r ed an d J ain , 2 00 5; Yousefnezhad, 20 13 ; Yousef n e z had e t a l ., 2 01 3), this pa per c o m p ares the perfo rma nc e of its p roposed m eth o d wi th othe r ba sic a nd e n sem b l e m ethods b y usi n g standa r d dat a sets a nd thei r real classes. Although t h is evaluat i on cannot g u a rantee that t h e propos e d method leads to hi g h perform a nces in all da ta set s in compar i son w ith other methods, it can b e consi d e r e d as an exa m p l e to dem on s t r a te the superiori ty o f t he p ropos e d method. Table 1 L ist o f data s e ts a nd th e i r relat e d informati o n No. Name Feature Class Sample 1 Half R i n g 2 2 400 2 Ir i s 4 3 150 3 Bala n c e Sc al e 4 3 625 4 Brea st Cancer 9 2 683 5 Bu p a 6 2 345 6 Gala xy 4 7 323 7 Gl as s 9 6 214 8 I onosp her e 34 2 351 9 S A Heart 9 2 462 10 W ine 13 2 178 11 Yeas t 8 10 1484 12 P e ndigits 16 1 0 10992 13 Sta tlog 36 7 6435 14 Optdigit s 64 1 0 5620 15 Arc ene 10000 2 900 16 CN A E-9 857 9 1 0 80 17 So n a r 6 0 2 208 4.1. Data sets T he prop osed me t hod is app li e d to 1 7 d i ffere n t st andard UCI da ta sets. Lik e many ot h er pa pers a nd res e arc h e s s uc h as (Ali z adeh et al., 201 1 , 20 14; Alizadeh et a l ., 20 12 ; A l izadeh et al., 2 015 ; Yousef n e z had, 2 01 3; Yousefnezhad et al., 2 01 3), we h ave u s e d the stan d a rd da t a s e t s t o evaluate our nume rous exp e rime nt s. Th e s e standard data sets hav e no negative or po si t i ve effec t s on the performance of an algori t hm . As a matter of fact, the reason of using the s tandard dat a sets i s to conduct an evaluation w ith no art i fic ial n e g ati ve/po si tive bias and to c o m p are d if fere nt algori t hm s fai r l y. We h ave chosen data s e t s whi ch a r e as div er s e as p ossi b le in their n um b er s of t r u e classes, features, and sa m p le s, because thi s variety b e t ter vali d ates the o btained resul t s. These da ta sets ar e exp l a ine d in Ta ble 1. Mo r e information ab out th e s e da ta sets is ava i lab le in (Alizadeh et al., 20 1 5; Azimi a nd Fern, 20 09 ; Jain et al. , 20 0 4; New man et a l ., 1 99 8; Yousefnezhad, 2 01 3; Yousefnezha d et a l ., 20 13 ). T he feature s of the data sets are n o r m ali z ed to a mean o f 0 and variance o f 1, i.e. N (0 , 1 ). 4.2. CAIL co de a n a l y zer As me ntion ed ear l ier, this pap er devel o ps an app li cation for evaluating I ndep e n de n cy degree by using CAIL codes. Figure 8 sh o w s a snap s h o t o f this application. T h is tool is de velop e d b y M icrosoft C # .Net 2013. Fi rst, this app li cation converts C AI L co des to grap h s. Aft er t hat, the graph s' arrays are stored in the mem o ry. F in a lly, the arra y s a r e c ompared with each other, and the Independency degr e e is s how n in a me s s a ge b ox. T h is appli cat io n can work wi t h a ny S CM T table that is prepared in formats which are descr ibed in s e c tion 3.2.1. As it is cl ear i n Figure 8, two CAIL code s a re giv en a s inputs. I n this fig ure, Code 1 i mp l eme nts K - m ea n s w hile Code 2 im p l e me n ts Spectral clusterin g us in g a sparse simil ar i t y matri x . A lso , t he messag e bo x represe nt s t he I ndependency degr e e o f the two m ent io ned algorithms. Fig. 8. The CA IL c o de ana ly zer 4.3 . Per f o rm a nce A nalys is T his p ap e r u sed MAT LAB R2 01 4b ( 8.4) in order t o ge n e ra te exp er ime n tal r e s u lts. T he algori thms w hich are descri b ed i n T able 2 were use d to generate the e ns e mb le comm ittee . T able 2 The standar d code mapp i ng ta b le No. Algorit h m N a me ID 1 K -Mean s K 2 F u z zy C-Me ans F 3 Me dian K-F lat s M 4 Gau ssia n M ix t ure G 5 Subtrac t Clust ering SUB 6 Singl e-L inkage Eucli d e an SLE 7 Singl e-L inkage Hamm ing SLH 8 Singl e-L inkage Cosine SLC 9 Av er a ge-L inka g e Eucli dean ALE 10 Av er a ge-L inka g e Ha mmi n g ALH 11 Av er a ge-L inka g e Cosin e ALC 12 Co mplet e-L inkage Euc lidea n CL E 13 Co mplet e-L inkage H amm ing CLH 14 Co mplet e-L inkage Cos in e CL C 15 Wa rd-L inkage Eu c lidea n WL E 16 Wa rd-L inkage Hamm ing WL H 17 Wa rd-L inkage Cosine WL C 18 Sp e ctra l c lu s terin g usi ng a sparse si mila rity mat rix SPS 19 Sp e ctra l c lu s terin g usi ng N y str o m met ho d wit h o rthogona liza tion SPN 20 Sp e ctra l c lu s terin g usi ng N y str o m met ho d wit hout o rt hogo n aliz ati o n SPW T able 3 il lustrates the SCMT tab le whi ch is us e d i n this p ap er . Tab le 3 The s tandard c ode ma p p in g table Desc ription S y m b o l No. Ge n e rate x ra n d om numbe r R ( 1 ) 1 R a n d om Selecti on R ( 2 ) 2 Y = E u c lidianDi stance(A , B) M(1 ) 3       N i m ij N i i m ij j u x u c 1 1 M(2 ) 4               C k m k i j i ij c x c x u 1 1 2 1 M(3 ) 5 D o ex p functio n : )) 2 / ( 1 ( 2 2      A e S M(4 ) 6 D o laplaci an functio n : 2 1 2 1      D S D L M(5 ) 7 Large st magnit u d e M(6 ) 8 Small est mag n it u d e M(7 ) 9 Normali zing A a n d B M(8 ) 10 Y=Ham min g Dist an c e(A, B) M(9 ) 11 Y= C o s i nD i stance (A, B) M(10) 12 ) , ( m in ) , ( , b a d C C D j i c b c a j i SL    M(11) 13 ) , ( max ) , ( , b a d C C D j i C b C a j i cl    M(12) 14     j i n b n a j i j i A L b a d n n C C D , ) , ( 1 ) , ( M(13) 15 j i n b n a j i j i j i W L b a n n n n C C D     , 2 ) , ( 2 ) , ( M(14) 16 ) ˆ , | ) ; ( ( ) ˆ , ( ) ( 0 ) ( j j z t E Q         M(15) 17 ) , ( * * ) , ( min ar g ) , ( m m d m     M(16) 18 Z=X/Y M(17) 19 2 1 ) ( x P P P xx P xx P P d i i T i T i T i i x     M(18) 20 x P i i K i    1 * max arg M(19) 21 Assi gn each obje ct to close st centroid/ sub s pace F(1) 22 Ge n e rate ( t- n ear est- n eighbo r) sparse dis tance ma tr ix F(2) 23 Co n ve rt dis tance matri x to si m ila rity ma tr i x F(3) 24 Do orthogali zation F (4) 2 5 R e store cl u s ter labels in orginal order F(5) 26 Comp u te the p r oxi m i ty m atri x F(6) 27 Mer ge two c lo s e st c l uster F(7) 28 Y=S u bcl as s (X) F(8) 29 Up date * * * * * : i x i i i P dtd P P P   F(9) 30 F i gure 9 shows AID M matri x ca l culated by t he S CMT t able, whic h is descr ibed in T ab le 3 , and the CAI L c o de analyzer . Fig. 9. The AI DM matrix In this pa rt, the resul t of the AI DM matrix i s anal yzed: T he result s of the linkage fami ly a l gor i thms ar e partici p ated i n t he fi na l res u l t b ased on their In de pendency d e g r ees . Ac cord in g to Figure 9 , the differ enc e s b etwe en the I n dependency de g rees of t he s e algorithm s a r e 0 .25 or 0 . 5. T h e di ff e rences are based on the p roble m solving mechanisms of t he algo r ithm s a nd t he di s tance me t rics. Also, k - mea ns is consi d e r ed indep e nd e n t w here th e linkages don't use the Eucli d e a n dista nc e metric. On the other h a nd, sinc e the spectral algori t h ms use k-means t o generate the final resu l t s after lap l a ci an t r an s f o rmati o n , the In dep end enc y de g rees of t h e spectral al gorithms t ow a rd k-means is considered s pecial. As mentioned earlier, th e results of th e p roposed me t ho d a r e compared w ith well- known b ase algori t hm s such as K - means a nd Sp e ct r a l , as w e l l as MCLA (St r ehl a nd G hosh, 2 002), EAC (Fred and Jain, 2 00 5 ), MAX (Ali z adeh et a l ., 2011) , AP M M (Al iza deh et a l ., 2 014; Ali z adeh et al., 2012 ), D&I (Yousef nezh ad et al., 2013) , a nd W OCCE (Ali z adeh et al., 2015 ) w h ic h are t h e sta te- o f -t he -a r t cluster ensem b le (selec t io n) methods. All of the s e algorithm s a r e i mp l e m ented in the MATLAB R2014b (8. 4 ) b y aut ho rs in order to gener a te exp e rime nta l resul t s. All res u l ts a re rep orted by averaging th e resul t of 10 ind e penden t runs of the al g ori thms w hich a re used in the exp er iment. I n t h is pap e r , dT is cho s e n such that each p ro posed a l gori thm rea ches to a runni n g tim e of app roxim at e l y 2 mi n on a P C w it h a certain sp e cific a ti on s 8 . T he experime nta l res u l ts a r e given in Table 4. The r e s u lts a r e in a form of a ccu ra cy (percen t age) ± standard devi a tion w h i ch i s achie v ed b a s e d o n the 10 time s run n ing of ea c h al g ori th m. The best resul t s o n e a c h d ata set are b olded . Table 4 The acc ur ac ies (i n perce n tage) a long w i th the stan d ard devi a ti ons ac h ie v e d in the experime n t s bas ed o n the 10 t im es r u nn i ng o f e a ch algorit h m . Accor d in g to Tab l e 4, although ba sic c lust e ring algori t hm s ha ve shown hi g h perfo rmance in some data sets, t hey cannot recognize t r u e p atterns in all o f th e m. As me n t ioned earlie r i n this pa per, in order t o solve the cl ust e ring prob l em, each b asic a l go r ithm co n si ders a sp ec ia l p er s pecti ve of a dat a set w h i ch is ba sed on i ts objec t i ve function. The achieve d results of b asic clu stering algori t h ms whi ch a r e d e p ict ed in Ta ble 4 are good evi dences f o r this cl a i m. Furtherm ore, the res ults generated b y MCLA and EAC show the ef fect of the aggregation m ethod on improving a cc u r a cy in t he f i nal r esults. Accor d in g to Ta ble 4 , WOCCE a nd the pr o p osed a lgori thm have gener a ted b etter r esults in com pa r ison wi th othe r b as i c and ens e mb l e algorithm s . Even t houg h the prop osed method was 8 App le Ma c Book Pro, CPU = I ntel Core i 7 ( 4 * 2.4 G Hz), RA M = 8GB, OS = OS X 10 . 1 0 outp e rformed by a numbe r of a l gorithms i n three data sets (Glass, S A Hart and Yaes t), the majori t y o f th e result s dem o nstrate t h e s u p er ior accuracy of t he p roposed me tho d i n c o mpa ri s on w ith o ther algori thms . To accurately cl arif y t he sup e ri o r ity of o ur p rop o s e d method in comparison w ith its p ower ful ensemble ri v als , the last row of T able 4 (Aver a ge) show s the average of a c curacy w hic h i s achi eved i n each method. In deed, a s a classic ensem ble m ethod, EAC doesn't hav e any e va lu at io n and sele ction in its p roc ess. Th is met h od cannot omi t err o r s whic h a r e made in t h e process of recognizi ng p atterns of the ba si c clu s teri ng result s b y using the corre ct inform at i on of other bas i c alg o r it h ms' res u lt s. The resul t s o f EAC w hich are giv e n in Tab l e 4 show the ef fects of eva l uat i on and sel ection in cluster ensem b l e sele c ti o n methods. 4.4. Par ameter Ana l y sis In t hi s sec t io n , the effect of dive rs i ty threshold on th e perform an c e and runtim e are a naly z e d . He re by , the main goal o f our exp e r i ment is to show the relation betwee n p e rformance and runtime in the propo sed met h od an d to ill ustrat e how the o ptim iz e d va l ues f o r the diver sity t hre s h old are d e t e rmi n e d . T hus, thi s pa per em p l o y s mul tiple da t a sets, tw o low dimen sional dat a sets (H a l f Ri ng, Iri s) as we ll as t w o h i gh dim ensional data sets (B r eas t C ancer , Wi ne), for this exp erim ent. F i g ure 10 il lust r a tes the relati o nshi p bet w een the runtim e of the propo s e d me tho d, bas e d on the number of corre ctly cl a ssi fied samp l es, and the div e rs ity t hresh o ld s. T h e v e r ti cal ax is refe rs to t he runtim e a nd t h e hori z o n tal a xis ref ers to the dive r s it y threshol d. Fig. 10. T h e effect of diversit y Thr es h ol d (dT) on t he runtime o f t h e p roposed a lgorit hm F i gure 11 il lustrates t he relation s hi p between th e perform ance of the propo s e d me t hod, b ased o n the number of co r r ectl y cl a ssifi ed samples, and the diver sity threshold s . The vertical a xis ref ers to the perform a nce w hil e the horizontal axis re fers to the diversi ty. Fig. 11. T h e effect of diversit y Thr es h o ld (dT) on t h e perf or m ance of the proposed a lgorithm As y o u can see i n Figu r e 10 a nd Figure 1 1 , al though increasin g the di v e rsity threshold can im prove t he perform a nce of the pr o p osed method, i t can increase the runtime of the a l g o r i thm, too. T he refore, this pa per use s a ti me c o n stant (2 mi n) to establish a balan c e b e t w een t he performance and the r unt i me. 4.5. Nois e an d M issing-Va lu e An aly sis In this section , a few exp eri ments are conduc t ed i n o r d e r t o a naly z e th e eff ect of noi s e and mis s i ng values on the performanc e of the prop osed me t hod. T his pap er employs Arce n e and CANE-9 for th i s exp er ime n t, since these two data set s are high-di mensio n al, large ( s ample) da ta sets. Figure 1 2 i llustrates the performanc e of t he prop osed method, WOC CE, APMM , M AX , a nd MCLA on the da ta sets wi th mi ss i ng val u e s . For this cause, some attributes of the mention ed data sets are randoml y chose n and their values are set to n ul l. Accordi ng to Figure 1 2, WOC CE and the p ropos e d me thod generate more st able result s . As it is clear in this Fig u r e, the p ropos e d method c an effe ctiv el y handle the missi n g va l ues. T he reason i s tha t, i t use s the grap h based I n de p endenc y and Unifor mi t y f o r div e r sity evaluation. a . T he perfor mance of the p ropose d me th o d, b. Th e per fo rma n c e of t h e p r oposed m ethod , WO C CE , A P MM, MA X, a n d MCL A o n W O CC E, A P MM, MAX, a n d MCLA o n A rce n e w i th missi ng v a l u e s CNAE 9 w i th mi s si n g v alues Fig . 12 . Mi ssing-Val ue Anal y s i s a . T h e perfo rmance of th e proposed method, b . Th e perfo rmance of the p r oposed m e tho d , WOC C E, A PMM, MAX, and MCLA on W OC CE, A PMM, MA X, and MCLA o n Noi s y Arce n e n o isy C NA E 9 Fig . 13 . Nois e Anal y sis F i gure 13 ill u str a tes the p erf o r mance of t he p r o posed method, WOCCE, APMM, M A X, and MCLA on the dat a sets whic h contain noi s e s . For this cau se, some at tri bu t e s of t he menti o n ed d ata sets a re randoml y changed. Accordi n g to Figur e 1 3, WOCCE a nd the prop osed me tho d generate mo re sta ble result s . It was claim ed earli e r that the g oal of In d e p endenc y is t o a c h i eve high-perform an c e a s we ll a s gener a tin g r o bust and sta ble results. T hi s e x perim ent c a n be t he be s t evidence for the m ent i oned cl aim. 5. Conclusi o n T r a diti on al cl us t e r ens emble methods c o nc entrat e on the d i versity and q uali t y of the b asic re sults. T his pa per su g gests a new method for employ ing t he graph-b ased mo d e ling, whic h is a concept i n s oftw are testing, for evaluation of basic c lusteri n g a l gorithm’ s Independenc y i n the clu ste r ensemble sele ction. The mo s t importan t ad vantage of thi s employm ent is the addi t io n of n e w asp ec t s, such as In dep ende ncy, w h i ch is b as e d on the gra ph of clusterin g algorithm s , as wel l as a n e w fram e w ork for s e lecti ng high qu ali ty b as i c clusterin g results. T he d e gree of In d e p endency w h i ch is ob tained from t he p roposed method is used a s a w eight t o evaluate diversit y in t he processes of gene rating t he fi n al result. Als o , t hi s pa per prop oses a p roc edure to as sess the Inde p endenc y of the ba se algori th m s. T his procedure i s b ased on the CAI L, w hich is a new modeli ng language for ca l cu la tin g the I nd e p ende ncy of clu s teri ng algorit h m s. We also i ntroduce t h e U n i formi ty c r iteri o n to m easure the di versity of the basic r esults. T o prove the claim s of this pa per, the resu lts of t he propo s e d method are compa r ed wi th the resul ts of ba sic cl u s t e ring methods, cluster ense mble me tho d s, and cluster ensem b le sele ction methods. T he results w ere a c h i eved by a pp l yin g the me ntioned methods on 1 7 stan dard dat a set s p ri ma r ily taken f rom the UCI repository . I n our ex peri ment, da ta sets wi t h differ ent s c a l es (smal l, a ver a ge, and l a rge ) we r e used s o that the accur a cy cou l d b e evaluated reg a rd less o f th e scale of a data s e t . I n ad di tion, in order to b e ensured ab out t h e accuracy of all results, t he exp e rime nt has been repeat e d 1 0 tim es. Sim ilar to oth e r p i oneeri n g ideas, the p roposed framewo rk can be im proved l a ter. This pa per suggests employ ing mo re ba sic cl u ste ri n g algor ithm s in o r d e r to be tter sa ti s f yi n g the diversi ty in the b as i c res ults. Ref e rences Akbari, E ., Dahlan, H .M., I b rahim , R., Alizadeh, H., 20 1 5. Hi era r chical clu s ter ensem b le selec t i on. Enginee ring App l icat i ons of Artifi ci a l In tell igence 3 9, 1 46-15 6 . Ali z adeh, H., Minaei -Bidgoli , B., Parvin, H., 2 01 1. A new asym metri c cri t e r i on for clu s ter v al i d ation , Ibe r o a me rican Congre s s on P atter n Recogni tion. Sp r inger, p p. 3 20-33 0 . Ali z adeh, H., M i naei -Bidgoli , B., P arvi n , H., 2 01 4. C l uster en s e mb le sel ection ba sed on a new cluste r sta bil ity me a sure. In t e lli g e n t Data Analy sis 1 8, 389 -40 8. Ali z adeh, H., Parvin, H., P arvi n, S . , 20 12 . A frame work for cluster ensem b le b ased on a max me t r ic a s cl u ste r evaluator. IA ENG I nternational J o urnal of C om p uter S c ie n c e 3 9 , 1 0-19 . Ali z adeh, H., Yousefne z had, M., Bidgoli , B.M., 2 01 5 . Wi sd om of Crow d s cluster ensemble. Inte ll igent Data A nal ysis 19 , 48 5 -503 . Am ma nn , P., Offut t , J., 2 008 . I ntroducti o n to s o f t w are testi n g. Ca m b rid ge Univ ersity Press. Azim i, J., F er n , X ., 2 00 9 . Adap tive C l uster Ense mble S e lecti on, I JCAI , pp. 9 92 -997 . Ba hro loloum, A., Nez am ab ad i -po ur, H., Sary a zdi , S . , 2 01 5 . A d ata clusteri ng ap pr o a ch b ased on uni v e r s a l g r a vit y rul e . Enginee ring App l ications of Arti fici al Inte lli gence 4 5 , 4 15- 42 8. Fern, X.Z., Lin, W., 2 0 08. Cluster ensemble sele ction. S tatistic a l Analysi s a n d D ata Mi n i ng 1 , 1 28 - 1 41. Fred, A., Lourenço, A., 2 0 08. Cluster e n sem b le methods: from single clusteri ngs to combined sol u t ions, Sup er v i sed and unsupervise d ensemble me t hods a nd thei r app l ications. S pringe r , pp . 3 -30 . Fred, A.L., Jain, A.K. , 2 00 5. Combini n g multi p le cluster ings usi n g eviden ce a c c u mulation. I E EE transa ct ions o n p at t e rn analy sis a nd m a c h i ne inte lli g e nce 27 , 83 5-85 0. Go se , E., 1 99 7. Pat ter n reco g ni tion and im a ge a nal ysis. Izakian, H., P e dryc z , W., Jamal, I ., 20 15. F u z zy clu s teri ng of tim e seri es data using dyn a mi c ti m e w a rping di sta nc e. En g ineeri ng App li cat i ons of Arti fici a l In t e lli g e n c e 39 , 23 5-24 4. Jain, A . K., 2 0 10 . Da ta cl us t ering: 5 0 years b ey ond K-m eans. Pattern re cog ni tion letters 31 , 65 1-66 6 . Jain, A.K ., Murty, M.N ., F l ynn, P.J., 1 999 . Data clusteri ng: a revi e w . ACM computing survey s (CSUR) 31 , 26 4-32 3. Jain, A.K ., To pchy, A., La w , M.H., Buhmann, J.M., 2 00 4 . La nd s c a pe o f cluster ing a lgo r i th m s, Patt e rn Recogni t i on, 2 00 4. I CPR 20 04 . Proceedin g s of the 1 7 t h I nternat i onal Confer ence on. IEEE , p p. 26 0-26 3 . Jia, J., X i a o, X., L i u, B., 2 0 12. Sim ilarit y-ba se d s pectr a l clusteri ng ensemble sel ection, Fuz zy System s and K no w le dge Di scovery (FSKD ), 2 012 9th I nt er n ati o nal Con f erence on. I EEE, pp . 10 71 -10 74 . K a ndyl as, V., Up h a m, S.P., Ungar, L.H., 20 08 . Findin g cohesi ve cl u sters for an aly z i ng k no wled g e com mu n i ties. K n ow le d ge and I nformation Systems 17, 33 5 - 3 54 . Lim in, L . , Xi a oping, F., 2 01 2 . A new s e lecti v e clu sterin g ensemble a l g ori thm, e-Bu s ine ss Engine ering (I CE BE), 2 0 12 IEEE Ninth I nt e rna ti o nal Confer ence on. I EEE, pp. 4 5-4 9. New man, D.J., He t ti ch, S . , Bla k e, C . L . , M er z , C.J., 1 9 98. { UCI} R e p ository o f machi ne learning dat aba ses . Strehl, A., G hosh, J ., 20 0 2. Cluster ense mb l es---a k no wled ge reuse framew ork for combinin g multiple pa r titions. Jo u rnal o f m achine learni n g res earch 3 , 5 83 -61 7. To pchy , A., Jain, A. K., Pun c h, W., 2 0 03 . C ombin ing mul t i p l e weak clusteri ngs, Data Mining, 20 03. ICD M 20 03 . T hir d IEEE Internati o nal C onfe rence on. I EEE, p p . 33 1 -33 8. Yousefne z had, M., 2 01 3. Cluster ensem b l e se lecti o n ba sed on the wi s d o m of crow ds. M . S c . f ina l t hes is in Maz andara n Uni versi t y of Sci ence an d Technolo g y , I ran, Bab ol. Yousefne z had, M., Al iza deh, H., Mi n ae i-Bidgol i, B., 201 3 . Ne w cluster ense mb l e selec tion me t hod ba s ed on d i versity a nd independent metri cs, 5 th Confere n c e on In f o r m at i on and K n ow le d ge T ec hn ol ogy (I KT’1 3), p p . 22 -2 4. Zhong, S., G h o sh, J., 2 00 5 . G e n e rat iv e mode l- b as e d docum ent cl u ste r i ng: a compa r a tiv e stu dy . K n ow le d ge a n d Info rmati o n Systems 8, 3 74-3 84 .

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment