An Artificial Immune System as a Recommender System for Web Sites

Artificial Immune Systems have been used successfully to build recommender systems for film databases. In this research, an attempt is made to extend this idea to web site recommendation. A collection of more than 1000 individuals web profiles (alter…

Authors: Tom Morrison, Uwe Aickelin

An Artificial Immune System as a Recommender System for Web Sites
An Artificial I mmune Syste m as a Reco mmender for W eb Sites Proceedings of the 1st Internal Con ference on ARtificial Im mune System s (ICARIS-2002), pp 161- 169, Canterbu ry, UK, 2002. Tom Mo rrison University of t he West of Eng land Frenchay Ca mpus Bristol BS16 1 QY tom.morrison@ uwe.ac.uk Uwe Aickelin School of Co mputer Science University of Notti ngham NG8 1B B UK uxa@cs.nott.ac.u k Abstract Artificial Immune Systems have been used successfully to build recom mender systems for film databases. In this research, an attempt is made to extend this idea to web site reco mmendation. A collection of more than 1000 individuals’ web pro files (alternativel y called p references / favo urites / book marks file) will be used. UR Ls will b e classified using the DMOZ (D irector y Mozilla) database of the Open Director y Projec t as our ontology. This will then be used as th e data for the A rtificial Immune Systems rat her than t he actual addresses. T he first attempt will involve u sing a simple classification cod e number coupled with the number of pages within that classification code. However, this implemen tation do es not make use of the hierarc hical tree-like structure of DMOZ. Considera tion will then be given to the co nstruction of a similar ity measure for web p rofiles that mak es use o f this hierarchical infor mation to build a better-infor med Artificial I mmune Syste m. 1 INTRODUCTION This research is concer ned with usin g Artificial Immune Systems as a r ecom mender of web site s for new d atabase members. Thus, a new member of the database system would be ab le to export their book mark / fa vourites file and re ceive a small number of reco mmendations of web site addre sses (URLs or Uniform Resource Locators). Unlike a searc h engine that will only return specific ite ms a user searc hes for, our reco mmender syste m should be capable of providing the u ser with surpris ing items of intere st. Artificial Immune Syste ms are adaptive search algorithms based on the biological immune system with the central task of pattern matching bet ween antigens and antibod ies. T hus in our opinion, they are particularl y well sui ted to data-mining tasks that involve sifting through lar ge databases and finding matches to other ite ms. This has been con firmed in recent re search b y Cayzer and Aickelin [5 ] w ho used Artificial Immune S ystems to reco mmend fil ms to new members of a databa se b ased on their rating of at least five films. As in the researc h by Cayzer a nd Aickeli n, the t ype of Artificial Immune S ystem developed here will be based on Jerne’s id iotypic net work ideas [13 ]. Hence, we will build an Artificia l I mmune S ystem that will find a group of user s in the database who ar e similar to the target user in their web site preferences. At the same time, the idiot ypic effe cts will ensure t hat this group is as diverse as po ssible. Thus, we will have created an ide al b ase for p redicting a nd recommending web sites. To do this succes sfully two steps are nece ssary: buildi ng a d atabase that models individuals’ web p rofiles using a suitable o ntology, and constructing a suitable meas ure of how similar two web pro files are. The r emainder of this p aper is or ganised as follo ws: In the next section, a ver y brie f overview o f the immune syste m is give n with particular e mphasis o n those features that we intend t o exploit here. Section 3 will su mmarise the research into f ilm pr ediction and explain differences and similarities to this piece of research. The following section describes the data a nd ontology used and give s further details about t he ta sk of web site recomme ndation. Section 5 pr esents a description o f the inte nded Ar tificial Immun e System with a n e mphasis o n the d iscussion of a suitable similarity measure. T he paper is concluded with a summary. 2 THE IMMUNE SYSTEM The human body is p rotected against foreign invad ers by a multi -layered immune syste m. The immune system is co mposed of p hysical bar riers such as t he skin a nd resp irator y system; ph ysiological barriers such as destruc tive e nzymes and sto mach acids; and the immune system, which h as two complementary parts, the innate and ad aptive immune s ystems. T he innate i mmune s ystem i s an unchanging mechani sm that d etects and destro ys cer tain in vading organisms, whilst the ad aptive i mmune system responds to previously unmet forei gn cell s and builds a response to them that ca n remain in the body over ti me. The immune system is composed of a number of different agents perfor ming different functio ns at a number of differe nt lo cations in the bod y. T he p recise interaction o f t hese a gents is still a topic for d ebate [10]. In order to pre sent the important aspec ts o f t he system from a mathematical v iewpoint it i s necessar y to simplify and p resent a selective d escription. The i mmune system’s job is to detect antigens, which are f oreign m olec ules fro m a bacterium or similar invader. The innate im mune system helps in the detection process but t he main respo nse is t hrough the adaptive i mmune system. T wo of the most important cells in t his p rocess are white blood cells, called T cells, and B cells. Both o f these o riginate in the bone marrow but T ce lls p ass o n t o the thymus to d evelop before, as with B cells, t hey c irculate t he bo dy in t he blood and lymphatic vessels. B cells a re respo nsible for the production and secretion of antibo dies, which are s pecific pro teins that b ind to t he a ntigen. Ea ch B c ell can only pr oduce one particular antibody. T he antigen is fou nd o n the surface of t he in vading organis m and the binding of an antibod y to the antigen is a signal to d estroy the invading cell. A diagram from de Castro and Von Zuben [4] o f this process is s hown in Figure 1. Figure 1: So me of the proc esses involved i n the adaptive i mmune system. Whilst there i s more tha n one mecha nism at work (see [8], [10] or [15] for more details), t he essent ial process for t he sake o f this r esearch is the match ing of antigen and antibo dy leading to increa sed concentrations of mor e closely matched antib odies. In particular, two pr ocesses, known as the ‘ clonal selection theor y’ b y Burnet [ 3] and the ‘idiot ypic network theory’ by Jer ne [ 13] and [1 4], are important to us. The former can be explained as follows: When an antibod y stro ngly matches an antigen the correspo nding B ce ll is sti mulated to p roduce clone s of itse lf that t hen produce more antibodies. This selection o f B c ells for clo ning o n t he b asis o f t he antibod y match is called the ‘clonal selection principle’ and will res ult in increa sing co ncentrations of that antibod y in the bod y. However, when the B cells clone themselves t hey do not d o so e xactly, but mutate slightly. Similarl y, B cells may be stimulated whe n the antibody-antigen match is not perfec t. By allo wing mutation, the match could be come better . Ho wever, a number of poo rer matches will also be created, and furt hermore, some of the newly pro duced antibo dies could even be harmful to our own cells. Such cells will die o ut under what is known a s the ‘ne gative selection principle’ [10]. The mutation, mentioned above, is q uite rapid, o ften as much as de Castro and Von Zuben state in [4] “one mutation per cell division”. This allo ws a very quic k response to the antige ns. This r apid mutation, kno wn as ‘somatic hypermutation’ [ 10], may be linked to t he ‘fitness’ of t he antibo dy. Hence, those B c ells producing a ntibodies that ar e a good match would be subject to less mutation and vice versa for those t hat are not such a goo d match. The idiotypic networ k theor y, introd uced by J erne i n [13] and [14], maintains that interactions in the immune syste m do no t j ust o ccur b etween antibod ies and antigens, but that antibod ies may interact with each other. Hence, an antibod y ma y be matched b y other antibodie s, which in tur n may be matched by yet other antibod ies. T his activation can co ntinue to spread through the pop ulation. Howe ver, th is interaction can have p ositive o r negative ef fects on a particular antibod y- prod ucing cell. The idiotypic network has bee n for malised by a n umber of theoretical i mmunologists in [ 15]. This theor y co uld help e xplain h ow the m emory of past infections is maintained. Furt hermore, it co uld result in the suppression of similar antibodies thus enco uraging diversity in the an tibody pool. This last pos sibility was us ed in the resear ch by Cayzer and Aickeli n [5] in o rder to preserve diversity. The Artificial I mmune Syste m in their resear ch produced a pool of users who were similar to the ne w entrant to the d atabase, but d issimilar to each other. Whilst this method pro duced similar per formance in predicting film ra tings to a k-nearest neighbo ur appro ach, the diver sity i n the poo l of r ecommende rs was found to yield statisticall y significant ly impro ved recommendations. Given the sparseness of the web site search space it may b e that s uppressio n of antibodies on similarity grounds might be unnecessar y. This will be investiga ted. There a re a number o f success ful Artificial I mmune System impleme ntations. Ho wever, even in the most complex artificia l systems o nly a fraction o f the functionality of t he bio logical immune system is exploited. T ypically, the antib ody-antigen interactio n coupled with so matic hypermutatio n, for m t he basi s for man y Artificial I mmune Sy stem a pplications. Examples are Timmis et al [18], who used an Artificial I mmune System for clustering multivariate data, and Hajela and Yoo [11], who c ombined a genetic algorithm and an Artif icial Immune System to optimise the design of a 10 bar truss. The re search b y Timmis et al also applied the id iotypic net work theory and were succe ssful in bot h c lassifying data and “generalising to cover a larger region o f the inp ut space”. Ho wever, the article d oes not comment on the effect of modelling a suppr ession factor between antibodies. So me of the mos t promising resear ch to date has been co nducted in the area of computer security, for instance b y Hof meyr and Forrest in computer network securi ty [12] and by Ki m and Bentley for fraud d etection [15 ] and [1 6]. 3 ARTIFICIAL IMMUNE SYSTEMS AS RECOMMENDERS Whilst m ost of the applica tions descr ibed ab ove involve so matic h ypermutation, Cayzer and Aickeli n [5] had only identical c loning, not mutation, i n their algorithm. This was because the po tential antibodies were act ual u sers o f the film database (EachMovie database provided by the Com paq Resear ch C entre [6]). There the task was to fin d users that were s imilar to new entrants to the database. Somatic hypermutation was not used , since it is not immediately o bvious h ow to mutate users sensibly such that t hese arti ficial entities still represen t plausible pro files. For the sa me rea sons, clo ning in o ur intende d Artificial Immune System will make exact copies, too. Future wor k might incl ude making i nexact c opies to create novel profiles once app ropriate rules for doing so have been established. T his could be particularl y bene ficial when data gatheri ng is expensive or d ata is otherwise sparse, p erhaps due to its sensitive nature, leading to few users being willi ng to share their i nformation with others. The main loop of the recommender algorit hm i s shown in Figure 2 below a nd is the core of our Artificial I mmune System. T he aim of this algorith m is to increase the co ncentratio ns of those a ntibodies (database users) that are simila r to the a ntigen (target user). This process is s ubject t o the suppression of similar an tibodies following J erne’s id iotypic ideas mentioned above. T hus, over time the A rtificial Immune Syste m contains high conce ntrations of a diverse set of user s who have similar film pre ferences to the target user. Initialise AIS Encode user for whom to make predictio ns as antigen Ag WHILE (AIS not stabilised) & (More d ata available) DO Add next user as an antibody Ab Calculate matching score between Ab and Ag Calculate matching scores between Ab and antibodies WHILE (AIS at full size) & (AIS not stable) DO Iterate AIS OD OD Figure 2: Main lo op of the Artificial Immune System’s (AIS) al gorithm for recommendatio n. The diagrams in F igure 3 show the id iotypic e ffect. I n the top diagram, antibodies Ab 1 and Ab 3 are very similar and they would have their concentrations reduced in the ’Iterate AIS’ stage of the algorithm above. However, i n t he lo wer diagra m, the fo ur antibodies are well separated fro m each ot her as well as being c lose to t he antige n and so would have t heir concentrations i ncreased. Figure 3: Illustratio n of the idio typic effect. At ea ch iteration o f the film reco mmendation Artificial Immune S ystem the concentratio n of t he antibodies changes accor ding to t he f ormula given belo w. This will increa se the concentratio n of antibodies that are similar t o the antigen a nd can allow eit her t he sti mulation, suppression, o r bo th, o f antibod y- antibod y interactions to have a n eff ect on the antibo dy conce ntration. More d etailed d iscussion of these effect s on reco mmendation pr oble ms are contained withi n Cayzer and Aickelin’s pap er [5]. The follo wing is a for mal eq uation for the idiotyp ic effect adapted from Equation 3 in Farmer [8 ]: Ab2 Ab3 Ab1 Ag Ab 4 Ab 1 Ab 3 Ab 2 Ag i N j i ji N j N j j i ij j i ji i x k y x m k x x m k x x m k c rate death recognised antigens recognised am I recognised antibodies c dt dx 3 1 2 1 1 1 0 −       + − =         −               +         −         = ∑ ∑ ∑ = = = Where: N is the number o f antibodies x i is the concentratio n of antibo dy i m i is the antibod y i and the antigen co rrelation m ij is the correlatio n between antibo dies i and j y is the concentratio n of the a ntigen k 1 is suppression, k 2 stimulatio n and k 3 death rate k 0 is set to zero in our system, i.e. we do not re ward antibod y - antibody recognition. The algorithm is ter minated, when th e A rtificia l Immune Syste m is said to have stabili sed, i.e. if it ha s not cha nged in co nsistency for more than te n iterations. T he concentratio ns a nd corr elations of the users in the final neig hbourhood, i.e. final immun e system iteration, are the n used to c alculate a weighted sum of the r atings of web site s. T his would b e either a specific unseen web site b y the target user in o rder to predict its ratings, o r general top 1 0 r ecommendations of new web site s that the targe t user might e njoy. 4 THE CHALLENGE OF WEB SITE RECOMMENDATION There are a num ber of algorithms that reco mmend items to u sers. One o f the best-known examples is Amazon.com’s [1] bo ok recommender based o n similar items bo ught. G eneral ly, the se reco mmenders use what is ter med “collabor ative filtering“ or “social filtering” by Bill sus a nd Pazzani [2]. With the exponential growth of available information on the internet, the need for auto mated techniq ues to winnow down the possibilitie s has also grown but “only a fe w different algorithms ha ve been prop osed in the literature thus far” [2]. Many of the current collab orative filterin g techniques use the P earson co rrelation coefficient to co mpare the item ra tings of differe nt us ers. This suffers fro m several limitation s. For examp le, due to t he ex tremely large a mount of infor mation to be rated, tw o users may o nly have a ver y small number of ite ms i n common c ausing t he cor relation measure to be undul y influenced b y those items. Further, there is p otentially no difference b etween the c orr elation bet ween two users with three items in common and t he measure for two users with 30 items in common, in ter ms of their “influence on t he final predicti on” [2] . The sparseness o f the infor mation spac e al so i mplies that t wo users might have no ite ms in co mm on. Ca n we there fore concl ude that they have co mpletely dissimilar ta stes, or do es the fact that the y have not rated particular items i mply a s imilar view of the importance o f t hose ite ms? For these reasons, alternative appr oaches to both current collaborative filtering algorithms and to the use of t he Pearson correlatio n coefficient sh ould be investigated. More information about traditi onal and enhanced collabor ative filtering is pro vided b y Gokhale [9]. T he Artificial I mmune S ystem pr esented here is another example. In our problem o f web sit e re commendation, t he original data consists of sets of web site addr esses or URLs taken from bookmark collections such as http://www.cs.uc l.ac.uk/staff/ Kim/ComputerI mmune. It is extremel y u nlikely that many peo ple will have many exact addresses in common within their web profiles. Because of this, it is necessary to transfor m or translate the ad dresses into a different form. To do this a number of steps are necessary and a widely used web site clas sification tr ee ontolog y will be used called DMOZ [7 ]. Let us look at the is sues involved in the cla ssification of U RLs systematical ly. Typicall y, a n individual web profile in ra w form might consist of a list o f book marks a s shown in Fi gure 4 (in this case take n from the Op era bro wser – only a small section is shown). #URL NAME=ODP - Open Directory Project URL=http://dmoz.org/ CREATED=1017158736 VISITED=1023875733 #URL NAME=Open Directory RDF Dump URL=http://dmoz.org/rdf.html CREATED=1017159133 VISITED=1023875759 Figure 4: P art of a raw web pr ofile taken fro m the Opera b rowser. This data ha s to be pr e-processed i n o rder to remove unwanted information a nd superfluous c haracters. This also includes re moving an y categories the user might have assigned to some o f t he book marks. Unfortunatel y, such categorisation of i nformatio n cannot be kept, as it is ar bitrary a nd individ ual to the person that o wns the boo kmarks. For in stance, www.bbc.co.uk cou ld be classified under ‘ media’ by one pe rson a nd unde r ‘ne ws’ by a nother. In ad dition, misclassification s and d uplications might be present in the ra w data. Hence, this fi ltering t ypically yields a file such as the o ne partiall y shown in Fig ure 5. www.bbc.co.uk/weather/ www.bbc.co.uk/ www.bbc.co.uk/sport/english/football/d efault.stm www.guardian.co.uk/ football.guardian.co.uk/ Figure 5: P art processed d ata with superfluous information deleted . As can be seen fro m the thir d line in Figure 5 , some of the U RLs will have long addr esses. Another web profile might c ontain a very similar ad dress such a s www.bbc.co .uk/sport/ english/football/ en/default.st m . If we were to use the raw addr esses within t he Artificial I mmune System, these two would be considered d ifferent. However, it is clear that the two users have b ookmarked different p ages withi n the same part o f the sa me site, i. e. ‘BBC o nline - football’, a nd thus have very s imilar interests. Therefore, it is still necessa ry to process the data before it can b e used. T his presents c onsiderab le proble ms. A progra m will ne ed to b e devised which will tru ncate the URLs in such a way so that the t wo addresses discussed above would b e considered th e same. However, lo oking agai n at Figure 4, a simple truncation o f the ad dresses w ould lead to the f irst three items occup ying t he same categor y. At the same time, it might no t le ad to t he la st t wo bei ng p icked together de spite the fact that both the ad dresses refer to pages from the sa me site. Furthermore, it might no t put ite ms 3 a nd 5 to gether despite the fact t hat they are both co ncerned with foo tball. To overco me these difficulties , two strategies ar e used within the DMO Z ontolo gy: Nor malisation and reverse partial look-up. First, all URLs under go a kind of normalisatio n when pre-for matting the data, a s well as when d oing look -ups. The protoco l and host part are m apped to lowercase characters and host only URLs are always terminated w ith a “/”. During the actual look- up, the cate gory infor mation is gained from DMOZ by employing a reverse truncation search. That is, at first, we tr y to match the full U RL, and then we try to match up to the last “/”, the n to th e last but one “/” etc. For instance, we would first tr y to match ite m t hree from ab ove by loo king for the full URL in DMOZ. If we ca nnot find that, we wo uld look for www.bbc.co.uk/sp ort/english/ football/; if this fails, we would searc h for w ww.bbc.co .uk/sport/english/ etc. Alternati vely, we could try to find the closest match in D MO Z defined by the n umber o f consecutive c haracters t hat ar e identical counted from the beginning of t he URL. These nor malisation and intell igent matchin g to gether should overco me the first pr oble m mentioned above. To overcome problems of misclassifica tion and to have a common standard we decided to use the DMOZ open dir ector y o ntolog y as a c lassificatio n system [7] . Figure 6 sho ws part of the structure of this directory. Arts […] 468769 Coolplay's Cool for Kids Includes anima ted nursery rh ymes, c rafts, alphabet and spelling games, and colouring book. Figure 6: P art of the DMOZ o pen directo ry structure. The first h alf of Fi gure 6 sho ws part of the ‘Arts’ category, which is located immediately b elow the root of the tree (ca lled T op). E ach cate gory has a u nique identifier number (2 in t his case). T his categor y has a number of sub categorie s that in tur n ha ve severa l s ub categories of their own. In total, there are some 5 million U RLs in 4 28,59 0 categories spr ead o ver 16 levels i n the directo ry. Categories can also be r eferred to using an add ress showing t he parent cate gories in a way that preser ves the tree structure infor mation. For example, a c ategor y address might read ‘1.3. 9’ meaning that it is the ninth sub categor y of categor y 3, which is the t hird sub cate gory of categor y 1. The second h alf of Figure 6 shows ho w U RLs are represented in DMOZ and gives an ex ample of a more detailed descriptio n of one URL as provided b y an ano nymous re feree. The complete DMOZ d atabase is roughl y o ne GB in size and updated r egularly. All specifications in t his paper refer to DMOZ as of 1 June 2002. Overall, the versio n o f DMOZ that we use has the followin g tree structure with deepe st branc h being 16 level s below the top : 1 18 / 621 // 6675 /// 30754 //// 61042 ///// 68901 ////// 101567 /////// 82802 //////// 51454 ///////// 20592 ////////// 3467 /////////// 616 //////////// 69 ///////////// 8 ////////////// 2 /////////////// 1 //////////////// Figure 7: Full D MOZ structur al tree. The final stage o f proc essing the data is to turn eac h of the URLs, s hown in Figure 7, into a file conta ining either the cate gory identific ation numbers o r the category ad dresses, co upled with the number o f ite ms in eac h categor y. The choice about which version to use will be d iscussed in the ne xt section. There are a number o f po ssible pitfalls with this process. For example, many p rofiles will contain a set of U RLs, whic h ar e crea ted by the bro wser progra m that the y use. Few users are lik ely to delete all of these li nks, reasonin g t hat they may be usef ul at some stage. This may create a situation o f arti ficial similarity between us ers, which w ould prevent the Artificial Immune System from functioning effectively. Secondly, the proce ss of p lacing UR Ls into categor ies is likel y to in volve so me tru ncation if at f irst there i s no c lear category i nvolved. This co uld lea d to several subtly different ad dresses being classi fied into the same category due to the tr uncation look-up. Depending on whether the truncated sites are f rom genuinely different U RLs o r not this co uld be good or bad. In the first case, the cat egory may app ear to be more po pular than it should b e whereas in the second case the number in the ca tegory is a clea r indicatio n of interest in that category. U ntil the data is f ully assembled a nd individ ual exa mples are checked, it will not be possible to judge how critical some of these proble ms will be. 5 BUILDING THE ARTIFICIAL IMMUNE SYSTEM RECOMMENDER In the film recomme nder research d escribed in Cayzer and Aickelin [5] , eac h user was cod ed as a user identification number followed by pair s o f fil m identification nu mbers with the corresp onding rating of the film. T he ta rget user b eca me the a ntigen, whilst the current da tabase me mbers were p otential antibodies. In eac h ite ration, a ntibodies were ad ded to the Artificial Immun e System. T hose judged to be more similar to the antigen in their fil m ratings had their concentratio n increased. A unique feature of that par ticular appr oach was t he application o f the idiot ypic net work theory b y J erne [13]. T his was i mplemented such that a ntibodies that were very si milar to each other had their concentration reduced . T his has the e ffect o f creating a set of users who are similar t o the new user b ut quite different to each ot her and thus enhanci ng the recommendation ac curac y of the system. We intend to use the same mechanism for o ur web site recommender to build an Artificial I mmune System as described in section 3. In ord er to do this, w e also h ave to decide on th e encoding of a user’s web p rofile for which there are two possibil ities. I n both c ases, a user is encoded as a list of categor y IDs and the number of boo kmarks within each cate gory. T he difference is in the category IDs; the y can be either an inte ger or a reference to the tree str ucture. To illustrate the difference, Fi gure 8 sho ws the same user’s boo kmarks for bo th encodings. The figures in b old indicate ho w many bookmar ks fall into a p articular categor y: Encoding with the Tree structure: 1.13.12.1.5: 5 ; 1.13.12.1.6: 3 ; 1.16.3.2.11.5: 1 ; 1.18.1.2: 1 ; Encoding with integer category IDs: 22343: 5 ; 495771: 3 ; 334921: 1 ; 3409: 1 ; Figure 8: Integer versus T ree Encoding. If the second encoding is used to gether with the number o f sites within each ca tegory as a r ating of the popularit y of that categor y th en t he p roble m beco mes similar to the fil m recommend ation pro blem. However, here we have a considerab ly spar ser searc h space. In the film database, there were approxi mately 20,000 entries whereas in the DMOZ directo ry there are over 400, 000 categories. This sparseness ma y prevent the syste m fro m working since m any users might have nothing in co mmon, or, at bes t some categories that are co mmon to the vast majority of the data. Furthermore, many use rs w ill have o nly o ne entry in a number of ca tegori es, lead ing to increased similarity since the ‘rating’ o f t hat categor y will b e the same. These pro blems may prevent a n Artificial Immune S ystem based on this encod ing being successful in iden tifying a group of similar users. There is another pro blem with using integer catego ry IDs. B ecause DMOZ is an evolving classificatio n system, new categorie s are adde d and removed regularly. This can have th e effect that two very similar categories e nd up with very different i nteger IDs as these are h anded out consecutivel y. For instance, S tar Wars p art four might have ID 2 0,004 when it was clas sified years a go, but Star War s part two might end up with ID 420,0 12 bec ause it has only recently e ntered the DMO Z s ystem. A si milar e ffect can b e seen in Figure 8 for the fir st t wo book marks. Figure 8 also shows how the tree str ucture IDs might prevent so me of t hese p roblems a s similar categories still end up near each other in the tree. The alter native to the inte ger encoding is to use an encoding that includes the tree structure in the for m of a category address. What is required then is a similarity measure that ca reful ly recognises ca tegories that are ‘ close’ within the str ucture of the tree. For example, it would need to j udge the parent / child or the sibli ng relationship as being more similar than a first co usin or grandpar ent t ype relationship. However, constructing such a m easure is far fro m simple. Consider the two trees in Figure 9. C D E F G H I J K L User 1: C D E F G H I J K L User 2: Figure 9: Si mple tree structure showing two web profiles. User 1 has entries at categor ies G, E, J a nd L, whilst user 2 ha s entrie s at D, I, J and F. Clearl y, matches should be scor ed more highl y the lower do wn t he tree they are because this indicate s a more precise match. Additionall y, ‘close’ relatio nships w ithin t he tree structure should count mor e to wards the matc h than ones separated by several ‘ge nerations’ (to continue the family tree metaphor). Whilst it is e asy to see that t hese users should have their si milarity measure increa sed, since bo th have an entry in cate gory J, a que stion re mains what to do with J a fterwards. Should t his m atch be discar ded once it has been counted by the measure or should the entries at I and J for user 2 be co unted as t wo entries at the parent b ranch ( E) for compariso n with user 1? The danger with discarding matches once counted is that two users might have ‘per fect’ matches for all of the 10 categories that the first user has in their p rofile, whilst the second user has another 100 entries. However, if o ne d oes not discar d categories that have already b een matched with another categor y then it is possible that one quite high level categor y might b e ‘matched’ with all the different entries at sub- categories for another us er. This might n ot matter since the ‘stre ngth’ of the match would have bee n reduced b y the generational distance and the weakness of the high-level cat egory’s contrib ution. 6 SIMILARITY MEASURES Let us no w construct a suitabl e similarit y measure for the Artificial Immune Sys tem that will produce a value on a 0– 1 scale w ith answers closer to 1 indicating a closer match. Follo wing t he discussion in the pre vious section, t he measure will be buil t according to the follo wing fiv e principles. 1. M atching at categories lo wer do wn the tree structure should c ontribute mo re to the meas ure than matching hi gher up. 2. M atches at t he top level of the tree (i. e. the ‘Top’ category in the DMOZ database should have a contrib ution of zero. 3. M atching contr ibution sho uld be reduced for ‘imperfect matc hes’ i.e. those not in e xactly the same category. The reductio n in co ntribution should be pro portional to the generatio nal distance (i.e. a grandp arent ch ild relations hip has a generational d istance of two. ) 4. T he matching metric should be scaled (averaged) so that it ranges from 0 to 1. 5. T he matchi ng metric should take into accou nt all possible matches bet ween the entries in each web p rofile, i.e. i f there are 1 0 entries in 1 and 20 in the o ther the n all 10 × 2 0 = 200 potential matches should co ntribute to t he measure. Suppose that we wish to calculate the matching coefficient for the categor y ad dresses 1 .3.1.1 and 1.3 in the sa mple tree diagra m in Figure 10 belo w. We need to define an ‘e dge di stance’ a s the n umber o f ‘steps’ apart any two addr esses ar e. For exa mple, 1.1 and 1.1 .2.2.1 have an ed ge d istance of three, as do 1.2.2 .2 and 1.2 .1. T his equates the relatio nship between grandparent a nd gr andchild as the sa me strength as that b etween siblin gs. Figure 10: Sa mple Tree diagram. By staged truncation o f the longer cate gory address (CA) until they are the same we obta in a match at CA 1.3 with t wo number s (ed ge distance s) discarded (but counted). T his match would have a strengt h determined by the category level (level 2) of the matching C A, and by the edge distance (ED) . How should t he edge d istance a ffect the val ue of t he overall match? One po ssibility would be to use 1 / ED as this would be a smaller value as the ED incre ases. However, this would not work when the CA matc h perfectly as we would be d ividing b y z ero. T herefore using 1 / (E D + 1) is b etter. How should t he d epth of the matching level a ffect the value o f t he o verall match? It seems useful to make the level nu mber the sa me as the number of integers in the CA. In the e xample abo ve, there are six levels. However, the tree is not of uniform depth. I n principle, matches a t lo wer levels should sco re higher since the y sho w a more precis e agree ment in the topic matter. Ho wever, does this mean that a pe rfect match at the bo ttom of o ne set o f b ranches (e.g. 1 .1.2.2 .2) should score less hi ghly than a perfect match a t the bottom of a nother lo wer set, say 1.3 .2.2.1.1? The DMOZ database is a hu man classificatio n o f human knowledge. To so me extent, the classifications are arbitrary because the y are the result o f p ragmatic as well as episte mological consid erations. Therefor e, it seems incor rect to allo w only a perfect match sco re when it occurs at t he lowest le vel. In the e xample abo ve it might be advisable to allo w perfect matche s to contribute fully at levels 4 ,5 and 6. Remembering that a match at the top level should count as zero the n a formula to give the level effect factor would b e (L - 1) / (4 - 1) i.e. level 4 would have a value of 1, level 3 a value of (2/3 ), level 2 (1/3), 1 1.1 1.2 1.3 1.2.1 1.2.2 1.1.2 1.1.1 1.3.1 1.3.2 1.2.1.1 1.2.2.1 1.2.2.2 1.3.1.1 1.3.1.2 1.3.2.1 1.3.2.2 1.3.2.3 1.1.2.3 1.1.2.2 1.1.2.1 1.1.1.3 1.1.1.2 1.1.1.1 1.1.2.2.1 1.1.2.2.2 1.2.2.2.1 1.2.2.2.2 1.2.2.2.2.1 1.2.2.2.2.2 1.3.2.2.1 1.3.2.2.1.1 NB All the categories roughly on a line are at the same level but are shown this way in order to fit in their labels i.e. 1.1.2.3 is on the same level as 1.2.1.1 whilst t he top level would have a value o f zer o. However, this would not work for va lues o f L greater than 4. To solve this we c ould use a value o f 1 in those cases. T hus, the gen eral matchin g formula becomes min{1, (L-1) / (ML -1)} where ML stands for the level at which the maximum contributio n starts. In the case of DMOZ, a rea sonable choice for the cut-off point might be level 8 based on the structure in Fi gure 7. A disad vantage of the measu re j ust descr ibed is the inherent simplificatio ns o f using a c ut-off point after which all matches are equally ‘per fect’. The smaller the cut -off value, the more inaccurate result will become. Ho wever, if set too lar ge then some branches of the tree might be to o shallo w to e ver achieve a perfect match. It is further more q uestionable whet her a linear measure is app ropriate. Hence, w e p ropo se the follo wing alternative. T he m atching sco res monotonicall y increasing from level 1 to 16 (in DMOZ’s case) but get c lose to 1 re latively quickly, say at le vel 8, and the n appr oaches 1 asymptotically as shown in the figure 11. Figure 11: Shape of prop osed matching functio n. The follo wing equation descri bes such a functio n. Let webprofile1 co ntain ca i (i = 1...n ) c ategory addre sses webprofile2 co ntain ca j (j = 1…m) cate gory addre sses ed i,j be the edge d istance from ca i to ca j l i,j be the matching level for ca i and ca j Proposed matching function: 240 32 33 + − − j i, 2 j i, l l This measure still agree s with the p rinciple that matches at lo wer level s should sco re higher b ut d oes not u nduly pe nalise t he branches that do not go do wn to the full 16 levels. Assuming we su m the contributions of a ll th e potential matches th e total would have to be d ivided by the tota l number of matches to tra nsform the metric to a 0 - 1 scale. Hence, the si milarity measure s beco mes: m n 240 l l - ed s n 1 i 1 j j i, 2 j i, j i, ×                 + − × = ∑ ∑ = = m 32 33 1 One furt her factor should be co nsidered when calculating the match be tween two web p rofiles. It is the validity of the match if the web pr ofiles have ver y different number s of URLs within them (which we will call the di sparity correc tion factor). If o ne web pro file has onl y 1 0 ite ms whils t the o ther has 10 0, then a match from these t wo people would seem to be less valid tha n one ba sed on web pro files containing 50 and 60 items. This is because in the first case the 10 entries fro m the first profile have been used p ropor tionately more i n calculating the match. Assuming t hat web p rofile 1 ( n e ntries) is smaller than web pro file 2 ( m entries) t hen finding the fraction n / m would give a hi gher res ult to those pairs of profiles which have similar n umbers o f entries ( see column 3 in Fig ure 12). However, it would also give a perfect scor e to two profiles with a very small number of URLs, s ay 2 URLs each. Clearly, the measure should ‘re ward’ web profiles t hat have a lar ger number of entries. One way to do this would b e to include the sum of the nu mber of entries. However, some p rofiles contain a very large nu mber of entries. Anal ysis of t he data sho ws that u sers with more than 10 0 bo okmarks ar e likely to be o utliers. Hence, in ord er to pro duce a measure in a range from 0 to 1, profile s with more than 10 0 entries are counted a s thoug h they ha ve 100 entrie s. Colu mn 4 in Figure 1 2 shows the ca lculation of s uch a measure under the assumptions a bove. The fifth colu mn in Figure 12 contains the pro posed disparit y factor. However, if t he ra w values in col umn 5 were used the corr ection effect would pr obably be stronger than the original matching sco re. T herefore a scaling par ameter a is introduced to reduce the r ange of the disparit y factor. This para meter deter mines the lowest va lue in the range ( a , 1) which t he dispar ity factor can take. n m n/m (n+m)/200 n/m*(n+m)/200 a+(1-a)*n/m*(n+m)/ 200 100 100 1.00 1.00 1.00 1.00 80 100 0.80 0.90 0.72 0.89 60 100 0.60 0.80 0.48 0.79 40 100 0.40 0.70 0.28 0.71 20 100 0.20 0.60 0.12 0.65 80 80 1.00 0.80 0.80 0.92 60 80 0.75 0.70 0.53 0.81 40 80 0.50 0.60 0.30 0.72 20 80 0.25 0.50 0.13 0.65 60 60 1.00 0.60 0.60 0.84 40 60 0.67 0.50 0.33 0.73 20 60 0.33 0.40 0.13 0.65 40 40 1.00 0.40 0.40 0.76 20 40 0.50 0.30 0.15 0.66 20 20 1.00 0.20 0.20 0.68 10 20 0.50 0.15 0.08 0.63 10 10 1.00 0.10 0.10 0.64 5 100 0.05 0.53 0.03 0.61 1 100 0.01 0.51 0.01 0.60 Figure 12: Disp arity correctio n using a dispar ity scaling factor of a = 0.6 . Using t he same notation as before , with a bei ng th e scaling p arameter for the dispar ity cor rection factor the final similarit y measure be comes: ( ) ( )       + + ×         +         + ×         + − × = ∑ ∑ ∑ ∑ = = = = m m) n(n a - 1 a j vote i vote j vote i vote 240 l l - ed s m 1 j n 1 i 1 j j i, 2 j i, j i, 200 32 33 1 1 n i m 7 CONCLUSIONS There are a number of st eps in t he proce ss of prepar ing the da tabase for use in the Art ificial Immune System. These m ay have an effect on the performance of the syste m. It will not b e possible to tell how critical these issues are until the proj ect is near completio n. Havin g constructed the web pr ofile database the choice of encod ing must be made. Again, this could have a critical effect on the success o f the Artificial I mmune System. I t is cle ar that the construction of a si milarity measure that will a llo w the use of the tr ee structure is not a tri vial task. It may be that t his is n ot necessary and e xploration of the potential of t he first enco ding will be undertaken first since there is alread y a succ essful precede nt in t his case. However, the sparseness of the data set may prevent t his, and the creatio n of a tree comparison similarity mea sure is an interestin g challenge. To conclude, we belie ve that with the correct matching metric an id iotypic network based Artificial Immune S ystem s hould be well suited to s upplying interesting y et surprisi ng URLs based on a user’s book marks. Preli minary r esults show that with t he aid of DMOZ we can map bet ween 60% and 80% of users’ b ookmarks to votes for suitable categories. We feel con fident t hat thi s gives us a strong b asis for an Artificial I mmune System reco mmender and subsequent result will be published in due course. Acknowledg ements The authors would li ke to thank the many volunteers donating their book marks and David Banks for his help with the DMO Z system. References [1] Amazon.com, http:// www.amazo n.com. [2] Billsus, D. and Pa zzani, M. (1998). "Learning Collaborative Information Filters" In Shavlik, J ., ed., Ma chine Learn ing: Pro ceedings of the Fifteenth Inter national Conference, Mo rgan Kaufmann Publis hers, San Francisco, CA. [3] Burnet, F. M. (19 59) The Clonal Selection Theory of Acqu ired I mmunity. Cambrid ge University Pr ess, Cambridge. [4] De Castro , L. N. & Von Zuben, F. J. (1999), Artificial I mmune Systems: Part I – Basic Theory and Applications, T echnical Repo rt – RT DCA 01/99 , FEEC/UNICAM P, Brazil. [5] Cayzer, S. & Aickelin, U. ( 2001). A recommender system based on the immun e network. Pro ceedings of CEC 2002. [6] Compaq Systems Research Centre. Eac hMovie collabor ative filtering data set, http:// www.research.co mpaq.co m/SRC/eachmovie/. [7] DMOZ ontology, http://d moz.org/. [8] Farmer JD, Pac kard NH and P erelson AS, T he immune system, adap tation, and machi ne learning Physica D, vol. 22 , pp. 187-20 4, 1986. [9] Gokhale A, I mprovements to Collabora tive Filtering Algorithms (1999). Worcester Polytechnic Instit ute. ht tp://www.cs. wpi.edu/ ~claypool/ ms/cf-improve/. [10] Goldsby R, Kindt T, Osborne B (2000), Kuby Immunology, Fo urth Edition, W H Freeman. [11] P. Hajela and J . Yoo ( 1999), Immune Ne twork Modelling in Desig n Opti mization, Ne w Methods in Op timisation, Editors: (book- chapter) D. Corne, M. Dori go and F. Glover, McGra w-Hill, pp. 203-21 6. [12] Hofmeyr, SA and Forrest, S. (2000). Architecture for an Artificial Immune System. Evolutionar y Computation 7, p p 45-68. [13] Jerne NK (1973), T owards a network theory o f the immune sys tem Annals o f I mmunology, vol. 125, no. C, pp. 373-38 9. [14] Jerne, N.K. (19 73). T he i mmune system. Scientific America n. 229 pp 52-60. [15] Kim, J . and Bentley, P. J. (2001) , To wards a n Artificial I mmune Sys tem for Network Intru sion Detection: An I nvestigation o f Clonal Selectio n with a Negati ve Selec tion Operator , the Congress o n Evol utionary Com putatio n (CEC- 2001). pp. 1244-1252, 2001. [16] Kim, J a nd B entley, P.J. ( 2001). An Evaluatio n of Negative S electio n in a n Artificial I mmune System for Net work I ntrusion Detection. Proce edings o f the Genetic a nd Evolutionar y Computation Co nference (G ECCO 2001 ). pp 1330 -1337. [17] Perelson A S and Weisb uch G (19 97), Immunology for physicist s Reviews of Modern Physics, vol. 69 , pp. 1219-1267. [18] Timmis, J., Neal, M. and Hunt, J. (2000 ), An Artificial Immune Syste m for Data Analysis. Biosystems 55 p p 143 -150.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment