Relating Web pages to enable information-gathering tasks
We argue that relationships between Web pages are functions of the user's intent. We identify a class of Web tasks - information-gathering - that can be facilitated by a search engine that provides links to pages which are related to the page the use…
Authors: Amitabha Bagchi, Garima Lahoti
Relating W eb pages to enable infor mation-gathering task s Amitabha Bagc hi Dept of Computer Scie nce & Engg Indian Institute of T ec hnology New Delhi 110016 Garima Lahoti Cazo o dle Inc. 60 Hazelw oo d Driv e, Suite 122, Champaign, IL 61820- 7460 Octob er 30, 2018 Abstract W e argue that relationships b etw een W eb pages are functions of the user’s in tent . W e identify a class of W eb tasks - inf ormatio n-gathering - that can be fa c ilita ted b y a sea rch engine that provides links to pages which are related to the page the user is curre nt ly vie w ing . W e define three kinds of int entional r elationships that corresp o nd to whether the user is a) seeking so ur ces of information, b) reading pages whic h provide infor mation, o r c) surfing through pages as par t of an extended information-ga thering pr o cess. W e show that these three r elationships can b e pr o ductively mined using a combination of textual and link information and provide three scoring mec hanisms that corres p o nd to them: Se ekR el , F actRel and SurfR el . These scoring mec ha nisms incorp ora te b oth textual and link information. W e build a set o f capacitated subnetw orks - each co rresp onding to a particular keyword - that mirr or the interconnection structure o f the W o rld Wide W eb. The scores are computed by computing flows on these subnetw orks. The capacities o f the links ar e deriv ed from the hub a nd authority v alues of the no des they connect, following the work of K le inber g (1998) on ass igning authority to pages in hyperlinked environmen ts. W e ev a luated our scoring mechanism by running expe r iments on four data s ets taken from the W eb. W e pr esent user ev aluations of the relev ance of the top r e sults returned by our scoring mechanisms and compar e those to the top results returned by Go ogle’s Similar Pages feature, and the Comp anion a lgorithm pr op osed by Dean and Henzinger (1 9 99). 1 In tro duction The tremend ou s success of collab orativ e filtering-based r ecommendation systems (see e.g. [20]) in online retail settings (e.g. Netflix) has demonstrated that users w elcome guidance while lo oking for b o oks to buy or fi lm s to rent i.e. where they are not lo oking for a pro du ct which satisfies a general sp ecificatio n rather than a sp ecific pro duct. I n the enterprise searc h space, the increa sing imp ortance of faceted searc h - essentially a metho d of providing r ecommendations to satisfy a us er’s searc h needs by creating m ultiple taxonomies - pioneered by Endeca [12], under the name “guided na vigation”, sho ws that businesses are recognizing that they can impr o ve profitabilit y by effectiv ely helping their emp loy ees and customers br o w s e through large databases by pro vidin g searc h results r elated to th e ones users express a p reference for. But searc h engines for the W orld Wide W eb hav e b een largely un s uccessful in p ro v id ing accurate and helpf ul recommendations to their users. Researc h h as found that most W eb u s ers are not u sing adv anced features p r o vid ed by searc h engines; it has b een shown that they barely u nderstand what these features do [29]. F urther, it has b een seen that W eb u sers searc h less and br o w se more [2 ]. An d, in fact, th e pro cess of dev eloping exp ertise in using the W eb coincides with an increase in bro wsing and decrease in searc hing [8]. 1 Despite this bleak scenario it is our conten tion th at search engines ha v e the resources to effectiv ely pro vide users with related pages that can help in information-gathering tasks. F urth er we b eliev e that a searc h engine whic h can do this will increase its v alue for its users , w ith consequen t increases in rev en ue. There is a gro win g u nderstand in g in the searc h domain that u ser int en t is crucial to th e searc h pro cess [27, 15]. Extendin g this understandin g to th e domain of relationships b et ween pages can mak e information-gathering tasks easier. In other words, to know the relationship b et ween t w o pages, we must first kno w what fu nction these t wo pages serve f or the u s ers w ho visit them. It is only by characte rizing the task the user is engaged in that we can offer relate d pages and hop e that these will actually facilitate the task. In this p ap er w e p r o vid e a suite of s coring mec hanism s that relate W eb pages: Se ekR el , F actR el and SurfR el . The purp ose of these scoring m echanisms is to help identify pages which may b e related to the current page dep end ing on whether the user is reading p ages wh ic h pr o vide information, s eeking lin k s to sources of information, or su r fing thr ough p ages as part of an extended inform ation-gat hering pro cess. A brief o verview of our metho d is as follo ws: W e compute our scores u s ing flo w calculatio ns in a set of su bnet wo rks of the W eb. Eac h subnetw ork corresp onds to a single ke ywo rd. The set of subnetw orks used to score a pair of pages is d ecided by fin d ing the k eywords relev ant to the pair of pages b eing scored. Then the ed ges of these n etw orks are capacitat ed u sing the h ub v alues [18] of their originating pages. Finally flow is sen t along these edges tow ards sp ecial no des we call witnesses . Th e amount of flo w that can b e routed is us ed as a m easure of the relationship. Organization. In Section 2 we fo cus on inform ation-gathering tasks and iden tify ho w our recommen- dations can facilita te them. W e survey previous w ork done in relating W eb pages in Section 3. Our sp ecific scoring mec hanisms are d escrib ed in Sections 4.1 and 4.2. A comparison of our results for a small to y net wo rk with the results pr o duced by t wo recen t prop osals PageSim [21] and SimR ank [16] is presen ted in Sectio n 4.3. In S ection 5 we presen t exp erimen ts conducted on real data take n fr om the W eb. W e discu s s the results of user surveys that compared the top results pro d uced by our s cor- ing mechanisms with the results given by the Comp anion algorithm of Dean and Henzinger [10] and Go ogle’s Similar P ages feature. Finally in Section 6, we conclude by discuss ing the m erits of ou r scheme in comparison to these t wo prop osals and by arguing that our sc heme is a b etter candidate for inclusion in a searc h engine than the Comp anion algorithm. 2 Information-gathering on the W eb The tasks th at W eb users undertak e were classified by Br o der [6] to generally fall into three categories: na v igational (find ing sp ecific p ages), inform ational (seeking facts) and transactional (p erforming some in teractiv e s et of tasks.) Kellar [17] fur ther classifies in formational tasks into information-seeking, information exc hange and information maint enance tasks. It is in the first of these classes, information seeking, th at searc h engines mak e their ma jor contributions. Kellar differentiat es b et ween three t yp es of information seeking tasks: fact-finding e.g. d irections to a friend’s h ouse or exam d ates, information- gathering e.g. tectonic mo v emen ts, Mac laptops, br owsing news, f riend’s h omepage. See Figure 1 for a sc h ematic represen tation of K ellar’s classification (Source [17].) Our fo cu s is on information-gathering which we distinguish fr om fact finding by using Kellar’s definition: Information-Gather ing c onsists of tasks in which a user is c ol le cting information, often fr om multiple sour c es, in or der to write a r ep ort, make a de cisi on, or b e c ome mor e informe d ab out a p articular top ic [17, pp 67]. With this definition in h and let u s try to c h aracterize this class of tasks. 2 Information Seeking Fact Finding Information Exchange Information Maintenence Browsing Transactions Communications Information Gathering Maintenence Web Information Goals Tasks Figure 1: Kellar’s classification of W eb inf ormation. Sou r ce: Kellar, 2007. 2.1 The iterative nature of information-gathering tasks In their s eminal w ork Belkin, Od dy and Bro oks [3, 4 ] p oin ted out that assumin g that a user w ith a need for information will b e able to sp ecify that n eed exactly is a mistak e. In stead, they p ointed out that a user comes to an inform ation r etriev al system with an anomal ous state of k now le dge , whic h she then attempts to express as a query . Based on the in formation receiv ed the anomaly is rectified to some exten t, the u ser’s image of the world is altered somewhat. But this is not the en d of the pro cess. The altered state of the us er ’s kno wledge generates n ew anomali es whic h she th en tak es bac k to the information retriev al engine in the form of requests or qu eries. This iterativ e pro cess con tin ues till the user is satisfied with the exten t of the c hange in his state of kno wledge. Belkin et. al.’s work h as b een refined in sev eral directions (w e refer th e r eader to Marc hioni’s b o ok for a surve y [24]) bu t the general view r emains a p o werful organizing principle f or in formation retriev al. This general view is b orne out in the W orld Wide W eb setting by Rose’s conclusion [27] - based on earlier studies he and Levinson conducted [28] - that information searc h w as an iterativ e pro cess. O ther supp orting evidence in this regard is Aula et. al.’s fin ding that exp erienced W eb user’s sho w ed a pattern of searc hing, then browsing, the searc hing aga in [2], which was s im ilar to Cothey’s finding in a longitudinal study that follo we d su b jects ev olving fr om n ovice to exp ert [8]. With all of this as backg round we c haracterize the information gathering task on the W orld Wide W eb as an iterativ e one (see Figure 2). T h is task is initiated by an un derstanding that the user’s kno wledge needs to b e augmente d, pro ceeds by lo oking for W eb pages (either fr om s earc h engines or other sources) and then b ro wsing them when they are found. T h e pro cess of b ro w sing rectifies the kno wledge anomaly p artially or w holly and has the additional effect of p ro viding links to other p ages whic h migh t aid the pro cess. The user ma y then c ho ose to follo w those other links (immediate ly or later) or fin ish bro wsing the current page. Wh en a p age is bro wsed completely the us er is either s atisfied en tirely and terminates the task, or is partially satisfied and resumes th e task by again lo oking for W eb pages. This iterativ e c haracterizatio ns of the in f ormation-gathering tasks suggests w a ys in wh ic h search engines could aid user s in p erforming th em. I t is to this that we now turn our atten tion. 2.2 Enabling information-gathering tasks In Figure 2 we notice that a user b ro wsing a W eb p age in the pro cess of gathering information treats the page either as a source of information or as a sour ce of links to other pages. It is therefore appropr iate to p ro vide users with links to t wo kin ds of pages: 3 certain domain my knowledge in a I look for pages that may help me augment promising This link looks I browse a Web page I am satisfied ... for now I am not satisfied with possess. the concepts I currently looks interesting This page That page was not much use I learned something but not enough Figure 2: A sc hematic of information-gathering on the W eb. 4 P1. P ages wh ic h con tain inform ation s im ilar to the current page. P2. P ages wh ic h pro vide links similar to the links pro vided b y th e cur ren t p age. Additionally it is our con ten tion that as u ser exp erience w ith the W eb im p ro v es, there will b e the realization that p eople wh o create W eb con ten t and W eb links h a ve an unders tand ing of the in terrelationships b et ween v arious pages. And so we suggest that a third kind of p age could b e useful in th e information-gathering pro cess: P3. P ages whic h can b e reac hed b y follo w ing a sequence of links from the curr en t p age and pages from whic h the curr en t page can b e reac hed b y f ollo wing a sequence of links. The sp ecifications P1 and P2 are general and op en to inte rpretation as to ho w they might b e satisfied. The pages of P3 are m ore sp ecific b ut there is still the question of which one of these pages to choose to present to the user from among the several thousand that might satisfy this criterion. In Section 4 we will provide thr ee s coring mec h anisms: F actR el will accoun t for pages of t yp e P1 , Se ekR el for P2 and SurfR el for P3 . Ho w s hould th ese link s b e provided to the u ser? While this is a question b etter addressed b y exp erts in human-computer int eraction, we would suggest that the “to olbars” pro vided by man y searc h engines could b e augmen ted to pro vide page recommendations. These already p ro vide in formation ab out the “rank” of th e page cu r rent ly b eing view ed an d v arious other pieces of information from the search engine’s bag, and this additional use can fit in seamlessly . F or searc hes that tak e place on a searc h engine’s W eb page, eac h “similar pages” link can con tain these related pages. In either case it is of paramount imp ortance that the links pro vided b e appropr iately categorize d so that the user can c ho ose to f ollo w them (or not) d ep end ing on the particular function the page b eing view ed (or the initial searc h result) pla ys in her information-gathering pro cess. 3 Related w ork The r elationship b et we en W eb pages has b een studied fairly extensiv ely . Most researc hers use the term “similarit y” and defi ne it v ariously . In this p ap er we hav e consciously av oided the u se of this term since w e b eliev e th at th e relationship b et we en W eb pages is not intrinsic to the pages but dep end s on the fun ctionalit y of the pages for the user. Ho w ev er, in order to survey previous con tributions to this area, in th is section w e will use the term “similarit y .” There are t wo br oad categories of approac h es to the W eb page similarit y problem. The firs t relies mainly on the textual conten t of W eb pages. The pair of pages b eing compared is either seen j ust as tw o groups of items (k eyw ords, anchor text, p atterns in the text) whic h o v erlap significantl y or as structured en tities that r esem b le eac h other in their organization e.g. whic h tags app ear next to whic h ones. Since this is not our appr oac h we do n ot review th is v ast literature here, referring the reader to [30] for a succinct surve y . The second approac h inv olv es taking link and interco nnection in f ormation in to accoun t. An im- p ortant wa y in wh ic h inte rconnection has b een used is to ascrib e authority to a p age based on which pages link to it. This idea forms the basis of the P agerank algorithm emplo y ed by Google [5]. In a similar vein, Kleinberg [18] describ ed t wo attributes of W eb pages: they can b e authorities on a topic, or th ey can b e hubs , linking to p ages wh ich are authoritativ e. He describ ed an iterativ e alg orithm to compute measures of th ese t wo attribu tes for eac h W eb page. One line along which Kleinber g’s work w as dev elop ed inv olv ed u sing anc hor text as a d escriptiv e sum m ary of the p age b eing linke d to [7, 13]. But, more relev an t to our metho ds wa s the fo cus on the link stru cture seen in Dean and Henzinger’s pap er [10 ] where they gav e tw o algorithms for fin ding similar W eb pages. One of their algorithms used 5 the idea of co-citat ion earlier seen in Pitk ow and Pirolli’s wo rk [26]. The second algorithm, called Com- p anion shares an imp ortan t feature with our scoring mec h anisms: it u ses Klein b erg’s algorithm. T heir metho d of b u ilding a f o cused graph extends K leinb erg’s id eas. W e hav e u sed their ideas for building capacitat ed sub net works in our algorithm. While Dean and Henzinger giv e a list of pages similar to a giv en p age, Huang et. al. [14] ga ve measures of similarit y based on the predecessor and successor set, and also on th e b asis of all the vertices reac hable f rom the t w o p ages. An other sort of “c losure” of co-citati on was used to define a similarit y measure called PageSim in [21]. In this measure, pages propagate their similarit y measure to their neigh b ors, the imp ortance of a particular p ropagation b eing decided b y the P ageRank of the page. This measure was an improv emen t on the SimR ank measure prop osed in [16] which work ed on the principle that t w o pages are similar if they are lin ked to b y similar pages. Th e SimR ank m easure wa s sho w n to b e sp ecific ins tance of a general framework for computing similarit y b etw een heterogeneous data ob jects by Xi et. al. [31] wh o pr op osed the SimF usion algorithm. In Section 4.3 w e w ill compare the results P ageSim and SimR ank pro duce on a small to y example with the results given by our scoring m ec hanism s. In a different use of link stru ctur e related to our o wn Lu et. al. [23, 22] claimed that t wo p ages were said to b e similar if flow could b e routed from one of them to the other. Ho w ev er , u nlik e our work, their capacit y assignmen ts w ere n ot based on any notion of authorit y . T o th e b est of our kno wledge this is the only other m en tion of u sing flo w to score s imilarit y in the literature. As mentio ned, the literature on find ing similarit y is v ast and has seen con trib utions from man y differen t areas. Th e p ap ers we ha ve discussed ab o v e are the ones whose tec hniqu es are closely related to our o wn. With these in view we now pro ceed to d escrib ing our scoring mec hanisms. 4 Computing relationship scores In this section we describ e our algorithms for s coring the relationships b et ween a pair of pages u and v . It is our cont en tion that tw o pages should ha ve a high F actR el s core if it is p ossible to find paths from m ultiple p ages to b oth of them. W e s a y that a page z witnesses the F actR el r elationship b et ween u and v if it is p ossible to reac h b oth u and v fr om z by follo wing a ser ies of links. F or example, in Figure 3, the page G witnesses F actR el b et ween A and B . Similarly w e sa y that a page z witnesses the Se ekR el relationship if it is p ossib le to r eac h z b y follo wing a series of lin ks from u to z and v to z . F or example, in Figure 3, C witnesses Se ekR el b et ween G and I . F is another such witness for these t wo. Somewhat differen tly , the SurfR el relat ionship do es n ot require explicit witnesses. W e sa y that the SurfR e l r elationship exists b et ween u and v if it is p ossible to reac h u from v by follo w ing a series of links, or vice versa. In Figure 3, we see that H and B are related this w a y , w hile A and I , are not related by SurfR el . Hence our scores are based on fi nding paths in the W orld Wide W eb. Bu t which paths should b e giv en more w eigh t than others in the scoring p r o cess? T o answe r this question we r ely on Klein b erg’s notion [18] of authorities (pages w hic h ha v e credible in formation) and hubs (pages which link to go o d authorities) in order to deriv e a relev an t set of fo cus ed capacitated subn et works fr om the stru cture of the W orld Wide W eb. W e then determine relationship scores by sending flo w to the t wo pages fr om witnesses (for F actR e l ), fr om the tw o pages to witnesses (for Se ekR el ) or b etwe en the p ages (for SurfR el ) in th ese capacitated n et works. W e will no w d iscuss these metho d s in d etail. In Section 4.1 we will describ e ho w to build our s et of capacitat ed su bnet w orks using Kleinberg’s metho d for assigning authorit y to W eb p ages. In Section 4.2 w e describ e h o w to fi nd witnesses, r oute flo w and compute scores. W e illustrate the wo rking of the algorithm on a to y example in S ection 4.3 and also present a comparison b y scoring Simr ank [16] and PageSim [21 ] on the same example. 6 C A B E H D G I F Figure 3: A simple h yp er lin k ed net w ork 4.1 Building capacitated subnet works W e b egin by assumin g that we ha v e a database of significan t k eyw ords, call it K . W e also assume that there is some fu n ction γ : K → (0 , 1) which assigns r elativ e imp ortance to these k eywo rds. W e do not sp ecify ho w to bu ild K , pr eferr ing to rely on th e w ealth of to ols av ailable for this pur p ose (see, e.g. [32]). F or eac h of these keyw ords we apply Kleinb erg’s [18] HITS algorithm to compute hub and authorit y v alues for all the pages asso ciated with that keyw ord. Finally , at the end of th is prepro cessing step eac h page h as h ub and authorit y v alues computed for the sub set of the d atabase of keyw ords it is asso ciated w ith . When we are give n a pair of p ages to score w e id en tify sets of significant keyw ords K u and K v for u and v . W e then merge these t wo sets to get the top k k eyw ords for some tunable parameter k . This set of c hosen k eywo rds w e call K . Once we hav e th is set of k eywords w e pro ceed b y making a set of netw orks N = { N w | w ∈ K } . F or eac h N w w e fi rst find th e set of W eb p ages P w whic h con tains the keyw ord w . Since we will b e usin g K lein b erg’s hub and authorit y v alues [18] to capacitate the netw ork (in Section 4.2) we grow N w b y fir st taking all the pages that link to th e pages in P w and the pages that are link ed to by the pages in P w (as d escrib ed in [18]). Then w e add the refin emen ts to this structur e prop osed by Dean and Henzinger [10]. Only the main f eature of these refinements has b een men tioned in S teps 2c and 2d. The reader is referred to [10] for fu rther details. Finally , for eac h net w ork N w w e run Kleinberg’s algorithm for assigning hub and authority v alues to eac h page and lab el a no d e z of netw ork N w with these v alues: h ub w ( z ) and auth w ( z ). A d irected edge f rom n o de x to no d e y is assigned capacit y hub w ( x ). A su mmary of the alg orithm is in Figure 4. W e p ostp one a discussion of the motiv ation for this construction to S ection 4.3. 4.2 Using flo ws to compute relationship scores Finding witnesses. F or b oth Se ekR el and F actR el we hav e to fi nd witnesses in eac h N w . In Figure 5 w e describ e a simple algorithm that uses br eadth-fi rst searc h fr om b oth u an d v u pto d lev els for some v alue of d to retur n a sorted list, S w , of witnesses for Se ekR el . Note that w e do n ot ju st create a set of witnesses, bu t actually make an ordered list of w itnesses. The s ignificance of this will b ecome clear shortly . In order to constru ct a list of witnesses for F actR e l w e simply reverse the direction of all the 7 Algorithm co nstructSubnetw orks ( k ) 1. Iden tify k eywords K u and K v for which u and v resp ectively hav e high hub and authority v alues. Pick the to p k elements of K u ∩ K v . Call this set K . 2. F or ea ch w ∈ K build N w as follows (a) T a ke the set of pages P w containing keyw o r d w . (b) Grow P w int o G w by ta king pag es that link to P w and pa ges that are link ed from P w . (c) Augment G w by including pag es that share an outlink with the pages in P w . (d) Augmen t G w by adding o ther pag es linked to by pa g es that link to the pages of P w . (e) N w = ( G w , E w ) wher e E w is the set of (dir ected) e dg es co nnecting pag es in G w . (f ) Run Kleinber g’s algorithm [1 8] to as s ign hub and authority v alues to each no de x ∈ G w , hub w ( x ) a nd auth w ( x ) r esp ectively . (g) Assig n (directed) edge ( x, y ) (where x is the origin of the edge and y is the endpo int) capacity c w ( x, y ) = hub w ( x ). Figure 4: Building a set of capacitated subn et works edges of N w and execute exactly th e same algorithm. W e denote the set of witn esses for F actR el b y F w ( u, v ). Note that th e rev ersal of edges is only to find witnesses, not to compute flo ws, that tak es place on the same graph for b oth Se ekR el and F actR el . Algorithm makeSeekWitnessLi s t ( N w , d, u, v ) 1. Perform a BFS for d levels star ting from u . Put a ll vertices encountered in S d ( u ). 2. Similarly co nstruct S d ( v ) by per forming a BFS for d levels from v . 3. Let S w ( u, v ) ← S d ( u ) ∩ S d ( v ). 4. F or ea ch x ∈ S w ( u, v ) compute min { h ( v , x ) , h ( u, x ) } where h ( x, y ) is the num b er o f hops in a directed path from x to y . 5. Return S w ( u, v ) so rted in ascending order of min { h ( v , x ) , h ( u, x ) } , brea king ties betw een tw o witnesses x and y by compa ring the v alue of max { h ( v , x ) , h ( u, x ) } with max { h ( v , y ) , h ( u, y ) } , and further br eaking ties ar bitr arily if they are the same. Figure 5: Findin g Se ekR el witnesses for u and v . Computing flo w s. Having constructed the ord ered list of witnesses we go d o w n the list one vertex at a time usin g an y standard single-source maximum flo w algorithm for computing the max flo w fi rst from u to or from the w itness as required, then f rom v . Note that when computing the flo w from u to a witness w e eliminate v fr om the net wo rk and vice ve rsa. Th is is to ensure that the one page do esn’t piggybac k on the other in ord er to r oute flo w to a w itness i.e. all the flo w v send s to a witness is indep endent of u and vice v ersa. After computing th e flo w for a particular witness, w e r educe the capacit y of the edges into or out of the witness. Let us p ostp one discussing the rationale and metho d for this to the end of this section. W e then mo v e on to th e n ext witness in the list. S ee Figure 6 for a formal description of the algorithm for co mpu tin g flo w s for Se ek R el . The algorithm for computing 8 flo w s f or F actR el is sym metric. In this case w e den ote the flo w f rom a witness x as factFlo w w ( x ). Algorithm flowSeek ( N w , d, u, v ) 1. L ← makeSeekWitnessList ( N w , d, u, v ). 2. While L 6 = ∅ (a) Extract the first element x fro m L . (b) flow w ( u, x ) ← max flow from u to x in N w \ { v } . (c) flow w ( v , x ) ← max flow from v to x in N w \ { u } . (d) seekFlow w ( x ) ← min { flow w ( u, x ) , flo w w ( v , x ) } . (e) Call reduceSeek Capacity ( x ). (f ) L ← L \ { x } . Figure 6: Findin g the witness flo w for Se ekR el The scores. In order to compute the relat ionship scores w e hav e to b e able to combine fl o ws from differen t subnetw orks. In order to do this we n ormalize all flows b y dividing b y the weig ht of the maxim u m ed ge in th e netw ork, a quantit y w e denote b y maxwt( w ). Also, w e factor in the relativ e imp ortance of th e v arious ke ywo rds in K using the f unction γ . Hence our first t wo scores are Se ekR el ( u, v ) = X w ∈ K γ ( w ) maxwt( w ) · X x ∈ S w ( u,v ) seekFlo w w ( x ) . F actR el ( u, v ) = X w ∈ K γ ( w ) maxwt( w ) · X x ∈ F w ( u,v ) factFlo w w ( x ) . The rationale b ehind F actR el is that p ages p ro viding similar information will b e identified by witnesses in the n et work, which act collaborative ly to identify go o d sources of in formation and link to th em. Similarly , the in tu ition b et wee n Se ekR el is that tw o p ages which allo w a user to reac h more or less the same pages are related in terms of their abilit y to guide the user as she n aviga tes th e W eb. The third m easure, SurfR el is easier to compute sin ce it requires n o witnesses. W e simply compute the max fl o w from u to v in N w , denoting it flow w ( u → v ) and the fl o w f rom v to u , denoted flo w w ( v → u ). Now, we can sa y that SurfR e l ( u → v ) = X w ∈ K γ ( w ) maxwt( w ) · flo w w ( u → v ) . SurfR e l encapsulates the idea that if the W eb allo ws one p age to reac h another through a simple sequence of clic ks, these t wo p ages m ust b e related b ecause they are b oth likel y to b e visited in a single browsing session. Th is id ea of t wo pages b eing r elated b y the pr esence of a path b etw een them is ve ry in tuitiv e and the concept of flo w generalizes th e idea of paths. By capacitat ing edges with the h ub v alue of the no des they originate at, we differenti ate b et ween no des and their abilit y to allo w information to propagate b y lo oking at their credibilit y as hubs in the hyp er lin k ed en vironment . T he more the credibility of the no de as a hub, the more the flow it can forw ard . Reducing witness capacity . In the case of Se ekR el and F actR el , the presence of many w itnesses that can sin k or s ource a lot of fl o w f rom the t wo pages b eing scored (or send a lot of flo w to them) leads 9 to a h igher score for the pair. But there are cases where this score m ay b e artificially high. C on s ider the n et work in Fig ure 7. E , B , C and G all w itness Se ekR el f or H and I . But the flow to B , C and G all go es through E . S o these three are redund an t, in the sense that th e inf ormation they provide is already con tained in the fact th at E is a witness for H and I . G B E H I C 10 50 50 50 20 50 50 Figure 7: Redun dant witnesses. Capacities are mark ed on edges. It is to preven t these r edundant witnesses fr om artificially inflating the relationship score that we reduce the capacit y asso ciated with the witness in S tep 2e of the flo w compu ting algorithm of Figure 6. F or Se ekR el when w e are done compu ting flo w to a witness w e r educe its incoming capacit y b efore mo vin g on to the next witness in the list. F or F actR el the outgoing capacit y is reduced. Before w e describ e the algorithm f ormally in Figure 8 let us d efine some n otation. F or a v ertex x let the s et of incoming edges b e I ( x ) and the set of outgoing edges b e O ( x ). Let th e flo w r ou ted for v ertex u on edge e b e f u ( e ). The capacit y of edge e is c ( e ). Algorithm reduce See kCapacity ( x ) if flow w ( u, x ) ≤ flow w ( v , x ) for ea ch e ∈ I ( x ) c ( e ) ← max { c ( e ) − f u ( e ) , 0 } . for ea ch e ∈ I ( x ) c ( e ) ← max { c ( e ) − f v ( e ) · flow w ( u,x ) flow w ( v, x ) , 0 } . else for ea ch e ∈ I ( x ) c ( e ) ← max { c ( e ) − f v ( e ) , 0 } . for ea ch e ∈ I ( x ) c ( e ) ← max { c ( e ) − f u ( e ) · flow w ( v, x ) flow w ( u,x ) , 0 } . Figure 8: Reducing capacities for Se e kR el Essen tially what reduceSeekCapacit y ( x ) do es is remov e the amount of fl ow w itn essed at x . Since w e tak e the minimum of flo w w ( u, x ) and fl o w w ( v , x ) as the amount of flow b eing witnessed, we remov e this amount from the incoming capacit y of x . And to ensu re w e d o this fairly for b oth u and v , w e p en alize the incoming edges used by b oth the flows flo w w ( u, x ) and flow w ( v , x ) equally b y scaling do w n the larger fl o w to the sm aller one b efore su btracting it from the capacit y of the in coming edge. 10 The algorithm reduceFlowCapacit y ( x ) is s ymmetric to this, only it r emo ves capacit y fr om the ou tgoing edges of the witness. In b oth these cases, with th e witness capacit y reduced, the abilit y of redund an t witnesses to s k ew the flow reduces. The reason for fixing an ordering for our w itnesses b ecomes clear no w since the capacities of the net work decrease after eac h flo w calculation. Clearly the order in which the witnesses are pro cessed will mak e a difference to the flo w that is r outed to them. Recall the function h ( u, v ) defined in Figure 5 as the num b er of h ops in a directed path from u to v . Let us s ee what the v alues of ( h ( x, v ) , h ( x, u )) are for v = H and u = I . F or G this is (3,2), for C it is (2,1), for B it is (1,2) and for E it is (1,1). S o the sorted ord er according to our alg orithm should b e E , B , C , G . F or this the total fl o w witnessed will b e 10 units for E plus 20 u nits for B = 30 units. Because of capacit y redu ction C and G will n ot b e ab le to witness an y flow. But is this correct giv en that I h as a direct edge to C whic h d o esn’t go through B or E ? W e argue it is sin ce the flow of 30 that could b e witnessed at C is the same as the flo w of 30 u nits that is b eing witnessed for H at B and E . In Figure 7 for example, if w e c h o ose the witnesses in the reverse order G, C, B , E the total flo w is 30 witnessed b y E + 30 w itnessed by C + 30 witn essed by B + 10 witnessed b y E , totalling 100, of which 70 u nits are red undant. Th us it follo ws th at distant witnesses along a c hain of n o des are more likely to b e red u ndant since decreasing their capacit y will n ot affect the flow to nearer witnesses. With this in mind , and also n oting that more in formation ab out a W eb page is lik ely to b e found in its n ear neighborh o o d rather than far a wa y from it, we ord er witnesses in increasing ord er of the distance of the witness fr om one of the pages in (Step 5 of algorithm makeSeek WitnessList ( N w , d, u, v ) in Figure 5. If a distant witness still witnesses flo w after the reduction of capacit y of nearer witnesses, we can b e su re that this is not red u ndant fl o w. 4.3 Discussion W e to ok the simple subnetw ork of Figure 9 and r an our scoring algorithms on it. The table of scores obtained is in Figure 10. F or cleanness of presentat ion all hub v alues ha v e b een scaled by 1000. The flo w v alues hav e b een scaled up by maxwt = 815 since we are only considerin g one sub n et work. 0 5 2 3 4 6 1 368 368 254 815 368 368 815 815 Figure 9: A simp le su bnet wo rk with edge capacitie s according to the hub v alues of the originating no de. In Figur e 9 we see that no de 2 is related by Se ekR el to no des 0 an d 3. This is as it s hould b e since they b oth p oin t to similar parts of the net wo rk. Node 0 sends 368 units of fl o w to the witness 11 0 1 2 3 4 5 6 0 (253,0,0,0) (368,0,368,0) (368 ,0,368,0) ( 0,0,368,0) (0,0,736,0) (0,0,368,0) 1 (253,0,0,0) (253,0,0,0) (0,0,253,0) (0,0,253,0) (0,0,0,0) (0,0,253,0) 2 (368,0,0,368) (253,0,0,0) (368 ,0,815,0) ( 0,0,368,0) (0,368,815,0) (0,0,11 83,0) 3 (368,0,0,368) (0,0,0,253) (368,0,0,815) (0,0,368,0) (0,815,0,0) (0,815,36 8,0) 4 (0,0,0,368) (0,0,0,253) (0,0,0,368) (0,0,0,368) (0,736,0,0) (0,368,0,0) 5 (0,0,0,736) (0,0,0,0) (0,368,0,815) (0,815,0,0) (0,736,0,0) (0,1183,0,0) 6 (0,0,0,368) (0,0,0,253) (0,0,0,1183) (0,815,0,368) (0,368,0,0) (0,1183,0,0) Legend: ( SeekR el ( x, y ) , F actR el ( x, y ) , SurfR el ( x → y ) , Su rfR el ( y → x )) Figure 10: Relationship scores for th e n etw ork of Figure 9 (scaled up b y 1000 × maxwt ) Se ekR el F actR el SurfR el → SurfR el ← 0 2,3 None 5 None 1 0,2 None 3,4,6 None 2 0,3 5 6 0 3 0,2 5,6 4,6 2 4 None 5 None 0,2,3 5 None 6 None 2 6 None 5 None 2 Figure 11: High scorers f or Figure 9 5 whic h is matc hed by 2. No de 3 send s 368 units of flo w to witness 6, wh ic h is also matc hed by 2. And although the n o de 1 shares many witnesses with 0, th e flow it can send is limited by its outgoing capacit y (which is lo w b ecause it is not a go o d hub) and so its Se ekR el score is lo w, though non-zero, and 2 and 3 b eat it out in scoring. If we lo ok at the second column of th e table in Figure 11, w e see that 5 and 6 are the top scorers for F actR el f or several n o des. No de 2’s high credibilit y as a hub helps draw atten tion to 5 and 6. Th e no de 0’s goo d hub v alue helps relate 2 to 5, in what can seen to b e an example of pure co-citation. In th e case of SurfR e l , the example of 0 is inte resting. While 0 is str on gly related to 5 by th is measure, its score with 6 is relativ ely less. In this case our score captur es the fact that there are m ultiple indep endent paths fr om 0 to 5 wh ile all paths from 0 to 6 go through 2. The dive rgence in th e paths after 2 do es not help b o ost SurfR el for 0 and 6. T o fu rther illustrate the p ow er of our metho d s, w e implemen ted the Si mR ank [16] and PageSim [21] scoring algorithms and s cored our simple subn etw ork using them. The results are in Figures 12 an d 13. 0 1 2 3 4 5 6 High Sco re 0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 None 1 0.0 1.0 0.0 0.0 0.0 0.0 0.0 None 2 0.0 0.0 1.0 0.0 0.0 0.5 0.0 5 3 0.0 0.0 0.0 1.0 0.0 0.25 0.25 5,6 4 0.0 0.0 0.0 0.0 1.0 0.0 0.5 6 5 0.0 0.0 0.5 0.25 0.0 1.0 0.25 2 6 0.0 0.0 0.0 0.25 0.5 0.25 1.0 4 Figure 12: SimR ank scores for Figure 9. SimR ank related n one of the pages to either 0 or 1 w h ereas our Se ekR el is able to detec t the f act that 0 can aid in helpin g the user find links to pages that 2 and 3 can also lead to. Ev en 1 sh ares this prop ert y as a naviga tional aid with some of the other pages, a fact that comes u p in our scoring. 12 The case of n o de 1 is p articularly in teresting b ecause PageSim , that giv es non-zero scores in many cases w here SimR ank fails, do es not deduce 1’s r elationship to 0 and 2 that our Se ekR el is able to fi nd. A u s er cu rrentl y viewing 1 w ould come to b eliev e that only 3 and 4 are related to 1 if she relied on PageSim or SimR ank . Th is would b e erroneous b ecause a knowledge that 0 is related in terms of lin k s it p ro vides could lead that user to 2, wh ic h she w ould not fin d if sh e relied on these other t wo m easures. PageSim is somewhat more sophisticated than SimR ank so it detects 0’s relationship to 5, just lik e our SurfR e l → do es, b ut it misses 0’s r elationship to 3 that we find through Se ekR el . 0 1 2 3 4 5 6 High Score 0 0.08 0.0 0 .04 0.01 0.01 0.05 0.02 5 1 0.0 0.08 0.0 0 .08 0 .04 0.0 0.04 3 2 0.04 0.0 0 .16 0 .0 5 0.0 3 0.0 8 0.0 8 5,6 3 0.01 0 .08 0.05 0.33 0 .16 0 .05 0 .19 6 4 0.01 0 .04 0.03 0.16 0 .33 0 .03 0 .16 3,6 5 0.05 0.0 0 .08 0 .0 5 0.0 3 0.2 5 0.0 6 2 6 0.02 0 .04 0.08 0.19 0 .16 0 .06 0 .42 3 Figure 13: PageSim scores for Figure 9. PageSim almost misses 5’s relationship to 4 an d also scores 5’s relationship to 6 quite lo w. SimR ank completely misses the relationship to 4 and scores the relationship to 6 lo wer than the relationship to 2. On the other h and, a high F actR el score for b oth of these allo ws a user to tell that the in formation a v ailable at 4 an d 6 are b oth relev ant to p eople who are in terested in 5. Since our F actR el score b et ween 5 and 2 is relativ ely lo w er and our SurfR el score b et ween them is high, a user can ded u ce the natur e of the relationship b et ween 5 and 2, a fact also detected by SimR ank . W e n o w mov e on to exp erimen ts on r eal data tak en fr om the W eb. 5 Exp erimen tal ev aluation 5.1 Exp erimen tal setup W e p erformed our exp erimen ts on fou r data sets tak en from the W eb. Cr eating these data sets wa s a m ulti-stage pr o cess that b egan b y quer y in g AltaVista [1 ] with a searc h string and taking the top 100 results to form a core set. W e did not use Google since we compare our resu lts to Go ogle’s Similar P ages f eature. W e then used the op en source W eb cra w ler Nutc h [25] to retriev e the pages link ed from the core set. Then we found th e top 1000 pages that link to these new pages usin g Alta vista’s adv anced f eature pro v id ing in links for a qu er ied p age. Finally , w e found the inlinks of the pages in the core using Alta vista then wen t bac k to Nutch to find the outlinks of these p ages. W e follo wed Dean and Henzinger [10 ] and to ok only the top 10 outlinks in the mann er they sp ecified i.e. if w e w ere lo oking at th e outlinks of a page u wh ic h p oin ted to a core page v , we to ok only the link s on u whic h w ere “around” th e link to v in the sense that w e to ok the 5 links imm ed iately preceding the link to v on the page and the 5 links immediately follo wing v . Ha ving obtained this d ata set we pr epro cessed it by computing the hub and authorit y v alues of all the pages in it. Our four data sets we re generated using th e keyw ord strings “automobile” (54952 p ages), “motor compan y ” (14973 p ages), “clothes shopp ing” (37724 pages) and “guess” (12101 ur ls). F or rep eatabilit y purp oses, these data sets hav e b een made av ailable online. 1 W e conducted extensiv e exp erimen ts on all these data sets by taking one p age out of them as a query , then scoring all three relationships for this 1 http://w ww.cse.iitd.ernet.in/ ˜ b agc hi/relationship-scores/ 13 page w ith all the other pages in the data set. Note that this do es not exactly corresp ond to our claim in Section 4 that w e pr o ceed by extracting k eyw ords from pairs of pages and then compu ting flows on the subnetw orks obtained f rom th ose keyw ords. Limitations on the amoun t of data we were equipp ed to handle in a universit y setting preve nte d us fr om p erforming these steps in th e full. W e pr esent h er e these stripp ed down exp erimen ts as indicativ e of wh at a full implement ation of our scoring mec hanisms migh t b e able to ac hiev e. W e compared our 10 top scoring pages for F actR el and Se ekR el with the top 10 pages return ed by Go ogle’s Similar Page s feature. W e also implemented the Comp anion algorithm d escrib ed in [10] and compared our results to the top 10 results r eturned by it. F or SurfR el we simp ly to ok our top 10 results and ev aluated them. The ev aluation in all these cases w as done by conducting user surveys. F or eac h of the target URLs scored using F actR el we aske d th e user to imagine they had visted it in the course of an information-gathering task and found it relev ant . W e then assembled a set of 30 URLs ( F actR el ’s top 10, Go ogle’s top 10 and Comp anion ’s top 10). W e p resen ted th ese 30 URLs in a random ord er and ask ed users to answe r three ye s/no questions: 1) Would you visit this p age if you had alr e ady visite d the tar get p age? 2) Do es this p age pr ovide similar information to the tar get p age? and 3) Is this p age r elevant to your information-gathering task? Eac h such sur v ey was given to b et w een 5 and 8 users. F or eac h of the 30 pages, and eac h of these 3 questions a relev ance score w as computed. F or a giv en target URL t , and a result page i : Rel( t, i ) = Num b er of YES answers Num b er of user s who to ok the su rv ey . F or eac h target page an d eac h algorithm, the Pr e cision at r of the r ank ed r esults was computed using the form ula Precision-at- r ( t, r ) = P r i =1 Rel( i ) r . The pr ecision at r for an algorithm w as computed b y taking the av erage of the p recision at r v alues o ver all the target pages ev aluated. F or Se ekR el the fi rst and the third question r emained the same. Th e second qu estion was replaced with Do e s this p age pr ovide links similar to those in the tar get p age? . Precision at r wa s calculated similarly for Se ekR el . F or SurfR el we only ask ed one question: Is this p age r elevant to your task? and w e did not present results from Go ogle or Comp anion . The co d e for all scoring mec hanism s was written in Jav a (JDK-1.5.0). The op en source Nutc h cra wler w as d o w n loaded and run and a parser was written in C++ to parse its output. A total of 9 target pages were ev aluated for F actR el , 7 for Se e kR el and 5 for SurfR el . On th e 14973 page “mot or compan y ” data set findin g all three relationship scores b et we en a give n query page and all the other query pages on a desktop PC with a 3.4 GHz Inte l Pen tium pro cessor with 1GB RAM to ok ab out 8 min utes on av erage f or th is data set. On the 37,72 4 page “clothes sh opping” data set it to ok ab out 1 hour on a v erage to calculate all three scores of a giv en page with all other pages. Let us no w see w hat the exp eriments rev ealed. 5.2 Exp erimen tal results In Figur e 14 w e list the 10 URLs that scored the highest on F actR el for the page www .honda.com. The relationship rev ealed is exp ected: other ma jor car companies. More in teresting is a list of pages related b y F actR el to w ww.cngv ehicle.com. Not only d o we get pages related to other alternate fuels (Biod iesel F orum (forums.bio d iesel.com), Electric Driv e T ransp ortation Asso ciation (ev aa.org) and go ve rnment agencies dealing w ith renewa ble energy p olicy , w e also get links to pr iv ate car ind ustry pla yers who are pursu ing th e dev elopment of energy efficien t cars. 14 www.honda.com www.cngvehicle.com 1 www.ford.com www.ev aa.or g 2 www.toy o ta .com forums.bio dieselnow.com 3 www.landrov er .c om www.eere.energ y .gov/cleancities 4 www.audi.com www.ford.com 5 www.gm.com www.nrel.gov 6 www.mer curyvehicles.com www.eere.ener gy .gov/cleancities/ – 7 www.cadillac.com www.gsa.gov/Portal/gsa/ep/— 8 www.c hevrolet.co m www.mercuryvehicles.com 9 www.lincoln.com www.gm.com 10 ww w .p o rsche.com www.honda.com Figure 14: F actR el top scorers for t wo pages of the “motor company” data set. F actR el Goo g le’s “similar pages” 1 www.macys.com www.saksfifthav enue.com 2 www.neimanmarcus.co m www.macys.com 3 www.jcp enney .com www.nor dstrom.com 4 www.a be r crombie.com www.neimanmarcus.co m 5 www.bananare public.co m www.barneys.co m 6 ww w .bluefly .com www.fds.com 7 www.spiegel.com www.starbucks.com 8 www.saksfifthav enue.com www.walmart.com 9 www.targ et.com www.nycvisit.com 10 www.eddiebauer.com www.ritzcarlton.com Figure 15: F actR el vs Go ogle’s “similar pages” for www.b lo omingdales.com F or the W eb page of the clothing store Blo omingdales 2 , F actR el and Go ogle’s “similar pages” returned more or less ident ical lists righ t at th e top, bu t F actR el ’s high scorers remained v ery fo cused (other app arel stores) while Go ogle pro vided links to coffee shops and h otels which could p ossibly b e appropriate in some con texts but deviate from wh at is arguably the main fo cus of a u ser visiting th e Bloomin gdale’s W eb p age (see Figure 15). As another demonstration of F actR el ’s reliabilit y in provi ding alternate sources of information, we present its top scorers for w ww.m ysimon.com, a comparison shopp ing site, in Figure 16. F actR el scored W eb p ages for ma jor comparison s hopping sites v ery h igh. T o test the robustness of F actR el we scored th e h ome page of Guess Jeans (w w w.guess.com) using not just the “clothes shopping” data set but also the “guess” data set. Despite the presence of th e am b iguous ke ywo rd “guess”, F actR el ’s top 10 resu lts we re close ly related to the original page: pages relating to clothing and accessories. Go ogle’s similar pages, on the other hand, app eared to get sev erely misled b y the ke ywo rd “guess” (see Figure 17). The precision at r graph for www.guess.com in Figure 18 rev eals that Google do es ve ry p o orly while F actR el and Comp anion p r o vid e goo d r esults. In general w e fou n d that F actR el ’s results we re su bstan tially b etter than Go ogle’ s but not b etter than th ose r eturned b y Comp anion . In Figure 19 w e see that users preferred to visit the top 10 pages present ed b y F actR el ov er those of Go ogle after ha vin g visited the target page. F or Se ekR el the picture was more complex. While u sers generally felt that the pages p resen ted by Se ekR el were far b etter than the resu lts presented by Go ogle and Comp anion in terms of the sim ilarity 2 www.bloomingdales.com 15 F actR el for www.mysimon.com 1 www.dealtime.com 2 shopping.yaho o .com 3 www.ebay .com 4 www.bizra te.com 5 www.pr icegra bb er .com 6 www.nextag.com 7 www.b ec o me.com 8 www.alibris.co m 9 www.buy .com 10 www.www.b estbuy .com Figure 16: F actR el h igh scorers for www.mysimon.co m F actR el Go ogle’s “s imilar pa ges” 1 www.gap.com www.guessthename.co m 2 www.g ucci.com www.onlinesho es.co m/... 3 www.marciano.co m www.son ypictures.com/... 4 www.guessinc.com www.imdb.co m/title/tt0372 237 5 www.jcrew.co m www.amazon.co m/Guess/... 6 www.gbyguess.com www.bizrate.com/...g uess+bag s.htm l 7 www.hugob oss.com www.learner.o rg/... 8 www.giv ench y .c o m po pular.ebay .com/...Guess+ Jeans.html 9 www.g ianfrancofer re.com www.a nswers.com/topic/gues s-inc 10 www.diesel.com www.guessfinancial.com Figure 17: F actR el vs Go ogle’s “similar pages” for www.guess.com 0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 8 9 10 Precision for Relevance r FactRel Companion Google’s Similar Pages Figure 18: Precision at r for the relev ance question for ww w .guess.com. 16 0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 8 9 10 Precision for Visit r FactRel Companion Google’s Similar Pages Figure 19: Pr ecision at r for the visit question for the F actR el target pages. 0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 8 9 10 Precision for Similarity r SeekRel Companion Google’s Similar Pages Figure 20: Precision at r for the similar links question for the Se ekR el target pages. 17 0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 8 9 10 Precision for Visit r SeekRel Companion Google’s Similar Pages Figure 21: Precision at r for the visit question for the Se ekR el target pages. www.shoppingcolumn.com/ p ersonal- shopping.html www.misslist.com/sto res/sho es.h tml www.arlingtonca rds.com/a roundtown/bizshop1.htm www.allwi.com/wipr esents.h tml www.digital-libra rian.com/s hopping.html www.co ol10 5 5.com/lc/ features/s ho pping www.ersys.co m/usa/1 3/134 9000/mall.ht m Figure 22: Se ek R el top scorers for ersys.com’s m all in formation page for Santa Clara, CA. of th e link s on them to those on the target page (see Figure 20), they preferred to vist the pages present ed by the other tw o algorithms (see Figure 21). Th is is strong ind ep end en t evidence in fav our of Au la et. al.’s conclusion [2] that W eb users pr efer to browse rather than searc h . Rather than visit another page with links similar to a giv en page they w ould rather visit a page with actual information on it. Another p ossible d ra wbac k in Se e k R el w as r eveal ed wh en we scored for a page listing all the malls in the San ta C lara, California area 3 (see Figure 22). W e found that p ages with lin ks to online shopping resources and ev en p ers onal shopping options scored high. But w e also found that pages with lo cal information for places as far afield from S an ta Clara as Macon, Georgia app eared near the top of the list. Th e absence of geographical domain kn o w ledge in our system sho ws up here. The u ser resp onse to SurfR el w as fairly go o d . Almost half of our top ten results w er e found relev ant b y th e resp ondents (see Figure 23). 3 www.ersys. com/usa/06/06 69084/mall.htm. 18 0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 8 9 10 Precision r SurfRel Figure 23: Precision at r for SurfR el . 6 Discussion Our scoring mec hanisms are designed with a view to integrate the t w o broad streams of thought on w eb page relationships: textual and link-based. This go es some wa y in addressin g La wrence and Giles’ criticism that searc h engines are biased to wards pages whic h are w ell-linked [19] and is hence an adv antag e ov er algorithms lik e Dean and Henzinger’s Comp anion which tak e only link information in to consid eration. Although us er s ranke d our r esults in the same ballpark as Comp anion , it is our cont en tion that our algorithms are m uc h less r esource intensiv e and m uc h more suited to inclusion in a real-w orld searc h engine. W e mainta in a dec k of subn etw orks, one corresp ond ing to eac h signifi cant keyw ord. It is difficult to estimate the num b er of significan t ke ywo rds b ut if we tak e the widely used lexical database W ordNet [11 ] as an indicator, the n umb er is of the order of 100,000. 4 The Comp anion algorithm, on the other hand, create s a su bnet w ork for eac h queried p age. Creating the subn etw ork at query time is difficult b ecause of the o v erhead inv olv ed in crawling the W eb (or ev en an image of the w eb stored on disk) and pr epro cessing and storin g these net w orks app ears infeasible giv en th at the size of the W orld Wide W eb is estimated to b e in the tens of billions of pages [9]. Ev en if it were feasible to store structural in formation on such a scale, the p roblem of up dation is hard to solv e. Th e W eb is constant ly c h an ging, and up dating our k eyword-based subnet wo rks will b e an ord er of magnitude less resource in tensiv e than up d ating billions of page-sp ecific subnetw orks. Our app roac h is further v in dicated by the observ ation that the results pr o vid ed by F actR el and Se ekR el clearly outp erform Go ogle’s Similar P ages. W e ha v e tried to gracefully br in g together textual and link information in a common framewo rk where one can comp ensate for the sh ortcomings of th e other. T h e Guess Jeans example presen ted ab o ve demonstrates that our scores can leverag e link information to h andle ambiguous ke ywo rds in a manner b etter than Go ogle can. The main contribution of this pap er, in our view, is the lo cation of our th inking on how to relate 4 W ordNet 3.0 contains 155,287 distinct strings. 19 pages in the con text of user int en t. As part of our futu re researc h agenda w e wan t to form ulate relationships b et ween p ages that can service user inten t outside the d omain of information-gathering. W e also wan t to test the applicabilit y of our metho ds in so cial netw orking situations an d u ser-generated con tent scenarios. References [1] Alta vista. h ttp://www.alta vista.com/ . [2] A. Aula, N. Jhav eri, and M. K¨ aki. Information searc h and re-access strategies of exp erienced web users. I n Pr o c. of 14th Intl. World Wide Web Confer enc e (WW W 2005) , 2005. [3] N. J. Belkin, R. N. Odd y , and H. M. Bro oks. ASK for information r etreiv al: Part I . Bac kground and theory. Journal of Do cumentation , 38(2) :61–71, 1982. [4] N. J. Belkin, R. N. Od dy , and H. M. Bro oks. ASK for inf ormation retreiv al: P art I I. Results of a design stu dy. Journal of Do cumentation , 38(3):14 5–164, 1982. [5] S. Brin and L. P age. The anatom y of a large-scale h yp ertextual W eb searc h engine. Computer Networks and ISDN Systems , pages 107–117, 1998. [6] A. Z. Bro der. A taxonom y of W eb searc h. In ACM SIGIR F orum , pages 3–10, 2002. [7] S. Ch akrabarti, B. Dom, P . Ragha v an, S. Ra jagopalan, D. Gibson, and J . Kleinberg. Automatic resource compilation b y analyzing h yp erlink structure and asso ciated text. In Pr o c. 7th Intl. Confer enc e on the World Wide Web (WW W ’98) , pages 65–74 , 1998. [8] V. Cothey . A longitudinal study of W orld Wide W eb users’ in f ormation-searc hing b eha vior. Jour- nal of the Americ an So ci e ty for Information Scienc e and T e chnolo gy , 53(2):67– 78, 2002. [9] M. de Ku n der. The size of the W orld Wide W eb. h ttp://www.w orldwidewebsize.c om/. Retrieve d on 29th F ebruary 2008. [10] J. Dean and M. Henzinger. Finding related pages in the W orld Wide W eb. In Pr o c e e dings of the 8th W W W Confer enc e , pages 1467–1 47, 1999. [11] C. F ellbaum, editor. Wor dnet: An ele ctr onic lexic al datab ase . Bradford Bo oks, 1998. [12] A. J. F errari, D. Gourley , K. Johnson, F. C. Knab e, D. T un k elang, and J. S. W alter. Hierar- c h ical data-driv en na vigation sy s tem and metho d for inf orm ation retriev al. U.S. P aten t n u m b er 7,035, 864, Ap ril 2006. [13] T. H. Ha v eliw ala, A. Gionis, D. Klein, and P . In dyk. Ev aluating str ategies for similarity searc h on the W eb. In P r o c. 11th Intl. Confer enc e on the W orld Wide We b (W WW 2002) , pages 157–16 3, 2002. [14] S.-H. S. Huang, C. H. Molina-Ro d riguez, J. U. Quevedo- T orrero, and M. F. F onseca-Lozada. Exploring similarit y among W eb pages u sing th e h yp erlin k structure. In Pr o c. International Confer enc e on Information T e chnolo gy: Co ding and Computing (ITCC’04) , pages 344– 348, 2004. [15] B. Jansen, D. Bo oth, and A. Spink. Determining the user inte nt of W eb searc h engine queries. In Pr o c e e dings of the 16th International Confer enc e on World Wide Web , pages 1149– 1150, 2007. 20 [16] G. Jeh and J. Widom. S imrank: A measure of structural-con text similarit y . In P r o c e e dings of the 8th ACM SIGKDD International Confer enc e on K now le dge Disc overy and Data M ining , 2002. [17] M. Kellar. A n E xamination of U ser Behaviour during Web Information T asks . PhD thesis, Dalhousie Un iv ers it y , Halifax, C an ad a, 2007. [18] J. K lein b erg. Au thoritativ e sources in a h yp erlink ed environmen t. In Pr o c e e dings of the 9th Annual ACM-SIAM Symp osium on Discr ete Algorithms , p ages 668–677, 1998. [19] S. La wr ence and C. L. Giles. Accessibilit y of information on the W eb. N atur e , 400:1 07–109 , 1999. [20] T.-P . Liang, H.-J. Lai, and Y.-C. Ku. P ersonalized conten t recommendation and user satisfac- tion: Th eoretical synthesis and emp irical fi ndings. Journal of Management Information Systems , 23(3): 45–70, 2006-7 . [21] Z. Lin, M. Lyu, and I. K ing. P agesim: A no vel link-based measure of W eb page s im ilarity . In Pr o c e e dings of the 15th International Confer enc e on World Wide Web , pages 1019– 1020, 2006. [22] W. Lu, J. C. M. J anssen, E. E. Milios, N. J apk o wicz, and Y. Zhang. No de similarit y in th e citation graph. Know l. Inf. Syst. , 11(1):1 05–129, 2007. [23] W. Lu, J. C. MJanssen, E. E. Milios, and N. Japko wicz. No de similarity in n et work ed informa- tion spaces. In Pr o c. Confer enc e of the Centr e for A dvanc e d Studies on Col lab or ative R ese ar ch (CASCON) , 2001. [24] G. Marc hionini. Information Se e k ing in Ele ctr onic Envir onments . Cam brid ge Univ ers it y P r ess, 1995. [25] Nutc h . http://luce ne.apac h e.org/n utch/ . [26] J. Pitk o w and P . Pir olli. Life, d eath, and la wfulness on the electronic fron tier. In Pr o c e e dings of ACM SIGCHI Confer enc e on Human F actors in Computing , 1997. [27] D. E. Rose. Reconciling in f ormation-seeking b ehavio r with search u ser interfaces for the W eb. Journal of the Americ an So ciety for Information Scienc e and T e chnolo gy , 57(6):797– 799, 2006. [28] D. E. Rose an d D. Levinson. Unders tanding us er goals in W eb search. In P r o c. of 13th Intl. Confer enc e on World Wide Web (WWW 2004) , pages 13–19, 2004. [29] A. Sp ink, D. W olfram, M. B. Jans en , and T. Saracevic. S earc hin g the W eb: Th e pu blic and their queries. Journal of the Americ an So ciety for Information Scienc e and T e chnolo gy , 52(3) :226–23 4, 2001. [30] A. T om bros and Z. Ali. F actors affecting W eb p age similarit y . In 27th Eur op e an Confer enc e on Information R etrieval (ECIR) , 2005. [31] W. Xi, E . F o x, W. F an, B. Z hang, Z . Ch en, J . Y an, an d D. Zhuang. Sim f usion: Measuring similarit y using unified relat ionship matrix. In Pr o c e e dings of the 28th ACM SIGIR Confer enc e on R ese ar ch and Development in Information R etrieval , p ages 130–137, 2005. [32] Y ahoo! Con ten t analysis w eb s er v ices: T erm extraction. h ttp://dev elop er.yahoo.com/searc h/cont ent/V1/termExtraction.h tml . 21
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment