Visualization of association graphs for assisting the interpretation of classifications

COLLNET 2006 480 Visualization of association graphs for assisting the interpretation of classifications S ANJUAN E RIC 1 R OCHE I VANA 2 1 LITA, Université de Me tz & SRDI, INIST – CNRS IUT de Metz, Ile du Sa ulcy, 57045 Metz, CEDEX 1, FRAN CE 2 SRDI, INIST – CNRS 2, Allée du Parc de Brabois, CS 10310, 5451 9 Vandoeuvre-lès-Nan cy, FRANCE Abstract Given a que ry on the PASC AL database m aintained by t he INIST, we desig n user interface s to visualize an d browse two ty pes of graphs ext racted from abstracts: 1) the graph of all associat ions between aut hors (co-a uthor graph), 2) t he graph of st rong associations between auth ors and term s automatically ex tracted from abstracts and grouped usi ng linguistic vari ations. We adapt for this purpose the Te rmWatch syst em that comprises a term extractor, a rel ation identifi er which yiel ds the term inological netw ork and a cl ustering m odule. The resul ts are output on two interfaces: a graphic one mappi ng the cl usters in a 2D space a nd a term inological hypertext network allowing the user to interactively explore resu lts and return to source texts. 1. Introduction As pointed out by Ley M. et al. (2006), traditiona l bibliographic data ba ses like PASCAL evolved from a small specialized bibliography to a digita l library covering most of scientific fields. The collection is maintained with massive human effort. On the long term this investment is only justified if data quality remains much higher than that of th e search engine sty le collections. In this paper we show that co-author and terminol ogical graphs of high quality can be very easily extracted from PASCAL database, visualized and browsed. In par ticular, we founded out that special problems of person names can be managed using si mple heuristics. Moreover, we s how that is possible and useful to display the complete coauthor graph of sev eral hundreds of PASCAL abstracts resulting from a request. This differs from Klink et al. (2004)'s appro ach that focus on local extracts of the graphs to comprehend only the surroundings of a single author on DBLP database. We focus on valued graph of terms (words, names or key-words) that constitute the input to co-word analysis as defined in (Michelet, 1988). It has been observed by Courtial (1990) that in this type of analysis applied to scientific and technical database s, short paths of strong associations can reveal potential new connections between separated fr agments of the network and reveal innovative situations. Following these observations, graph clustering met hods that do not focus on homogeneous clusters, but highlight some heterogeneous clusters formed along a path of strong associations have been proposed. The problem of graph clustering is well studied in a large literature. We refer to F lake et al. (2004) for a complete review. The best known graph clustering algorithms attempt to optimize specific criteria such as k-median, minimum sum, minimum diameter, etc., meanwhile si mple fast variants of single link clustering (SLC) give satisfactory results in Courtial (1990)'s appro ach of co-word analysis. One of these variants can be found in the SDOC sy stem (Polanco et al. 1995) actually integrated to STANALYST (http://stanalyst.inist.fr/) on-line interface that gives access to the PASCAL and COLLNET 2006 481 FRANCIS databases (http://www.inist.fr) for information analysis purposes. It uses a threshold to fix the maximal cardinality of clusters. One of the mo st important qualities of SDOC is to present the content of clusters as subgraphs of associations that can be analyzed in a very intuitive way that minimizes the risk of observing corre lations where they do not exists. In this paper we revisit this main idea of co-wor d analysis applied to two other types of graphs extracted from bibliometric data: 1) the graph of all associations between au thors, 2) the graph of strong associations between authors and terms auto matically extracted from abstracts and grouped using linguistic variations. We adapt for this purpose the TermWatch system that comprises three main modules: a term extractor, a relation identifie r which yield s the terminological network and a clustering module. The results are output on two inte rfaces: a graphic one mapping the clusters in a 2D space and a terminological hypertext network allowing th e user to interactively explore results, return to source texts or re-execute the system's modules. Yet, based on the interactive graph visualisation t oolkit (AiSee, http://www.aisee.com) and a different clustering algorithm called CPCL introduced in (Ibekwe-SanJuan 1995) and im plemented in TermWatch (SanJuan 2005), we show how to reduce and explore inform ative paths between nodes. The article is organized as follows. Section 2 is devoted to a precise description of our methodology which involves the definition of our text data repres entation §2.1, the graphs we extract from this data (§2.2 to §2.4) and the way we mine informative s hort paths §2.5 u sing a simple clustering algorithm. In §3 we carry out an experimentation of our a pproach on a small and a medium size corpus (the whole extracted graph feats in RAM memory). We pr ocess to the clustering of this non SWG in §3.3. §4 is devoted to discussion and future work. 2. Graph extraction from documents 2.1. Association graphs We carry out the usual extraction of co-author and co-terms association graph f rom the collection of separate documents. From a formal point of view, the input is a hypergraph (a finite family H ={ h1,h2, ... } of finite subsets called edges) having as many hyper-edges as documents. Each hyper-edge hi is a set of units extracted from a document i . hi can contain authors, terms, bibliometric attributes etc. From H_D, we derive the valued graph of associations G =( V,E,a ) where: • V is the set union of all extracted items. • E is the family of all pairs (dyads or edges) of elements included in hyper-edges (for any edge e in E , e is of the form { u,v } and there exists an edge di containing both u and v ). • a is a valuation of E . In this experimentation we chose the equivalence coefficient which is defined for any edge e={u,v} as the product of conditional probabilities of findin g one of the vertex in an hyper-edge knowing the presence of the other. 2.2. Reducing and Visualizi ng Association Graphs Usually, when a co-occurrence is used, a threshold is set on the keywor d frequency in order to ob tain a less sparse matrix (Feldman et al., 1998). In our appr oach we prefer to set the threshold on a ssociation index after observing that low frequency items from PASCAL can represent valuable prospective rare information without inducing too much noise. Conse quently, every value s in ]0,1[ induces a sub- graph Gs=(V,Es,a) where Es is the set of pairs of vertices {i,j} such that a(i,j)> s. We use a variant of the single link clustering (SLC ), called CPCL, originally introduced in Ibekwe- SanJuan (1998) and described hereafter, to form cl usters of keywords related by geodesic paths made of relatively high associations. However, any varian t of SLC that reduces its chain effect can produce interesting results in this context. In this expe rime nt we use CPCL instead of SDOC because it does not require fixing the maximal size of a cluster, and relations between extracted clusters are COLLNET 2006 482 symmetric. For all these reasons, CPCL is better adap ted to the task of valuate graph reduction while preserving its structure. We use two types of interfaces to browse the resulting clustered network. The interactive graph visualization interface AiSee ( http://www.aisee.co m ) is based on minimization energy models. It displays the graph on the form of a set of masses related by spiral springs, represented by straight edges. Vertices r epresent units th at can be interactively folded or clusters that can be unfolded to observe their internal structur e and their external links to surround clusters or external items. Since clusters are labeled by their cen tral terms, this interface reveals units (authors or terms) with high scores of betweenness centrality. More over, by opening the clusters the user visualize the geodesic paths of strong associations that cross the label of the cluster. Th ese short paths reveal potential new interactions be tween document attributes. In the case of very large graphs, like graphs incl uding author’s terminology as explained hereafter, a supplementary hypertext browsing interface is re quired to retrieve and explore informative associations. We used the navigator interface include d in the TermWatch system to check details of association between authors and automatic extracted terms from raw text. 2.3. Clustering Algorithm There exist many graph clustering algorithms. In partic ular, Matsuda et al. (1999) proposed alg orithms extracting high density subsets of vertices that could be apply to the values of graph G, but not to Gs graphs of textual data that are especiall y sparse. We chose the CPCL (Clas sification by Preferential Clustered Link) algorithm that tends to form cluste rs along short geodesic paths of strong associations. It consists in merging iteratively clusters of keywords related by an association strongest than any other in the external neighborhood. In other word s , it works on local maximal edges instead of absolute maximal values like in standard SLC. We briefly recall this algorithm hereafter and we refer the reader to (Berry et al. 2004) for a detailed description in the graph form alism. Program CPCL(V,E,a) 1) Compute the set S of edges {i,j} such that a(i,j) is greater than s(i,z) and s(j,z) for any vertex z, 2) Compute the set C of connected components of the sub-graph (V, S). 3) Compute the reduced valued graph (C, E_C, a_C) where E_C is the set of pairs of components {I,J} such that: there exists {i,j} in E with i in J, j in J and a_C(I,J) = max{a(i,j): i in I, j in J}. If V <> C go to phase 1 else return (C, E_C, a_C) 2.4. Term extraction for topic mapping Termwatch performs multi-word term extraction b ased on shallow NLP, using the LTPOS tagger and LTChunker software (C) Andrei Mikheev 2000 of the University of Edinburgh. LTPOS is a probabilistic part-of-speech tagger based on Hi dden Markov Models. LTChunk identifies simplex noun phrases (Nps), i.e., NPs without prepositional att achments. In order to extract more complex terms, contextual rules are used to identify complex terminological Nps. Terms undergo variations which lead to the creation of new neighboring concepts. This process, called variation in the computational terminology fiel d has been well studied (Ibekwe 1998). Variations occur at different linguistic levels: morphological , lexical, syntactic, seman tic. Term Watch identifies the different variants of the same term going from close semantic variants like morphological (gender COLLNET 2006 483 and spelling), synonyms (using WordNet) to generi c-specific relations using syntactic criteria (expansions and structural changes). Then given a collection D ={ d1,...,dn } of docu ments, we consider the following hyper graph T ={ t1,...,tn } where each hyper edge ti contains all author names of document di with all terms extracted in document di and all their variants founded in the cor pus of documents. From this hyper-graph we de rive an association graph G =( V,E,f ) as explained in subsection 2.1 but with a different valuation function f. Indeed we set for each edge ( u,v ) to: • 1 if u and v are terms and u is a variant of v , • a( u,v ) otherwise. The graph Gs for s=0.8 is then clustered using CPCL and visualized in AiSee while we use the hypertext TermWatch interface to browse all asso ciation links between clustered elements as we already explained. 2.5. The station of analysis STANALYST ® The station of analysis STANALYST ® is composed of a whole of modules allowing the search for information in the bibliographical data bases of th e INIST, for their statistical, terminological and thematic analysis (Polanco et al. 2001). The integration of these various modules within a common, accessible graphic user interface since a navigator HTTP is as follows: the RECEPTION is a static page HTML giving access to the application; the user declares his name and defin es his password. The PROJEC T makes it possible to define an environment of work, i.e. a repertory in which will be stored all the results concerning the project. The user is the owner and it also has the possibility of giving access to its project to the associated users. Modules CORPUS, BIBLIOMETRICS, INDEXING and INFORMETRICS constitute the modules of work of this station of analysis of inform ation. The module CORPUS manages the creation of corpus by execution of requests built by the user. The co rpora can then be exported bound for the following modules. Module BIBLIOMETRICS manages the creation of descriptive statistical analyses. The module INDEXING makes it possible either to revise the indexing or to carry out an automatic indexing of the corpus, resting for that in tool s allowing a term inological extraction on the basis of several terminological references. The result of the INDEXING module will be the input for classification set of themes. The module INFORMETRICS manages classification using non- supervised automatic classification methods: two progra ms are available. The first realizes a hierarchic classification based on the co-word analysis meth od and the second one applies the K-means axial method to obtain a non-hierarchical classification. The modules use a whole of repertories of work containing the programs, scripts, param eters necessary to its operation, as well as the whol e of the projects created by the users. 3. Results 3.1. Corpus extracted from PASCAL database We extracted two experimental corpora from PASCAL database. A corpus on South-America nano- technologies that we shall refer by SAN for short, and a corpus on chordal graphs that we shall refer by CG. The SAN corpus is constituted by 939 bibliogra phic references coming from PASCAL database. The query asks for scientific and technical papers related to nanotechnologies, published in the twelve last years (1994 to nowadays) and having at least one au thor affiliated to a South-American organization. The principal scientific domains covered by the corpus references are Physics with 68% of the references, Engineering Sciences with 16% and Chemistry with 11%. 85% of the bibliographic references have been published in the 6 last years. Almost all the documents (99.7%) are in English language. The 2,574 authors come from 1,984 affiliations located in 51 different countries. The five COLLNET 2006 484 best represented are Brazil, USA, Argentina, Fran ce and Spain with, respectively, 40%, 12%, 9%, 6% and 4% of documents. The documents come from 180 journals published in 11 different countries. The three first are USA, The Netherlands and Un ited Kingdom with, respectively, 35%, 31% and 16% of documents. The 4 journals the most represented (2% of the cover) produce 23% of the references and 75% of them are produced by 28% of the cover. The CG corpus is composed by 155 bibliogra phic references coming from PASCAL database. Th e query asks for scientific papers dealing with c hordal graphs and explores 22 years of PASCAL database. The principal scientific domains covere d by the corpus referenc es are Mathematics (56%) and Engineering Sciences (43%). 67% of the bibliogr aphic references have been published in the 6 last years. All the documents are in English. The 261 authors come from 240 affiliations situated in 32 different countries. USA, Canada, France, Germa ny and Taiwan with, respectively, 27%, 10%, 9%, 9% and 4% of documents are the best represented. The documents come from 29 journals published in 8 different countries. The Netherlands, Germany an d USA with, respectively, 57%, 29% and 29% are the first ones. 25% of the references are produced by only one journal and the five first journals produce 75% of the cover. 3.2. Co-Author graphs We generated two level visualizati ons of author graphs. First level is the whole graph of associations between authors, thus all pairs of authors that wrot e at least one paper together are represented. For sets with less than 300 documents th is graph can be browsed using an appropriate toolkit like AiSee as shown in Fig. 1 (left side) for the SAN corpus and Fig 3 for the cor pus on chordal graph. In the case of SAN corpus, co-author graph reveal s a set of clusters representing international important collaborations of South-American academic institutions. It is possible to observe in cluster Fig 1. Co-author graph extracted from corpus S AN. Left figure shows the general shape of the graph. Vertices represent authors. Neighboring auth ors in the same cluster share the same color. Right figure shows the contents of clusters Knobel and Souza. COLLNET 2006 485 J. Jiang a great number of exchanges with Japan and in cl uster M. S. Dresselhaus a strong collaboration with USA. Some authors have a central position as M. Knobel, A. F. Craievich, P. C. Morais, A. G. Souza or D. Ugarte, respectively in cl usters M. Knobel, A. F. Craievich, P. C. Morais, A. G. Souza and J. Jiang. These authors come fro m internationally known academic institutes. It is also possible to find author cliqu es: an interesting example can be observed in cluster P. Levy, whose authors are related to a national atomic energy institute. Clustering allows to highlight an underlying stru cture as shown in Fig 2 for the SAN corpus and Fig. 2 Reduced co-author graph on SAN. Vertex repr esent clusters labeled by their author having the highest number of links towards other clusters. COLLNET 2006 486 authors that relate different groups as shown in Fig. 1 (right side). In the case of CG corpus, co-author graph reveals a cen tral author (cutting vertex): Dieter Kratsh and a dense cluster (unfolded in the figure) formed around Heggernes and Berry who is related to all authors in this cluster. Another feature of this image is to show the evolution of authors Berry and Paul since their PhD with Bordat and Habib respectively. Fig. 3 Co-author graph extracted from CG corpus. Clusters are wrapped together. COLLNET 2006 487 3.3. Terminology Author graphs We proceed to a term extraction in both corpor a. The resulting graphs are huge and a hypertext browsing interface is required. However, in the case of GC corpus, graph display confirms the central position of Dieter Kratsch in this corpus since a cl uster labeled by his name appears. By unfolding it, we see the terminology used in his publications. Fig. 4 Cluster labeled Kratsch and asso ciations to cluster of open questions Fig. 5 Contents of cluster k log in GC graph. COLLNET 2006 488 Fig 4 and 5 show features that is difficult to detect using the sole hypertext browsing interface. In the case of CG corpus, if the centrality of author Krastch is revealed by the fact that this name is the label of the biggest cluster, the graph display interface a llows to point out the multiple links between terms associated to open questions and terms by folding the huge cluster labeled Krastch and unfolding the cluster labeled open question. By unfolding the two clusters of term inology respectively related to authors Simonet and Berry, that graphical interfa ce allows pointing out the multiple graph problems targeted by Berry, meanwhile Simonet appears to be more specialized in a single application of these algorithms. The terminology-author graph obtaine d from SAN cor pus allows to confirm same statements made from the co-author graph exploitation. By example, it is possible to validate the central position of A. Craeivich and determine the terminology associated by examining the cluster contents: doped film, grazing-incidence small angle X-ray reflectivity , GISAXS pattern, film surface. The cluster K. D. Machado presents the same clique of authors observed in the co-author graph (J. C. de Lima, T. A. Grandi, C. E. M. Campos and K. D. Machado) and the association with two others clusters: P. S. Pizani and binary mixture. The cluster binary mixture is composed by: hexagonal Co Se, nominal composition Co, amorphous selenium . 4. Discussion Relying on the data quality of PASCAL database, we have designed and experi mented an interface that can extract and efficiently display in real time the co-author and terminology network from documents issued by a user query through th e STANALYST system. Further human computer interface study shall allow us to better adapt this interface to non AiSee’s users by automatically selecting clusters to fold or unfold for different scien tific watch tasks. Moreover, in our experi ment we did not use the key-word field of PASCAL abstract s since we focused on the feasibility of displaying the vocabulary used in text abstracts. Another furt her experiment will be to add this data to the hypergraph representation of documents. References 1. Berry A., Kab a B., Nadif M. , SanJuan E., Sigayret A. ( 2004) Classifi cation et désart iculation de g raphes de termes in JADT 2004, Leuven, Belgium, 10-12 march, 160-170. 2. Courtial J-P. (1990) Intro duction à la scient ométrie, Anthropo s – Economica, Paris, 135p. 3. Flake G. W., Tarjan E. R., Tsi outsioulikli s K (2004) Graph Clusteri ng and Minim um Cut Trees. Int ernet Mathematics Vol. 1, No. 4 , A K Peters Ltd., p. 38 5-408 4. Ibekwe-SanJ uan F. (199 8). A lingui stic and m athemati cal method f or mapping them atic trends from texts. I n: Proceedings of the 13th European Conference on Artificia l Intelligence (ECAI). B r ighton, U.K., pp.170- 174. 5. Jain, A.K., Dubes, R.C. (1988) A lgorithms for Cluster ing Data. Prentice Hall, En glewood Cliffs , NJ. 6. Klink S., Ley M., Rabbidge E., Reuther P., W alter B., Weber A . (2004) Browsing and Visu alizing Digital Bibliogra phic Data. VisSy m 2004: 237- 242 7. Ley M., Reuther P. (2006) Maintaining an Onlin e Bibliographical Database: The Problem of Data Quality (2006) EG C 2006: 5-10. 8. Michelet B. (1988) L’Analyse des Associations. Thèse de doct orat, Université de Paris 7. 9. Matsuda H, Ishih ara T, Hashimoto A . (1999) Classif ying Molecular Sequen ces Using a Link age Graph with Their Pairwise Similarities. Theor. Comput. Sci., 210(2): 305-325 . COLLNET 2006 489 10. Polanc o X., François C., Royauté J., Besagni D ., Roche I. (2001) STANAL YST : An Integrated Environment for Clu stering and Mapping Analysis on Scien ce and Technlology. 8 TH International Conference on Scientometrics and Inform etrics, July 16-20 20 01, SYDNEY, AUSTR ALIA. 11. Polanco X., Grivel L., Royauté J.(1995) How to do things with terms in informetrics: terminological variation and s tabilization as sci ence watch indicat ors, in: Michael E.D Koenig, A braham Bookst ein (Eds), 5th International Conference of th e International Society for Scientom etrics and Informetrics, 435-444, Learned Infor mation Inc. M edford NJ. 12. SanJuan E., Dowdall J., Ibekwe-SanJuan F., Rinaldi F. (2005) A symbolic approach to automatic multiword term struct uring in: C omputer Speec h and La nguage, V ol 19, issue 4 , October 2 005, pp 5 24-542.

Visualization of association graphs for assisting the interpretation of classifications

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment