Applying an XML Warehouse to Social Network Analysis, Lessons from the WebStand Project
In this paper we present the state of advancement of the French ANR WebStand project. The objective of this project is to construct a customizable XML based warehouse platform to acquire, transform, analyze, store, query and export data from the web,…
Authors: Benjamin Nguyen, Antoine Vion, Francois-Xavier Dudouet
In this paper, we describe our platform, WebStand 1 , currently under development, to be used by sociologists when studying information found on the Web, and in particular analyzing social behavior on mailing lists, forums or any place in which (tracked) discussions take place on the Web. Our current focus is the analysis of the W3C standardization mechanism around the XQuery recommendation. Indeed, Information Technology is only just receiving attention from sociologists, and our goal is to create new tools for sociologists to assess and analyze this domain.
Our approach, when designing our initial platform architecture, was to consider, in conjunction with sociologists what sort of information they whished to obtain, and what sort of analysis they wanted to run. A preliminary study led us to the following conclusions:
Traditionally, sociological data consist of reports, questions and interviews. On the contrary, in the Web context, the data manipulated is electronic: mailing lists, homepages, and institution or company pages. Our goal is to discover, extract, and analyze actors of this field, their positions, their relationships, and their influence, etc. All this data is particularly adapted to automatic processing.
The WebStand approach is based on the use of a semi-structured temporal XML content warehouse to store the data, and graphically generated XQueries to analyze it. Let us stress that our warehouse aims to cover the whole Extract Transform and Load (ETL) scope of a sociological application. Our goal in this short paper is to focus on the architecture and temporal model of our application, briefly present the modules already developed, and give some sociological results that illustrate the sort of information that we can calculate easily.
The WebStand architecture is shown in Figure 1. WebStand is implemented in Java, and is running using the JDBC compatible MonetDB-XQuery Although we use MonetDB XQuery database to store the data, in some cases where the queries can not be run (such as queries using temporal functions) we use Saxon-B to compute the result.
The global use case is the following: a social scientist defines the concepts he is interested in, choosing from already existing concept (such as person or email) that can be extended with his own. This sociological model is (for the moment manually) translated into an XML Schema, used to store information extracted from the web by the acquisition modules. This XML Schema is also used to help the sociologist generate graphical queries, using a QBE-like interface, developed in our visualization and query tool. We used QBE rather than XQBE [1] due to the widespread use of Microsoft Access by the sociologists we work with, but we are considering alternative query interfaces based on XQBE. WebStand also provides simple XSL to export XML data in many formats used in the sociology world, although in a forseable future, we envision these applications to be all compatible with a simple XML format.
A preliminary study using our tool on 8 public mailing lists, related to XQuery and XML Schema has been performed. We are currently working on analyzing the data provided by all the public mailing lists of the W3C working groups.
The
Information used to create Table 4 was entered by hand using our temporal model detailed in section 4. We are currently in the process of automating authoring information from the versions (from WD to REC) of one W3C technical report found on the Web.
Other results that are produced by our system are social graphs, that indicate common participation on a thread, answering profiles that indicate with which other list participants a given person privileges discussion, we can not provide them here due to the fact these graphs are place consuming, but we give one example in the Appendix, and we also refer to [4] for more examples of these graphs.
The data stored in the warehouse respects a given sociological schema that is generated by our tool (i.e., for the moment, we support person, institution, and email schemas). This data has specific temporal aspects, due to its inherent sociological nature. First of all, any information in the database (e.g., John Doe is XML Corp. CEO) is temporal. We want to store information regarding the fact that John became CEO at a certain point in time t 1 (2001-1-1), and changed position at time t 2 (2003-31-12). Moreover, this information was given to us at time t 3 (2004-6-6).
To represent this, we add two TemporalInformation nodes to the data, as shown in Figure 3b. The event element conveys the semantics attached to the temporal interval. This allows our model to capture the traditional validity time or transaction time aspects [5], but also to be fully flexible (any type of custom event can be defined).
If we receive new and contradictory information, for instance we learn at date t4(2005-4-10) the fact that John left his job at t2' (2003-30-11), this is simply captured by adding a new temporal annotation to the function node, and annotating this annotation to indicate the validity time's validity, and changing the value of the previous validity node (no example shown due to place limitations).
Temporal data is accessed using XQuery, and we are currently implementing temporal query plan optimizations over MonetDB, which does not currently support temporal data. We are also extending our sociological XML model to include sourcing and quality (i.e., where did we find the information, and what degree of credit to give to it.) Current work also involves automatically extracting technical reports temporal information, cross referenced with authors and their affiliation at the time the report was published, and storing it in XML using our temporal model.
In this short paper, we present a brief overview of the architecture and functionalities of the WebStand platform and give some brief results of a study of the W3C. For more details on the sociological results, we refer to [4]. Our current experience shows that use of XML and XQuery through simple graphical interfaces simplifies the accessibility of XQuery to novice users, such as sociologists. The flexibility of our temporal model has allowed us to capture all the data collection situations that we have encountered so far, and it is our belief that such software can be used in various other sociological applications to analyse behaviors.
6.
corpus we focused on consist of 20.697 emails posted over the course of 4 years (from April 02 -to April 06) by about 3000 different "physical" people (i.e., after grouping emails together based on our heuristics, we identified 2923 different "entities"), analyzed according to activism on the lists and their participation in the writing of working drafts or recommendations. These emails originated from approximately 2000 different domains (Institutions or Internet Service Providers, our heuristics led us to 2076 different domains) It is possible to run any query on this data, we show here simple aggregate results obtained to illustrate simple yet non the less valuable participation information. Table 1 illustrates activism within W3C. It contains anonymized data showing the number of posts made by a single person: the top poster scored 1077 different posts.
This work is partially funded by the French ANR-JCJC-05 "WebStand".
We selected 28 technical reports in the recommendation process that appeared in the discussions on the list.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment