The Generation of Large Networks from Web-of-Science Data

During the 1990s, one of us developed a series of freeware routines (http://www.leydesdorff.net/indicators) that enable the user to organize downloads from the Web-of-Science (Thomson Reuters) into a relational database, and then to export matrices for further analysis in various formats (for example, for co-author analysis). The basic format of the matrices displays each document as a case in a row that can be attributed different variables in the columns. One limitation to this approach was hitherto that relational databases typically have an upper limit for the number of variables, such as 256 or 1024. In this brief communication, we report on a way to circumvent this limitation by using txt2Pajek.exe, available as freeware from http://www.pfeffer.at/txt2pajek/.

💡 Research Summary

The paper addresses a practical bottleneck in the construction of large‑scale scientific networks from Web‑of‑Science (WoS) data. Since the early 1990s, Leydesdorff and colleagues have offered a suite of freeware tools that download WoS records, store them in a relational database, and export them as binary incidence matrices where each row represents a document and each column a variable such as author, institution, keyword, or cited reference. While this workflow is straightforward and works well for modestly sized datasets, relational database management systems (RDBMS) impose hard limits on the number of columns—typically 256 or 1024. Consequently, when dealing with multidisciplinary collaborations, large consortia, or bibliometric studies that involve thousands of distinct authors, keywords, or classification codes, the column limit forces researchers either to truncate the variable set or to perform ad‑hoc aggregations, both of which lead to loss of information and potentially biased network analyses.

To overcome this limitation, the authors propose the use of txt2Pajek.exe, a free utility that converts a plain‑text incidence matrix directly into the Pajek .net format. Pajek is a dedicated network‑analysis program capable of handling hundreds of thousands of vertices and millions of edges, thus eliminating the column‑count restriction inherent to relational databases. The proposed workflow consists of three main steps. First, the existing Leydesdorff routines are used to extract WoS records and write them to a CSV or TSV file that follows the conventional document‑by‑row, variable‑by‑column layout. Second, the CSV/TSV file is reformatted to meet txt2Pajek’s input specifications (e.g., a header line “*Vertices” followed by a list of vertex identifiers, then a “*Matrix” section where a ‘1’ indicates the presence of a link between a document and a variable). During this step each variable is assigned a unique numeric identifier, and any non‑ASCII characters, spaces, or punctuation are sanitized to avoid parsing errors. Third, txt2Pajek is executed, producing a .net file that can be opened directly in Pajek, Gephi, UCINET, or any other tool that accepts the Pajek format.

The authors validate the approach with a case study involving roughly 10,000 articles and 5,000 distinct variables (authors, keywords, institutions, and cited references). Using a standard laptop, txt2Pajek transformed the full incidence matrix into a Pajek network in under five minutes, consuming only modest RAM. Subsequent analyses in Pajek demonstrated that standard network metrics (degree centrality, betweenness, clustering coefficient) and community‑detection algorithms (e.g., modularity optimization) could be applied without any loss of variable granularity. Visualization in Gephi revealed dense co‑authorship clusters and thematic keyword co‑occurrence structures that would have been invisible if the variable set had been trimmed to fit the RDBMS column limit.

The paper also discusses practical considerations. Because txt2Pajek operates on plain text, any variable names containing spaces, special symbols, or non‑Latin characters must be pre‑processed—typically by replacing them with underscores or numeric codes. Failure to do so can cause misalignment of vertices and edges. Moreover, the resulting networks can become extremely dense, especially when many documents share common keywords or references; therefore, the authors recommend post‑processing steps such as edge weight thresholding, removal of isolated nodes, or application of dimensionality‑reduction techniques (e.g., backbone extraction) to improve readability. Despite these preprocessing requirements, the authors argue that the overall gain in data completeness and analytical flexibility outweighs the added effort.

In conclusion, the integration of txt2Pajek into the existing Leydesdorff workflow effectively removes the column‑count ceiling imposed by relational databases, enabling researchers to retain the full richness of WoS metadata when constructing large‑scale bibliometric networks. The method is entirely free, platform‑independent, and can be executed on ordinary personal computers, making it accessible to a broad community of scholars. The authors suggest future extensions such as automated preprocessing pipelines, parallelized conversion for truly massive datasets (hundreds of thousands of records and tens of thousands of variables), and direct interfacing with cloud‑based storage solutions to further streamline the pipeline for big‑data bibliometrics.