Retrieval of very large numbers of items in the Web of Science: an exercise to develop accurate search strategies

Retrieval of very large numbers of items in the Web of Science: an   exercise to develop accurate search strategies
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The current communication presents a simple exercise with the aim of solving a singular problem: the retrieval of extremely large amounts of items in the Web of Science interface. As it is known, Web of Science interface allows a user to obtain at most 100,000 items from a single query. But what about queries that achieve a result of more than 100,000 items? The exercise developed one possible way to achieve this objective. The case study is the retrieval of the entire scientific production from the United States in a specific year. Different sections of items were retrieved using the field Source of the database. Then, a simple Boolean statement was created with the aim of eliminating overlapping and to improve the accuracy of the search strategy. The importance of team work in the development of advanced search strategies was noted.


💡 Research Summary

The paper addresses a practical problem that many researchers encounter when using the Web of Science (WoS) platform: the system’s hard limit of returning a maximum of 100,000 records for any single query. This restriction becomes a serious obstacle when the research goal is to retrieve an entire national output for a given year, which often exceeds several hundred thousand items. The authors present a step‑by‑step exercise that demonstrates how to overcome this limitation by partitioning the search space, executing multiple sub‑queries, and then recombining the results while eliminating overlaps.

The case study focuses on the United States’ scientific production for a specific year. The authors chose the “Source” field (journal title) as the primary partitioning attribute because journal titles are relatively evenly distributed across the alphabet and can be used to create logical groups that stay under the 100,000‑record ceiling. For example, they constructed a query such as “Source: (A* OR B* OR C* OR D* OR E* OR F*)” to capture all records from journals whose titles begin with those letters. After running the initial set of queries, they discovered that some groups (e.g., G‑L) still exceeded the limit. To further reduce the size of these groups, they added secondary filters such as document type or publication year, thereby bringing each sub‑query’s result set below the threshold.

A critical challenge in this approach is the potential for duplicate records, because a single article may appear under multiple journal name variants or be indexed in more than one source field. To address this, the authors designed Boolean statements that systematically subtract overlapping records. The logic follows a “NOT” pattern: the results of the first group are taken, then any records that also appear in the second group are excluded, and so on for subsequent groups. By applying this sequential exclusion, they assembled a final, de‑duplicated dataset that closely matches the known total output for the United States in that year.

The paper also highlights the importance of teamwork. The development of the search strategy, execution of the many queries, verification of results, and management of duplicate removal were distributed among specialists in information science, library science, and domain experts. Regular communication allowed the team to spot syntax errors early, monitor API call limits, and ensure that the final dataset met quality standards. The authors suggest that automation—using scripts written in Python or similar languages to call the WoS API—can dramatically reduce the manual workload and minimize human error when dealing with thousands of sub‑queries.

Limitations of the method are acknowledged. Relying on the “Source” field works well when journal titles are alphabetically balanced, but in fields where journals cluster around certain letters, additional partitioning criteria (e.g., ISSN ranges or subject categories) may be required. Moreover, journal name changes, multilingual titles, and the continual addition of new journals can complicate the grouping process, necessitating dynamic adjustments to the query sets.

In conclusion, the authors provide a reproducible, low‑cost solution for extracting very large bibliometric datasets from WoS despite the platform’s built‑in record cap. Their approach—divide, query, exclude duplicates, and recombine—offers a template that can be adapted to other large‑scale retrieval tasks, such as global citation analyses, policy impact studies, or comprehensive meta‑research projects. By sharing both the technical details and the collaborative workflow, the paper contributes a valuable methodological tool to the bibliometrics community.


Comments & Academic Discussion

Loading comments...

Leave a Comment