LScDC-new large scientific dictionary

LScDC-new large scientific dictionary
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper, we present a scientific corpus of abstracts of academic papers in English – Leicester Scientific Corpus (LSC). The LSC contains 1,673,824 abstracts of research articles and proceeding papers indexed by Web of Science (WoS) in which publication year is 2014. Each abstract is assigned to at least one of 252 subject categories. Paper metadata include these categories and the number of citations. We then develop scientific dictionaries named Leicester Scientific Dictionary (LScD) and Leicester Scientific Dictionary-Core (LScDC), where words are extracted from the LSC. The LScD is a list of 974,238 unique words (lemmas). The LScDC is a core list (sub-list) of the LScD with 104,223 lemmas. It was created by removing LScD words appearing in not greater than 10 texts in the LSC. LScD and LScDC are available online. Both the corpus and dictionaries are developed to be later used for quantification of meaning in academic texts. Finally, the core list LScDC was analysed by comparing its words and word frequencies with a classic academic word list ‘New Academic Word List (NAWL)’ containing 963 word families, which is also sampled from an academic corpus. The major sources of the corpus where NAWL is extracted are Cambridge English Corpus (CEC), oral sources and textbooks. We investigate whether two dictionaries are similar in terms of common words and ranking of words. Our comparison leads us to main conclusion: most of words of NAWL (99.6%) are present in the LScDC but two lists differ in word ranking. This difference is measured.


💡 Research Summary

This paper introduces the Leicester Scientific Corpus (LSC), a large-scale collection of 1,673,824 English abstracts from research articles and conference papers indexed by the Web of Science (WoS) for the publication year 2014. Each abstract is annotated with at least one of 252 subject categories and citation counts. Using this corpus, the authors construct two lexical resources: the Leicester Scientific Dictionary (LScD) and its core subset, the Leicester Scientific Dictionary‑Core (LScDC).

The LScD contains 974,238 unique lemmas derived after tokenisation, stemming, and removal of standard stop‑words. For each lemma the dictionary records the number of documents in which it appears and its total frequency across the entire corpus. To create a more compact and informative resource, the authors apply a frequency threshold: any lemma appearing in ten or fewer documents (≤10) is excluded. This yields the LScDC, a core list of 104,223 lemmas. The threshold was chosen because roughly 60 % of the lemmas in LScD occur in only a single document, many of which are misspellings or otherwise non‑informative tokens. By discarding these rare items, the authors aim to reduce noise for downstream text‑mining algorithms while retaining the majority of meaningful academic vocabulary.

Both dictionaries are sorted by descending document frequency, providing a ready‑to‑use ranking for researchers interested in term importance across disciplines. The paper then compares LScDC with the New Academic Word List (NAWL), a well‑known academic vocabulary set comprising 963 word families extracted from a 288‑million‑token corpus that includes the Cambridge English Corpus, spoken academic corpora, and textbooks. After stemming NAWL entries, the authors obtain 895 unique lemmas. Of these, 891 (99.6 %) are present in LScDC, with only four NAWL lemmas missing. However, the ranking of shared terms differs markedly because NAWL’s source corpus blends general, spoken, and textbook language, whereas LScDC is derived exclusively from scholarly abstracts, leading to a higher proportion of domain‑specific terminology.

The authors discuss the strengths of their approach: the sheer size and disciplinary breadth of LSC, the transparent preprocessing pipeline, and the explicit rarity filter that enhances the utility of LScDC for quantitative meaning analysis. They also acknowledge limitations: the corpus is confined to a single year (2014) and to WoS‑indexed publications, potentially omitting recent trends and non‑English‑centric research. The stemming process may also collapse distinct lexical senses, and the reliance on document frequency alone does not capture term specificity within sub‑domains.

Future work is outlined to address these issues. Extending the corpus across multiple years and integrating additional bibliographic databases (e.g., Scopus, PubMed) would improve coverage and temporal relevance. Incorporating semantic network analysis, word embeddings, or information‑theoretic measures could enrich the representation of term meanings beyond raw frequencies. Finally, the authors suggest combining LScDC with general‑purpose lists such as the New General Service List (NGSL) to create a hybrid resource that balances academic specificity with broader language coverage.

In conclusion, the Leicester Scientific Dictionary‑Core provides a high‑coverage, discipline‑spanning lexical resource that aligns closely with established academic word lists while offering a more granular, frequency‑based view of scholarly vocabulary. Its public availability positions it as a valuable foundation for research on meaning quantification, document classification, clustering, and other natural language processing tasks within the scientific literature domain.


Comments & Academic Discussion

Loading comments...

Leave a Comment