Title: Good practices for a literature survey are not followed by authors while preparing scientific manuscripts
ArXiv ID: 1005.3063
Date: 2010-05-17
Authors: D. R. Amancio, M. G. V. Nunes, O. N. Oliveira, L. da F. Costa
📝 Abstract
The number of citations received by authors in scientific journals has become a major parameter to assess individual researchers and the journals themselves through the impact factor. A fair assessment therefore requires that the criteria for selecting references in a given manuscript should be unbiased with respect to the authors or the journals cited. In this paper, we advocate that authors should follow two mandatory principles to select papers (later reflected in the list of references) while studying the literature for a given research: i) consider similarity of content with the topics investigated, lest very related work should be reproduced or ignored; ii) perform a systematic search over the network of citations including seminal or very related papers. We use formalisms of complex networks for two datasets of papers from the arXiv repository to show that neither of these two criteria is fulfilled in practice.
💡 Deep Analysis
Deep Dive into Good practices for a literature survey are not followed by authors while preparing scientific manuscripts.
The number of citations received by authors in scientific journals has become a major parameter to assess individual researchers and the journals themselves through the impact factor. A fair assessment therefore requires that the criteria for selecting references in a given manuscript should be unbiased with respect to the authors or the journals cited. In this paper, we advocate that authors should follow two mandatory principles to select papers (later reflected in the list of references) while studying the literature for a given research: i) consider similarity of content with the topics investigated, lest very related work should be reproduced or ignored; ii) perform a systematic search over the network of citations including seminal or very related papers. We use formalisms of complex networks for two datasets of papers from the arXiv repository to show that neither of these two criteria is fulfilled in practice.
📄 Full Content
The advance of knowledge is founded and critically dependent on the broad dissemination of novel approaches and results, which allows other scientists and practitioners to analyze reported results to validate and complement their investigations. The primary objective of any scientific publication is therefore to be read, tried, and cited by as many people as possible. Indeed, articles have been evaluated in terms of the citations they motivate, while journals are typically rated according to the impact factors reflecting the number of citations to their articles. Citations have been a major factor since the 1920s [1], and subjects such as this are now analyzed in scientometrics, which studies the relationship between areas of knowledge and the evolution of science [2]. Of course, the success of a paper in being read and cited varies enormously owing to several factors, including the renown of the journal and the eminence of the authors and their institutions. Strictly speaking, such a success should depend not only on the quality, originality, completeness and clarity of a specific paper, but moreover on the degree of relationship and overlap with the investigation being reported. For all the papers that are strongly similar or related to a current investigation should be read, and potentially cited. However, with the limited time available to any researcher for seeking and reading, the related works have to be somehow filtered by using some limiting criteria. Though unavoidable, this implies that potentially important publications are overlooked [3,4], which may undermine the efficiency of the whole system, in the sense that painstaken, costly efforts are repeated or ignored.
We take the view that little attention has been given to the procedures of selecting publications for guiding the research and preparing a list of references [5]. In this paper we suggest two criteria for such a selection. The first is that similar, strongly related works should be selected, and the second is that the authors should do a systematic search over related publications and their citations. We check whether these criteria are fulfilled by using complex networks [6,7,8] and natural language processing formalisms (for the use of complex networks in natural language processing, see [9,10,11,12]). Two datasets containing 700 articles each from the arXiv ‡ repository, for the areas of complex networks and genetics, were used to obtain two networks for each area: (i) the traditional citation networks, where each article is a node and citations become directed edges between them; and (ii) a network obtained by the overlap between the contents of pairs of articles. These networks are henceforth referred to as citation and overlap, being directed and undirected, respectively. Concerning the overlap network, each article was modeled as a complex network in order to extract the relations of similarity. The model used (see methodology), which basically connects adjacent words after a pre-processing step, was chosen because of its success in other studies on Natural Language Processing, such as automatic text assessment [13], automatic summarization strategies [14] and automatic machine translation assessment [15]. After defining these two networks one may quantify the number of: (a) articles which are related and cited; ‡ http://arXiv.org
(b) articles that are related but not cited; and (c) articles that are loosely related but are cited nonetheless. We shall show that the analysis of these numbers indicates that the similarity criterion for selecting references is not obeyed. We also perform a random walk through the citation networks to simulate a systematic search by an author, whose results are used to infer that the second criterion is not obeyed either. In addition to discussing the possible causes and implications of these results, we suggest a virtual citation approach for complementing the relationships between articles, which gives rise to a virtual scientometry.
In our experiments, the relationships (similarity and citation) between two articles are modeled as complex networks. A network is defined as a data structure comprising a set of nodes linked by edges. The set of edges and nodes can be represented as a matrix W ij , where the presence of an edge between two nodes i and j with weight p infers W ij = p and the absence of an edge implies W ij = 0. If there is no order distinction to link two nodes (i → j is the same as j → i ), then W ij = W ji is always true. If two nodes are connected by an edge, they are said to be adjacent. If two edges are associated with the same node, they are called adjacent edges. A sequence of adjacent edges defines a walk over the network. The length of a walk is defined as the number of the edges of the walk. The networks were built using a corpus comprising 700 articles about complex networks (or scale free networks) and 700 articles about genetics from the arXiv publications base. The ar