Bibliometric and usage-based analyses and tools highlight the value of information about scholarship contained within the network of authors, articles and usage data. Less progress has been made on populating and using the author side of this network than the article side, in part because of the difficulty of unambiguously identifying authors. I briefly review a sample of author identifier schemes, and consider use in scholarly repositories. I then describe preliminary work at arXiv to implement public author identifiers, services based on them, and plans to make this information useful beyond the boundaries of arXiv.
Deep Dive into Author Identifiers in Scholarly Repositories.
Bibliometric and usage-based analyses and tools highlight the value of information about scholarship contained within the network of authors, articles and usage data. Less progress has been made on populating and using the author side of this network than the article side, in part because of the difficulty of unambiguously identifying authors. I briefly review a sample of author identifier schemes, and consider use in scholarly repositories. I then describe preliminary work at arXiv to implement public author identifiers, services based on them, and plans to make this information useful beyond the boundaries of arXiv.
Author Identifiers in Scholarly Repositories
Simeon Warner
Cornell Information Science and
Cornell University Library
Ithaca, NY 14850, USA
simeon.warner@cornell.edu
Submitted: 2009-10-09
Abstract
Bibliometric and usage-based analyses and tools highlight the value of information about scholarship
contained within the network of authors, articles and usage data. Less progress has been made
on populating and using the author side of this network than the article side, in part because of
the difficulty of unambiguously identifying authors. I briefly review a sample of author identifier
schemes, and consider use in scholarly repositories. I then describe preliminary work at arXiv to
implement public author identifiers, services based on them, and plans to make this information
useful beyond the boundaries of arXiv.
1
Context
In an ideal scholarly communication system there would be tools to browse, navigate, make recom-
mendations and assess influence based on the complete graph of all actors (people, collaborations,
institutions) and all communication artifacts (articles, comments, blog posts, usage data1). As a
shorthand I will call this complete graph the publication network. Contained within it are the famil-
iar citation, usage, co-authorship, and co-citation graphs. In recent bibliometric and usage-based
work, significant progress has been made with the artifact part of this graph (see, for example the
work of the MESUR project [3]). Much less progress has been made with the actor part of the
graph, in part because it is much harder to unambiguously identify authors than articles.
Consider table 1 which shows the most frequently occurring lastname, initial pairs in arXiv user
accounts. This illustrates one facet of the name disambiguation problem, namely that there are
many authors with the same name. This is compounded by inconsistent spellings, use of initials
or full first names, and even name changes. Within a single repository such as arXiv it is not
usually possible to accurately answer the question “show me all the articles by this Zhang, Y”.
In recent years there has been considerable work on unsupervised and supervised author name
disambiguation using many different heuristic, machine learning and clustering techniques, and
many different properties including co-authorship, citations and subjects/topics. While much better
than naive approaches, these techniques are still far from perfect.
In a recent Nature Correspondence, Raf Aerts asked “If it is possible to have DOIs for objects (or,
so they say, enough IPv6 addresses for every molecule on Earth), why is it so difficult to implement
1Logically usage data would be links between actors and artifacts. However, for historical, cultural and practical
reasons most usage data is treated as anonymous even though co-usage information may be extracted.
1
arXiv:1003.1345v1 [cs.DL] 6 Mar 2010
Lastname, Initial
Count
Zhang, Y
100
Lee, J
97
Wang, Y
89
Wang, J
84
Chen, Y
77
Kim, J
77
Wang, X
76
Lee, S
74
Kim, S
69
Liu, Y
69
Table 1: Most frequently occurring lastname, initial pairs in arXiv user accounts. There may be a
few duplicate accounts but this indicates that nearly 100 different people named “Zhang, Y” have
created user accounts at arXiv (as of May 2009).
DAIs [Digital Author Identifiers] for authors?” [1]. Raf had earlier hinted at part of the answer by
pointing out that he has more than one identifier in Scopus [6]. As we have already discussed, it is
difficult to mine existing data to disambiguate references to authors. The more fundamental part
of the answer is that it is much easier to create DOIs for articles when the one owner for an article
creates the one DOI for it and presents it with the article (ignoring the issue of multiple versions
of articles). As authors, we are not owned by a single authority and even if an identifier were
created for us at birth by the appropriate government, there would be significant privacy concerns
about using it for everything. Consider, for example, concerns over the uses and misuses of social
security numbers in the USA. While we want to link a single author’s works together, do we want
that identity to immediately link us to all other digital information about the private life of the
individual?
2
Author Identifiers
To illustrate the diversity of currently used author identifiers, table 2 shows several example schemes
used in the scholarly domain. A more detailed inventory is provided on the repinf wiki [5]. The
OpenID and ISNI schemes are not limited to the scholarly domain. OpenID is aimed primarily
at authentication, however, if it continues to see growing acceptance it may well be a useful open
system that repositories could use. It is not clear whether ISNI will develop into a widely used
system. The largest efforts to create author identifiers specifically for the scholarly domain, Scopus
Author Identifiers and ResearcherID, come from commercial entities and are clearly motivated by
the desire to provide improved services b
…(Full text truncated)…
This content is AI-processed based on ArXiv data.