What country, university or research institute, performed the best on COVID-19? Bibliometric analysis of scientific literature

Annals of Data Science https://doi.org/10.1007/s40745-022-00406-8 1 3 Abstract In this article, we conduct data mining and statistical analysis on the most eec - tive countries, universities, and companies, based on their output (e.g., produced or collaborated) on COVID-19 during the rst wave of the pandemic. Hence, the focus of this article is on the rst wave of the pandemic. While in later stages of the pandemic, US and UK performed best in terms of vaccine production, the focus in this article is on the initial few months of the pandemic. The article presents nd - ings from our analysing of all available records on COVID-19 from the W eb of Science Core Collection. The results are compared with all available data records on pandemics and epidemics from 1900 to 2020. This has created interesting nd - ings that are presented in the article with visualisation tools. Keywords COVID-19 · Pandemic · Epidemic · Disease · V irus · Statistics · Computation · Data mining · Big data analytics 1 Introduction Since the COVID-19 pandemic started, we have seen an increasing number of sci - entic research articles, on a wide variety of related topics to disease, pandemics, Received: 6 October 2020 / Revised: 5 April 2022 / Accept ed: 9 April 2022 © The Author(s) 2022 What c ountry , university , or research institute , per formed the best on C OVID-19 during the ﬁrst wav e of the pandemic? Bibliometric analysis of scientiﬁc liter ature – analy sing a ‘ snapshot in time’ of the ﬁrst wav e of COVID-19 P etarRadanliev 1 · DavidDeRoure 1 · Rob Walton 1 · Max VanKleek 2 · OmarSantos 3 · La ’ T reallMaddo x 3  Petar Radanliev petar .radanliev@eng.ox.ac.uk; petar .radanliev@oerc.ox.ac.uk 1 Department of Engineering Sciences, University of Oxford, Oxford, England, UK 2 Department of Computer Science, University of Oxford, Oxford, England, UK 3 Cisco Research Centre, Research T riangle Park, North Carolina, USA Annals of Data Science 1 3 viruses, etc. Some of these topics are closely related to technological advancements and data sciences solutions. For example, the research on tracking and monitoring of cases, is closely related to digital solutions, e.g. mobile apps. In this study , we use W eb of Science data records from a ‘snapshot in time’ (pub - lished until 16th of May 2020) with a computable statistical method, to investigate the correlations between, dierent scientic research records on the COVID-19 pan - demic. Apart from investigating the connections between dierent topics, we also investigate the data records for patterns in the response from dierent countries. Our objective is that by investigating individual responses, we can provide scientic insights on specic organisations performances, e.g. the W orld Health Organisation (WHO) speed of response. There are topics that we consider beyond the scope of this study , such as the concerns on the origin of the disease. Our aim is to provide statisti - cal analysis, that can assist other researchers in answering these dicult questions. W ith the global focus on the pandemic, the data records are changing dramatically . Since research data records are often categorised by year and not by months, it could be challenging for researchers to nd scientic data and to model, with precision, the research response at dierent stages of the pandemic. W e considered this study to be of signicant relevance because it provides statistical results that can be seen as a snapshot in time. Our rationale was based on the fact that at the time of the ‘snapshot’ the pandemic had been in existence for a few of months, and the scientic per -review process last few months. Hence, the data that we collected during the ‘snapshot’, is from research papers that have been produced at the very beginning of the pandemic. W e applied semi-automatic and automatic analysis of big data, to extract unusual and unknown patterns, from data records on COVID-19, published until 16th of May 2020 . W e analysed 3094 data records, which constituted all data records in existence at the time of the snapshot - from the W eb of Science Core Collection on COVID- 19. T o compliment this, we conducted a second analysis of 138,624 historical data records from the W eb of Science Core Collection, on pandemics and epidemics, covering the time period from 1900 to 2020. W e used the historical data records, to compare with the current scientic research on COVID-19, and we use quantita - tive analysis to derive unexpected conclusions on the speed of response, from the most prominent organisations in pandemic research. In the investigation, we applied cluster analysis, anomaly detection analysis, association rule mining, and sequential pattern mining, among other methods. The ndings of this study are presented in groupings of data records, and categorisations of patterns from the input data, which can be used or reproduced in future studies for predictive analytics, e.g. forecasting, monitoring and management of future pandemics. 1.1 Research questions Our objectives are to use computable statistical methods, to conduct bibliometric data mining on scientic research records and to answer some emer ging questions on COVID-19. In the study , we investigate: 1. What country produced the most research papers on Covid-19 since the pan - demic started? Annals of Data Science 1 3 2. What universities and companies are publishing research on Covid-19? 3. Which countries/universities collaborated most in research papers on Covid-19? After identifying the answers to these research questions, we focus on a new set of research questions: 4. What country produced the most research papers on pandemic and epidemics from 1900 to 2020? 5. What universities and companies have published most research on pandemic and epidemics from 1900 to 2020? 6. Which countries/universities collaborated most in research papers on pandemic and epidemics from 1900 to 2020? W e use a variety of statistical methods (e.g. three-elds plot, factorial analyse, his - torical analysis, network map analysis, etc.) to compare the ndings from these questions. 1.2 Disc ussion on data scienc e Data science consists of (1) design for data; (2) collection of data; and (3) analysis on data; and can be described as ‘synthetic concept to unify statistics, data analysis and their related methods’ [ 1 ]. Data science practitioners apply integrated techniques to analyse real-world big data problems. Some of the integration concepts of big data analytics and/or data science include ‘multi-criteria optimization for learning, expert and rule-based data analysis, support vector machines for classication, feature selection, data stream analysis, learning analysis, sentiment analysis, link analysis, and evaluation analysis’ [ 2 ]. Data mining practical applications in various elds (e.g., nancial analysis, credit management, health insurance, network intrusion detection, internet services analysis) are often enhanced with optimisation techniques, such as (1) Support V ector Machines for Classication; (2) LOO Bounds for Support V ec - tor Machines; (3) Unsupervised and Semi-supervised Support V ector Machines; (4) Feature Selection via l p - Norm Support V ector Machines; (5) Multiple Criteria Programming; etc. [ 3 ]. Data science is strongly represented in business data mining [ 4 ], for real-time decision making with a combination of internet-of-things (IoT) and articial intelligence (AI) technologies [ 5 ]. More recently , the Covid-19 pandemic has been analysed with various data science tools, e.g., 1 for outbreak prediction of the top 10 highly and densely populated countries, using Auto-Regressive Moving A verage [ 6 ]; e.g., 2 for ‘age-specic social contact characterization. of the underly - ing transmission patterns’ [ 7 ]. The data mining in this article is more closely related to ‘culture vs. policy’ [ 8 ] with the aim of promoting more global collaborations to combatting global pandemics with technological solutions. 1.2.1 Data mining vs. data analy sis In this study , we dierentiate our data mining approach from data analysis. W e con - sider data analysis to be related to testing the eectiveness of specic models or Annals of Data Science 1 3 hypothesis. W e dierentiate this from the data mining in our study , which we con - sider as using computerised statistical models to uncover interesting, unusual and unknown patterns from big data. Therefore, any reference to analysis in this study , e.g. historical analysis, bibliometric analysis, etc., refers to data mining and not data analysis. In addition, we understand that our data mining is based on keywords which were representative of the pandemic in the ‘snapshot in time’ that we analysed. W ith time, these keywords will change and evolve, and the characteristics of the future analysis should also evolve with the characteristics of the new data records. Nevertheless, this evolution will happen in the future, and with this article, we wanted to preserve the information as it was during the ‘snapshot in time’ which was taken during the rst wave of the Covid-19 pandemic – snapshot was taken in May 2020 and represents the time period from December 2019 when Covid-19 emerged, to May 2020 when the rst wave ended – although, there might be various dierent interpretations as to the exact end date of the rst wave. 2 Litera ture revie w W e conducted a brief literature review , to identify the current gaps in knowledge and to structure our research questions around these gaps. W e found a related study on scientometric analysis of COVID-19 and coronaviruses [ 9 ]. Hence, we structured our questions on bibliometric analysis of COVID-19, compared with historic data on pandemics and epidemics. The main dierences in this article in the approach. Scien - tometric analysis is focused on the performance of dierent authors, or journals. The bibliometric analysis in this article is focused on analysing national responses, insti - tutional outputs, and the correlations between research ndings. Similar research is present from March 2020 [ 10 ], and presents analysis of 564 data records. Since then, the number of scientic research data records has increased to 3094. In addition, we use dierent statistical methods in our data mining and visualisation, which enables us to compare the COVID-19 analysis, with 138,624 historical data records on pan - demics and epidemics. This is signicantly dierent that a bibliometric analysis of 564 data records. The third study we reviewed to structure our research questions, was a study from March 2020, based on 183 data records from PubMed and analysed Identied and analysed the title, author, language, time, type and focus [ 11 ]. T o dif - ferentiate our focus on looking at the same problem, from a dierent aspect, we used W eb of Science data records, and we focused on clustering, classication, associa - tion, regression, summarisation, and anomaly detection. Prior to conducting this review study , we consulted earlier articles on bibliometric analysis and review on articial intelligence in health care [ 12 ], on roles and research trends analysis with bibliometric mapping analysis and systematic review [ 13 ], and on the role of bibliometric and review in dierent research areas e.g., in operations environment [ 14 ]. The innovation of the bibliometric analysis in this paper is the categorisations of one research keyword (Covid-19) in a separate analysis with its main research area (pandemics and epidemics). This innovation enables the review to derive with two Annals of Data Science 1 3 postulates on what country , university , or research institute, performed the best on COVID-19 during the rst wave of Covid-19. The postulates are analysed in great debt with bibliometric analysis of scientic literature from a ‘snapshot in time’ of the rst wave of Covid-19. 3 Methodology In this study , we applied computable methods for statistical analysis, including R Studio, ‘Biliometrix’ package [ 15 ], and VOSviewer [ 16 ]. T o extract big data from established scientic databases, we used the W eb of Science Core Collection, which contains data records from over 21,100 peer-reviewed, high-quality scholarly jour - nals published worldwide, in over 250 disciplines 1 . 3.1 Data mining on COVID-19 Data mining represents a process of discovering new knowledge from patterns in big data. Usual methods applied include a combination of machine learning and statis - tics, on analysing big database systems. Data mining is considered a research eld that combines computer science and statistics, in designing intelligent methods for extracting new information and for knowledge discovery from existing databases. The data mining in this study involved data management, data pre-processing, model inference and complexity considerations, postprocessing of discovered results considerations, visualisation, and interestingness metrics. 4 Bibliometric analysis Bibliometric analysis, or bibliometrics, is the practice of using statistical methods to analyse research publications, books, articles, and other scientic communications. In this bibliometric analysis, we extracted data records from the W eb of Science Core Collection, and we analysed the records with three dierent data mining tools, (1) the W eb of Science analyse results built-in tool; (2) data mining with VOSviewer; (3) data mining with the R Studio ‘Bibliometrix’ package. 4.1 Data records The rst search for data records was on the W eb of Science Core Collection. W e searched for all data records on COVID-19 and we extracted 3094 data records (search performed on the date: 17th of May 2020). W e also conducted a second search for TOPIC: (pandemic) OR TOPIC: (epidemic), which resulted with 138,624 data records. Both data sets were analysed with the W eb of Science analyse results built- in tool. Only the smaller data set was analysed with computerised statistical analysis, using the R Studio ‘Bibliometrix’ package. This was not by choice, but because of 1 https://clarivate.com/webofsciencegroup/solutions/web-of-science-core-collection/ . Annals of Data Science 1 3 practicality . The W eb of Science has data extraction limit of 500 records, to download the 3094 data records, we extracted 7 dierent les, and we merged the les using the ‘Sublime T ext’ program. T o repeat this process on 138,624 data records, we would need to extract 277 separate data les, and merge into one. This was considered tire - some and not practical. Instead, for the VOSviewer data mining, we used the W eb of Science tool to identify the top 1000 most relevant data records and we used this data set as representative sample of the 138,624 data records. Only the 1000 most relevant data records are used for the VOSviewer data mining on pandemic and epidemics. 4.2 Automatic data mining using the w eb of Science analyse results built-in t ool T o analyse all data records available on the W eb of Science Core Collection, we used the built-in result analysis tool. First, we categorised the data records in researcher areas Fig. 1 . From Fig. 1 , we can see that current research is focused on the medical aspect of COVID-19. There is very little scientic research on the digital aspect of monitoring and managing the pandemic. Other relevant research areas are also missing, such as guidance on privacy preserving mobile app design for pandemic management, the role of internet of things in pandemic management, philosophical perspective on long term societal changes caused by the pandemic. In the rst wave of the pandemic, the focus seems to have been predominantly on the medical aspects. Learning from this result, we can conclude that all other research areas become secondary in the immedi - ate threat of pandemics - death. Therefore, scientic research on these topics should be ongoing and constantly advancing, in anticipation of similar pandemics happening Fig . 2 W eb of Science result analysis tool – research by country Fig . 1 W eb of Science result analysis tool – research areas Annals of Data Science 1 3 in the future, without notice. T o analyse if such preparations were happening in the past, we analyse the data records on COVID-19, and we compare the results with a historical analysis of data records on pandemic management. In Fig. 2 , we categorise the data records by country . From Fig. 2 , we can see that most scientic research is happening in the US and China, followed by the UK and Italy . W ould be interesting to compare these results after few weeks, and see if the countries produce more output as the infec - tions spreads. W e would suggest focusing on India, because of how the virus spread. W ith time, we could see the output from India increasing, if we are correct in our assumption that output increases as countries are faced with the deadly pandemic. The leading countries in Fig. 2 , are some of the worst aected countries at the time we collected our data records. Although Spain and Iran are also in the hardest hit countries, the scientic research from these two countries is not showing as strong. Therefore, it is indicative, but not conclusive that countries that are most aected, are also most productive in terms of scientic research. T o advance this analysis, we categorised the data records by organisations (enhanced) in Fig. 3 . What becomes visible from the categorisation in Fig. 3 , is that among the most reputable universities, which usually predominate such categori - sations, we now have Wuhan University , where the pandemic originated (was rst detected). The organisations (enhanced) in Fig. 3 categorises the 3094 data records, to include research from associated organisations. W e compare the 3094 data records on COVID-19, with the second data le on pandemics and epidemics records from 1900 to 2020, containing 138,624 data records in Fig. 4 . Fig . 4 W eb of Science result analysis tool – research on pandemics and epidemics by organisation (enhanced) Fig . 3 W eb of Science result analysis tool – research on COVID-19 by organisation (enhanced) Annals of Data Science 1 3 What becomes clear when we compare the two classications from Figs. 3 and 4 , is that some of the best performing universities on COVID-19, are not even present on the list of best performing research or ganisations on global pandemics and epi - demics from the historical analysis. This indicates that there is either a global shift in scientic research, or the early aected regions e.g. W uhan, have been most produc - tive in scientic research on COVID-19. The second seems more likely . Since the Figs. 3 and 4 are classifying organisation-enhanced categories, to get a dierent perspective on organisations own research production, we did a second categorisation of the 3094 data records, by organisations own research Fig. 5 . By categorising the organisations own research, we present a dierent result from the same data records. In the simple categorisation Fig. 5 , we can see that Chines universities are currently in the lead, and we can also see that University of T eheran is also working on this topic, and its much higher ranked in terms of productivity from the previous categorisations - top performing organisation enhanced categorisation in Fig. 3 . When we compare the Fig. 5 - which is visualising the 3094 data record le, with Fig. 6 - which is visualising the 138,624 data record le, we can see a further conr - mation that the top performing institutions by output on COVID-19 (Fig. 5 ), are not representative of the best performing research institutions (Fig. 6 ). This could signify that the world, for unclear reasons, was slow in responding with scientic research on COVID-19. W e could speculate that the world didn’t take COVID-19 seriously , or that Chinese knew something that the rest of the world didn’t, but we have no data to conrm such speculations. What we can conrm with certainty , is that the Chinese Fig . 6 W eb of Science result analysis tool – research on pan - demics and epidemics by organ- isation (simple categorisation) Fig . 5 W eb of Science result analysis tool – research on COVID-19 by organisation (simple categorisation) Annals of Data Science 1 3 research institutes acted much faster than the rest of the world, including the leading research organisation on pandemics and epidemics. Finally , the last categorisation in this part of the analysis, we investigated the sci - entic research published by funding agencies Fig. 7 . What we can see in the categorisation by funding (based on the 3094 data record le) in Fig. 7 , is that China is in the lead, but the US has more distributed funding programme, and if we sum up all the funding, we could get a dierent result. What is surprising however , is the weak performance of EU funding agencies. There are only 6 data records from the EU funds. When we compare the Fig. 7 - which is visualising the 3094 data record le, with Fig. 8 - which is visualising the 138,624 data record le, we can see that the or ganisa - tions that have historically provided most of the funding on pandemics and epidem - ics, are not in the lead. Since COVID-19 is a global pandemic, the classications in Fig. 7 , should be similar with Fig. 8 . The results are very dierent, and it is uncertain why the global response was so much slower than the Chinese response. But when we compare the institutions in Fig. 5 , with countries that got most aected in the early stages of COVID-19, we can see the connection between countries aected early , and increased data records. Such assumptions from the categorisations, based on the automatic data mining using the W eb of Science analyse results built-in tool, are speculative. W e need more specic data mining methods to analyse this data records further . In the following section, we apply semi-automated data mining to look for association rule learning, anomaly detection, and regression to accompany and enhance our cluster - ing and classication. Fig . 8 W eb of Science result analysis tool – research by fund - ing agencies on pandemics and epidemics (historic records) Fig . 7 W eb of Science result analysis tool – research on COVID-19 by funding agencies Annals of Data Science 1 3 Fig . 10 Density visualisation by country of the top 1000 most relevant data records on pandemic and epidemic Fig . 9 T op 1000 most relevant data records on pandemic and epidemic - visualisation by country Annals of Data Science 1 3 4.3 S emi-automa tic data mining with VOSviewer W e continued our data mining with using a computerised statistical analysis, using the VOSviewer computer program. From the 138,624 data records on pandemic and epidemics on the W eb of Science Core Collection, for the VOSviewer data mining, we used the top 1000 most relevant data records and we considered this data set as representative sample of the 138,624 data records. W e exported two separate text les and we used the two les for the data mining with VOSviewer . In Fig. 9 , we can see the VOSviewer visualisation by country and collaborations between countries. In VOSviewer , we can select specic relationships of one country , and we can zoom in the image for more detailed data mining. It is however relatively easy to identify the US, England, Australia and China as the leading countries in the top 1000 historical data records on pandemics and epidemics. The Figs. 9 and 10 are both based on Lin/Log modularity normalisation. W e conducted normalisation by association strength, and by fractionalisation normali - sation, but the Lin/Log modularity normalisation presented better visualisation of the research collaborations between countries. For data mining on collaborations between institutions, we used associated strength normalisation, and since we wanted to investigate the collaborations, we set a limit on data records that included collabo - rations, this reduced our data records from 1000 to 90 analysed in Fig. 11 . Although it’ s dicult to see in the Fig. 11 visualisation, the VOSviewer identi - ed 8 clusters, with the US universities predominating the biggest two clusters, and Chinese universities appearing in the third cluster . W e continued our data mining in Fig . 11 Collaborations between institutions on pandemics and epidemics historically and globally - nor - malisation based on association strength Annals of Data Science 1 3 the next section with using a computerised statistical analysis, using the R Studio ‘Bibliometrix’ package. 4.4 S emi-automa tic data mining with R Studio ‘Bibliometrix’ package Since the W eb of Science has data extraction limit of 500 records, to download the 3094 data records, we extracted 7 dierent les, and we merged the les using the ‘Sublime T ext’ program. Then we downloaded the le in the ‘Bibliometrix’ package, ‘Biblioshiny’ function. Our data mining was based on association rule and cluster - ing, using thee-elds plots, factorial analysis, collaboration network, conceptual map design, etc. The rst graph we present (Fig. 12 ) is based on association rule, investigating the relationship between variables, e.g. from all records on COVID-19, we used asso - ciation rule to determine which other keywords are related in research, like SARS, infection, virus, etc. Apart from association rule, to design the three-elds plot in Fig. 12 , we also applied clustering to discover and associate data records by countries of origin. The three-elds plot in Fig. 12 , is similar to the research by country using the W eb of Science result analysis tool, in Fig. 2 . The dierence in the visualisation is that in Fig. 12 we can see the keywords associations between data records from individual country . While in Fig. 2 , we can only see classications of data records by country . T o nd a regression function, that estimates the relationship between data records, with the smallest amount of error, we developed a collaboration network map (Fig. 13 ), using country in the network parameters, with equivalence normalisation, in a circle network layout, using Louvain clustering algorithm and the minimum number of edges set at 2. Fig . 12 Three-elds plot of classication by country , research area and research keywords from all records on COVID-19 Annals of Data Science 1 3 Although very detailed, the collaboration network in Fig. 13 seems a bit cluttered. T o present a better visualisation of the data records, in Fig. 14 , we kept the same parameters, but we reduced the minimum number of edges at 7. Since density is the proportion of present edges from all possible edges in the collaboration network, in Fig. 14 , we can see the strongest collaboration networks, in edge connections and colour coding. Just to clarify these connections in the col - laboration network map, edge density equals number of edges divided by maximal number of edges. Hence, an edge density in Fig. 14 is dened of overlapping and weighted in graph communities. However , it is possible that edge variations in mul - tiple keywords mainly reect the variations in few underlying keywords. Hence, in Fig. 15 , we applied factorial analysis as a statistical method to identify joint key - words in response to unnoticed (concealed) keywords. The parameters we applied in the factorial analysis (Fig. 15 ), included ‘multi- correspondence analysis’, with eld of analysis being the keywords of the records, with automatics clustering and a maximum number of terms 50. In Fig. 15 we describe variability among the correlated keywords with potentially lower number of unobserved keywords (factors), aiming to identify independent latent keywords. In Fig . 13 Collaboration network map by Country in a circle network layer with a minimum of 2 edges Annals of Data Science 1 3 other words, we wanted to reduce the number of keywords in the data records. Our objective was to nd the latent factors that create a commonality in the data records, and we applied factorial analysis because it is a statistical method that can identify smaller number of underlying variables, within large numbers of observed variables 2 . The factorial analysis derives two classications of keywords (in Fig. 15 ). The classication in blue, represents keywords like management, care, exposure, response, therapy , health, impact, risk, etc. The classication in red, represents more specic keywords, like respiratory syndrome, functional receptor, acute respiratory syndrome, etc. What we can see in the Fig. 15 conceptual structure map, is factorial analysis of 3094 records, presenting classication of common keywords from all data records, in two classications. 5 Discussion The most interesting ndings from this study was that institutions that are estab - lished as leaders in scientic research on pandemic and epidemics, have responded much slower than organisations that are located in the areas where COVID-19 rst 2 https://en.wikipedia.org/wiki/Factor_analysis . Fig . 14 Collaboration network map by Country in a circle network layer with a minimum of 7 edges Annals of Data Science 1 3 occurred. W uhan University is currently very high in the classication of scientic research on COVID-19. In the alternative classication, using historical records on most prominent organisations in this eld, the Wuhan and T eheran Universities are not present in the statistical classication. This triggers many questions on have the leading organisations on pandemic and epidemic management reacted as the appro - priate speed? Is so, why are they behind in the production of scientic journal? Has there been a gap in communications and data sharing? Leaving these organisations oblivious to what was happening? Or was it that our global alert mechanisms failed to act? Did we ignore the warning signs? These are just some of the emer ging ques - tions from this study . Until these questions are answered, conspiracy theories would continue to spread. W ith the COVID-19 pandemic slowing down, we should also be seeking answers to these questions to prevent a second wave, and most importantly , to prevent the same mistakes happening in future pandemics. Our research ndings can serve as the background work and starting point for future studies on developing spatial indices, such as large spatio-temporal datasets, and multidimensional objects, for computer decision support systems based on arti - cial intelligence. Fig . 15 Factorial analysis - conceptual structure map Annals of Data Science 1 3 5.1 Disc ussion on results As we can see from Fig. 2 , the USA produced most scientic records on Covid-19 during the rst wave, followed by China and the UK. In terms of best performing university – during the rst wave, in Fig. 3 we can see that University of London was leading in the research eorts, followed by Harvard University and Huazhong University of Science T echnology . However , it is worth mentioning that University of London (similarly to Harvard and Huazhong) is an umbrella organisation that represents many dierent universities. W e tried to rectify this by separating the data by individual institutions in Fig. 4 and expanding the search to pandemics and epi - demics. While the University of California systems emerged as the best performing university on a global level, the data is still partially representative of the umbrella organisations. W e can see in Fig. 4 that University of London still appears on the list. Although Fig. 4 provides valuable insights on research by dierent institutions on pandemics and epidemics, we wanted to determine the best performing instruction on Covid-19 during the rst wave, without the umbrella organisation. In Fig. 5 , we man - aged to separate the data into individual organisations, and we can see that Huazhong University of Science T echnology produced most research on Covid-19 during the rst wave, followed by Wuhan University , Harvard Medical School, and University of T eheran Medical School. This changes the picture signicantly from the analysis in Fig. 3 . While it’ s dicult to conrm with certainty the connection between increased research output by individual institution, it is quite clear that the best performing institutions are based in countries / areas that were rst impacted by the rst Covid- 19 wave. It could be that these institutions were best preforming, because of the urgency and the severity of the impact – in the ‘snapshot in time’ analysed. In the next step of our analysis, we wanted to compare this (rst) postulate (we would need more date to call this a hypothesis) and we investigated if the same or ganisations would be expected to perform the best in an event of a global pandemic. In Fig. 6 , we analysed scientic data records from 1900 to 2020 on the topic of pandemics and epidemics and not on Covid-19 specically . The objective of this analysis was that grounded on the idea that the term ‘Covid-19’ was coined only after the pandemic occurred. In other words, this term (word) didn’ t exist before Covid-19 happened. Since this term didn’t exist as a word, it should not be present in scientic data records prior to 2019 (the actual term/word was announced by WHO in 2020). In Fig. 6 , we can see the analysis of the data records on pandemics and epidemics from 1900 to 2020, and it’ s quite clear that the organisations in Figs. 3 and 5 are not the same as the organisations in Fig. 6 (with the exception of Harvard University that preserved its second place). This supports the (rst) postulate and conrms that the organisations that performed best, are not the organisations that have traditionally performed best in this eld of research. The second postulate we present is that countries that got worst aected in the rst wave, invested most money in research on Covid-19. This can be seen from Fig. 7 , where the National Natural Science Foundation of China emer ges as the largest funder of research on Covid-19. W orth mentioning that the data in Fig. 7 is categorised by organisation and not categorised by nation, and we can see that mul - tiple organisations from China are in the top organisations that provided funding for Covid-19 research – during the rst wave. This categorisation was done to compare Annals of Data Science 1 3 the total research funding with or ganisations that are considered as largest funders in the more general eld of pandemics and epidemics, which are analysed in Fig. 8 . By comparing Fig. 7 with Fig. 8 , we can clearly see that the leading organisations didn’t allocate the most funding on Covid-19 during the rst wave. This conrms the second postulate - that the worst aected countries in the rst wave, invested most money in the initial research eorts on Covid-19. W e understand that further research is required to prove these postulates as hypothesis. Hence, we are making our data records available (in open access) for future researchers to use the data sets that we collected as a ‘snapshot in time’ from the rst wave of the Covid-19 pandemic. T o eliminate bias in our analysis, we continued our analysis with dierent biometrical tools and software. W e used the VOSviewer to present visualisations of the data records by country , with records mapping (in Fig. 9 ), by density (in Fig. 10 ), by collaborations (Fig. 11 ), with three-elds plot of classication by country , research area and research keywords (from all records on COVID-19) (Fig. 12 ), with a circle network of collaborations (Figs. 13 and 14 ), and with Factorial Analysis (Fig. 13 ). 6 Conclusions As the scientic research on COVID-19 continues to expand, the publications are becoming more fragmented, which creates challenges in navigating through the accumulation of new knowledge - on global pandemics. In this article, we present the results from bibliometric science mapping based on three dierent data mining methods. The process can be replicated by other scientist seeking to analyse research records from the rst response on the COVID-19 pandemic. W e found individual tools being restrictive, and we propose a multi-tool approach that enables faster results from statistical and graphical packages, aligned to bibliographical databases. W ith the use of these statistical methods, we presented visualisations of the research connections between areas and countries, on the emer ging patterns from national responses, and we provide scientic insights on the speed of response. Our aim was to provide statistical ‘snapshot in time’, and to assist other researchers to reassess the response in the initial stages of the pandemic and prepare for future global pandemics. In the article, we presented two conclusions: 1. The best performing institutions are based in countries / areas that were rst impacted (and most severely) by the rst Covid-19 wave, 2. Countries that got worst aected in the rst wave, invested most money in research on Covid-19 – during the rst wave. While there is signicant evidence for these conclusions to be conrmed in this arti - cle, we believe this topic will be further investigated and analysed for many years to come. Hence, we make our datasets publicly available (in open access), for other researchers to reuse in future analysis. There can be various interpretations in practice about these ndings. The fact remains that the world was not prepared for a global pandemic. The research insti - tutes that were expected to react as rst responders, didn’t respond as fast as the Annals of Data Science 1 3 institutes and organisations in the most aected areas. In the end, we have seen that organisations that were preparing for a Disease X event, produced the most output. But during the rst wave, most of the output was produced by organisations and insti - tutes that had access to data on the Covid-19 pandemic. This brings into question the value of sharing medical data (at speed and low latency) in preventing and managing future Disease X events. 6.1 Research limitations There are obvious limitations in interestingness metrics, such as lack of insights into negative relationships, lack of statistical base on COVID-19. In addition, since we can only present results that emer ge from the data, this study lacks an objective cri - terion for assessment. By lack of objective criterion, we refer to the lack of clearly dened research objectives, in specic terms that can be used to conrm if the terms of the objective criterion denitions are met. W e didn’t have a predened problem or a research question that we tried to answer , such as; is one country or organisation better than other . Instead, the visualisations in this article are representative of the statistical data records, as described in our search parameters, available on the 17th of May 2020. In the spirit of reproducible research, we include our data records in this submission. Supplementary information The online version contains supplementary material available at https://doi. org/10.1007/s40745-022-00406-8 . Acknowledgements Eternal gratitude to the Fulbright V isiting Scholar Project. Author contribution Dr Petar Radanliev: main author; Prof. Dave De Roure, Prof. Max V an Kleek: super - vision; Rob W alton, Omar Santos, La’Treall Maddox: review and corrections. Funding This work was funded by the UK EPSRC [grant number: EP/S035362/1] and by the Cisco Research Centre [grant number 1525381]. Availability of data and ma terials all data and materials included in the submission. Code A vailability N/A. Declarations Conﬂict of interest On behalf of all authors, the corresponding author states that there is no conict nor competing interest. This article is licensed under a Creative C ommons Attribution 4.0 International Lic ense, which permits use, sharing, adapta tion, distribution and reproduction in any medium or forma t, as long as you giv e appropriate credit t o the original author(s) and the source, pr ovide a link to the Creative Commons lic ence, and indicate if changes were made . The images or other third party material in this article are included in the article’ s Creativ e Commons licence, unless indicated otherwise in a credit line to the mat erial. If material is not included in the article’ s Cr eative C ommons licence and your int ended use is not permitted by statutory regulation or ex ceeds the permitted use, you will need to obtain permission dir ectly from the copyright holder . T o view a copy of this licence, visit http://crea tivecommons .org/licenses/by/4.0/. Annals of Data Science 1 3 Referenc es 1. Hayashi C (1998) “What is Data Science ? Fundamental Concepts and a Heuristic Example,” pp.40–51, 2. Shi Y (2022) Advances in Big Data Analytics. Springer Singapore 3. Jain L, W u, Xindong Advanced Information and Knowledge Processing .Springer 4. Olson DLouis, Shi Y , Shi Y (2007) Introduction to business data mining, vol 10. McGraw-Hill/Irwin New Y ork 5. T ien JM (2017) “Internet of Things, Real-Time Decision Making, and Articial Intelligence,” Ann. Data Sci. , vol. 4, no. 2, pp. 149–178, Jun. 6. Khakharia A, Shah V , Jain S, Shah J, Tiwari A, Daphal P , W arang M, Mehendale N (2021) “Outbreak Prediction of COVID-19 for Dense and Populated Countries Using Machine Learning,” Ann. Data Sci. , vol. 8, no. 1, p. 1, Mar . 7. Liu Y , Gu Z, Xia S, Shi B, Shi ZXiaoN, Liu, Jiming (2020) “What are the underlying transmission patterns of COVID-19 outbreak? An age-specic social contact characterization,”EClinicalMedic ine, vol. 22, 8. Li J, Guo K, V iedma EHerrera, Lee H, Liu J, Zhong N (2020) Autran Monteiro Gomes, Luiz Flavio., … Shi, Y ong, “Culture versus Policy: More Global Collaboration to Eectively Combat COVID- 19,” Innov . , vol. 1, no. 2, p. 100023, Aug. 9. Haghani M, Bliemer, Michiel CJ, Goerlandt F , Li J (2020) “The scientic literature on Coronavi - ruses, COVID-19 and its associated safety-related research dimensions: A scientometric analysis and scoping review .,”Saf. Sci., p.104806, 10. Chahrour M, Assi S, Bejjani M, Nasrallah AA, Salhab H, Fares MY , Khachfe, Hussein H (2020) “A Bibliometric Analysis of COVID-19 Research Activity: A Call for Increased Output,” Cureus , vol. 12, no. 3, Mar . 1 1. Lou J, T ian SJ, Niu SM, Kang XQ, Lian HX, Zhang LX, Zhang JJ (2020) Coronavirus disease 2019: A bibliometric analysis and review . Eur Rev Med Pharmacol Sci 24(6):3411–3421 12. Guo Y , Hao Z, Zhao S, Gong J, Y ang F (2020) “Articial intelligence in health care: Bibliometric analysis,” J. Med. Internet Res. , vol. 22, no. 7, p. e18228, Jul. 13. Hwang Gwo-Jen, T u Y -F (2021) “Roles and Research T rends of Articial Intelligence in Mathemat - ics Education: A Bibliometric Mapping Analysis and Systematic Review ,” Mathematics , vol. 9, no. 6, p. 584, Mar . 14. Dhamija P , Bag, Surajit “Role of articial intelligence in operations environment: a review and bibliometric analysis,” TQM Journal , vol. 32, no. 4. Emerald Group Publishing Ltd., pp. 869–896, 06-Mar-2020 15. Aria M, Cuccurullo, Corrado (2017) “bibliometrix: An R-tool for comprehensive science mapping analysis,” J. Informetr . , vol. 11, no. 4, pp. 959–975, Nov . 16. van Jan N, W altman Ludo, “Software survey: VOSviewer, a computer program for bibliometric mapping.” Publisher’ s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional aliations.

What country, university or research institute, performed the best on COVID-19? Bibliometric analysis of scientific literature

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment