Delineating Knowledge Domains in the Scientific Literature Using Visual Information

Delineating Knowledge Domains in the Scientific Literature Using Visual Information Sean T . Y ang University of W ashington Seale, W ashington tyyang38@uw .edu Po-shen Lee University of W ashington Seale, W ashington sephon@uw .edu Jevin D . W est University of W ashington Seale, W ashington jevinw@uw .edu Bill Howe University of W ashington Seale, W ashington billhowe@cs.washington.edu ABSTRA CT Figures are an important channel for scientic communication, used to express complex ideas, models and data in ways that w ords cannot. Howev er , this visual information is mostly ignored in anal- yses of the scientic literature . In this paper , we demonstrate the utility of using scientic gures as markers of knowledge domains in science, which can be used for classication, recommender sys- tems, and studies of scientic information exchange. W e encode sets of images into a visual signature, then use distances between these signatures to understand how paerns of visual communica- tion compare with paerns of jargon and citation structur es. W e nd that gures can be as eective for dierentiating communities of practice as text or citation paerns. W e then consider where these metrics disagree to understand how dierent disciplines use visualization to express ideas. Finally , we further consider how specic gure types propagate through the literature, suggesting a new mechanism for understanding the o w of ideas apart from conventional channels of text and citations. Our ultimate aim is to beer leverage these information-dense objects to improv e scien- tic communication across disciplinary boundaries. KEY W ORDS VizioMetrics, science of science, bibliometrics, scientometrics A CM Reference format: Sean T . Y ang, Po-shen Lee, Jevin D. W est, and Bill Howe. 2016. Delineating Knowledge Domains in the Scientic Literature Using Visual Information. In Proceedings of ACM Conference, W ashington, DC, USA, July 2017 (Confer- ence’17), 10 pages. DOI: 10.475/123 4 1 IN TRODUCTION Increased access to publication data has contributed to the emer- gence of the Science of Science (SciSci) as a eld of study . SciSci Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permie d. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and /or a fee. Request permissions from permissions@acm.org. Conference’17, Washington, DC, USA © 2016 A CM. 123-4567-24-567/08/06. . . $ 15.00 DOI: 10.475/123 4 studies metrics of knowledge production and the factors contribut- ing to this production [ 14 ]. Citations and text are the primary data types for measuring inuence and tracking the evolution of scien- tic disciplines in this eld. Dong et al. [ 9 ] use citations to study the gro wth of science and observe the globalization of scientic development within the past century . Vilhena et al. [ 43 ] character- ize culture holes of scientic communication embe dded in citation networks. How ever , among the studies in SciSci, the use of visual- izationhas received lile aention, despite being widely r ecognize d as a signicant communication channel within disciplines, across disciplines, and with the general public [28]. Humans perceive information presented visually b eer than textually[ 35 ] due to the highly developed visual cortex[ 44 ]. As a result, gures play a signicant role in academic communication. e information density of a visualization or diagram can repr esent complex ideas in a compact form. For example, a neural netw ork architecture diagram conveys an overview of the method used in a paper without requiring code listings or signicant text. Moreover , the presence of a neural network diagram can be a b eer indicator that the paper involv es the use of a neural network than any simple text features such as the presence of the phrase ”neural netw ork. ” Despite the importance of the gures in the scientic literature, they have r eceived relatively lile aention in the SciSci commu- nity . Viziometrics [ 28 ] is the analysis of visual information in the scientic literature . e term was adopted to distinguish this analy- sis from bibilometrics and scientometrics, while still conv eying the common objectives of understanding and optimizing paerns of sci- entic inuence and communication. Lee et al. [ 28 ] has shown the relationship between visual information and the scientic impact of a paper . In this paper , we demonstrate that visual information can serve as an eective measure of similarity that can demarcate areas of knowledge in the scientic literature. Dierent scientic communities use visual information dier- ently and one can use these dierences to understand communities of practice across traditional disciplines and show how ideas o w between these communities. W e consider three hypotheses: H1) Sub-disciplines use distin- guishable paerns of visual communication just as they use dis- tinguishable jargon, H2) these paerns expose new modalities of communication that are not identiable by either text or the struc- ture of the citation graph, and H3) by classifying and analyzing use Conference ’17, July 2017, W ashington, DC, USA Sean T . Y ang, Po-shen Lee, Jevin D. W est, and Bill Howe of specic types of gures, we can track the propagation and popu- larity of certain ideas and metho ds that are dicult to discern using text or citations alone ( e.g., inclusion of neural network diagrams suggest contributions of new neural network architectures). T o test these hypotheses, we extract over 5 million scientic gures from papers on arXiv .org, pr ocess the images into low- dimensional vectors, then build a visual signature for each eld by clustering the vectors and computing the frequency distribution across clusters for each discipline. W e use these signatures to reason about the similarity between elds, and compar e these measures to prior work in understanding scientic community structure using text [ 43 ] and the citation graph [ 10 , 43 ]. Citations and text have been used to circumscrib e knowledge domains, but this is the rst study that shows that gures can also delineate elds. W e compare the pairwise distances between these three matrices using the Mantel test [ 32 ], a common statistical test of the correla- tion between two distance matrices. W e nd that the visual distance is moderately correlated to citation-based metrics (r = 0.706, p = 0.0001, z score = 5.103) and text-based metrics (r = 0.531, p=0.0002, z score = 5.019). W e also perform hierarchical clustering on all distance matrices to pro vide a qualitative comparison of the r esults, nding that the hierarchical structur e of the elds largely agrees, but with some signicant exceptions. W e then consider pairs of elds that are visually distinct but similar in either text distance or citation distance, suggesting dierences in the visual style of how ideas are presented. For example, we nd that Computation and Language is visually distinct from other Computer Science dis- ciplines despite being quite similar in citation distance , because the former includes far more tables of data. Finally , we consider spe cic cases of the use of particular types of gures can indicates a common method or idea in a way that text and citation similarity do not. W e conduct a case study on two popular types of visualizations, neural network diagrams and embedding visualizations used to show clusters. e analysis in- dicates that visualizations can be use d to make inferences about concept adoption within scientic communities. W e also observe that the gures re veal the uptake of neural networks earlier than ci- tation analysis, since citation counts take years to accrue . With this case study , we show the signicance of visualizations in scientic literature, suggesting that the integration of gures into systems for bibilometric analysis, do cument summarization, information retrieval, and recommendation can improv e performance and af- ford new applications. Our focus is in the scientic literature, but our methods are directly applicable to other domains, including patents, web pages [2], and news. In this paper , we make the following contributions: • W e present a method for delineating scholarly disciplines based on the gures and visualizations in the literature. • W e compar e this method to prior results based on citations and text and nd that dierent elds and sub-disciplines exhibit discernible paerns of visual communication (H1) • W e nd instances of elds that use similar jargon and cite similar sources, but are visually distinct, suggesting that visual paerns of communication are not redundant with other forms of communication (H2). • W e present a method for identifying specic gure types and show that the presence of these gures in a paper can be used to understand concept adoption and a potential marker for tracking the evolution of scientic ideas (H3). 2 RELA TED W ORK Citations have been extensively studied and utilized as a measure of similarity among scientic publications. Marshakova proposed co- citation analysis [ 33 ] which uses the frequency that pap ers are cited together as a measure of similarity . Citations are also utilized to delineate the emerging nanoscience elds in [ 30 , 47 ] and are applied to design recommendation systems [ 21 ]. Howe ver , citations only reveal the structural information with the scholarly literature and ignore the rich content in the articles. T ext has also received signicant aention on analyzing the connection within scientic disciplines and documents, espe cially in citation recommendations [ 20 , 42 ]. Vilhena et al. [ 43 ] proposed a text-based metric to characterize the jargon distance b etween disciplines. Howev er , ambiguity and synonymity of text makes text-based model less ideal[24]. Researchers have explored other asp ects of a research paper for measuring the distance between disciplines. e fr equency of mathematical symb ols in papers are used to delineate elds by W est et. al [ 45 ], but mathematical symbols ar e not as ubiquitous as other components. Visual communication is a signicant channel for conveying scientic knowledge, but is r elatively less explored. A number of studies have focused on mining the scientic gures. Chart classication was well-studied by Futrelle et al. [ 15 ], Shao et al. [ 39 ], and Lee et al. [ 29 ]. Re cent studies have been focusing on the extraction of quantitativ e data fr om scientic visualizations, includ- ing line charts [ 31 , 40 ], bar charts [ 3 ], and tables [ 13 ]. Researchers have also investigated the techniques to understand the semantic messages of the scientic gures. Kembhavi et al. [ 22 ] utilized a convolution neural network (CNN) to study the problem of diagram interpretation and reasoning. Elzer et al. [ 12 ] studied the intended messages in bar charts. Several visualization-based search engines have also been presented. DiagramF lyer [ 7 ], introduced by Chen et al., is a search engine for data-driven diagrams. VizioMetrix[ 27 ] and NO A[ 6 ] are both scientic gures search engines with big scholar data, while they both work by examining the captions around the gures. W e see visual-based models for demarcating knowledge domains as a next step in this area of resear ch. 3 METHOD 3.1 Data e data for this study comes from the arXiv . e arXiv is an open access repository for pre-prints in physics, mathematics, computer science, quantitative biology , quantitative nance, statistics, elec- trical engineering, systems science, and economics. e variety of disciplines allows consideration of information between elds, in contrast to more sp ecialized repositories such as PubMed. ere are 1,343,669 research papers which include 5,009,523 gures on arXiv through December 31st 2017. Delineating Knowledge Domains in the Scientific Literature Using Visual Information Conference ’17, July 2017, W ashington, DC, USA Figure 1: Overall pipeline. Figures are mappe d to vectors using ResNet-50, dimension-reduced, then organized into a histogram for each eld. e distances b etween these histograms are use d to infer relationships and information ow . 3.2 Processing Pipeline Fig. 1 shows the pipeline to characterize scientic disciplines using visual information. Each step will be explained in the corresponding numbered paragraph. 3.2.1 Convert Figures Into Feature V ectors. W e rst emb ed each gure into a 2048-d feature vector using the pre-trained ResNet-50 [ 18 ] model. e gures are re-sized and padded with white pixels to b e 224 x 224 before being embe dded by pre-trained ResNet- 50. ResNet-50 was trained on the ImageNet [ 8 ] corpus of 1.2M natural images. Even though the model was trained on natural images, we nd that the early layers of the network identify simple paerns (lines, edges, corners, cur ves) that are suciently general for the overall network to represent the combinations of edges and shapes that comprise articial images as well. Although we posit that a custom neural network architecture could be designed to incrementally improve performance on articial images, we do not further consider that direction in this paper . 3.2.2 Dimension Re duction. W e reduce the dimension of each gure vector using Principal Component Analysis (PCA). e high- dimensional vectors produce d by ResNet-50 contain more infor- mation than is necessar y for our application of computing the visual similarity between elds, and we seek to make the pipeline as ecient as possible. Plus, the ResNet model is pre-trained by natural images, while scientic gures have a lot more white ar- eas, which make the embedding vectors more sparse, than natural images. Distances tend to be inated in high dimensional space, reducing clustering performance [ 4 ]. W e follow the typical practice of applying dimension reduction prior to clustering. Our original hypothesis was that a very low number of dimensions (10) would be sucient to capture the dierences between elds, but in our evaluation the higher values (200+) produced stronger correlations with other methods of delineating elds. W e considered dierent values of this parameter using a sample of 1.5M gures from the 5M gure corpus. e results of the experiment are presented in Section 5.1. 3.2.3 Cluster the Figure Corpus. e distribution of dierent types of gures carries signicant information ab out how the visual communication is dierent in each discipline and could further rep- resent each category . W e cluster our gure corpus with K -Means clustering to aggregate similar gur es. Although more advanced methods of clustering could provide beer results, we aim to demon- strate that the approach can work even with very simple methods. e objective of this paper is to show the utility of the gures for po- tential applications, rather than to propose a specialized framew ork for specic task. e experimental results are shown in Section 5.1. 3.2.4 Visual Signatures for Each Discipline. W e cluster the g- ures with number of centroid k = 4 and generate the normalized histogram for each discipline to acquire visual signature of each discipline. Aer the visual signature of each discipline is generated, we calculate the euclidean distance between each pair of disciplines. W e evaluate the computed visual similarity between disciplines by comparing to citation-based and text-based metrics described in previous work, which ar e explained in Section 4. 3.3 Classifying Figure T yp es In this section, we describ e the process to train the classier to identify specic gure types, which we will use to understand how the use of particular styles of visualization and diagrams propagate through the literature. W e consider two spe cic examples: neural network diagrams (associated with the rapid increase of neural network methods in the literature) and clustering plots (associated with the use of unsupervised learning). Examples of these visual- izations are shown in Figure 2. Sethi et al. [ 38 ] characterize six (a) (b) Figure 2: Examples of neural network diagram and emb ed- ding visualization. (a) An example of neural network dia- gram. e diagram is borrowed from AlexNet paper [23]. (b) An example of emb edding visualization. e plot is bor- rowed from MultiDEC paper [46]. dierent gure types to demonstrate neural network ar chite cture. W e label 10,651 gures from arXiv , which includes 1,503 neural network diagrams, 1,057 embedding visualizations, 8,091 negative examples. For neural network diagrams, w e label them according to the taxonomy suggested by Sethi et al. [ 38 ], but we exclude gures in table format. W e consider a gure as an embedding visualization if the gure is used to visualize the repr esentation distribution of the data. e annotators make use of images and captions to label Conference ’17, July 2017, W ashington, DC, USA Sean T . Y ang, Po-shen Lee, Jevin D. W est, and Bill Howe Figure 3: e architecture of the neural network diagrams and embedding visualization classier . T able 1: Implementation details for training the neural net- work diagrams and emb edding visualization classier . Learning Rate Decay Epoch Batch Size Loss 0.001 0.001 150 256 Categorical Cross Entropy the images. W e extract visual features from the fully connecte d layer of a ResNet-50[ 18 ] model, which is pre-trained by 1M Ima- geNet dataset[ 8 ]. e gures ar e resized to 224x224 and a 2048-d numeric vector is acquired for each gure . e lab eled image set is then split into training, validation, and test set with 8:1:1 ratio to train a deep neural network (DNN) classier . W e tune the depth of the model, dimension of the layers, dr op out rate, learning rate , decay ratio, and training epochs. e architecture of the nal model is shown in Figure 3 and implementation details is shown in T able 1. 4 COMP ARISON WI TH CI T A TION- AND TEXT -BASED METHODS W e use the Mantel test [ 32 ], a standard statistical test of the corre- lation between two matrices, to compare visual distance with the distance matrices created by (1) A verage shortest citation distance [ 10 , 43 ] and (2) Natural language jargon distance [ 43 ]. Citations and text have been extensively analyzed and employed to measure the similarity among resear ch articles, and both of the measures have had success on information retrieval and recommendation systems among scholarly documents. erefore, we consider ci- tation distance as our benchmark of the task and te xt distance as alternative comparison. 4.1 A verage Shortest Citation Path W e compute the average shortest path between each pair of elds as a measure of similarity . A verage shortest path [ 10 ] is one of the three most robust measures [ 5 ] of network topology , in addition to its clustering coecient and its degree distribution. Vilhena et al [ 43 ] used this method to measure distance in the citation network to compare with their text-based metric. A verage shortest path is computed as follows: D i j = 1 n i n j Õ n i Õ n j d ( v i , v j ) where n i is the number of vertices in eld i and n j is the number of vertices in eld j . e average shortest path between eld i and eld j , D i j , is the average of all paths between all vertex pairs, v i and v j . Our citation graph is obtaine d from the SAO/NASA Astrophysics Data System ( ADS)[ 11 ], a digital library portal maintaining three bibliographic databases containing more than 13.6 million records covering publications in Astronomy and Astrophysics, Physics, and the arXiv e-prints. e cr eation of the citations in ADS [ 1 ] is started by scanning the full-text of the paper to retriev e bib code for each reference string in the article, followed by computing the similarity score between the ADS record and the bibcode. e citation pairs are generated if the similarity is higher than the threshold. is data has be en extensively use d on several bibliographic studies [ 16 , 25 ]. ere are 14,555,820 citation edges within our arXiv data corpus. 4.2 Jargon Distance W e also compare our results to text metrics base d on cultural in- formation as represented by paerns of discipline-specic jargon. Jargon distance was rst proposed by Vilhena et al. [ 43 ], where the authors quantitatively measure the communication barrier be- tween elds using n-grams from full text. e jargon distance ( E i j ) between eld i and eld j is dened as the ratio of (1) the entropy H of a random variable X i with a probability distribution of the jargon or mathematical symbols within eld i and (2) the cross entropy Q between the probability distributions in eld i and eld j : E i j = H ( X i ) Q ( p i | | p j ) = − Í x ∈ X p i ( x ) log 2 p i ( x ) − Í x ∈ X p i ( x ) log 2 p j ( x ) Imagine a writer from eld i trying to communicate with a reader from eld j . e writer has a codebook P i that maps the natural language or mathematical symbols to codewor ds that the reader has to decode using the codebook P j from eld j . A small jargon distance means high communication eciency between two elds and are closely related. is metric could be easily applied to natural language jargon to explore how the communication varies through these two channels across disciplines. W e compute the jargon distance between two dierent disciplines by applying the metrics on unigram from abstracts. 5 RESULTS W e show that the distance between visual signatures can be used to determine the overall relationships between elds in a manner similar to prior methods, but that this approach also exposes in- formation that prior methods cannot. In Section 5.1, we present the experimental results on picking the numb er of dimensions and clusters. In Section 5.2, we show the capacity of visual distance to reveal the relationships acr oss scientic disciplines by showing global agreement b etween visual distance and citation distance (H1). In Section 5.3, we examine each cluster to understand the Delineating Knowledge Domains in the Scientific Literature Using Visual Information Conference ’17, July 2017, W ashington, DC, USA visual composition and nd that each cluster is dominated by a cer- tain type of visualization, extending prior work in the life sciences that used coarse-graine d labeling of gure types [ 28 ]. In Section 5.4, we show that citation distance and visual distance disagree in certain cases, and consider one case in particular (H2). Finally , we consider cases where the presence of a particular typ e of gure can indicate the use of a method or concept in a way that text and citation similarity do not in Section 5.5 (H3). W e demonstrate that the gures in the scientic literatur e can serve as an indicator of concept adoption that travels faster than citation count. 5.1 Choosing the numb er of dimensions and clusters Our pipeline involves two hyperparameters: the number of dimen- sions to r etain via PCA and the number of clusters to assume when constructing visual signatures. W e determine these parameters experimentally . e results of our analysis of PCA dimensions ap- pear in T able 2. e explained variance ratio shows the percentage of variance explained by the sele cted components. e variance explained grows insignicantly aer 256 components. e average correlation with citation distance shows the average of the correla- tions between visual distance and citation distance across all the numbers of centroid k (from 2 to 30). W e evaluate our method by conducting the Mantel test [ 32 ] to compare the correlation between visual distance and citation distance. It conrms our hypothesis that the corr elation increases when more components are used, but it converges aer sucient information is pr eser ved. Maximum correlation to citation distance shows the maximum correlation of the specied dimension among dierent options of number of cen- troid k , and the k contributing the maximum correlation is shown in ”Maximum at k = ?” . Surprisingly , the maximum correlation happens at larger number of centroid with low dimension of gure vector . Our interpretation is that there is not sucient information preserved by low dimensional space. W e ran a second experiment to determine the number of cen- troids k . Initially , we expe cted the correlation with other measures to be higher using larger values of k , since the diversity of gures in the literature appears vast. However , considering k = 100, 200, and 400, we found that larger values of k generate lower correla- tions with citation distance ( correlation coecient around 0.4), due to overing to rare, low-condence clusters. Lowering k to the range of 2 to 30 p erformed beer; these results appear in T able 2. e relatively low values of k suggest that there are r elatively few modalities of visual communication in use across elds. e maximum correlation occurred at k = 4 in most of the experiments. W e further discuss the interpretation of these results in Section 5.3. 5.2 Delineating Disciplines In this section, we demonstrate the ability of visual distance to characterize the relationships between elds, quantitatively and qualitatively . antitatively , we conduct the Mantel test [ 32 ] with Spearman rank correlation method to compare two dierent dis- tance matrices to reveal the similarity between two structures. W e also perform hierarchical clustering using UPGMA algorithm [ 36 ] to visualize the hierarchical relationships acr oss disciplines, quali- tatively . Vilhena et al.[ 43 ] used similar technique to qualitatively visualize how disciplines are delineated, but the data they used was from JSTOR, which focuses on biological science and social science so that it is not comparable with our task. T able 3 shows the correlation results between dierent distances. e rst two columns indicate the methods being compared and the Results column sho ws the correlations. e correlation b etween visual distance and citation distance ( r = 0 . 706, p value = 0.0001, z score = 5.103) is higher than the correlation between jargon dis- tance and citation distance ( r = 0 . 697, p value = 0.0001, z score = 5.989), providing evidence for our hypothesis that styles of visual communication are a stronger indicator of communication and in- uence than the terminology used by a eld. Visual distance is also moderately correlated to jargon distance with r = 0 . 531, p value = 0.0002, and z score = 5.019. is result is expecte d. It veries our rst hypothesis: sub-disciplines use distinguishable paerns of visual communication. Correlation between visual distance and citation distance is sucient enough to sho w that visual distance is capable of characterizing general relationships between disciplines, but it also rev eals that there are still dierences between citation distances and visual distance. W e will elaborate the dierent con- nections visual distance expose in Section 5.3. W e then perform hierar chical clustering, using the UPGMA algo- rithm [36], to qualitatively visualize how dierent methods group similar disciplines together and separate dissimilar disciplines. e hierarchical clustering results for visual distance, citation distance, and jargon distance are shown in Fig.4. W e observe similar paerns between visual distance and citation distance where Computer Sci- ence , Statistics , Math , and Mathematical Physics are isolated fr om other physics-related elds of study . ere is inconsistency b etween visual distance and citation distance in the eld of antitativ e Bi- ology , which is the outlier in citation distance, but is assigned to the physics-related cluster in visual distance. 5.3 Analyzing Clusters W e classify the gures in each cluster to understand the visual composition of each cluster . W e use the convolutional neural net- work classier in [ 29 ] to categorize gures into ve categories: (1) Diagrams (2) Plots (3) T able (4) Photo and (5) Equation. e classication results are shown in Fig. 5. Surprisingly , each clus- ter is prominently associated with a certain type of visualization: Cluster#0 is primarily composed of diagrams ( Diagram ), Cluster#1 is primarily composed of tables T able , Cluster#2 is primarily com- posed of plots of quantitative information ( Plot , and Cluster#3 is primarily composed of photos P hoto . ese results corroborate previous work that used supervised methods and manual lab eling to categorize gures into ve classes (Diagram, Plot, T able, Photo, and Equation) [ 28 ]. e distribution of gures helps to reveal the properties of each discipline. For instance, Cluster Plot is dominant in antitative Biology (48%) and Nuclear Experiment (60%), which may indicate the degree to which these elds can be considered experimental and data-driven. e distribution could further be used to group similar disciplines and separate the dissimilar elds as we show in the pre vious section. Conference ’17, July 2017, W ashington, DC, USA Sean T . Y ang, Po-shen Lee, Jevin D. W est, and Bill Howe T able 2: Choosing the numb er of clusters (k). Dimension Explained V ariance Ratio A verage of Correlations to Citation Distance Maximum Correlation to Citation Distance Maximum at k=? 16 52.0% 0.661 0.737 15 32 63.7% 0.631 0.768 3 64 73.9% 0.660 0.769 4 128 82.3% 0.662 0.770 4 256 88.9% 0.672 0.793 4 320 90.7% 0.674 0.793 4 Figure 4: e hierarchical clustering dendrogram of visual distance (le), citation distance (middle), and jargon distance (right). Citation distance is a b enchmark in our task. It shows similar pattern as visual distance where Computer Science , Statistics , Math , and Mathematical P hysics are separated from the rest of the disciplines. e inconsistency between citation distance and visual distance is antitative Biology , which is clustered with physics-related disciplines in visual distance while it is isolated in citation distance. On the other hand, Jargon distance segregates disciplines dierently from visual distance and citation distance in the high level. High Energy P hysics and Nuclear are separate d from the rest wher e antitative Biology , Computer Science and Statistics are isolated in the sub-cluster . T able 3: e correlation results between distance matrices. Results Visual Distance Citation Distance r = 0.706 p = 0.0001 z = 5.103 Visual Distance Jargon Distance r = 0.531 p = 0.0002 z = 5.019 Jargon Distance Citation Distance r = 0.697 p = 0.0001 z = 5.989 5.4 Visuals delineate dierently than citations In this section, we focus on the cases in computer science where visual distance and citation distance disagree and we validate our second hypothesis: visual paerns expose new modalities of com- munication that are not identiable by either text or the structure of the citation graph. e analysis aims to answer the following questions: (1) Where are ther e visual dierences in the disciplinary landscape when compared to citation dierences? (2) What is re- vealed about the elds where visual dierences occur? W e normalize visual distance and citation distance, then subtract visual distance from citation distance to expose the discrepancies. Fig. 6 shows that there is a signicant disagreement between vi- sual distance and citation distance for the subeld Computation Delineating Knowledge Domains in the Scientific Literature Using Visual Information Conference ’17, July 2017, W ashington, DC, USA Figure 5: e visual composition of each cluster . It appears that each cluster has one dominant visualization. Figure 6: Heat map of dierences between visual and cita- tion distance. W e normalize visual distance and citation dis- tance and subtract visual distance from citation distance to expose the discrepancies. Red indicates that two subelds are visually distant but near in citation distance. Green indi- cates that two subelds are distinct in citation distance but visually similar . Computation and Language is visually dif- ferent across the subelds in Computer Science but relatively close in terms of citation distance. and Language . Red cells show the disagreements where elds are visually distinct but similar in citation distance. Green cells, in contrast, indicate disciplines that are visually similar , but far apart in citation distance. W e observe that Computation and Language is generally close to all other categories in Computer Science , but visually distinct. W e further examine the visual prole of Compu- tation and Language in order to beer understand the reasons for the divergence between these two distances. Fig. 7 shows the distribution of the gure usage in Computation and Language (CL) and Computer Science (CS) over the past ten years. W e make two observations from this stacked bar chart: (1) Cluster T able dominates the visual communication style with over 50% in Computation and Language in 2017, compared to appro xi- mately 30% in Computer Science , and it has been growing over the past few years. (2) e researchers in Computation and Language use very few gures associated with Cluster Photo . W e further investigate the reason that tables ar e largely used in Computation and Language by analyzing the cluster textually . W e conduct topic modeling on the captions of the gur es of Cluster T able using Non- negative Matrix Factorization (NMF) [ 26 ] with ve topic numbers. In Table 4, we display the top 10 keywords of each topic along with the ratio of the count of the gures in each topic to the total count in the cluster over the past 10 years. W e also look at the images in each topic to help us understand the purpose of each topic. Based on the keywords and the images, we can infer that T opic 0 mostly contains table with comparison data to other models, T opic 1 includes the examples of the language and words, T opic 2, which is similar to T opic 0, also involves comparing results between dierent models. T opic 3 consists of statistics about the dataset. T opic 4 is a mix of the tables and diagrams which mostly are used to illustrate the architecture of LSTM models. It appears that tables to compare the accuracy of dierent models have been growing signicantly , from 46.4% (28.6% + 17.8%) in 2008 to 60% (47.6% + 12.4%) in 2017, suggesting that an empirical r egime of research is dominant, perhaps due to improved access to advanced computa- tional infrastructure, easy access to data and code, and the rapid growth of the eld itself. Figure 7: e chart shows how the distribution of the clus- ters evolves in Computation and Language and Computer Science over the past ten years. W e could observe that Clus- ter T able has been growing in Computation and Language and researchers in Computation and Language use a rela- tively low number of gures in the Photo Cluster . 5.5 Fine-grained Figure Analysis e classier achieves accuracy of 0.902 on the validation set and 0.868 on the test set with precision of 0.741 and r ecall of 0.827 on neural network diagrams. e confusion matrix of the classier is shown in Fig. 8. e classier tends to misclassify ow charts, bar charts, and diagrams with multiple circles as neural network diagrams and the classier is also oen confused between embed- ding visualization and scaer plots (which ar e indeed quite similar ). e classier appears suciently eective at identifying neural net- work diagrams and embedding visualizations to conduct following analysis. W e use the trained classier to label 60k gures in computer science papers on arXiv and analyze the count of the neural network diagrams (T op line chart in Fig. 9) and the emb edding visualizations in computer science disciplines o ver time. W e select four categories, which are A rticial Intelligence , Machine Learning , Computer Vision , and Computation Language . ese disciplines are known to be strongly involved in neural network research. W e also include Computational Complexity , which has less involvement in neural learning research as a control. W e also compute the count of papers Conference ’17, July 2017, W ashington, DC, USA Sean T . Y ang, Po-shen Lee, Jevin D. W est, and Bill Howe T able 4: T op 10 keywords for each topic in Cluster Table along with the ratio of the gure in each topic over time. T opic 0 T opic 1 T opic 2 T opic 3 T opic 4 Cluster T able results table models dierent performance best scores dataset comparison accuracy words gure word number example table sentence example sentences used et al 2015 2016 2014 2017 2013 results 2011 taken set test training data development sets table dev used statistics model language trained baseline lstm proposed models aention layer performance year ratio ratio ratio ratio ratio 2008 28.6% 25.0% 17.8% 22.9% 5.7% 2009 31.1% 26.8% 16.1% 18.6% 7.4% 2010 31.2% 24.2% 16.9% 21.0% 6.7% 2011 34.2% 25.1% 17.2% 16.3% 7.2% 2012 39.1% 22.7% 16.0% 15.5% 6.7% 2013 37.3% 21.7% 17.3% 15.9% 7.8% 2014 39.4% 19.8% 16.0% 14.9% 9.9% 2015 43.9% 18.7% 14.2% 12.2% 11.0% 2016 45.3% 18.5% 13.4% 10.4% 12.4% 2017 47.6% 17.4% 12.4% 10.1% 12.5% Figure 8: e confusion matrix of the gure type classier . e classier achieves 0.868 overall accuracy . whose abstract include ”neural network” and ”deep learning” in the selected categories over time. e usage prole by eld in the use of embedding visualizations is similar to that of neural network diagrams. e trend is shown in the middle line chart in Fig. 9. Finally , we select six inuential papers in deep learning research: AlexNet [ 23 ], GAN [ 17 ], LSTM [ 19 ], ResNet [ 18 ], RNN [ 37 ], V GG [ 41 ], and W ord2V ec [ 34 ]. W e calculate the received citation count of each paper for each y ear to show the gro wth of inuence of these papers (Boom line chart in Fig. 9). W e compare these results with our visualization-based metrics to study our third hypothesis: we can use specic types of gures to track the pr opagation of ideas and methods in the literature. From the three plots, we make the following observations. First, the three line charts demonstrate the same tendency: a rapid rise in recent years. It is not surprising to see this common trend; increased interest in a topic leads to both increasing citations and an increasing number of relevant diagrams across the literatur e. Second, the count of papers that include ”neural network” in their abstracts steadily increases from 2012 to 2014 (yellow background), Figure 9: e three line charts demonstrate the trend of re- cent studies in deep learning using three dierent media: gures (top), text (middle), and citation (bottom). T op: e number of papers that include neural network diagrams over time. Middle: e count of pap ers that have ”neural net- work” or ”deep learning” in their abstracts over time. Bot- tom: e citation count of six selected inuential papers in deep learning. e annotation of each inuential pap er indi- cates the publication time. Citation count of the most inu- ential papers and use of the term ”neural network” in the ab- stract quickly increase (yellow area), but the eect is small. e use of relevant gures increases only once authors start to truly adopt the concept in their research. as does the citation count of one particular paper , AlexNet. But there is no increase in the use of gur es during this period. e cost of mentioning ”neural networks” or citing a rele vant paper is low , but the cost of developing a relevant gure is high. W e interpret this result as e vidence that the use of a gure is beer corr elated with the true adoption of a concept or metho d, as oppose d to simply Delineating Knowledge Domains in the Scientific Literature Using Visual Information Conference ’17, July 2017, W ashington, DC, USA acknowledging the relevance of a concept or method. Aer a novel idea is publishe d, the community rapidly begins to discuss the work and, potentially , cites a relevant pap er . But it takes time for the community to integrate the concept into their own research. Once they have done so , the cost of developing a gure is justied, and the number of gures increases. When the concept is adopting the concept, visualizations begin to emerge in the literature. ird, the number of neural network diagrams increases dra- matically in 2015 in the four relevant disciplines, while, except for AlexNet, we do not see such rapid growth of received citation counts until 2017 (ResNet and V GG). ere is a two year gap be- tween the emergence of the use of neural network diagrams and the rise of the received citation counts. Figures, as well as text, are faster to react to the introduction of new ideas than aggregate cita- tion counts. ese results both validate the use of gures as a signal of scientic communication, but also that they expose paerns not otherwise discernible. 6 CONCLUSION In this study , we demonstrate the feasibility of visual information being used as a measur e of similarity . W e show that visual distance is able to determine the overall relationships between elds by acquiring moderate high correlation (0.706) between visual distance and citation distance. In addition, we show that visual distance still delivers valuable information when it disagr e es with citation distance. W e further conduct a case study on two specic types of gures: neural network diagrams and embedding visualizations. W e nd that the upward trend of neural network diagrams and embedding visualizations predates the citation counts of inuential papers in recent years. is provides evidence that gur es in the scientic literature are leading indicators of citations. W e plan to extend our study to more ne-grained gure labels. is extension will aord beer interpretation of the correlations between gures, text, and citations and help us beer rene our groupings. In addition, w e plan to apply these visual demarcation techniques to tasks in information retrieval and r ecommendation systems. 7 A CKNO WLEDGEMEN T is research has made use of NASA ’s Astrophysics Data System Bibliographic Services. REFERENCES [1] Alberto Accomazzi, Gunther Eichhorn, Michael J Kurtz, Carolyn S Grant, Ed- win Henneken, Markus Demleitner , Donna ompson, Elizabeth Bohlen, and Stephen S Murray . 2006. Creation and Use of Citations in the ADS. arXiv preprint cs/0610011 (2006). [2] Bram van den Akker , Ilya Markov , and Maarten de Rijke. 2019. ViTOR: Learning to Rank W ebpages Based on Visual Features. arXiv preprint (2019). [3] Rabah A Al-Zaidy and C Lee Giles. 2015. Automatic extraction of data from bar charts. In K-CAP . A CM, 30. [4] Richard E Bellman. 1961. Adaptive control processes: a guided tour . V ol. 2045. Princeton university press. [5] Stefano Boccalei, Vito Latora, Y amir Moreno, Martin Chavez, and D-U Hwang. 2006. Complex networks: Structure and dynamics. P hysics reports 424, 4-5 (2006), 175–308. [6] Jean Charbonnier , Lucia Sohmen, John Rothman, Birte Rohden, and Christian W artena. 2018. NOA: A Sear ch Engine for Reusable Scientic Images Beyond the Life Sciences. In ECIR . Springer , 797–800. [7] Zhe Chen, Michael Cafarella, and Eytan Adar . 2015. Diagramyer: A search engine for data-driven diagrams. In e W eb Conference . ACM, 183–186. [8] Jia Deng, W ei Dong, Richard Socher , Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Ima- genet: A large-scale hierarchical image database. In CVPR . Ieee, 248–255. [9] Y uxiao Dong, Hao Ma, Zhihong Shen, and Kuansan W ang. 2017. A Century of Science: Globalization of Scientic Collaborations, Citations, and Innovations. In KDD . ACM, 1437–1446. [10] Stuart E Dreyfus. 1969. An appraisal of some shortest-path algorithms. Operations research 17, 3 (1969), 395–412. [11] Guenther Eichhorn. 1994. An overview of the astrophysics data system. Experi- mental Astronomy 5, 3-4 (1994), 205–220. [12] Stephanie Elzer , Sandra Carberry , and Ingrid Zukerman. 2011. e automate d understanding of simple bar charts. A rticial Intelligence 175, 2 (2011), 526–555. [13] Jing Fang, Prasenjit Mitra, Zhi T ang, and C Lee Giles. 2012. T able Header Detection and Classication.. In AAAI . 599–605. [14] Santo Fortunato, Carl T Bergstrom, K aty B ¨ orner , James A Evans, Dirk Helbing, Sta ˇ sa Milojevi ´ c, Alexander M Petersen, Filippo Radicchi, Roberta Sinatra, Brian Uzzi, et al. 2018. Science of science. Science 359, 6379 (2018), eaao0185. [15] Robert P Futrelle, Mingyan Shao, Chris Cieslik, and Andrea Elaina Grimes. 2003. Extraction, layout analysis and classication of diagrams in PDF documents. In ICDAR . IEEE, 1007–1013. [16] Eugene Gareld. 2006. e history and meaning of the journal impact factor. Jama 295, 1 (2006), 90–93. [17] Ian Goodfellow , Jean Pouget- Abadie, Mehdi Mirza, Bing Xu, David W arde-Farley , Sherjil Ozair , Aaron Courville, and Y oshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems . 2672–2680. [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR . 770–778. [19] Sepp Hochreiter and J ¨ urgen Schmidhuber . 1997. Long short-term memor y . Neural computation 9, 8 (1997), 1735–1780. [20] W enyi Huang, Zhaohui Wu, Prasenjit Mitra, and C Lee Giles. 2014. Refse er: A citation recommendation system. In JCDL . IEEE Press, 371–374. [21] J.D. W est, I. W esley-Smith, and C. T . Bergstrom. 2016. A recommendation system based on hierarchical clustering of an article-level citation network. IEEE Trans- actions on Big Data 2, 2 (June 2016), 113–123. hps://doi.org/10.1109/TBDA T A. 2016.2541167 [22] Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Ha- jishirzi, and Ali Farhadi. 2016. A diagram is worth a dozen images. In ECCV . Springer , 235–251. [23] Alex Krizhevsky , Ilya Sutskever , and Georey E Hinton. 2012. Imagenet classica- tion with deep conv olutional neural networks. In A dvances in neural information processing systems . 1097–1105. [24] Onur K ¨ u c ¸ ¨ uktun c ¸ , Erik Saule, Kamer Kaya, and ¨ Umit V C ¸ ataly ¨ urek. 2012. Direction awareness in citation recommendation. (2012). [25] Michael J Kurtz and Edwin A Henneken. 2017. Measuring metrics-a 40-year longitudinal cross-validation of citations, downloads, and peer revie w in astro- physics. Journal of the Association for Information Science and T e chnology 68, 3 (2017), 695–708. [26] Daniel D Lee and H Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (1999), 788. [27] Poshen Lee, Jevin W est, and Bill Howe. 2016. VizioMetrix: A Platform for Analyzing the Visual Information in Big Scholarly Data. In e W eb Conference W orkshop on BigScholar . [28] Poshen Lee, Jevin W est, and Bill Howe. 2017. Viziometrics: Analyzing Visual Paerns in the Scientic Literature. IEEE Transactions on Big Data (2017). [29] Poshen Lee, T. Sean Y ang, Jevin W est, and Bill Howe. 2017. PhyloParser: A Hybrid Algorithm for ExtractingPhylogenies from Dendrograms. (2017). [30] Loet Leydesdor and Ping Zhou. 2007. Nanotechnology as a eld of science: Its delineation in terms of journals and patents. Scientometrics 70, 3 (2007), 693–713. [31] Xiaonan Lu, J W ang, Prasenjit Mitra, and C Lee Giles. 2007. Automatic extraction of data from 2-d plots in documents. In ICDAR , V ol. 1. IEEE, 188–192. [32] Nathan Mantel. 1967. e detection of disease clustering and a generalized regression approach. Cancer research 27, 2 Part 1 (1967), 209–220. [33] IV Marshakova. 1973. Co-Citation in Scientic Literature: A New Measure of the Relationship Between Publications. ” . Scientic and T echnical Information Serial of VINI TI 6 (1973), 3–8. [34] T omas Mikolov , Ilya Sutskever , Kai Chen, Greg S Corrado, and Je Dean. 2013. Distributed representations of words and phrases and their compositionality . In Advances in neural information processing systems . 3111–3119. [35] Douglas L Nelson, V alerie S Reed, and John R W alling. 1976. Pictorial superiority eect. Journal of Exp erimental Psychology: Human Learning and Memory 2, 5 (1976), 523. [36] F James Rohlf and David R Fisher . 1968. T ests for hierarchical structure in random data sets. Systematic Biology 17, 4 (1968), 407–412. [37] David E Rumelhart, Georey E Hinton, and Ronald J Williams. 1986. Learning representations by back-propagating errors. Nature 323, 6088 (1986), 533. [38] Akshay Sethi, Anush Sankaran, Nav een Panwar , Shreya Khare, and Senthil Mani. 2018. DLPaper2Code: Auto-generation of code from deep learning research Conference ’17, July 2017, W ashington, DC, USA Sean T . Y ang, Po-shen Lee, Jevin D. W est, and Bill Howe papers. In AAAI . [39] Mingyan Shao and Robert P Futrelle. 2005. Re cognition and classication of gures in PDF documents. In International W orkshop on Graphics Recognition . Springer , 231–242. [40] Noah Siegel, Zachary Horvitz, Roie Levin, Santosh Divvala, and Ali Farhadi. 2016. FigureSeer: Parsing result-gures in r esearch papers. In ECCV . Springer , 664–680. [41] Karen Simonyan and Andrew Zisserman. 2014. V ery deep convolutional net- works for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014). [42] Trev or Strohman, W Bruce Cro, and David Jensen. 2007. Recommending citations for academic papers. In SIGIR . ACM, 705–706. [43] D. Vilhena, J. Foster , M. Rosvall, J.D. W est, J. Evans, and C. Bergstrom. 2014. Finding Cultural Holes: How Structure and Culture Diverge in Networks of Scholarly Communication. Sociological Science 1 (2014), 221–238. hps://doi. org/10.15195/v1.a15 [44] Colin W are. 2012. Information visualization: perception for design . Elsevier. [45] J.D. W est and J. Portenoy . 2016. Delineating Fields Using Mathematical Jargon. In JCDL W orkshop on BIRNDL . [46] Sean Y ang, Kuan-Hao Huang, and BIll Howe. 2019. MultiDEC: Multi-Modal Clustering of Image-Caption Pairs. arXiv preprint arXiv:1901.01860 (2019). [47] Michel Zi and Elise Bassecoulard. 2006. Delineating complex scientic elds by an hybrid lexical-citation method: An application to nanosciences. Information processing & management 42, 6 (2006), 1513–1531.

Delineating Knowledge Domains in the Scientific Literature Using Visual Information

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment