Delineating Knowledge Domains in the Scientific Literature Using Visual Information
Figures are an important channel for scientific communication, used to express complex ideas, models and data in ways that words cannot. However, this visual information is mostly ignored in analyses of the scientific literature. In this paper, we de…
Authors: Sean Yang, Po-shen Lee, Jevin D. West
Delineating Knowledge Domains in the Scientific Literature Using Visual Information Sean T . Y ang University of W ashington Seale, W ashington tyyang38@uw .edu Po-shen Lee University of W ashington Seale, W ashington sephon@uw .edu Jevin D . W est University of W ashington Seale, W ashington jevinw@uw .edu Bill Howe University of W ashington Seale, W ashington billhowe@cs.washington.edu ABSTRA CT Figures are an important channel for scientic communication, used to express complex ideas, models and data in ways that w ords cannot. Howev er , this visual information is mostly ignored in anal- yses of the scientic literature . In this paper , we demonstrate the utility of using scientic gures as markers of knowledge domains in science, which can be used for classication, recommender sys- tems, and studies of scientic information exchange. W e encode sets of images into a visual signature, then use distances between these signatures to understand how paerns of visual communica- tion compare with paerns of jargon and citation structur es. W e nd that gures can be as eective for dierentiating communities of practice as text or citation paerns. W e then consider where these metrics disagree to understand how dierent disciplines use visualization to express ideas. Finally , we further consider how specic gure types propagate through the literature, suggesting a new mechanism for understanding the o w of ideas apart from conventional channels of text and citations. Our ultimate aim is to beer leverage these information-dense objects to improv e scien- tic communication across disciplinary boundaries. KEY W ORDS VizioMetrics, science of science, bibliometrics, scientometrics A CM Reference format: Sean T . Y ang, Po-shen Lee, Jevin D. W est, and Bill Howe. 2016. Delineating Knowledge Domains in the Scientic Literature Using Visual Information. In Proceedings of ACM Conference, W ashington, DC, USA, July 2017 (Confer- ence’17), 10 pages. DOI: 10.475/123 4 1 IN TRODUCTION Increased access to publication data has contributed to the emer- gence of the Science of Science (SciSci) as a eld of study . SciSci Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permie d. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and /or a fee. Request permissions from permissions@acm.org. Conference’17, Washington, DC, USA © 2016 A CM. 123-4567-24-567/08/06. . . $ 15.00 DOI: 10.475/123 4 studies metrics of knowledge production and the factors contribut- ing to this production [ 14 ]. Citations and text are the primary data types for measuring inuence and tracking the evolution of scien- tic disciplines in this eld. Dong et al. [ 9 ] use citations to study the gro wth of science and observe the globalization of scientic development within the past century . Vilhena et al. [ 43 ] character- ize culture holes of scientic communication embe dded in citation networks. How ever , among the studies in SciSci, the use of visual- izationhas received lile aention, despite being widely r ecognize d as a signicant communication channel within disciplines, across disciplines, and with the general public [28]. Humans perceive information presented visually b eer than textually[ 35 ] due to the highly developed visual cortex[ 44 ]. As a result, gures play a signicant role in academic communication. e information density of a visualization or diagram can repr esent complex ideas in a compact form. For example, a neural netw ork architecture diagram conveys an overview of the method used in a paper without requiring code listings or signicant text. Moreover , the presence of a neural network diagram can be a b eer indicator that the paper involv es the use of a neural network than any simple text features such as the presence of the phrase ”neural netw ork. ” Despite the importance of the gures in the scientic literature, they have r eceived relatively lile aention in the SciSci commu- nity . Viziometrics [ 28 ] is the analysis of visual information in the scientic literature . e term was adopted to distinguish this analy- sis from bibilometrics and scientometrics, while still conv eying the common objectives of understanding and optimizing paerns of sci- entic inuence and communication. Lee et al. [ 28 ] has shown the relationship between visual information and the scientic impact of a paper . In this paper , we demonstrate that visual information can serve as an eective measure of similarity that can demarcate areas of knowledge in the scientic literature. Dierent scientic communities use visual information dier- ently and one can use these dierences to understand communities of practice across traditional disciplines and show how ideas o w between these communities. W e consider three hypotheses: H1) Sub-disciplines use distin- guishable paerns of visual communication just as they use dis- tinguishable jargon, H2) these paerns expose new modalities of communication that are not identiable by either text or the struc- ture of the citation graph, and H3) by classifying and analyzing use Conference ’17, July 2017, W ashington, DC, USA Sean T . Y ang, Po-shen Lee, Jevin D. W est, and Bill Howe of specic types of gures, we can track the propagation and popu- larity of certain ideas and metho ds that are dicult to discern using text or citations alone ( e.g., inclusion of neural network diagrams suggest contributions of new neural network architectures). T o test these hypotheses, we extract over 5 million scientic gures from papers on arXiv .org, pr ocess the images into low- dimensional vectors, then build a visual signature for each eld by clustering the vectors and computing the frequency distribution across clusters for each discipline. W e use these signatures to reason about the similarity between elds, and compar e these measures to prior work in understanding scientic community structure using text [ 43 ] and the citation graph [ 10 , 43 ]. Citations and text have been used to circumscrib e knowledge domains, but this is the rst study that shows that gures can also delineate elds. W e compare the pairwise distances between these three matrices using the Mantel test [ 32 ], a common statistical test of the correla- tion between two distance matrices. W e nd that the visual distance is moderately correlated to citation-based metrics (r = 0.706, p = 0.0001, z score = 5.103) and text-based metrics (r = 0.531, p=0.0002, z score = 5.019). W e also perform hierarchical clustering on all distance matrices to pro vide a qualitative comparison of the r esults, nding that the hierarchical structur e of the elds largely agrees, but with some signicant exceptions. W e then consider pairs of elds that are visually distinct but similar in either text distance or citation distance, suggesting dierences in the visual style of how ideas are presented. For example, we nd that Computation and Language is visually distinct from other Computer Science dis- ciplines despite being quite similar in citation distance , because the former includes far more tables of data. Finally , we consider spe cic cases of the use of particular types of gures can indicates a common method or idea in a way that text and citation similarity do not. W e conduct a case study on two popular types of visualizations, neural network diagrams and embedding visualizations used to show clusters. e analysis in- dicates that visualizations can be use d to make inferences about concept adoption within scientic communities. W e also observe that the gures re veal the uptake of neural networks earlier than ci- tation analysis, since citation counts take years to accrue . With this case study , we show the signicance of visualizations in scientic literature, suggesting that the integration of gures into systems for bibilometric analysis, do cument summarization, information retrieval, and recommendation can improv e performance and af- ford new applications. Our focus is in the scientic literature, but our methods are directly applicable to other domains, including patents, web pages [2], and news. In this paper , we make the following contributions: • W e present a method for delineating scholarly disciplines based on the gures and visualizations in the literature. • W e compar e this method to prior results based on citations and text and nd that dierent elds and sub-disciplines exhibit discernible paerns of visual communication (H1) • W e nd instances of elds that use similar jargon and cite similar sources, but are visually distinct, suggesting that visual paerns of communication are not redundant with other forms of communication (H2). • W e present a method for identifying specic gure types and show that the presence of these gures in a paper can be used to understand concept adoption and a potential marker for tracking the evolution of scientic ideas (H3). 2 RELA TED W ORK Citations have been extensively studied and utilized as a measure of similarity among scientic publications. Marshakova proposed co- citation analysis [ 33 ] which uses the frequency that pap ers are cited together as a measure of similarity . Citations are also utilized to delineate the emerging nanoscience elds in [ 30 , 47 ] and are applied to design recommendation systems [ 21 ]. Howe ver , citations only reveal the structural information with the scholarly literature and ignore the rich content in the articles. T ext has also received signicant aention on analyzing the connection within scientic disciplines and documents, espe cially in citation recommendations [ 20 , 42 ]. Vilhena et al. [ 43 ] proposed a text-based metric to characterize the jargon distance b etween disciplines. Howev er , ambiguity and synonymity of text makes text-based model less ideal[24]. Researchers have explored other asp ects of a research paper for measuring the distance between disciplines. e fr equency of mathematical symb ols in papers are used to delineate elds by W est et. al [ 45 ], but mathematical symbols ar e not as ubiquitous as other components. Visual communication is a signicant channel for conveying scientic knowledge, but is r elatively less explored. A number of studies have focused on mining the scientic gures. Chart classication was well-studied by Futrelle et al. [ 15 ], Shao et al. [ 39 ], and Lee et al. [ 29 ]. Re cent studies have been focusing on the extraction of quantitativ e data fr om scientic visualizations, includ- ing line charts [ 31 , 40 ], bar charts [ 3 ], and tables [ 13 ]. Researchers have also investigated the techniques to understand the semantic messages of the scientic gures. Kembhavi et al. [ 22 ] utilized a convolution neural network (CNN) to study the problem of diagram interpretation and reasoning. Elzer et al. [ 12 ] studied the intended messages in bar charts. Several visualization-based search engines have also been presented. DiagramF lyer [ 7 ], introduced by Chen et al., is a search engine for data-driven diagrams. VizioMetrix[ 27 ] and NO A[ 6 ] are both scientic gures search engines with big scholar data, while they both work by examining the captions around the gures. W e see visual-based models for demarcating knowledge domains as a next step in this area of resear ch. 3 METHOD 3.1 Data e data for this study comes from the arXiv . e arXiv is an open access repository for pre-prints in physics, mathematics, computer science, quantitative biology , quantitative nance, statistics, elec- trical engineering, systems science, and economics. e variety of disciplines allows consideration of information between elds, in contrast to more sp ecialized repositories such as PubMed. ere are 1,343,669 research papers which include 5,009,523 gures on arXiv through December 31st 2017. Delineating Knowledge Domains in the Scientific Literature Using Visual Information Conference ’17, July 2017, W ashington, DC, USA Figure 1: Overall pipeline. Figures are mappe d to vectors using ResNet-50, dimension-reduced, then organized into a histogram for each eld. e distances b etween these histograms are use d to infer relationships and information ow . 3.2 Processing Pipeline Fig. 1 shows the pipeline to characterize scientic disciplines using visual information. Each step will be explained in the corresponding numbered paragraph. 3.2.1 Convert Figures Into Feature V ectors. W e rst emb ed each gure into a 2048-d feature vector using the pre-trained ResNet-50 [ 18 ] model. e gures are re-sized and padded with white pixels to b e 224 x 224 before being embe dded by pre-trained ResNet- 50. ResNet-50 was trained on the ImageNet [ 8 ] corpus of 1.2M natural images. Even though the model was trained on natural images, we nd that the early layers of the network identify simple paerns (lines, edges, corners, cur ves) that are suciently general for the overall network to represent the combinations of edges and shapes that comprise articial images as well. Although we posit that a custom neural network architecture could be designed to incrementally improve performance on articial images, we do not further consider that direction in this paper . 3.2.2 Dimension Re duction. W e reduce the dimension of each gure vector using Principal Component Analysis (PCA). e high- dimensional vectors produce d by ResNet-50 contain more infor- mation than is necessar y for our application of computing the visual similarity between elds, and we seek to make the pipeline as ecient as possible. Plus, the ResNet model is pre-trained by natural images, while scientic gures have a lot more white ar- eas, which make the embedding vectors more sparse, than natural images. Distances tend to be inated in high dimensional space, reducing clustering performance [ 4 ]. W e follow the typical practice of applying dimension reduction prior to clustering. Our original hypothesis was that a very low number of dimensions (10) would be sucient to capture the dierences between elds, but in our evaluation the higher values (200+) produced stronger correlations with other methods of delineating elds. W e considered dierent values of this parameter using a sample of 1.5M gures from the 5M gure corpus. e results of the experiment are presented in Section 5.1. 3.2.3 Cluster the Figure Corpus. e distribution of dierent types of gures carries signicant information ab out how the visual communication is dierent in each discipline and could further rep- resent each category . W e cluster our gure corpus with K -Means clustering to aggregate similar gur es. Although more advanced methods of clustering could provide beer results, we aim to demon- strate that the approach can work even with very simple methods. e objective of this paper is to show the utility of the gures for po- tential applications, rather than to propose a specialized framew ork for specic task. e experimental results are shown in Section 5.1. 3.2.4 Visual Signatures for Each Discipline. W e cluster the g- ures with number of centroid k = 4 and generate the normalized histogram for each discipline to acquire visual signature of each discipline. Aer the visual signature of each discipline is generated, we calculate the euclidean distance between each pair of disciplines. W e evaluate the computed visual similarity between disciplines by comparing to citation-based and text-based metrics described in previous work, which ar e explained in Section 4. 3.3 Classifying Figure T yp es In this section, we describ e the process to train the classier to identify specic gure types, which we will use to understand how the use of particular styles of visualization and diagrams propagate through the literature. W e consider two spe cic examples: neural network diagrams (associated with the rapid increase of neural network methods in the literature) and clustering plots (associated with the use of unsupervised learning). Examples of these visual- izations are shown in Figure 2. Sethi et al. [ 38 ] characterize six (a) (b) Figure 2: Examples of neural network diagram and emb ed- ding visualization. (a) An example of neural network dia- gram. e diagram is borrowed from AlexNet paper [23]. (b) An example of emb edding visualization. e plot is bor- rowed from MultiDEC paper [46]. dierent gure types to demonstrate neural network ar chite cture. W e label 10,651 gures from arXiv , which includes 1,503 neural network diagrams, 1,057 embedding visualizations, 8,091 negative examples. For neural network diagrams, w e label them according to the taxonomy suggested by Sethi et al. [ 38 ], but we exclude gures in table format. W e consider a gure as an embedding visualization if the gure is used to visualize the repr esentation distribution of the data. e annotators make use of images and captions to label Conference ’17, July 2017, W ashington, DC, USA Sean T . Y ang, Po-shen Lee, Jevin D. W est, and Bill Howe Figure 3: e architecture of the neural network diagrams and embedding visualization classier . T able 1: Implementation details for training the neural net- work diagrams and emb edding visualization classier . Learning Rate Decay Epoch Batch Size Loss 0.001 0.001 150 256 Categorical Cross Entropy the images. W e extract visual features from the fully connecte d layer of a ResNet-50[ 18 ] model, which is pre-trained by 1M Ima- geNet dataset[ 8 ]. e gures ar e resized to 224x224 and a 2048-d numeric vector is acquired for each gure . e lab eled image set is then split into training, validation, and test set with 8:1:1 ratio to train a deep neural network (DNN) classier . W e tune the depth of the model, dimension of the layers, dr op out rate, learning rate , decay ratio, and training epochs. e architecture of the nal model is shown in Figure 3 and implementation details is shown in T able 1. 4 COMP ARISON WI TH CI T A TION- AND TEXT -BASED METHODS W e use the Mantel test [ 32 ], a standard statistical test of the corre- lation between two matrices, to compare visual distance with the distance matrices created by (1) A verage shortest citation distance [ 10 , 43 ] and (2) Natural language jargon distance [ 43 ]. Citations and text have been extensively analyzed and employed to measure the similarity among resear ch articles, and both of the measures have had success on information retrieval and recommendation systems among scholarly documents. erefore, we consider ci- tation distance as our benchmark of the task and te xt distance as alternative comparison. 4.1 A verage Shortest Citation Path W e compute the average shortest path between each pair of elds as a measure of similarity . A verage shortest path [ 10 ] is one of the three most robust measures [ 5 ] of network topology , in addition to its clustering coecient and its degree distribution. Vilhena et al [ 43 ] used this method to measure distance in the citation network to compare with their text-based metric. A verage shortest path is computed as follows: D i j = 1 n i n j Õ n i Õ n j d ( v i , v j ) where n i is the number of vertices in eld i and n j is the number of vertices in eld j . e average shortest path between eld i and eld j , D i j , is the average of all paths between all vertex pairs, v i and v j . Our citation graph is obtaine d from the SAO/NASA Astrophysics Data System ( ADS)[ 11 ], a digital library portal maintaining three bibliographic databases containing more than 13.6 million records covering publications in Astronomy and Astrophysics, Physics, and the arXiv e-prints. e cr eation of the citations in ADS [ 1 ] is started by scanning the full-text of the paper to retriev e bib code for each reference string in the article, followed by computing the similarity score between the ADS record and the bibcode. e citation pairs are generated if the similarity is higher than the threshold. is data has be en extensively use d on several bibliographic studies [ 16 , 25 ]. ere are 14,555,820 citation edges within our arXiv data corpus. 4.2 Jargon Distance W e also compare our results to text metrics base d on cultural in- formation as represented by paerns of discipline-specic jargon. Jargon distance was rst proposed by Vilhena et al. [ 43 ], where the authors quantitatively measure the communication barrier be- tween elds using n-grams from full text. e jargon distance ( E i j ) between eld i and eld j is dened as the ratio of (1) the entropy H of a random variable X i with a probability distribution of the jargon or mathematical symbols within eld i and (2) the cross entropy Q between the probability distributions in eld i and eld j : E i j = H ( X i ) Q ( p i | | p j ) = − Í x ∈ X p i ( x ) log 2 p i ( x ) − Í x ∈ X p i ( x ) log 2 p j ( x ) Imagine a writer from eld i trying to communicate with a reader from eld j . e writer has a codebook P i that maps the natural language or mathematical symbols to codewor ds that the reader has to decode using the codebook P j from eld j . A small jargon distance means high communication eciency between two elds and are closely related. is metric could be easily applied to natural language jargon to explore how the communication varies through these two channels across disciplines. W e compute the jargon distance between two dierent disciplines by applying the metrics on unigram from abstracts. 5 RESULTS W e show that the distance between visual signatures can be used to determine the overall relationships between elds in a manner similar to prior methods, but that this approach also exposes in- formation that prior methods cannot. In Section 5.1, we present the experimental results on picking the numb er of dimensions and clusters. In Section 5.2, we show the capacity of visual distance to reveal the relationships acr oss scientic disciplines by showing global agreement b etween visual distance and citation distance (H1). In Section 5.3, we examine each cluster to understand the Delineating Knowledge Domains in the Scientific Literature Using Visual Information Conference ’17, July 2017, W ashington, DC, USA visual composition and nd that each cluster is dominated by a cer- tain type of visualization, extending prior work in the life sciences that used coarse-graine d labeling of gure types [ 28 ]. In Section 5.4, we show that citation distance and visual distance disagree in certain cases, and consider one case in particular (H2). Finally , we consider cases where the presence of a particular typ e of gure can indicate the use of a method or concept in a way that text and citation similarity do not in Section 5.5 (H3). W e demonstrate that the gures in the scientic literatur e can serve as an indicator of concept adoption that travels faster than citation count. 5.1 Choosing the numb er of dimensions and clusters Our pipeline involves two hyperparameters: the number of dimen- sions to r etain via PCA and the number of clusters to assume when constructing visual signatures. W e determine these parameters experimentally . e results of our analysis of PCA dimensions ap- pear in T able 2. e explained variance ratio shows the percentage of variance explained by the sele cted components. e variance explained grows insignicantly aer 256 components. e average correlation with citation distance shows the average of the correla- tions between visual distance and citation distance across all the numbers of centroid k (from 2 to 30). W e evaluate our method by conducting the Mantel test [ 32 ] to compare the correlation between visual distance and citation distance. It conrms our hypothesis that the corr elation increases when more components are used, but it converges aer sucient information is pr eser ved. Maximum correlation to citation distance shows the maximum correlation of the specied dimension among dierent options of number of cen- troid k , and the k contributing the maximum correlation is shown in ”Maximum at k = ?” . Surprisingly , the maximum correlation happens at larger number of centroid with low dimension of gure vector . Our interpretation is that there is not sucient information preserved by low dimensional space. W e ran a second experiment to determine the number of cen- troids k . Initially , we expe cted the correlation with other measures to be higher using larger values of k , since the diversity of gures in the literature appears vast. However , considering k = 100, 200, and 400, we found that larger values of k generate lower correla- tions with citation distance ( correlation coecient around 0.4), due to overing to rare, low-condence clusters. Lowering k to the range of 2 to 30 p erformed beer; these results appear in T able 2. e relatively low values of k suggest that there are r elatively few modalities of visual communication in use across elds. e maximum correlation occurred at k = 4 in most of the experiments. W e further discuss the interpretation of these results in Section 5.3. 5.2 Delineating Disciplines In this section, we demonstrate the ability of visual distance to characterize the relationships between elds, quantitatively and qualitatively . antitatively , we conduct the Mantel test [ 32 ] with Spearman rank correlation method to compare two dierent dis- tance matrices to reveal the similarity between two structures. W e also perform hierarchical clustering using UPGMA algorithm [ 36 ] to visualize the hierarchical relationships acr oss disciplines, quali- tatively . Vilhena et al.[ 43 ] used similar technique to qualitatively visualize how disciplines are delineated, but the data they used was from JSTOR, which focuses on biological science and social science so that it is not comparable with our task. T able 3 shows the correlation results between dierent distances. e rst two columns indicate the methods being compared and the Results column sho ws the correlations. e correlation b etween visual distance and citation distance ( r = 0 . 706, p value = 0.0001, z score = 5.103) is higher than the correlation between jargon dis- tance and citation distance ( r = 0 . 697, p value = 0.0001, z score = 5.989), providing evidence for our hypothesis that styles of visual communication are a stronger indicator of communication and in- uence than the terminology used by a eld. Visual distance is also moderately correlated to jargon distance with r = 0 . 531, p value = 0.0002, and z score = 5.019. is result is expecte d. It veries our rst hypothesis: sub-disciplines use distinguishable paerns of visual communication. Correlation between visual distance and citation distance is sucient enough to sho w that visual distance is capable of characterizing general relationships between disciplines, but it also rev eals that there are still dierences between citation distances and visual distance. W e will elaborate the dierent con- nections visual distance expose in Section 5.3. W e then perform hierar chical clustering, using the UPGMA algo- rithm [36], to qualitatively visualize how dierent methods group similar disciplines together and separate dissimilar disciplines. e hierarchical clustering results for visual distance, citation distance, and jargon distance are shown in Fig.4. W e observe similar paerns between visual distance and citation distance where Computer Sci- ence , Statistics , Math , and Mathematical Physics are isolated fr om other physics-related elds of study . ere is inconsistency b etween visual distance and citation distance in the eld of antitativ e Bi- ology , which is the outlier in citation distance, but is assigned to the physics-related cluster in visual distance. 5.3 Analyzing Clusters W e classify the gures in each cluster to understand the visual composition of each cluster . W e use the convolutional neural net- work classier in [ 29 ] to categorize gures into ve categories: (1) Diagrams (2) Plots (3) T able (4) Photo and (5) Equation. e classication results are shown in Fig. 5. Surprisingly , each clus- ter is prominently associated with a certain type of visualization: Cluster#0 is primarily composed of diagrams ( Diagram ), Cluster#1 is primarily composed of tables T able , Cluster#2 is primarily com- posed of plots of quantitative information ( Plot , and Cluster#3 is primarily composed of photos P hoto . ese results corroborate previous work that used supervised methods and manual lab eling to categorize gures into ve classes (Diagram, Plot, T able, Photo, and Equation) [ 28 ]. e distribution of gures helps to reveal the properties of each discipline. For instance, Cluster Plot is dominant in antitative Biology (48%) and Nuclear Experiment (60%), which may indicate the degree to which these elds can be considered experimental and data-driven. e distribution could further be used to group similar disciplines and separate the dissimilar elds as we show in the pre vious section. Conference ’17, July 2017, W ashington, DC, USA Sean T . Y ang, Po-shen Lee, Jevin D. W est, and Bill Howe T able 2: Choosing the numb er of clusters (k). Dimension Explained V ariance Ratio A verage of Correlations to Citation Distance Maximum Correlation to Citation Distance Maximum at k=? 16 52.0% 0.661 0.737 15 32 63.7% 0.631 0.768 3 64 73.9% 0.660 0.769 4 128 82.3% 0.662 0.770 4 256 88.9% 0.672 0.793 4 320 90.7% 0.674 0.793 4 Figure 4: e hierarchical clustering dendrogram of visual distance (le), citation distance (middle), and jargon distance (right). Citation distance is a b enchmark in our task. It shows similar pattern as visual distance where Computer Science , Statistics , Math , and Mathematical P hysics are separated from the rest of the disciplines. e inconsistency between citation distance and visual distance is antitative Biology , which is clustered with physics-related disciplines in visual distance while it is isolated in citation distance. On the other hand, Jargon distance segregates disciplines dierently from visual distance and citation distance in the high level. High Energy P hysics and Nuclear are separate d from the rest wher e antitative Biology , Computer Science and Statistics are isolated in the sub-cluster . T able 3: e correlation results between distance matrices. Results Visual Distance Citation Distance r = 0.706 p = 0.0001 z = 5.103 Visual Distance Jargon Distance r = 0.531 p = 0.0002 z = 5.019 Jargon Distance Citation Distance r = 0.697 p = 0.0001 z = 5.989 5.4 Visuals delineate dierently than citations In this section, we focus on the cases in computer science where visual distance and citation distance disagree and we validate our second hypothesis: visual paerns expose new modalities of com- munication that are not identiable by either text or the structure of the citation graph. e analysis aims to answer the following questions: (1) Where are ther e visual dierences in the disciplinary landscape when compared to citation dierences? (2) What is re- vealed about the elds where visual dierences occur? W e normalize visual distance and citation distance, then subtract visual distance from citation distance to expose the discrepancies. Fig. 6 shows that there is a signicant disagreement between vi- sual distance and citation distance for the subeld Computation Delineating Knowledge Domains in the Scientific Literature Using Visual Information Conference ’17, July 2017, W ashington, DC, USA Figure 5: e visual composition of each cluster . It appears that each cluster has one dominant visualization. Figure 6: Heat map of dierences between visual and cita- tion distance. W e normalize visual distance and citation dis- tance and subtract visual distance from citation distance to expose the discrepancies. Red indicates that two subelds are visually distant but near in citation distance. Green indi- cates that two subelds are distinct in citation distance but visually similar . Computation and Language is visually dif- ferent across the subelds in Computer Science but relatively close in terms of citation distance. and Language . Red cells show the disagreements where elds are visually distinct but similar in citation distance. Green cells, in contrast, indicate disciplines that are visually similar , but far apart in citation distance. W e observe that Computation and Language is generally close to all other categories in Computer Science , but visually distinct. W e further examine the visual prole of Compu- tation and Language in order to beer understand the reasons for the divergence between these two distances. Fig. 7 shows the distribution of the gure usage in Computation and Language (CL) and Computer Science (CS) over the past ten years. W e make two observations from this stacked bar chart: (1) Cluster T able dominates the visual communication style with over 50% in Computation and Language in 2017, compared to appro xi- mately 30% in Computer Science , and it has been growing over the past few years. (2) e researchers in Computation and Language use very few gures associated with Cluster Photo . W e further investigate the reason that tables ar e largely used in Computation and Language by analyzing the cluster textually . W e conduct topic modeling on the captions of the gur es of Cluster T able using Non- negative Matrix Factorization (NMF) [ 26 ] with ve topic numbers. In Table 4, we display the top 10 keywords of each topic along with the ratio of the count of the gures in each topic to the total count in the cluster over the past 10 years. W e also look at the images in each topic to help us understand the purpose of each topic. Based on the keywords and the images, we can infer that T opic 0 mostly contains table with comparison data to other models, T opic 1 includes the examples of the language and words, T opic 2, which is similar to T opic 0, also involves comparing results between dierent models. T opic 3 consists of statistics about the dataset. T opic 4 is a mix of the tables and diagrams which mostly are used to illustrate the architecture of LSTM models. It appears that tables to compare the accuracy of dierent models have been growing signicantly , from 46.4% (28.6% + 17.8%) in 2008 to 60% (47.6% + 12.4%) in 2017, suggesting that an empirical r egime of research is dominant, perhaps due to improved access to advanced computa- tional infrastructure, easy access to data and code, and the rapid growth of the eld itself. Figure 7: e chart shows how the distribution of the clus- ters evolves in Computation and Language and Computer Science over the past ten years. W e could observe that Clus- ter T able has been growing in Computation and Language and researchers in Computation and Language use a rela- tively low number of gures in the Photo Cluster . 5.5 Fine-grained Figure Analysis e classier achieves accuracy of 0.902 on the validation set and 0.868 on the test set with precision of 0.741 and r ecall of 0.827 on neural network diagrams. e confusion matrix of the classier is shown in Fig. 8. e classier tends to misclassify ow charts, bar charts, and diagrams with multiple circles as neural network diagrams and the classier is also oen confused between embed- ding visualization and scaer plots (which ar e indeed quite similar ). e classier appears suciently eective at identifying neural net- work diagrams and embedding visualizations to conduct following analysis. W e use the trained classier to label 60k gures in computer science papers on arXiv and analyze the count of the neural network diagrams (T op line chart in Fig. 9) and the emb edding visualizations in computer science disciplines o ver time. W e select four categories, which are A rticial Intelligence , Machine Learning , Computer Vision , and Computation Language . ese disciplines are known to be strongly involved in neural network research. W e also include Computational Complexity , which has less involvement in neural learning research as a control. W e also compute the count of papers Conference ’17, July 2017, W ashington, DC, USA Sean T . Y ang, Po-shen Lee, Jevin D. W est, and Bill Howe T able 4: T op 10 keywords for each topic in Cluster Table along with the ratio of the gure in each topic over time. T opic 0 T opic 1 T opic 2 T opic 3 T opic 4 Cluster T able results table models dierent performance best scores dataset comparison accuracy words gure word number example table sentence example sentences used et al 2015 2016 2014 2017 2013 results 2011 taken set test training data development sets table dev used statistics model language trained baseline lstm proposed models aention layer performance year ratio ratio ratio ratio ratio 2008 28.6% 25.0% 17.8% 22.9% 5.7% 2009 31.1% 26.8% 16.1% 18.6% 7.4% 2010 31.2% 24.2% 16.9% 21.0% 6.7% 2011 34.2% 25.1% 17.2% 16.3% 7.2% 2012 39.1% 22.7% 16.0% 15.5% 6.7% 2013 37.3% 21.7% 17.3% 15.9% 7.8% 2014 39.4% 19.8% 16.0% 14.9% 9.9% 2015 43.9% 18.7% 14.2% 12.2% 11.0% 2016 45.3% 18.5% 13.4% 10.4% 12.4% 2017 47.6% 17.4% 12.4% 10.1% 12.5% Figure 8: e confusion matrix of the gure type classier . e classier achieves 0.868 overall accuracy . whose abstract include ”neural network” and ”deep learning” in the selected categories over time. e usage prole by eld in the use of embedding visualizations is similar to that of neural network diagrams. e trend is shown in the middle line chart in Fig. 9. Finally , we select six inuential papers in deep learning research: AlexNet [ 23 ], GAN [ 17 ], LSTM [ 19 ], ResNet [ 18 ], RNN [ 37 ], V GG [ 41 ], and W ord2V ec [ 34 ]. W e calculate the received citation count of each paper for each y ear to show the gro wth of inuence of these papers (Boom line chart in Fig. 9). W e compare these results with our visualization-based metrics to study our third hypothesis: we can use specic types of gures to track the pr opagation of ideas and methods in the literature. From the three plots, we make the following observations. First, the three line charts demonstrate the same tendency: a rapid rise in recent years. It is not surprising to see this common trend; increased interest in a topic leads to both increasing citations and an increasing number of relevant diagrams across the literatur e. Second, the count of papers that include ”neural network” in their abstracts steadily increases from 2012 to 2014 (yellow background), Figure 9: e three line charts demonstrate the trend of re- cent studies in deep learning using three dierent media: gures (top), text (middle), and citation (bottom). T op: e number of papers that include neural network diagrams over time. Middle: e count of pap ers that have ”neural net- work” or ”deep learning” in their abstracts over time. Bot- tom: e citation count of six selected inuential papers in deep learning. e annotation of each inuential pap er indi- cates the publication time. Citation count of the most inu- ential papers and use of the term ”neural network” in the ab- stract quickly increase (yellow area), but the eect is small. e use of relevant gures increases only once authors start to truly adopt the concept in their research. as does the citation count of one particular paper , AlexNet. But there is no increase in the use of gur es during this period. e cost of mentioning ”neural networks” or citing a rele vant paper is low , but the cost of developing a relevant gure is high. W e interpret this result as e vidence that the use of a gure is beer corr elated with the true adoption of a concept or metho d, as oppose d to simply Delineating Knowledge Domains in the Scientific Literature Using Visual Information Conference ’17, July 2017, W ashington, DC, USA acknowledging the relevance of a concept or method. Aer a novel idea is publishe d, the community rapidly begins to discuss the work and, potentially , cites a relevant pap er . But it takes time for the community to integrate the concept into their own research. Once they have done so , the cost of developing a gure is justied, and the number of gures increases. When the concept is adopting the concept, visualizations begin to emerge in the literature. ird, the number of neural network diagrams increases dra- matically in 2015 in the four relevant disciplines, while, except for AlexNet, we do not see such rapid growth of received citation counts until 2017 (ResNet and V GG). ere is a two year gap be- tween the emergence of the use of neural network diagrams and the rise of the received citation counts. Figures, as well as text, are faster to react to the introduction of new ideas than aggregate cita- tion counts. ese results both validate the use of gures as a signal of scientic communication, but also that they expose paerns not otherwise discernible. 6 CONCLUSION In this study , we demonstrate the feasibility of visual information being used as a measur e of similarity . W e show that visual distance is able to determine the overall relationships between elds by acquiring moderate high correlation (0.706) between visual distance and citation distance. In addition, we show that visual distance still delivers valuable information when it disagr e es with citation distance. W e further conduct a case study on two specic types of gures: neural network diagrams and embedding visualizations. W e nd that the upward trend of neural network diagrams and embedding visualizations predates the citation counts of inuential papers in recent years. is provides evidence that gur es in the scientic literature are leading indicators of citations. W e plan to extend our study to more ne-grained gure labels. is extension will aord beer interpretation of the correlations between gures, text, and citations and help us beer rene our groupings. In addition, w e plan to apply these visual demarcation techniques to tasks in information retrieval and r ecommendation systems. 7 A CKNO WLEDGEMEN T is research has made use of NASA ’s Astrophysics Data System Bibliographic Services. REFERENCES [1] Alberto Accomazzi, Gunther Eichhorn, Michael J Kurtz, Carolyn S Grant, Ed- win Henneken, Markus Demleitner , Donna ompson, Elizabeth Bohlen, and Stephen S Murray . 2006. Creation and Use of Citations in the ADS. arXiv preprint cs/0610011 (2006). [2] Bram van den Akker , Ilya Markov , and Maarten de Rijke. 2019. ViTOR: Learning to Rank W ebpages Based on Visual Features. arXiv preprint (2019). [3] Rabah A Al-Zaidy and C Lee Giles. 2015. Automatic extraction of data from bar charts. In K-CAP . A CM, 30. [4] Richard E Bellman. 1961. Adaptive control processes: a guided tour . V ol. 2045. Princeton university press. [5] Stefano Boccalei, Vito Latora, Y amir Moreno, Martin Chavez, and D-U Hwang. 2006. Complex networks: Structure and dynamics. P hysics reports 424, 4-5 (2006), 175–308. [6] Jean Charbonnier , Lucia Sohmen, John Rothman, Birte Rohden, and Christian W artena. 2018. NOA: A Sear ch Engine for Reusable Scientic Images Beyond the Life Sciences. In ECIR . Springer , 797–800. [7] Zhe Chen, Michael Cafarella, and Eytan Adar . 2015. Diagramyer: A search engine for data-driven diagrams. In e W eb Conference . ACM, 183–186. [8] Jia Deng, W ei Dong, Richard Socher , Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Ima- genet: A large-scale hierarchical image database. In CVPR . Ieee, 248–255. [9] Y uxiao Dong, Hao Ma, Zhihong Shen, and Kuansan W ang. 2017. A Century of Science: Globalization of Scientic Collaborations, Citations, and Innovations. In KDD . ACM, 1437–1446. [10] Stuart E Dreyfus. 1969. An appraisal of some shortest-path algorithms. Operations research 17, 3 (1969), 395–412. [11] Guenther Eichhorn. 1994. An overview of the astrophysics data system. Experi- mental Astronomy 5, 3-4 (1994), 205–220. [12] Stephanie Elzer , Sandra Carberry , and Ingrid Zukerman. 2011. e automate d understanding of simple bar charts. A rticial Intelligence 175, 2 (2011), 526–555. [13] Jing Fang, Prasenjit Mitra, Zhi T ang, and C Lee Giles. 2012. T able Header Detection and Classication.. In AAAI . 599–605. [14] Santo Fortunato, Carl T Bergstrom, K aty B ¨ orner , James A Evans, Dirk Helbing, Sta ˇ sa Milojevi ´ c, Alexander M Petersen, Filippo Radicchi, Roberta Sinatra, Brian Uzzi, et al. 2018. Science of science. Science 359, 6379 (2018), eaao0185. [15] Robert P Futrelle, Mingyan Shao, Chris Cieslik, and Andrea Elaina Grimes. 2003. Extraction, layout analysis and classication of diagrams in PDF documents. In ICDAR . IEEE, 1007–1013. [16] Eugene Gareld. 2006. e history and meaning of the journal impact factor. Jama 295, 1 (2006), 90–93. [17] Ian Goodfellow , Jean Pouget- Abadie, Mehdi Mirza, Bing Xu, David W arde-Farley , Sherjil Ozair , Aaron Courville, and Y oshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems . 2672–2680. [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR . 770–778. [19] Sepp Hochreiter and J ¨ urgen Schmidhuber . 1997. Long short-term memor y . Neural computation 9, 8 (1997), 1735–1780. [20] W enyi Huang, Zhaohui Wu, Prasenjit Mitra, and C Lee Giles. 2014. Refse er: A citation recommendation system. In JCDL . IEEE Press, 371–374. [21] J.D. W est, I. W esley-Smith, and C. T . Bergstrom. 2016. A recommendation system based on hierarchical clustering of an article-level citation network. IEEE Trans- actions on Big Data 2, 2 (June 2016), 113–123. hps://doi.org/10.1109/TBDA T A. 2016.2541167 [22] Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Ha- jishirzi, and Ali Farhadi. 2016. A diagram is worth a dozen images. In ECCV . Springer , 235–251. [23] Alex Krizhevsky , Ilya Sutskever , and Georey E Hinton. 2012. Imagenet classica- tion with deep conv olutional neural networks. In A dvances in neural information processing systems . 1097–1105. [24] Onur K ¨ u c ¸ ¨ uktun c ¸ , Erik Saule, Kamer Kaya, and ¨ Umit V C ¸ ataly ¨ urek. 2012. Direction awareness in citation recommendation. (2012). [25] Michael J Kurtz and Edwin A Henneken. 2017. Measuring metrics-a 40-year longitudinal cross-validation of citations, downloads, and peer revie w in astro- physics. Journal of the Association for Information Science and T e chnology 68, 3 (2017), 695–708. [26] Daniel D Lee and H Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (1999), 788. [27] Poshen Lee, Jevin W est, and Bill Howe. 2016. VizioMetrix: A Platform for Analyzing the Visual Information in Big Scholarly Data. In e W eb Conference W orkshop on BigScholar . [28] Poshen Lee, Jevin W est, and Bill Howe. 2017. Viziometrics: Analyzing Visual Paerns in the Scientic Literature. IEEE Transactions on Big Data (2017). [29] Poshen Lee, T. Sean Y ang, Jevin W est, and Bill Howe. 2017. PhyloParser: A Hybrid Algorithm for ExtractingPhylogenies from Dendrograms. (2017). [30] Loet Leydesdor and Ping Zhou. 2007. Nanotechnology as a eld of science: Its delineation in terms of journals and patents. Scientometrics 70, 3 (2007), 693–713. [31] Xiaonan Lu, J W ang, Prasenjit Mitra, and C Lee Giles. 2007. Automatic extraction of data from 2-d plots in documents. In ICDAR , V ol. 1. IEEE, 188–192. [32] Nathan Mantel. 1967. e detection of disease clustering and a generalized regression approach. Cancer research 27, 2 Part 1 (1967), 209–220. [33] IV Marshakova. 1973. Co-Citation in Scientic Literature: A New Measure of the Relationship Between Publications. ” . Scientic and T echnical Information Serial of VINI TI 6 (1973), 3–8. [34] T omas Mikolov , Ilya Sutskever , Kai Chen, Greg S Corrado, and Je Dean. 2013. Distributed representations of words and phrases and their compositionality . In Advances in neural information processing systems . 3111–3119. [35] Douglas L Nelson, V alerie S Reed, and John R W alling. 1976. Pictorial superiority eect. Journal of Exp erimental Psychology: Human Learning and Memory 2, 5 (1976), 523. [36] F James Rohlf and David R Fisher . 1968. T ests for hierarchical structure in random data sets. Systematic Biology 17, 4 (1968), 407–412. [37] David E Rumelhart, Georey E Hinton, and Ronald J Williams. 1986. Learning representations by back-propagating errors. Nature 323, 6088 (1986), 533. [38] Akshay Sethi, Anush Sankaran, Nav een Panwar , Shreya Khare, and Senthil Mani. 2018. DLPaper2Code: Auto-generation of code from deep learning research Conference ’17, July 2017, W ashington, DC, USA Sean T . Y ang, Po-shen Lee, Jevin D. W est, and Bill Howe papers. In AAAI . [39] Mingyan Shao and Robert P Futrelle. 2005. Re cognition and classication of gures in PDF documents. In International W orkshop on Graphics Recognition . Springer , 231–242. [40] Noah Siegel, Zachary Horvitz, Roie Levin, Santosh Divvala, and Ali Farhadi. 2016. FigureSeer: Parsing result-gures in r esearch papers. In ECCV . Springer , 664–680. [41] Karen Simonyan and Andrew Zisserman. 2014. V ery deep convolutional net- works for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014). [42] Trev or Strohman, W Bruce Cro, and David Jensen. 2007. Recommending citations for academic papers. In SIGIR . ACM, 705–706. [43] D. Vilhena, J. Foster , M. Rosvall, J.D. W est, J. Evans, and C. Bergstrom. 2014. Finding Cultural Holes: How Structure and Culture Diverge in Networks of Scholarly Communication. Sociological Science 1 (2014), 221–238. hps://doi. org/10.15195/v1.a15 [44] Colin W are. 2012. Information visualization: perception for design . Elsevier. [45] J.D. W est and J. Portenoy . 2016. Delineating Fields Using Mathematical Jargon. In JCDL W orkshop on BIRNDL . [46] Sean Y ang, Kuan-Hao Huang, and BIll Howe. 2019. MultiDEC: Multi-Modal Clustering of Image-Caption Pairs. arXiv preprint arXiv:1901.01860 (2019). [47] Michel Zi and Elise Bassecoulard. 2006. Delineating complex scientic elds by an hybrid lexical-citation method: An application to nanosciences. Information processing & management 42, 6 (2006), 1513–1531.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment