A Statistical Framework for Detecting Emergent Narratives in Longitudinal Text Corpora
Narratives about economic events and policies are widely recognised as influential drivers of economic and business behaviour. Yet the statistical identification of narrative emergence remains underdeveloped. Narratives evolve gradually, exhibit subt…
Authors: Cynthia Medeiros, John Quigley, Matthew Revie
A Statistical F ramew ork for Detecting Emergen t Narrativ es in Longitudinal T ext Corp ora Cyn thia Medeiros , John Quigley and Matthew Revie Departmen t of Managemen t Science, Univ ersity of Strathclyde, UK cynthia.medeiros@strath.ac.uk Abstract Narrativ es about economic ev en ts and p olicies are widely recognised as influen tial driv ers of economic and business b eha viour. Y et the statistical identification of narrative emergence remains underdev elop ed. Narrativ es evolv e gradually , exhibit subtle shifts in conten t, and ma y exert influence disproportionate to their observ able frequency , making it difficult to determine when observ ed c hanges reflect genuine structural shifts rather than routine v ari- ation in language use. W e propose a statistical framework for detecting narrativ e emergence in longitudinal text corp ora using Latent Dirichlet Allo cation (LD A). W e define emergence as a sustained in- crease in a topic’s relativ e prominence ov er time and articulate a statistical framew ork for in terpreting suc h tra jectories, recognising that topic proportions are latent, model- estimated quan tities. W e illustrate the approach using a corpus of academic publications in economics spanning 1970-2018, where Nob el Prize-recognised contributions serv e as externally observ able sig- nals of influential narratives. T opics associated with these con tributions display sustained increases in estimated prev alence that coincide with p eriods of heightened citation activ- it y and broader disciplinary recognition. These findings indicate that mo del-based topic tra jectories can reflect identifiable shifts in economic discourse and provide a statistically grounded basis for analysing thematic c hange in longitudinal textual data. Keyw ords: Latent Dirichlet Allo cation; T opic mo dels; Longitudinal text analysis; Narra- tiv e emergence 1 In tro duction A narr ative is understo o d as a structured account that organises ev ents in to a coher- en t sequence and assigns them meaning within a broader context, that is, a construction through which actors interpret and situate exp erience (Somers, 1994). A t the individual lev el, narratives shap e ho w agents in terpret their environmen t, assess risks, and ev aluate the plausibilit y of alternativ e courses of action. Beyond this in terpretative role, narratives ma y op erate as dynamic so cial forces. As stories that circulate within public discourse, they can capture attention and shap e collectiv e understandings of economic and p olitical ev ents (Shiller, 2020). When particular narratives b ecome prominent, they influence exp ec- tations and co ordinate b eliefs across individuals. Through this pro cess, shifts in dominan t narrativ es may translate into changes in financial, p olitical, and so cietal decision-making, with aggregate consequences. Cyn thia Medeiros , John Quigley and Matthew Revie Shiller (2020) argues that narratives ev olve from individual discussions to collective promi- nence through a pro cess of emergence, increase in traction, and even tual decline, often resem bling a h ump-shap ed pattern. Once a narrativ e reaches its p eak, it can b e easily iden tified through the frequency with which it app ears in discourse, and its decline can similarly b e observed as references to it become less frequent o v er time. The early stages of emergence, ho wev er, are more difficult to observ e. During this phase, they typically constitute only a small fraction of ov erall discourse, may b e partially ob- scured by dominant stories, and often evolv e through subtle shifts in emphasis or framing. As such, their iden tification requires formal methods capable of detecting structured change within discourse. This article prop oses a statistical framework for identifying the emergence of narrativ es in longitudinal data. Since narratives are expressed and transmitted through language, w e treat text as observ able data from whic h latent thematic structure can b e inferred. W e emplo y Latent Dirichlet Allo cation (LDA), a probabilistic topic model, to recov er the underlying themes (‘topics’) in a collection of documents (Blei et al., 2003), and to estimate their relativ e prominence ov er time. A range of extensions to LDA has b een dev elop ed, including the Dynamic T opic Mo del (Blei and Lafferty, 2006), which allo ws topic–word distributions to evolv e ov er time; the Structural T opic Mo del (Rob erts et al., 2013), which incorp orates do cumen t-level meta- data; and more recent neural or em b edding-based approaches, suc h as the BER T opic Mo del (Gro otendorst, 2022). These framew orks are often motiv ated by predictiv e p erformance or represen tational richness. How ev er, they also in tro duce additional la y ers of mo delling and tuning c hoices that can complicate in terpretation when the primary ob ject of interest is a statistically interpretable measure of thematic prominence o ver time. The present study therefore adopts baseline LDA delib erately . Our ob jective is not to maximise predictiv e accuracy , but to examine what can b e inferred from the p osterior distribution of a gener- ativ e model ab out the structure of discourse ov er time, a cen tral research stream in topic mo delling iden tified b y Blei (2012). LD A is particularly well suited to this goal b ecause it yields an explicit probabilistic representation of do cument-lev el topic prop ortions that can be aggregated across time p erio ds to quantify changes in thematic prominence within discourse. Moreov er, fixing to pics o v er the sample av oids the comparabilit y issues that ma y arise when topic definitions themselv es ev olve, thereby providing a stable reference frame for interpreting temp oral ch ange. Recent w ork on emergence detection also suggests that probabilistic topic mo dels such as LD A can provide reliable signals for retrosp ectiv e iden tification of emerging themes, in some settings outp erforming em b edding-based alter- nativ es such as BER T opic (Li et al., 2025). Embedding-based approaches remain v aluable when the ob jective is to iden tify semantically nov el concepts that ma y not yet ha v e stable lexical realisations (Ma and Ny arko, 2025), and hybrid pip elines combining topic mo dels with large language mo dels (LLMs) ha v e been proposed to detect and interpret narrativ e shifts (Lange et al., 2025). Our approac h is complementary . The con tribution of this arti- cle lies not in prop osing a new topic mo del, but in developing a statistical framework for in terpreting c hanges in estimated topic prev alence as signals of narrative emergence. LDA serv es as a probabilistic measuremen t mo del that pro duces coheren t estimates of thematic prominence, whose temp oral dynamics can then b e ev aluated against external indicators of influence. This p ersp ectiv e aligns naturally with Shiller (2020)’s account of narratives as evolv- ing thematic constructions that gain prominence within discourse. If narrativ es manifest A Statistical F ramework for Detecting Emergen t Narrativ es in Longitudinal T ext Corp ora as structured re-allo cations of thematic atten tion, then p osterior estimates of topic pro- p ortions provide a principled statistical ob ject through whic h such re-allo cations can b e measured. By fo cusing on LDA, w e isolate the inferential prop erties of a w ell-understo od generativ e mo del and clarify how sustained c hanges in its p osterior summaries may b e in terpreted as evidence of narrative emergence. The main goals of this article are threefold. First, to formalise narrative emergence as a sustained increase in estimated topic prop ortions within longitudinal text data. Second, to clarify ho w changes in these estimated topic prop ortions can b e in terpreted as evidence of narrative formation. Third, to assess whether shifts in topic prop ortions are asso ciated with measurable indicators of influence. In Section 2, w e presen t the statistical framew ork and define narrativ e emergence in terms of estimated topic prop ortions. Section 3 describes the data and empirical strategy . In Section 4, we rep ort the empirical results, examining the temp oral dynamics of topic prop ortions and their relationship with indicators of in- fluence. Section 5 concludes with a discussion of interpretation, limitations, and broader implications. 2 Statistical F ramew ork for Narrative Emergence LD A, introduced by Blei et al. (2003), builds upon earlier mixture mo dels for text, most notably probabilistic Latent Seman tic Analysis (pLSA) (Hoffman, 1999). In pLSA, eac h document is represented as a mixture of laten t topics, but the topic proportions for eac h document are estimated indep enden tly as free parameters. As the n um b er of do cumen ts increases, the n umber of suc h parameters grows accordingly , whic h ma y lead to o verfitting and limits the model’s ability to generalise to new do cumen ts. LDA addresses this limitation b y introducing a Dirichlet prior o v er do cumen t-level topic proportions, inducing a shared probabilistic structure across the corpus (collection of do cuments). Under LD A, eac h document is therefore mo delled as a mixture of laten t topics, where a topic is defined as a probability distribution ov er a fixed vocabulary (a set of unique words present in the corpus), and do cumen ts differ in the prop ortions with whic h they express these topics. Let D denote the num b er of do cumen ts in the corpus and K the n um b er of topics. F or do cument d , let θ d = ( θ d 1 , · · · , θ dK ) denote the v ector of topic prop ortions, where θ dk represen ts the share of do cumen t d dev oted to topic k . Eac h topic k is asso ciated with a w ord distribution β k , a probabilit y v ector o v er the v o cabulary . The generativ e structure of LD A assumes that topic prop ortions θ d are dra wn from a Diric hlet distribution with parameter α , and topic -sp ecific w ord distributions β k are drawn from a Diric hlet distribution with parameter η . Conditional on θ d , each w ord in do cumen t d is generated b y first sampling a topic assignmen t and then sampling a w ord from the corresp onding topic distribution. More formally , the generative pro cess can b e describ ed as (Blei et al., 2003): 1. F or eac h topic k = 1 , · · · , K : – Draw the p er-topic word distribution β k ∼ Dirichlet( η ) . 2. F or eac h do cumen t i = 1 , · · · , D : – Draw the topic-prop ortion vector θ d ∼ Dirichlet( α ) . Cyn thia Medeiros , John Quigley and Matthew Revie – F or each word n = 1 , · · · , N i : (a) Dra w the topic assignment z dn ∼ Multinomial( θ d ) . (b) Dra w the observed word w dn ∼ Multinomial( β z dn ) . F ormally , the joint distribution of the latent and observ ed v ariables can b e written as p ( β , θ , z , w ) = K Y k =1 p ( β k | η ) D Y d =1 p ( θ d | α ) N Y n =1 p ( z dn | θ d ) p ( w dn | β z dn ) (1) where z dn denotes the latent topic assignment for w ord n in do cument d , and w dn the observ ed word. Exact posterior inference under this mo del is intractable (Blei et al., 2003). In practice, appro ximate metho ds such as v ariational inference (Blei et al., 2003) or Gibbs sampling (Steyv ers and Griffiths, 2007) are used to obtain p osterior summaries. In this study , we use the p osterior mean of the do cument-lev el topic prop ortions ˆ θ dk as our quantit y of interest. These p osterior mean estimates represent the mo del-based share of do cument d dev oted to topic k and serv e as summaries of thematic comp osition. When do cuments are indexed by time, the estimated topic prop ortions can b e aggregated within time p erio ds to construct a time series of thematic comp osition. Let t = 1 , · · · , T index discrete time p erio ds, and let D t denote the set of do cumen ts observed in p erio d t . F or eac h topic k , w e define the a verage topic prop ortion in p eriod t as ¯ θ k,t = 1 |D t | X d ∈D t ˆ θ dk (2) ¯ θ k,t represen ts the estimated share of discourse in p eriod t asso ciated with topic k . It pro vides a comparable measure of thematic prominence across time: since all do cuments are mo delled as mixtures of the same set of topics, changes in ¯ θ k,t reflect shifts in the relativ e emphasis placed on recurring thematic patterns within the corpus. F or eac h topic k , the aggregated quantit y ¯ θ k,t defines a time series of estimated topic prop ortions. W e in terpret narrative emergence as a p ersisten t up ward shift in this series relativ e to its earlier lev el. In practical terms, emergence o ccurs when a theme that previ- ously occupied a small share of discourse exhibits a sustained increase in its estimated topic prop ortion, corresp onding to the rise-p eak-decline dynamic described by Shiller (2020). 3 Data and Empirical Strategy T o illustrate the prop osed framew ork, we examine the ev olution of discourse within aca- demic economics. Journal publications pro vide a record of disciplinary debate spanning sev eral decades, making it p ossible to follow ho w thematic concerns develop o ver time. Un- lik e public or media discussions, which ma y shift quickly in response to immediate even ts, academic researc h typically evolv es more gradually: new lines of research are introduced through foundational contributions, adopted by other scholars, debated and refined, and A Statistical F ramework for Detecting Emergen t Narrativ es in Longitudinal T ext Corp ora ev entually incorp orated into mainstream research. As this pro cess unfolds, it reshapes the distribution of atten tion within the literature. Early con tributions may o ccupy a relatively small share of discussion; ov er time, related terminology b ecomes more common, asso ciated questions expand, and the research agenda broadens. The resulting transformation is not a single break, but a sustained reallo cation of thematic emphasis across the corpus. Economics offers a particularly structured setting for examining these dynamics. The discipline is organised around a well-established classification system, JEL codes, that de- lineates subfields and provides a consistent framework for categorising research ov er time. Economic research is highly heterogeneous, spanning theoretical, empirical, and metho d- ological traditions that evolv e at different rates, and often resp ond to distinct in tellectual and p olicy influences (Hamermesh, 2013). As such, inno v ations that reshap e discourse in one subfield may not translate directly into others. Analysing subfield-sp ecific corpora defined b y JEL co des therefore pro vides a more coherent thematic con text, reducing cross- domain heterogeneit y and improving the in terpretability of temp oral c hanges in discourse. A dditionally , economics exhibits a highly cen tralised publication hierarc h y . A small n um- b er of general-interest journals – the ‘T op-5’: American Economic Review (AER), Quar- terly Journal of Economics (QJE), Journal of P olitical Econom y (JPE), Econometrica (ECMA), Review of Economic Studies (REStud) – pla y a disprop ortionate role in shaping mainstream research agendas (Ellison, 2002; Card and DellaVigna, 2013; Hec kman and Moktan, 2020). As a result, shifts in thematic emphasis within these outlets tend to re- flect broader changes in disciplinary attention rather than lo calised debates. In tellectual dev elopments are therefore traceable b oth within subfields and at the core of the disci- pline. Finally , the discipline also provides identifiable p oin ts at which particular lines of w ork receive broad professional recognition. The aw arding of the Nobel Prize in Economic Sciences ackno wledges con tributions that hav e substan tially influenced the direction of the discipline, marking a stage at which that b o dy of work has b ecome widely established. This makes it p ossible to examine whether thematic prominence increases in the p eriod preceding suc h recognition. Figure 2 illustrates this pro cess. Cyn thia Medeiros , John Quigley and Matthew Revie Fig. 1. Flow chart illustrating the stages of a Narrativ e Shift in academic though t. In our case study , we are mainly concerned with using NLP to identify the c hanges occurred in Step 3. F ull-text articles were obtained from JSTOR’s Data for Research platform (Constellate), whic h provided access to digitised journal arc hives and asso ciated metadata. The corpus consists of 17,877 researc h articles published in AER, QJE and JPE, whic h are consisten tly iden tified as the most influen tial among the ‘T op-5’ general-interest journals. (Card and DellaVigna, 2013; Hamermesh, 2013). The sample p eriod b egins in 1970, when JEL classifi- cation w as in tro duced across the field, allo wing discourse to b e analysed within a consisten t subfield structure, and concludes in 2018, the last complete publication year a v ailable at the JSTOR database at the time of data extraction. F rom this archiv e, subfield-sp ecific corp ora were constructed using JEL classification co des, and topic mo dels w ere estimated separately within eac h subfield. The full text of each article, indexed b y the publication year, w as standardised using con ven tional natural language pre-pro cessing pro cedures. The JSTOR archiv e provided tok enised text; from this, non-alphabetic characters, numerical expressions, and stop words (e.g. ‘the’, ‘and’, ‘of ’) w ere remo ved. Generic academic terms with limited semantic conten t w ere excluded to reduce noise. All tok ens were lo wercased. Stemming or lemmatisation w as not applied, in order to preserve distinctions b etw een conceptually meaningful v arian ts (for example, ‘institution’ and ‘institutional’). A Statistical F ramework for Detecting Emergen t Narrativ es in Longitudinal T ext Corp ora After prepro cessing, the corpus w as transformed in to a do cumen t-term matrix, where eac h ro w corresp onds to a do cumen t and eac h column to a term in the resulting vocabulary . Matrix en tries record within-do cumen t word frequencies. This representation defines the observ ed data on which LDA is estimated, yielding do cumen t-sp ecific topic prop ortions that can subsequen tly b e aggregated b y publication year. T o examine whether thematic prominence increases prior to formal professional recogni- tion, w e fo cus on Nob el Prize-recognised contributions introduced in or after 1980. This restriction reflects b oth empirical and design considerations. Empirical evidence indicates a lag b et w een initial publication of a contribution and its subsequen t Nobel recognition in economics, estimated at approximately 23 ± 7 . 5 years (Bjørk, 2020). Given that the corpus b egins in 1970, selecting 1980 as the earliest introduction year ensures that a meaningful pre-recognition p eriod is observ able within the data for each contribution. In particular, con tributions introduced in the 1980s and 1990s are typically recognised in the 2010s and 2020s, allowing the analysis to capture both the diffusion phase preceding formal recog- nition and the consolidation phase that fo llows. This restricti on therefore ensures that narrativ e tra jectories can b e examined o v er a sufficiently long temp oral horizon within the 1970-2018 windo w. In our analysis, w e exclude a wards primarily asso ciated with econometric metho dology , as their diffusion reflects tec hnical dev elopmen ts rather than shifts in substan tive researc h discourse. The selected con tributions span ma jor subfields defined by JEL classification, enabling narrativ e tra jectories to be examined within relativ ely coheren t thematic do- mains. A complete list of included laureates and asso ciated publications is pro vided in the App endix. T o complemen t the analysis of topic tra jectories, w e compare the ann ual evolution of ¯ θ k,t with citation coun ts for the corresp onding contributions. Citation counts are a con ven tional bibliometric indicator of scholarly influence and pro vide an observ able b enc hmark against whic h c hanges in thematic prominence can b e interpreted (Hamermesh, 2013; Card and DellaVigna, 2013). 4 Empirical Results 4.1 T opic Iden tification W e begin by examining whether the LDA mo dels reco ver thematically coherent structures that align with Nob el Prize-recognised contributions introduced after 1980. Within eac h subfield-sp ecific corpus, the num b er of topics K was selected through iterativ e estimation. While likelihoo d-based criteria such as p erplexit y assess statistical fit, they do not guar- an tee in terpretability of topic-word distributions; prior work has shown that human ev al- uation more reliably captures thematic coherence in topic mo delling applications (Chang et al., 2009). W e therefore retained sp ecifications that yielded clearly distinguishable and in terpretable clusters of terms. T opic iden tification pro ceeds by examining the highest-probabilit y words within each topic giv en the β k p osterior and ev aluating whether the resulting lexical clusters corresp ond to conceptual or metho dological orientation of the recognised contributions. A topic is considered aligned with a Nob el-recognised contribution only if its dominan t vocabulary reflects a coheren t research strand rather than isolated keyw ords. T able 1 summarises the topics iden tified as most closely asso ciated with the selected laureates. Cyn thia Medeiros , John Quigley and Matthew Revie Results indicate that the mo del recov ers several Nob el-asso ciated research strands as lex- ically distinct topics, but not all. In development e c onomics , one topic is characterised b y terms such as ‘treatment’, ‘exp erimen t’, ‘sc ho ol’, ‘child’, ‘health’, and ‘interv en tion’. The concen tration of terminology related to field exp erimen tation and programme ev aluation reflects the empirical paradigm asso ciated with the introduction of randomised controlled trials b y laureates Abhijit Banerjee, Esther Duflo and Michael Kremer (Miguel and Kre- mer, 2004; Duflo et al., 2007; Banerjee et al., 2015). A separate dev elopment-related topic includes terms suc h as ‘institution’, ‘demo cracy’, ‘go vernmen t’, and ‘property’, corresp ond- ing to researc h emphasising the long-run role of p olitical institutions in shaping economic outcomes, introduced by Daron A cemoglu, Simon Johnson and James Robinson (A cemoglu et al., 2001, 2005; Robinson and A cemoglu, 2012). The lexical separation of these clusters suggests that the mo del distinguishes b et w een alternative in tellectual approaches within the same subfield: one cen tred on metho dological innov ation and micro-empirical ev alua- tion, the other on theoretical and historical analysis. In macr o e c onomic r ese ar ch , a further topic asso ciated with financial crises and banking instabilit y is identified, chara cterised b y v ocabulary related to credit markets, banking systems, liquidit y constraints, and crisis transmission. This cluster aligns with the literature surrounding financial intermediation and systemic risk, including contributions by Ben Bernank e (Bernank e and Gertler, 2001) and Douglas Diamond and Philip Dyb vig Diamond and Dyb vig (1983); Diamond (1984). Unlike foundational asset-pricing terminology , whic h is p erv asive within financial economics, this topic reflects concentrated discourse around banking fragilit y and crisis dynamics, indicating the presence of a coherent research strand. Within lab our e c onomics , the model identifies a distinct topic cen tred on gender and lab our mark et dynamics, c haracterised by terms such as female, w age, participation, education, and gender. The concen tration of this vocabulary distinguishes it from broader human capital or earnings-related clusters. The topic aligns closely with research on female lab our force participation and gender wage inequalit y by Claudia Goldin (Goldin, 1990, 2006, 2014). By con trast, the contributions of Paul Romer, David Card, and Rob ert Shiller do not corresp ond clearly to distinguishable topics recov ered by LDA. In Romer’s case, his con- tribution on endogenous growth theory (Romer, 1986, 1990) extends and reform ulates the neo classical growth theory pioneered by Solow (1957), dra wing hea vily on an already estab- lished macro economic vocabulary . As a result, the asso ciated discourse remains em b edded within dominan t growth-related themes rather than appearing as a separate lexical cluster. Similarly , Card’s con tributions to labour economics are primarily methodological ad- v ancing quasi-exp erimen tal approac hes to causal inference within applied micro economics (Card and Krueger, 1993; Card, 2001). These metho ds are applied across div erse contexts and do not rely on a narrow sub ject-matter lexicon. As a consequence, the language as- so ciated with this research app ears disp ersed across broader applied topics, rather than within a distinct thematic grouping. Shiller’s recognised w ork spans multiple domains within financial economics, including asset return predictabilit y , b eha vioural finance, and housing market measurement (Shiller, 1990a,b, 2003). The dispersion of these con tributions across related but distinct areas mak es it difficult for the model to asso ciate them with a single coherent vocabulary . In this case, the absence of a dedicated topic reflects the distributed lexical expression of the w ork within the corpus. A Statistical F ramework for Detecting Emergen t Narrativ es in Longitudinal T ext Corp ora T able 1. Nob el-associated topics identified by LD A Subfield T opic ID Represen tative W ords In terpreted theme Asso ciated laureate(s) Dev elopment T8 exp erimen t, treatmen t, de- sign, school, health Exp erimen tal dev elopment / R CT Banerjee, Duflo, Kremer Dev elopment T9 institution, demo cracy , gov- ernmen t, elec- toral, vote Institutional p o- litical economy A cemoglu, Johnson, Robinson Lab our T7 female, wage, lab or, education, sc ho ol Gender and lab our mark ets Goldin Macro economics T6 effect, control, co- efficien t, design, treatmen t Exp erimen tal dev elopment / R CT s Banerjee, Duflo, Kremer Macro economics T12 bank, asset, risk, b orro w, credit Asset pricing / mark et dynamics Ben Bernanke, Diamond & Dybvig Finance T4 v ote, election, con trol, institu- tion, gov ernment Institutional p o- litical economy A cemoglu, Johnson, Robinson Finance T7 dynamic, ag- gregate, sho c k, in vestmen t, bor- ro w Financial inter- mediation and macro-financial sho c ks Ben Bernanke, Diamond & Dybvig Notes: T op terms shown are the ten highest-probabilit y words within each topic; ellipses indicate omitted terms. T ak en together, these cases illustrate that LDA can capture con tributions as distinct topics when they are expressed through concentrated and recognisable thematic v o cabularies. Where influen tial researc h extends existing paradigms, op erates metho dologically across fields, or spans multiple substan tiv e domains, its influence ma y b e reflected in shifts within broader topics rather than through the formation of a separate cluster. 4.2 T emporal Dynamics and T rend Detection T o examine whether the tra jectories of the topics in Section 4.1 exhibit sustained temp oral c hange consistent with narrative emergence, w e analyse the evolution of the aggregated topic prop ortions ¯ θ k,t defined in Equation 2. Figure 2 presen ts the resulting time series for sev en selected topics asso ciated with the Nob el-recognised research strands across finance, macro economics, lab our, and growth, b et ween 1970 and 2018. Visual insp ection suggests heterogeneous patterns of increase across topics. Finance - T opic 7 and Macro economics - T opic 12 display particularly pronounced up w ard tra jectories, while the remaining topics exhibit more gradual but consisten tly rising tra jectories. Consisten t with the definition of narrativ e emergence as a sustained increase in ¯ θ k,t , we use the Mann-Kendall test to ev aluate whether each tra jectory exhibits a monotonic upw ard trend ov er time. Alongside statistical significance, w e report Kendall’s τ as a measure of the strength of monotonic asso ciation b et ween time and topic prev alence. T o characterise the magnitude of change, w e complemen t the test with Sen’s slop e estimator, whic h pro vides a robust estimate of the median ann ual rate of increase together with a confidence in terv al. Cyn thia Medeiros , John Quigley and Matthew Revie Fig. 2. Time series of ¯ θ k,t for selected Nobel-asso ciated topics across subfields, 1970-2018. The series indicate sustained increases in Finance-T opic 7 and Macro economics-T opic 12, alongside more moderate up ward mov ements in the other topics. The results of the trend analysis are rep orted in T able 2. All seven tra jectories exhibit statistically significant p ositive trends o ver the study p eriod. Kendall’s τ v alues range from 0.47 to 0.84, indicating mo derate to v ery strong monotonic increases in topic prominence. The strongest trend is observ ed for Finance - T opic 7 ( τ = 0 . 835) , whic h also displays the largest estimated ann ual increase in prev alence (Sen’s slope = 0.0076, 95% CI [0.0061, 0.0091]). Macro economics - T opic 12 similarly exhibits a strong up ward tra jectory ( τ = 0 . 721) , with a median ann ual increase of approximately 0.0038. The remaining topics sho w smaller but consistently p ositiv e slop es, with confidence interv als excluding zero in all cases, indicating that the estimated rates of increase are statistically distinguishable from zero. The magnitude of these slop e estimates provides additional substantiv e insigh t into the pace of thematic change. F or example, the estimated gro wth rate for Finance - T opic 7 implies an increase of approximately 0.3 in a verage topic prev alence in ov er four decades, consisten t with the rise observed in Figure 2. Macroeconomics - T opic 12, which exhibits similarly strong growth, is asso ciated also with closely related research on financial in- termediation and macro-financial instability , linked to the con tributions of Bernank e and A Statistical F ramework for Detecting Emergen t Narrativ es in Longitudinal T ext Corp ora T able 2. Mann–Kendall trend tests for ann ual aggregated topic prev alence (1970–2018) for selected Nob el- asso ciated topics across subfields. Kendall’s τ measures the strength of the monotonic trend, and Sen’s slope estimates the median annual increase in topic prev alence. All tra jectories exhibit statistically significant p ositiv e trends. Subfield – T opic τ p-v alue Sen’s slope 95% CI Finance – T opic 7 0.835 < 10 − 6 0.0076 [0.0061, 0.0091] Macro economics – T opic 6 0.621 < 10 − 6 0.0035 [0.0027, 0.0043] Macro economics – T opic 12 0.721 < 10 − 6 0.0038 [0.0029, 0.0047] Lab our – T opic 7 0.607 < 10 − 6 0.0034 [0.0028, 0.0041] Gro wth – T opic 8 0.628 < 10 − 6 0.0025 [0.0019, 0.0032] Gro wth – T opic 9 0.558 < 10 − 6 0.0011 [0.0009, 0.0014] Finance – T opic 4 0.469 2 . 0 × 10 − 6 0.0013 [0.0008, 0.0018] Diamond and Dybvig. The prominence of these tra jectories is consistent with the increasing atten tion to financial crises and systemic risk following the global financial crisis of 2007- 2009. In contrast, the growth tra jectories asso ciated with T opics 8 and 9 in the growth subfield are more gradual, reflecting slow er consolidation of thematic prominence. These dif- ferences suggest heterogeneit y in the sp eed with whic h research narratives gain prominence across subfields, consistent with v ariation in in tellectual diffusion and adoption dynamics across areas of economics. 4.3 Relationship with citations T o examine whether statistically detected narrative emergence corresponds to measurable sc holarly influence, we analyse the temp oral asso ciation b et ween ann ual aggregated topic prev alence ¯ θ k,t and citation coun ts for the corresp onding Nob el-recognised contributions. Figure 3 presents the join t tra jectories of topic prev alence and citation coun ts for the selected topics. In several cases – notably Finance - T opic 7 and Macro economics - T opic 12 – the tw o series app ear to mo ve closely together, with pronounced increases in topic prev alence o ccurring alongside sharp increases in citations. In other cases, suc h as Lab our - T opic 7 and Growth - T opic 9, the increase in thematic prominence app ears to precede the rapid acceleration of citations. These visual patterns suggest heterogeneous temp oral relationships across subfields and motiv ate a more formal lead-lag analysis. Cyn thia Medeiros , John Quigley and Matthew Revie Fig. 3. T opic prev alence and citation tra jectories for selected Nob el-asso ciated topics, 1970-2018. Upp er panels show annual aggregated topic prev alence ¯ θ k,t ; low er panels show citation counts for the corresp ond- ing Nob el-recognised contributions. F or jointly associated topic-con tribution pairs, Bernank e citations are sho wn in grey and Diamond-Dybvig citations in the topic colour. The panels illustrate both contempora- neous alignmen t (e.g. Finance T opic 7) and apparen t lead–lag patterns (e.g. Lab our T opic 7), motiv ating the formal lag analysis. Let C t denote the annual citation count for a giv en con tribution. F or each topic-con tribution pair, we compute lagged correlations b etw een ¯ θ k,t and C t + ℓ for lags ℓ ∈ [ − 10 , 10] . The symmetric ten-y ear window is chosen to capture medium-run diffusion dynamics within the observ able sample. Empirical evidence indicates that recognition in economics unfolds o ver extended horizons (Bjørk, 2020). Positiv e v alues of ℓ indicate that increases in topic prev alence precede citation growth; negativ e v alues indicate the reverse. The optimal lag is defined as ℓ ∗ = arg max ℓ corr ( ¯ θ k,t , C t + ℓ ) (3) Figure 4 provides the full lag-correlation profiles for eac h topic-contribution pair. T able 3 summarises the maxim um correlation, the corresponding lag, and the contemporaneous correlation for eac h case. Three distinct temp oral structures emerge. First, several topics exhibit very high correlations at or near lag zero. Finance - T opic 7 and Macro economics - T opic 12 b oth display correlations exceeding 0.92 in absolute v alue, with peak alignmen t o ccurring either con temp oraneously or within one y ear. In these cases, thematic prominence and citation accum ulation ev olve in close synchron y . Second, Lab our - T opic 7 and Gro wth - T opic 9 exhibit p eak correlations at p ositiv e lags of appro ximately 9-10 years. In these cases, the contemporaneous correlations are materially lo wer than the maxim um lagged correlations. F or example, Gro wth - T opic 9 displa ys a modest correlation at lag zero, but a substantially stronger asso ciation at longer p ositive lags. This pattern indicates that increases in thematic prominence precede citation acceleration, suggesting a gradual consolidation pro cess in whic h a researc h strand first expands with discourse b efore A Statistical F ramework for Detecting Emergen t Narrativ es in Longitudinal T ext Corp ora Fig. 4. Lagged correlations b et ween ann ual topic prev alence ¯ θ k,t and citation counts for selected Nob el- asso ciated topics. The horizon tal axis denotes lag in y ears; positive v alues indicate that topic prev alence precedes citation growth. Finance–T opic 7 and Macro economics–T opic 12 exhibit high correlations near lag zero, while Growth–T opic 9 displays increasing correlation at p ositiv e lags, consistent with topic prev alence leading subsequent citation accum ulation. b eing reflected in formal citation metrics. Third, Finance - T opic 4 displa ys w eaker and negativ e correlations across lags, despite exhibiting a statistically significant up ward trend in Section 4.2. The absence of strong citation alignment in this case is informative. Not all sustained increases in topic prev alence corresp ond to concentrated citation dynamics. Thematic expansion ma y reflect broader integration in to m ultiple research strands rather than diffusion cen tred on a small num ber of highly cited contributions. T ak en together, the lag analysis strengthens the in terpretation of narrative emergence as more than lexical drift. In most cases, statistically significant increases in topic prev alence align with, or precede, measurable shifts in scholarly recognition. Moreov er, the v ariation in lag structure suggests that narrative diffusion is not uniform across subfields. Some narrativ es rise in tandem with citation gro wth, while others exhibit a lead-lag structure consisten t with delay ed institutional consolidation. The framew ork therefore pro vides tw o complemen tary lay ers of inference: (i) internal statistical evidence of sustained thematic c hange, (ii) external v alidation through alignmen t with citation tra jectories. Cyn thia Medeiros , John Quigley and Matthew Revie T able 3. P eak lag correlations betw een annual topic prev alence ¯ θ k,t and citation counts C t . The b est lag ℓ ∗ maximises corr( ¯ θ k,t , C t + ℓ ) ov er ℓ ∈ [ − 10 , 10] . Subfield – T opic Laureate(s) Best Lag Max Corr. Corr. (Lag 0) P attern Finance – T opic 7 Bernank e 1 0.947 0.944 Near-con temp oraneous Finance – T opic 7 Diamond & Dyb vig -1 0.954 0.951 Near-con temp oraneous Macro economics – T opic 12 Bernank e 0 0.922 0.922 Con temp oraneous Macro economics – T opic 12 Diamond & Dyb vig -1 0.925 0.923 Near-con temp oraneous Macro economics – T opic 6 Banerjee, Duflo & Kremer 2 0.591 0.590 Sligh t topic lead Lab our – T opic 7 Goldin 10 0.668 0.593 T opic precedes citations Gro wth – T opic 9 A cemoglu et al. 9 0.587 0.214 T opic clearly precedes citations Gro wth – T opic 8 Banerjee, Duflo & Kremer 0 0.601 0.601 Con temp oraneous Finance – T opic 4 A cemoglu et al. 10 -0.636 -0.462 W eak / misaligned P ositive lags indicate that increases in topic prev alence precede citation growth. Negative lags indicate the reverse. A Statistical F ramework for Detecting Emergen t Narrativ es in Longitudinal T ext Corp ora 5 Discussion and Conclusion This article has developed a statistical framew ork for identifying narrative emergence in longitudinal text corp ora. By defining emergence as a sustained increase in aggregated topic prop ortions estimated via LDA, the framew ork translates a qualitative theoretical concept in to an empirically testable statistical quantit y . First, the framew ork clarifies what constitutes evidence of emergence. T opic prev alence is a latent, mo del-estimated quan tity . In terpreting changes in ¯ θ k,t requires explicit statistical testing rather than visual insp ection. The combination of Mann-Kendall trend detection and Sen’s slope estimation provides a metho d for distinguishing p ersisten t structural shifts from short-term fluctuation. Second, the empirical application demonstrates that influential research strands are de- tectable when they are expressed through concentrated thematic v o cabularies. Financial in termediation, institutional p olitical econom y , experimental dev elopmen t economics, and gender and lab our markets emerge as lexically coheren t topics with sustained upw ard tra- jectories. By contrast, contributions that extend existing paradigms or operate metho dolog- ically across domains ma y diffuse within broader thematic structures rather than forming distinct clusters. Narrativ e detectabilit y therefore dep ends, in this con text, not only on influence, but on lexical concen tration. Third, the citation alignment analysis sho ws that narrative emergence often coincides with, or precedes, formal markers of disciplinary recognition. In some subfields, thematic prominence and citation gro wth ev olv e con temp oraneously; in others, topic prev alence leads citation accumulation by nearly a decade. This heterogeneit y highligh ts the v alue of com bining internal mo del-based diagnostics with external v alidation metrics. The framework also has limitations. LDA assumes a fixed n umber of topics and exchange- abilit y at the document level. Estimated topic prop ortions dep end on prepro cessing c hoices, prior settings, and the selected n umber of topics. Citation counts capture only one dimen- sion of influence and may lag b ehind broader in tellectual change. F uture research could extend the framework to examine how exogenous ev en ts influence the emergence and evo- lution of narratives. External sho c ks such as ma jor publications, p olicy in terven tions, or economic crises ma y induce abrupt reallo cations of thematic attention, and mo delling such in terven tions explicitly could help distinguish endogenous narrativ e gro wth from externally triggered shifts, for example by incorp orating approac hes that treat v ariation in textual comm unication itself as a source of sho c ks (Handlan, 2020). More broadly , the con tribution lies in demonstrating that narrative emergence can b e op erationalised as a statistically tractable phenomenon. Rather than treating narrativ es as purely in terpretativ e constructs, the framew ork shows that their rise can b e measured as a p ersisten t reallo cation of thematic atten tion within discourse. When com bined with external indicators, topic tra jectories pro vide a reproducible lens for studying in tellectual transformation o ver time. While illustrated using economics journals, the metho dology is applicable to an y longi- tudinal corpus in which thematic change unfolds gradually — including scien tific fields, p olicy debates, and media discourse. In this sense, the framework contributes to bridging narrativ e theory and statistical text analysis, offering a principled basis for analysing how ideas rise, diffuse, and consolidate within structured discursiv e systems. Cyn thia Medeiros , John Quigley and Matthew Revie A c kno wledgements The authors thank Kevin Wilson for helpful suggestions during the revision of this w ork, particularly regarding Section 4.3. An y errors are our own. Bibliograph y A cemoglu, D., Johnson, S., and Robinson, J. A. (2001). The colonial origins of comparative dev elopment: An empirical inv estigation. A meric an e c onomic r eview , 91(5):1369–1401. A cemoglu, D., Johnson, S., and Robinson, J. A. (2005). Institutions as a fundamental cause of long-run gro wth. Handb o ok of e c onomic gr owth , 1:385–472. Banerjee, A., Duflo, E., Glennerster, R., and Kinnan, C. (2015). The miracle of micro- finance? evidence from a randomized ev aluation. Americ an e c onomic journal: Applie d e c onomics , 7(1):22–53. Bernank e, B. S. and Gertler, M. (2001). Should central banks resp ond to mov emen ts in asset prices? Americ an e c onomic r eview , 91(2):253–257. Bjørk, R. (2020). The journals in physics that publish nob el prize researc h. Scientometrics , 122(2):817–823. Blei, D. M. (2012). Probabilistic topic mo dels. Communic ations of the ACM , 55(4):77–84. Blei, D. M. and Lafferty , J. D. (2006). Dynamic topic mo dels. In Pr o c e e dings of the 23r d international c onfer enc e on Machine le arning , pages 113–120. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Laten t dirichlet allo cation. Journal of Machine L e arning R ese ar ch , 3(Jan):993–1022. Card, D. (2001). Immigrant inflows, native outflows, and the lo cal labor market impacts of higher immigration. Journal of lab or e c onomics , 19(1):22–64. Card, D. and DellaVigna, S. (2013). Nine facts ab out top journals in economics. Journal of Ec onomic liter atur e , 51(1):144–161. Card, D. and Krueger, A. B. (1993). Minim um wages and employmen t: A case study of the fast fo od industry in new jersey and p ennsylv ania. Chang, J., Gerrish, S., W ang, C., Bo yd-Grab er, J., and Blei, D. (2009). Reading tea leav es: Ho w humans in terpret topic models. A dvanc es in neur al information pr o c essing systems , 22. Diamond, D. W. (1984). Financial in termediation and delegated monitoring. The r eview of e c onomic studies , 51(3):393–414. Diamond, D. W. and Dybvig, P . H. (1983). Bank runs, dep osit insurance, and liquidity . Journal of p olitic al e c onomy , 91(3):401–419. Duflo, E., Glennerster, R., and Kremer, M. (2007). Using randomization in developmen t economics researc h: A to olkit. Handb o ok of development e c onomics , 4:3895–3962. Ellison, G. (2002). The slowdo wn of the economics publishing pro cess. Journal of p olitic al Ec onomy , 110(5):947–993. Goldin, C. (1990). Understanding the gender gap: An e c onomic history of Americ an women . New Y ork. Goldin, C. (2006). The quiet rev olution that transformed women’s employmen t, education, and family . Americ an e c onomic r eview , 96(2):1–21. Goldin, C. (2014). A grand gender conv ergence: Its last chapter. A meric an e c onomic r eview , 104(4):1091–1119. Gro otendorst, M. (2022). Bertopic: Neural topic mo deling with a class-based tf-idf pro ce- dure. arXiv pr eprint arXiv:2203.05794 . Hamermesh, D. S. (2013). Six decades of top economics publishing: Who and ho w? Journal of Ec onomic liter atur e , 51(1):162–172. Handlan, A. (2020). T ext sho c ks and monetary surprises: T ext analysis of fomc statemen ts with mac hine learning. W orking p ap er . Cyn thia Medeiros , John Quigley and Matthew Revie Hec kman, J. J. and Moktan, S. (2020). Publishing and promotion in economics: The t yranny of the top five. Journal of e c onomic Liter atur e , 58(2):419–470. Hoffman, T. (1999). Probabilistic latent semantic analysis. In Pr o c. 22nd A nnual ACM Confer enc e on R ese ar ch and Development in Information R etrieval, California, 1999 . Lange, K.-R., Schmidt, T., Reccius, M., Müller, H., Roos, M., and Jen tsc h, C. (2025). Narrativ e shift detection: A h ybrid approac h of dynamic topic models and large language mo dels. arXiv pr eprint arXiv:2506.20269 . Li, X., Esp osito, C. D., Groth, P ., Sitruk, J., Szatmari, B., and Wijnberg, N. (2025). Ev al- uation of unsup ervised static topic mo dels’ emergence detection abilit y . Pe erJ Computer Scienc e , 11:e2875. Ma, S. and Ny arko, J. (2025). Identifying emerging concepts in large corp ora. In Pr o c e e dings of the 2025 Confer enc e of the Nations of the Americ as Chapter of the Asso ciation for Computational Linguistics: Human L anguage T e chnolo gies (V olume 1: L ong Pap ers) , pages 6760–6778. Miguel, E. and Kremer, M. (2004). W orms: iden tifying impacts on education and health in the presence of treatmen t externalities. Ec onometric a , 72(1):159–217. Rob erts, M. E., Stewart, B. M., Tingley , D., Airoldi, E. M., et al. (2013). The structural topic mo del and applied so cial science. In A dvanc es in neur al information pr o c essing systems workshop on topic mo dels: c omputation, applic ation, and evaluation , volume 4, pages 1–20. Harrahs and Harv eys, Lake T aho e. Robinson, J. A. and A cemoglu, D. (2012). Why nations fail: The origins of p ower, pr os- p erity and p overty . Profile London. Romer, P . M. (1986). Increasing returns and long-run gro wth. Journal of p olitic al e c onomy , 94(5):1002–1037. Romer, P . M. (1990). Endogenous technological c hange. Journal of p olitic al Ec onomy , 98(5, P art 2):S71–S102. Shiller, R. J. (1990a). Mark et volatilit y and inv estor b eha vior. The Americ an Ec onomic R eview , 80(2):58–62. Shiller, R. J. (1990b). Sp eculativ e prices and p opular mo dels. Journal of Ec onomic p er- sp e ctives , 4(2):55–65. Shiller, R. J. (2003). F rom efficient markets theory to behavioral finance. Journal of e c onomic p ersp e ctives , 17(1):83–104. Shiller, R. J. (2020). Narrative economics: How stories go viral and driv e ma jor economic ev ents. Solo w, R. M. (1957). T ec hnical change and the aggregate pro duction function. The r eview of Ec onomics and Statistics , 39(3):312–320. Somers, M. R. (1994). The narrative constitution of iden tity: A relational and netw ork approac h. The ory and So ciety , 23(5):605–649. Steyv ers, M. and Griffiths, T. (2007). Probabilistic topic mo dels. In Handb o ok of latent semantic analysis , pages 439–460. Psychology Press.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment