ORACLE: Time-Dependent Recursive Summary Graphs for Foresight on News Data Using LLMs
ORACLE turns daily news into week-over-week, decision-ready insights for one of the Finnish University of Applied Sciences. The platform crawls and versions news, applies University-specific relevance filtering, embeds content, classifies items into PESTEL dimensions and builds a concise Time-Dependent Recursive Summary Graph (TRSG): two clustering layers summarized by an LLM and recomputed weekly. A lightweight change detector highlights what is new, removed or changed, then groups differences into themes for PESTEL-aware analysis. We detail the pipeline, discuss concrete design choices that make the system stable in production and present a curriculum-intelligence use case with an evaluation plan.
💡 Research Summary
ORACLE is a production‑grade foresight platform built for a Finnish University of Applied Sciences that continuously transforms daily Finnish news streams into decision‑ready, week‑over‑week intelligence. The system begins with a daily RSS crawler that extracts the main article content, normalizes the HTML, and computes a stable hash of the normalized page. If the hash changes for a known URL, a new version is stored and re‑embedded, providing fine‑grained version control and auditability.
A two‑stage relevance filter tailors the incoming stream to the university’s strategic interests. The first lexical stage removes obviously unrelated items using expanded keyword queries that cover university names, education‑related domains, and geographic terms. The second semantic stage compares each article’s embedding against a curated set of exemplar documents (e.g., funding announcements, curriculum reforms) and retains items that exceed a similarity threshold. Irrelevant items are cold‑stored for possible future re‑activation.
All retained articles are embedded with OpenAI’s TextEmbedding‑3 model and stored in a Milvus vector database together with metadata (source, publication date, PESTEL label, version chain). A lightweight supervised classifier assigns a single PESTEL dimension (Political, Economic, Social, Technological, Environmental, Legal) to each article; the authors note that multi‑label extensions are feasible.
The core of ORACLE is the Time‑Dependent Recursive Summary Graph (TRSG), a two‑level hierarchical representation that is recomputed weekly. In the L0→L1 step, a cosine‑similarity graph of the week’s items is built and clustered with the Leiden algorithm. Each resulting community is fed to a large language model (Gemini 2.0 Flash) with a “factual” prompt that forces the model to list the main theme, key entities, dates, figures, policy changes, and regional details. The generated summary text is re‑embedded and stored as an L1 node. In the L1→L2 step, the L1 summaries themselves are clustered, and a second, more abstract “strategic synthesis” prompt produces an L2 node that captures cross‑domain trends and implications. When a cluster’s combined text exceeds the model’s context window, the system recursively splits the text, summarizes each chunk, and then re‑summarizes the interim outputs, preserving completeness while respecting token limits.
Week‑to‑week change detection compares consecutive weeks’ L1 and L2 node embeddings. For each new summary the best‑matching old summary is identified; similarity ≥ 0.90 is labeled “Stable”, 0.70–0.90 “Changed”, < 0.70 “Added”. Unmatched old nodes are marked “Removed”. The resulting delta set is turned into human‑readable themes by first generating short micro‑labels with an LLM, then canonicalizing them via TF‑IDF‑based agglomerative clustering. Each theme is further analyzed through a schema‑constrained PESTEL module that returns a structured object containing title, analysis, level, group, and an importance score (0–1). Results are cached in MySQL keyed by week pair and perspective, guaranteeing reproducibility and fast retrieval.
A concrete use case—Curriculum Intelligence—demonstrates the platform’s value. An analyst comparing weeks 23 and 28 sees two new L2 themes: EU digital‑skills funding and quantum‑computing policy momentum. The L1 summaries provide concrete facts such as program names, funding amounts, and involved institutions. The PESTEL analysis then recommends (i) aligning elective modules with EU skill frameworks, (ii) adding a quantum‑fundamentals track, and (iii) exploring partnerships with local industry labs. Because every recommendation links back to the underlying news items and cluster summaries, stakeholders can justify decisions with a shared, auditable evidence base, rather than relying on ad‑hoc reports.
The authors discuss several engineering choices that enhance stability: hash‑based versioning to avoid duplicate processing, a hybrid similarity search that uses direct cosine for small graphs and FAISS for larger ones, snapshotting weekly graphs as pickled objects for quick reload, and deterministic clustering parameters (Leiden modularity, cosine thresholds). They also acknowledge limitations: coverage bias due to the selected news sources, potential hallucination in LLM summaries despite factual prompts, domain‑specific tuning of the PESTEL classifier, and the weekly granularity that may miss rapid bursts. Future work includes multi‑label PESTEL classification, multilingual source integration, and richer policy‑science‑industry link analysis.
In sum, ORACLE presents a coherent, end‑to‑end pipeline that ingests, filters, embeds, clusters, hierarchically summarizes, and tracks changes in a fast‑moving news stream, delivering structured, traceable intelligence aligned with institutional strategic planning. The system’s design balances scalability (vector databases, FAISS), interpretability (two‑level graph, deterministic prompts), and auditability (versioned storage, cached analyses), making it a practical blueprint for other public institutions seeking continuous foresight capabilities.
Comments & Academic Discussion
Loading comments...
Leave a Comment