GLiSE: A Prompt-Driven and ML-Powered Tool for Automated Grey Literature Extraction in Software Engineering

GLiSE: A Prompt-Driven and ML-Powered Tool for Automated Grey Literature Extraction in Software Engineering
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Grey literature is essential to software engineering research as it captures practices and decisions that rarely appear in academic venues. However, collecting and assessing it at scale remains difficult because of their heterogeneous sources, formats, and APIs that impede reproducible, large-scale synthesis. To address this issue, we present GLiSE, a prompt-driven tool that turns a research topic prompt into platform-specific queries, gathers results from common software-engineering web sources (GitHub, Stack Overflow) and Google Search, and uses embedding-based semantic classifiers to filter and rank results according to their relevance. GLiSE is designed for reproducibility with all settings being configuration-based, and every generated query being accessible. In this paper, (i) we present the GLiSE tool, (ii) provide a curated dataset of software engineering grey-literature search results classified by semantic relevance to their originating search intent, and (iii) conduct an empirical study on the usability of our tool.


💡 Research Summary

The paper introduces GLiSE, a prompt‑driven, machine‑learning‑enhanced tool designed to automate the discovery, acquisition, and curation of grey literature in software engineering. Grey literature—such as blog posts, GitHub issues, Stack Overflow discussions, and official documentation—captures real‑world practices that are rarely represented in academic publications, but its heterogeneous sources, inconsistent metadata, and lack of standardized APIs make large‑scale, reproducible collection difficult.

GLiSE addresses these challenges through a three‑step pipeline.

  1. Query Generation: Users input a free‑text research intent. The system calls an OpenAI large language model (LLM) to translate this intent into platform‑specific search queries for GitHub, Stack Overflow, and Google Search. Users can configure temperature, language restrictions, time windows, and the number of generated queries, ensuring reproducibility. Queries are exportable/importable as JSON.
  2. API Retrieval: The generated queries are executed against the public APIs of the selected platforms. GLiSE handles pagination, retries, and provenance logging. For each result it extracts core metadata (URL, title, snippet) and platform‑specific content (e.g., README files for GitHub repositories, meta‑descriptions for Google). Near‑duplicate detection based on URL, title, and snippet reduces redundancy.
  3. Relevance Classification: Both the search intent and each result’s textual fields are embedded using OpenAI’s text‑embedding‑3‑small or text‑embedding‑3‑large models. Various similarity features—cosine distance, Euclidean distance, L1 distance, element‑wise absolute differences, and element‑wise products—are computed and concatenated. Five machine‑learning classifiers (Gaussian Naïve Bayes, Logistic Regression, XGBoost, Linear SVC, Ridge) are trained on a manually curated dataset of 1,137 labeled items (relevant vs. irrelevant) spanning GitHub repositories, GitHub issues, Stack Overflow posts, and Google results. The best‑performing model for each source‑embedding combination is selected via a 50/50 train‑test split and GridSearchCV hyper‑parameter tuning.

The authors evaluated several embedding dimensions (512, 1024, 1536) and input feature sets, ultimately choosing configurations that maximized balanced accuracy, precision, recall, and F1 score. For example, GitHub repositories achieved the highest F1 (0.63) with text‑embedding‑3‑large, XGBoost, and L1 distance; Stack Overflow reached F1 0.76 using text‑embedding‑3‑small, GaussianNB, and overlap product vectors; Google Search performed best with text‑embedding‑3‑large, GaussianNB, and element‑wise absolute differences (F1 0.79). An LLM‑only baseline (GPT‑4o) was tested but proved slower, more expensive, and less accurate, so it was not adopted.

A usability study with five participants (four software engineers, one researcher) compared manual grey‑literature search against GLiSE‑assisted search across two comparable research tasks. Objective metrics showed substantial gains: Time‑to‑First‑Relevant dropped from 158 s (manual) to 96 s (GLiSE), and total screening time to collect ten relevant items fell from 20 minutes to 2.5 minutes. The System Usability Scale (SUS) score for GLiSE was 81 ± 7.6, exceeding the standard acceptability threshold of 68 and entering the “excellent” range. Perceived usefulness and intention to reuse both averaged 6 out of 7. Qualitative feedback praised the integration of multiple sources into a single workflow and suggested a cleaner, less dense interface.

The paper’s contributions are threefold: (i) the GLiSE tool itself, providing a reproducible, configurable pipeline for SE‑specific grey‑literature extraction; (ii) a publicly released, manually labeled dataset of 1,137 search results linked to their originating intents; and (iii) an empirical usability evaluation demonstrating that GLiSE markedly reduces effort while maintaining high relevance.

Limitations include the current focus on only three sources (GitHub, Stack Overflow, Google) and a relatively modest labeled dataset, which may affect generalizability to other domains (e.g., Reddit, internal wikis). Future work is planned to extend source coverage, incorporate active learning for continuous model improvement, and explore user‑customizable feedback loops for on‑the‑fly model refinement.

In summary, GLiSE offers a practical, open‑source solution that automates grey‑literature retrieval in software engineering, combines LLM‑driven query synthesis with embedding‑based relevance filtering, and delivers measurable efficiency and usability benefits over traditional manual approaches.


Comments & Academic Discussion

Loading comments...

Leave a Comment