ProfOlaf: Semi-Automated Tool for Systematic Literature Reviews

ProfOlaf: Semi-Automated Tool for Systematic Literature Reviews
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Systematic reviews and mapping studies are critical to synthesize research, identify gaps, and guide future work, but are often labor-intensive and time-consuming. Existing tools provide partial support for specific steps, leaving much of the process manual and error-prone. We present ProfOlaf, a semi-automated tool designed to streamline systematic reviews while maintaining methodological rigor. ProfOlaf supports iterative snowballing for article collection with human-in-the-loop filtering and uses large language models to help select articles, extract key topics, and answer queries about the content of articles. By combining automation with guided manual effort, ProfOlaf enhances the efficiency, quality, and reproducibility of systematic reviews across research fields. ProfOlaf can be used both as a CLI tool and in web application format. A video demonstrating ProfOlaf is available at: https://youtu.be/R-gY4dJlN3s


💡 Research Summary

Systematic literature reviews (SLRs) and mapping studies are indispensable for synthesizing research, identifying gaps, and guiding future work, yet they remain labor‑intensive and prone to human error. Existing tools only support isolated steps such as reference management or database search, leaving the bulk of the workflow manual. In response, the authors present ProfOlaf, a semi‑automated, open‑source platform that aims to streamline the entire SLR pipeline while preserving methodological rigor.

ProfOlaf’s workflow is divided into two major phases: collection and analysis. In the collection phase, users start by providing a seed list of article titles (e.g., from a prior review). The tool creates a local database and then performs iterative snowballing: for each article it retrieves forward citations, backward references, or both, pulling bibliographic metadata from Google Scholar, Semantic Scholar, and DBLP. The system automatically detects duplicate or variant records and lets the user decide which version to keep. After gathering the raw set, ProfOlaf applies metadata screening based on optional criteria such as venue ranking, publication year, and language. Venue ranking is assisted by a cosine‑similarity search against existing rankings (SCImago, CORE), dramatically reducing the manual effort required to assign quality scores to venues.

The next step is article screening, which follows the progressive approach advocated by Wohlin: title → abstract → full‑text. At each level, multiple reviewers can record inclusion/exclusion decisions; ProfOlaf visualizes disagreements and facilitates consensus discussions. An LLM (configured in the paper as GPT‑5.2) can act as an auxiliary rater, providing a second opinion based on concise inclusion/exclusion prompts. The authors evaluate this auxiliary rater against two human raters and a third independent human, finding that the LLM’s precision slightly exceeds that of humans while its recall is lower, indicating a more conservative bias.

Once the final set of papers is established, the analysis phase begins. ProfOlaf downloads the PDFs, parses their text, and feeds the content to two LLM‑driven modules. The first module, TopicGPT, uses prompt‑based generation to produce human‑readable topic labels and descriptions, then assigns each paper to one or more topics. This approach yields interpretable clusters that are more transparent than traditional bag‑of‑words topic models. In the authors’ case study on machine‑learning‑for‑code, TopicGPT generated 19 topics in its first pass, correctly matching 12 of the 22 ground‑truth topics (≈54%). After refinement, it produced 14 topics with a 45% match rate, showing that low‑frequency topics tend to be dropped. The second module, the Task Assistant, can answer arbitrary queries about a paper, extract key information, or produce concise summaries. Summaries were evaluated on faithfulness, structure, salience, and conciseness, receiving average Likert scores between 4.33 and 4.91 out of 5, indicating high quality with only minor coverage gaps.

Quantitatively, the authors conducted a seven‑iteration snowballing process that examined 1,009 candidate articles, ultimately retaining 108 after screening—a final efficiency of 0.11 (included/total examined). Automated screening of 183 titles and 125 full texts achieved accuracy comparable to human reviewers (F1 scores around 0.88–0.93). Topic modeling precision and recall were 0.645 and 0.850 respectively, while programming‑language identification achieved 0.590 precision and 0.710 recall, revealing a tendency of the LLM to over‑assign common languages such as Python.

The discussion emphasizes that while LLMs cannot fully replace human judgment—especially for nuanced tasks like fine‑grained topic assignment—they are valuable as supplemental raters that stimulate discussion, surface borderline cases, and reduce manual workload. The authors acknowledge current limitations: occasional over‑assignment, hallucination in language detection, and reduced recall in screening due to conservative LLM behavior. They propose future work on prompt engineering, domain‑specific model fine‑tuning, and tighter integration of additional scholarly databases to improve both precision and recall.

ProfOlaf is released under an open‑source license on GitHub (https://github.com/sr‑lab/ProfOlaf) and as a Docker container via Zenodo, ensuring reproducibility. It offers both a command‑line interface and a web application, making it accessible to a wide range of researchers. In sum, ProfOlaf represents a pragmatic blend of automation and human‑in‑the‑loop oversight, delivering a more efficient, transparent, and reproducible workflow for systematic literature reviews across disciplines.


Comments & Academic Discussion

Loading comments...

Leave a Comment