Detecting Malicious Entra OAuth Apps with LLM-Based Permission Risk Scoring
This project presents a unified detection framework that constructs a complete corpus of Microsoft Graph permissions, generates consistent LLM-based risk scores, and integrates them into a real-time detection engine to identify malicious OAuth consent activity.
š” Research Summary
The paper addresses the growing problem of malicious OAuth applications in Microsoft Entra ID by introducing a unified detection framework that leverages large language models (LLMs) to assign risk scores to Microsoft Graph permissions. The authors first construct a comprehensive corpus of 769 Graph permissions, enriching each entry with metadata such as read/write scope, global versus resourceāspecific access, and functional category. Using eight openāsource LLMsāincluding GPTāOSSā120B, GPTāOSSāSafeguardā120B, and Qwenā3ā235Bāthey generate a consistent risk score (1āÆ=āÆlow, 5āÆ=āÆhigh) and a naturalālanguage justification for each permission. The resulting dataset, which is publicly released, reveals that broad ā.Read.Allā and ā.Write.Allā scopes consistently receive the highest scores, while narrowly scoped permissions like āUser.Readā are rated low.
The detection pipeline consists of five stages. StageāÆ1 collects application registration events and consent logs from Entra ID. StageāÆ2 aggregates the LLMāderived scores for all permissions requested by an app, producing an āAggregated Application Risk (Rapp)ā metric. StageāÆ3 applies a stateful spikeālogic algorithm that monitors shortāterm fluctuations in Rapp; a rapid increase triggers a spike state stored in an ināmemory buffer. StageāÆ4 sends realātime alerts via Slack Webhooks and persists the event, risk scores, and metadata in a SQLite database for auditability. StageāÆ5 updates the system state and cleans up temporary data.
Empirical evaluation shows clear separation between highārisk and lowārisk permission sets, with statistical analysis of mean, standard deviation, and distribution across models. The authors also perform Nāgram and triāgram analyses of LLM reasoning texts to assess consistency and identify modelāspecific biases. Limitations include dependence on prompt design, the need for periodic corpus updates as new permissions are introduced, and the current focus on static permission risk without modeling interāpermission correlations or user behavior.
Future work proposes multiāmodel ensemble scoring, correlationāaware risk adjustment, extension to other identity platforms (Azure AD, Google Workspace), and integration with dynamic threat intelligence feeds to enable adaptive risk scoring and automated response playbooks. In summary, the study demonstrates that LLMābased permission risk scoring, when coupled with a realātime detection engine, provides a practical and explainable method for identifying malicious Entra OAuth applications, filling a critical gap in cloud identity security.
Comments & Academic Discussion
Loading comments...
Leave a Comment