Malicious Or Not: Adding Repository Context to Agent Skill Classification
Agent skills extend local AI agents, such as Claude Code or Open Claw, with additional functionality, and their popularity has led to the emergence of dedicated skill marketplaces, similar to app stores for mobile applications. Simultaneously, automa…
Authors: Florian Holzbauer, David Schmidt, Gabriel Gegenhuber
[PREPRINT] Malicious Or Not: A dding Rep ository Con text to Agen t Skill Classification Florian Holzbauer 1 , Da vid Schmidt 2 , 3 , Gabriel Gegenh ub er 1 , Sebastian Sc hrittwieser 2 , 3 , and Johanna Ullric h 1 1 In terdisciplinary T ransformation Univ ersit y (IT:U), Linz, Austria 2 Univ ersity of Vienna, Vienna, Austria 3 Christian Doppler Lab oratory AsT ra Abstract. Agen t skills extend lo cal AI agents, suc h as Claude Co de or Op en Claw, with additional functionalit y , and their popularity has led to the emergence of dedicated skill mark etplaces, similar to app stores for mobile applications. Simultaneously , automated skill scanners w ere in tro duced, analyzing the skill description a v ailable in SKILL.md, to v erify their b enign b eha vior. The results for individual market places mark up to 46.8% of skills as malicious. In this pap er, we present the largest empirical securit y analysis of the AI agen t skill ecosystem, questioning this high classification of mali- cious skills. Therefore, we collect 238,180 unique skills from three ma- jor distribution platforms and GitHub to systematically analyze their t yp e and behavior. This approach substantially reduces the num b er of skills flagged as non-b enign by security scanners to only 0.52% which remain in malicious flagged rep ositories. Consequen tly , out metho dology substan tially reduces false positives and provides a more robust view of the ecosystem’s curren t risk surface. Beyond that, we extend the securit y analysis from the mere inv estigation of the skill description to a comparison of its congruence with the GitHub rep ository the skill is em b edded in, pro viding additional context. F urthermore, our analysis also unco vers several, by now undo cumen ted real-world attack vectors, namely hijacking skills hosted on abandoned GitHub repositories. Keyw ords: Agen t Skills, Security , Skill Scanner, F alse P ositives, Claude Co de, Co dex, Op enCla w, GitHub, Classification 1 In tro duction Autonomous AI agents, suc h as Claude Co de [ 4 ] or Op enCla w [ 36 ] extend large language models (LLMs) from standalone text generation systems into truly autonomous, closed-lo op assistan ts that can plan, act, and learn complex tasks. In a nutshell, the LLM in terprets user requests, inv ok es adequate routines to serv e them, and ev entually in tegrates the results in to subsequen t reasoning steps. A key concept are skills, which are reusable, modular comp onen ts that extend an agent’s capabilities like access to an external API, co de execution, or data retriev al. F ollowing a standardized and op en format [ 2 ], skills consist 2 Holzbauer et al. of natural language descriptions, informing the user and the LLM ab out its capabilities, in combination with executable logic implemen ting the functionality . Skills are found on dedicated skill markets, e.g., Cla wHub [ 35 ], Skill.sh [ 38 ], and SkillDirectory [ 34 ], or traditional rep ositories like Github. Agents migh t even disco ver them on their own, e.g., when reading on moltbook.com , a reddit-like platform for b ots. On the one hand, the tight integration of LLM reasoning with execution capabilities p oses unique risks as anecdotal evidence of undesired email deletion emphasizes [ 10 ]. A recurring pattern, on the other hand, are supply chain risks by in tegrating external resources for a system’s extended functionality from mark et places or other sources. Examples are machine images in compute clouds [ 9 ], do c k er hub for con tainers [ 33 ], mobile applications for iPhones [ 30 ], or pac ket managers like npm and PyPI [ 40 ], and now ada ys, it app ears, also skills for AI agen ts. Among others, malicious skills attempted to steal priv ate information from macOS [ 26 ] or redirect crypto currency assets [ 22 ]. In consequence, skill mark ets now ada ys automatically scan the provided skills for security , and provide the results for orien tation to their users. The total share of malicious skills, ho wev er, v aries significantly among market places – 46.8% (Cla wHub), 23.0% (Skills.sh), and 6.0% (SkillsDirectory). In this paper, we present the largest empirical security study of the AI agent skill ecosystem, collecting and analyzing 238,180 skills from three distribution platforms and GitHub. Our research questions the high risk, currently asso ciated with mark et places, and is guided by the follo wing research questions. R Q1 What skil ls ar e shar e d on market plac es, and which new attack ve ctors emer ge fr om the skil l e c osystem? R Q2 How do marketplac es classify skil ls as malicious? R Q3 Can r ep ository c ontext impr ove existing se curity classific ation of skil ls? In detail, w e analyze our skill collection, e.g., regarding included scripts or endp oin ts, to infer their c haracteristics. Then, we fo cus on the skills classified as malicious, either by the mark et places or the Cisco skill scanner [ 11 ]. W e reev aluate these results by integrating the skills’ rep ositories and analyze them for congruence with the skill’s sp ecification. Therefore, w e apply static analysis to extract securit y- and priv acy-relev ant artifacts suc h as embedded secrets, external endp oin ts, and dep endencies. Finally , w e analyze malicious b ehavior and unco ver previously undo cumen ted attack vectors within the AI skill ecosystem. Our ov erall results show that analyzing agen t skills without rep ository context substan tially o verestimates the ecosystem’s risk. By means of rep ository-aw are analysis, we are able to significantly reduce the rate of false p ositives while simul- taneously uncov ering structural w eaknesses in skill distribution platforms, that allo w adv ersaries to hijack existing skills. Summarizing, our pap er con tributes the follo wing asp ects: – Large-scale ecosystem measuremen t. With 238,180 unique skills, we construct the largest cross-platform dataset of agen t skills to date by col- lecting skills from three official marketplaces as well as GitHub rep ositories. F rom Capability to V ulnerability 3 The data set do es not only facilitate the analysis at hand, but also provides the basis for future longitudinal studies on the AI skill ecosystem. – Rep ository-a w are skill analysis. Existing security scanners classify a large share of offered skills as malicious. W e dev elop a semantic analyzer that b ey ond the skill description, also incorp orates the rep ository con text. This approach reduces the n umber of flagged skill rep ositories to only 0.52% of 2,887 scanner-flagged skills that reside in malicous flagged rep ositories. – Disco very of new attack v ectors. W e identify previously undo cumen ted attac k v ectors in the AI skill ecosyste m. In particular, we demonstrate the hijac king of 7 abandoned repositories referenced b y skill indexes, affecting 121 skills. One of them even has more than 1,000 recorded installations, a significan t num b er considering the recency of the in vestigated ecosystem. T o enable repro ducibility and future work, we publish our co de: https:// anonymous.4open.science/r/agent_skills/ . 2 Bac kground and Related W ork A gent and Skil ls. AI agents autonomously execute tasks in interaction with external services, and op erate in a reasoning-action lo op. The LLM interprets a user request, selects and in vok es an adequate capability , and ev entually in- corp orates the results in to subsequen t reasoni ng. Man y agen ts support mo dular extensions, such as API access, co de execution, or data retriev al, that are referred to as skil ls . Skills t ypically come in a rep ository , c om bining a specification file ( SKILL.md ) with optional scripts, configuration files, or static assets. The sp ecification file describ es the skill’s capabilities and its inv o cation context in natural language, enabling the autonomous agen t to decide, while the latter represen t the executable logic. F or interoperability , Anthropic recently sp ecified a skill pac k aging format [ 25 ]. Skil l Marketplac es. Agent skills are distributed ov er dedicated marketplaces suc h as ClawHub [ 35 ], Skills.sh [ 38 ], and SkillsDirectory [ 34 ]. As of March 9, 2026, they provide 18,412, 86,800, and 36,109 skills, resp ectiv ely . Y et, the platforms dif- fer substantially , leading to different levels of control for the op erators: ClawHub curates and reviews uploaded skills, and ev entually hosts them itself. Skills.sh adopts an op en Git-based distribution model and indexes skills in external rep ositories. SkillsDirectory also refers to external rep ositories, but mo derates submissions and p erforms rule-based security scanning. Se curity on Marketplac es. Malicious skills are known to manipulate agent b eha vior and after rep orts on their distribution ov er mark etplaces [ 26 , 22 , 6 , 31 ], mark etplaces now ada ys automatically scan the offered skills for security . The results, t ypically a classification of whether a skill is benign or malicious in com bination with a short explanation on the reasoning, are provided as metadata to their users. Therefore, ClawHub relies on VirusT otal [ 39 ] and a custom LLM- based detection system, Skills.sh integrates several third-part y scanners, and 4 Holzbauer et al. SkillsDirectory rep orts the use of more than 50 rule-based detection mechanisms. The share of skills reported as suspicious v aries substan tially across mark et- places, namely 46.8% (ClawHub), 23% (Skills.sh), and 6% (SkillsDirectory). A cross all marketplaces, the share remains high, whic h indicates either a large n umber of malicious skills or a high false p ositive rate. Empiric al Studies on the Skil l Ec osystem. Studies by third parties come to a similar v ariety in the share of malicious skills on mark etplaces. An analysis of 3,984 skills on ClawHub and skills.sh [ 6 ] found 13.4% of them having a critical- lev el security flaw like malw are distribution, prompt injection or exp osed secrets, and 36.82% show (more minor) security pitfalls lik e hard-co ded API k eys or insecure creden tial handling. Another analysis inv estigated 31,132 skills [ 24 ] from skills.rest and skillsmp.com, and found that 26.1% of the skills contain a security vulnerabilit y suc h as prompt injection, data exfiltration, privilege escalation, and supply chain risks. The largest study by now in vestigated 40,285 skills from skills.sh [ 23 ]. While predominantly fo cusing on their publication behavior o ver time as w ell as prompt length, the author also assessed their security and concluded 9% of them ha ving critical flaws. In this context, also multiple skill scanners emerged, e.g., SkillScan [ 24 ], SkillF ortify [ 8 ], Sn yk [ 7 ], and the Cisco skill scanner [ 11 ]. Multiple works promise annotated data sets of skills for security b enc hmarking [ 8 , 24 , 3 ]. Ho wev er, up on closer insp ection, w e were unable to find them, imp eding a direct comparison of our approac h with those from previous w ork on the same data sets. Se curity of AI A gents. Instead of skills, vulnerabilities migh t also directly affect the agent. (Mean while fixed) ClawJacke d enabled an attack er to gain con trol ov er Op en Claw instances using a web so c ket to lo calhost [ 22 ]. Alter- nativ ely , attack ers could use web so c kets to mo dify log files that the AI agents ev entually rely on for troubleshooting [ 37 ]. Via prompt injection, Op enCla w w as p ersuaded to reveal priv ate keys [ 12 ]. In the manner of so cial engineering, moltb ook app ears to b e exploited to extract metadata for reconnaissance. The programming agen t Claude Co de was meanwhile also vulnerable to remote co de execution [ 13 ], and revealed API keys [ 14 ]. Finally , Sho dan currently discov ers 55,561 of such Op enCla w instances on the Internet [ 32 ], and a honeyp ot provides a glimpse in to the attack ers’ current strategies [ 18 ]. Scientific Liter atur e. Due to the recency of the topic, most rep orts are either published in non-scien tific ven ues suc h as blogs or newsp ost, or as yet, unaccepted preprints. Y et, there is also scien tific literature on autonomous AI agen ts a v ailable [ 1 , 28 , 5 ]. Security is considered b oth, a p otential application of and a ma jor challenge for these systems. On the one hand, autonomous AI allo ws to contin uously monitor for malicious activities, and immediately blo c k them – ev en in the case of just emerging threats – with the p otential to increase security . On the other hand, these systems are considered security threats themselves due to their autonomy in combination with access to large, and p oten tially sensitiv e F rom Capability to V ulnerability 5 data. A dedicated surv ey on security [ 15 ] classified the threat landscape of AI agen ts, also including Agent2T o ol and supply c hain threats. 3 Metho dology W e study the s ecurit y of agent skills using a three-stage measurement pip eline as illustrated in Figure 1 . First, we collect skills from ma jor mark etplaces and public rep ositories to obtain a comprehensiv e view of the ecosystem. Second, w e classify skill b eha vior using existing marketplace scanners, an op en-source securit y scanner, and an LLM-based feature-extraction pip eline, and highlight inconsistencies in their predictions. Finally , we incorp orate rep ository con text in to the analysis to determine whether skills flagged as malicious remain sus- picious when ev aluated within their surrounding co debase, whic h can provide v aluable context for impro ved security ev aluation. 3.1 Cross-Platform Skill Collection T o study the agen t skill ecosystem for R Q1 , w e collect all skills from three ma jor marketplaces ClawHub [ 35 ], Skil ls.sh [ 38 ], and Skil lDir e ctory [ 34 ], and complemen t this dataset with a custom search on GitHub arc hive snapshots [ 20 ] to index skills apart from the kno wn marketplaces. Skills on ClawHub target OpenClaw, while those on SkillDirectory target Claude. Skills from Skills.sh and GitHub are not platform-specific. Cla wHub directly hosts skill files, whereas Skills.sh and SkillDirectory index GitHub rep osi- tories. Skills.sh links to sp ecific skill folders, while SkillDirectory references entire (paren t) rep ositories. W e download skills from ClawHub via its API and extract all rep ositories indexed by Skills.sh and SkillDirectory . Skil ls Beyond Marketplac es. T o identify skills outside official marketplaces, w e search GitHub rep ositories for SKILL.md files. The GitHub REST API sup- p orts file-based search but limits results to 1,000 en tries, preven ting ecosystem- scale analysis. W e therefore use GHAr chive [ 20 ], which pro vides historical GitHub activit y snapshots. W e analyze daily snapshots from Octob er 2025, as skills were in tro duced during this month, and apply keyw ord based heuristics to rep ository titles and activit y metadata to identify rep ositories that likely contain skills. This step reduces the candidate set to 546,512 rep ositories. W e shallow-clone rep ositories 4 and scan for SKILL.md files, identifying 25,014 candidates. T o a void duplicate measurements, w e deduplicate skills b y computing a SHA-256 hash o ver the complete skill artifact and retaining only unique instances. Static Skil l Analysis. T o study skill con tents for R Q1 , we p erform static analysis. F or each skill, w e enumerate all files and store metadata, including paths, types, and suffixes, in a database to infer prop erties such as program- ming languages used. W e then analyze all files using regular expressions to 4 F or scalability , cloning is limited to tw o min utes and skill directories to 200 MB. 6 Holzbauer et al. 2. Malicious Classification Platform Reports Existing Skill-Based Analysis skill-to-analyze Malicious scripts assets SKILL.md Repo 1 Repo 2 skill-to-analyze README.md Codebase ⭐ Stars 6.4K Issues 7 Forks 43 Size Activity Age Repo 3 # Repositories 3 2. Codebase Score: 82 1. Metadata Score: 90 Repo 1 Clone Domain Alignment 👍 Code Intent 👍 Description 👍 Malicious Behavior 👎 Repository Context Score: 85 Repository-Aware Analysis V erdict: Likely Benign skill-to-analyze 1. Cross-Platform Skill Collection 3. Repository Context Skills.sh Cisco Skill Scanner LLM Feature Extraction Static Analysis Metadata Collection Repository Judgement SkillsDirectory Github Indexing Clawhub New attack vectors V erdict: Malicious Fig. 1. Overview of our rep ository-aw are skill analysis approac h to reduce the high rate of malicious claims. The approach is tested in a three-stage pip eline encompassing cross-platform skill collection, malicious classification, and rep ository context analysis. extract URLs and IP addresses. W e include all files, including do cumen tation and configuration files, since agen ts ma y in terpret artifacts such as README or SKILL.md as instructions. Extracted endp oints rev eal p oten tial data sources and sinks. Finally , w e detect em b edded secrets using a mo dified v ersion of T ruffleHo g [ 19 ], adapted from prior work on mobile app secret detection [ 29 ]. The mo difications enable automated v alidation of detected credentials against remote APIs, allo wing us to distinguish inactive strings from v alid creden tials and measure the prev alence of exp osed secrets in the skill ecosystem. 3.2 Malicious Classification T o study R Q2 , we analyze ho w skills are classified by existing security scanners and develop platform-indep endent classification methods that can b e applied to skills b ey ond marketplaces. Platform-b ase d Sc anners. W e analyze skill classification rates reported by the s tudied marketplaces. F or Cla whub and Skills.sh, we collect the skill-level scanner rep orts display ed on each platform. Clawh ub relies on VirusT otal and Op enCla w-sp ecific chec ks, while Skills.sh rep orts results from Gen A gent T rust Hub , So cket , and Snyk . W e compare these rep orts with results from the Cisco skill scanner and use them as input for our rep ository-a ware analysis. Cisc o Skil l Sc anner. T o obtain a platform-indep enden t assessment of skill b eha vior, we rely on the op en-source Cisc o Skil l Sc anner [ 11 ], which analyzes F rom Capability to V ulnerability 7 AI agen t skills for securit y risks suc h as prompt injection, data exfiltration, or malicious command execution using static and behavioral techniques. The scanner comprises four deterministic mo dules: Static Analysis , which scans files using Y AML parsing and Y ARA rule matching to detect known malicious pat- terns [ 11 ]; Byte c o de Inte grity Analysis , whic h insp ects compiled Python .pyc files for tamp ering or hidden logic; Pip eline Analysis , whic h identifies dangerous shell command pip elines; and Behavior al A nalysis , which p erforms AST-based dataflo w analysis to detect suspicious b ehavior such as unauthorized data access or external comm unication. LLM-b ase d F e atur e Extr action. T o complement the previous static analysis, w e conduct an LLM-based b eha vioral analysis using a lo cal Co dex instance (Op enAI GPT-5.3). The model ev aluates eac h skill using a structured ques- tionnaire with 25 predefined features derived from existing skill scanners and exp ert knowledge obtained during the static analysis of skills. W e mak e the full feature set and the prompt publicly av ailable in our co de rep ository . The prompt instructs the mo del to analyze the SKILL.md file located in the skill directory and to assess the skill’s b eha vior based on its executable logic and plausible runtime paths. W e ev aluate each skill in an isolated session to preven t con text carryo ver b et ween samples. During the analysis, the LLM autonomously generates auxiliary scripts to extract relev an t information from the skill conten ts. The resulting feature vector captures b eha vioral indicators such as system in- teraction, netw ork communication, creden tial exp osure, p ersistence mec hanisms, detection ev asion, and financially motiv ated abuse. W e enco de features either as b oolean indicators or b ounded numeric v alues. F or netw ork related features, w e additionally record the n umber of contacted domains and unique IP addresses. 3.3 Rep ository-A w are Analysis T o answer R Q3 , w e analyze the rep ository environmen ts of skills previously flagged as malicious. Since existing scanners typically ev aluate skills in isolation, whic h often leads to false p ositiv es, we incorporate rep ository context to assess whether a skill’s functionality aligns with its language description, surrounding co debase, and dev elopment history . Our analysis fo cuses on a subset of flagged skills identified by b oth the Cisco Skill Scanner [ 11 ] and our LLM-based b ehav - ioral analysis. Sp ecifically , we consider skills lab eled HIGH or CRITICAL by the Cisco scanner and assigned an LLM risk score ab o ve three on a fiv e-p oint scale. Platforms hosting standalone skills without rep ository context are excluded. F or eac h selected skill, we collect rep ository metadata and co de artifacts and derive t wo complementary signals capturing rep ository context: a co debase score and a metadata score. Co deb ase A lignment. The codebase score measures ho w w ell a skill’s de- scrib ed functionalit y aligns with the surrounding rep ository . Using LLM-based analysis, we ev aluate whether the rep ository domain matc hes the purp ose of the skill, whether the source co de implements functionalit y consistent with the skill 8 Holzbauer et al. description, and whether the rep ository documentation supp orts the in tended use of the skill. Eac h asp ect is rated as low , me dium , or high alignment and com bined into a single score. Metadata A nalysis. The metadata score estimates the credibility and matu- rit y of the rep ository hosting a skill. Legitimate skills are typically embedded in established and actively maintained pro jects, whereas malicious or misleading skills often app ear in newly created or low-credibilit y rep ositories. T o estimate maturit y , w e analyze rep ository-lev el s ignals including age, developmen t activity , p opularit y indicators (e.g., stars, forks, issues), and ov erall repository size. R ep ository Context Sc or e. W e compute the final repository con text score b y com bining the tw o signals with greater w eight on the co debase sc ore. Since seman tic alignment b et ween a skill and its rep ository is a stronger indicator of legitimacy than rep ository maturity alone, the co debase score con tributes 70% and the metadata score 30%. W e additionally apply a p enalt y if a rep ository ex- hibits suspicious characteristics despite apparent alignmen t. Some skills app ear in multiple rep ositories. T o av oid bias, we compute the final skill-lev el score as the av erage rep ository context score across the three most relev an t rep ositories. Limiting aggregation to three repositories ensures representativ e contexts while prev enting numerous lo w-quality mirrors or forks from dominating the result. This rep ository-a ware scoring helps distinguish skills that are legitimate com- p onen ts of larger pro jects from those that remain suspicious in con text. The score is intended to complemen t existing skill scanners and supp ort users in assessing whether to trust a giv en skill. 4 R Q1: Cross-platform Skill Analysis W e first present results on the differences b et ween the mark etplaces and the additional skills that we disco vered on GitHub. W e then pro vide an o verview of the conten t of the published skills, and demonstrate how attack ers can hijack skills referenced in t wo marketplaces. Collected Skill Dataset T able 1 summarizes the skills that we crawled from the different marketplaces and the subset that we successfully downloaded and analyzed. In total, we indexed 16,755 skills from ClawHub, 157,191 skills from Skills.sh, 32,896 skills from SkillsDirectory , and 142,822 skills referenced on GitHub. During collection we encountered several issues that limited the num b er of retriev able skills. F or SkillsDirectory , a large rep ository sub directory that effec- tiv ely represents a separate skill collection was omitted, resulting in 13,304 skills not b eing retrieved. In addition, c hanges in rep ository structures caused skill paths referenced by the marketplaces to b ecome inv alid. Sp ecifically , 21,800 skills referenced by marketplace indexes were no longer present in the rep ositories, and another 3,991 rep ositories no longer exist for skills referenced by Skil ls.sh . F urthermore, 188 rep ositories had b ecome priv ate or required authentication at F rom Capability to V ulnerability 9 T able 1. Overview of agen t skills distributed by ClawHub [ 35 ], Skills.sh [ 38 ], SkillDirec- tory [ 34 ], and skills indexed on GitHub. The analyzed skills denote the n umber of successfully downloaded skills from the platforms. The unique skills denote the n umber of skills that we added from the mark etplaces to our dataset after deduplication. Metric Cla wHub Skills.sh SkillsDir. GitHub Skills 16,755 * 157,191 32,896 142,822 Analyzed Skill 16,755 125,928 17,611 136,095 Unique Added Skills 16,755 125,919 3,923 91,583 Skill Owners n/a 7,950 709 14,197 Rep ositories n/a 9,431 766 16,413 #T otal 238,180 (sum of cross-marketplace unique counts) * W e downloaded 55,366 skills listed on Skills.sh. The rep ositories referenced by the mark etplace con tained an additional 77,456 skills, which we also collected. 2025-11 2025-12 2026-01 2026-02 2026-03 W eek 10 100 1K 10K 100K New skills / week (log) ClawHub.ai GHArchive.org Skills.sh SkillsDirectory.com Fig. 2. All inv estigated platforms sho w an increase in num b er of weekly added agent skills o ver time. ClawHub GHAr chive Skills.sh SkillsDir ClawHub GHAr chive Skills.sh SkillsDir 50 1 0 50 16.5K 6.1K 1 16.5K 6.6K 0 6.1K 6.6K 16.8K 142.8K 79.7K 32.9K 0 20000 40000 60000 80000 100000 120000 140000 Overlapping skills (count) Fig. 3. Overlap of skills published to the mar- k etplaces. the time of collection. These issues highlight the dynamic nature of the ecosystem and the c hallenges of reliably arc hiving skill datasets. Figure 2 further shows the n umber of skills added to these platforms each w eek since Nov ember 2025. In total, we successfully do wnloaded and analyzed 16,755 skills from Cla wHub, 125,928 from Skills.sh, 17,611 from SkillsDirectory , and 136,095 from GitHub. In Figure 3 , we further show the ov erlap of skills across the platforms. The figure illustrates that a substantial fraction of skills app ears on m ultiple mark etplaces, indicating that many platforms reference the same underlying rep ositories. In particular, GitHub acts as the primary hosting platform for most skills, while the marketplaces serve as indexing and disco very lay ers. At the same time, eac h mark etplace contributes a num b er of unique skills not listed elsewhere, demon- strating differences in indexing scop e and up date frequency . Overall, aggregating 10 Holzbauer et al. m ultiple marketplaces increases cov erage of the skill ecosystem and results in a final dataset of o ver 238,180 skills av ailable for further analysis. The marketplaces also differ substantially in their ecosystem structure. The largest num b er of skills and repositories is referenced by Skills.sh, with 7,950 o wners and 9,431 rep ositories, while the GitHub dataset spans 14,197 owners and 16,413 rep ositories. SkillsDirectory is comparatively smaller with 709 owners and 766 rep ositories. These num b ers indicate that man y repositories host multiple skills, suggesting that developers frequently group related skills within a single pro ject rather than publishing them individually . Skil l Content. W e further analyzed which scripts skills contain and provided a table in our artifact. 5 A cross all marketplaces, Python scripts app ear most frequen tly , follow ed by shell scripts, Jav aScript, and TypeScript. How ev er, the share of skills that include at least one script differs across marketplaces. While Skills.sh, SkillsDirectory , and GitHub hav e a similar range of skills con taining scripts (11.8% to 15.7%), ClawHub shows a substan tially higher share of skills including at least one script (44.1%). One explanation for this difference could b e that ClawHub more sp ecifically targets Op enCla w instead of general agents. W e further ev aluate whether scripts reside in a d irectory named scripts/ , as defined by the sp ecification [ 25 ]. Again, ClawHub represents an outlier. Among skills that include scripts from Cla wHub, 13.2% lack a scripts/ directory . In con trast, only 2.9% to 3.4% of skills from the other marketplaces contain scripts without the corresp onding directory . Endp oint Classific ation. Additionally , we analyzed where the endpoints in- cluded in skills are lo cated at the con tinent level, the complete table is provided in our artifact. 6 F or this purp ose, w e resolved the effective top lev el domains plus one (eTLD+1) of eac h statically detected endp oin t using Go ogle’s DNS serv er 8.8.8.8 from within the Europ ean Union. Across all marketplaces, we attribute most endp oin ts to North America, follow ed by coun tries within Europ e, whic h is particularly relev ant due to the GDPR [ 17 ], and Asia. All other regions con tribute only a minor share. In terestingly , the share of skills that contain endp oin ts v aries across marketplaces. F or example, 54.3% of skills on ClawHub con tain at least one eTLD+1 lo cated in North America, while only 34% of skills on SkillsDirectory and GitHub con tain such an eTLD+1. W e further analyzed whether skills use tracking endp oin ts listed in the Ex- o dus T racking list 7 . W e identify suc h endp oin ts in 4,963 skills. Sp ecifically , 503 skills on ClawHub include such endpoints (3%), 2,889 on Skills.sh (2.29%), 364 on Skills Directory (2.07%), and 2,267 skills collected from GitHub (1.67%). Ho wev er, most findings originate from google.com and facebook.com (95.63%), while only 217 skills include additional trac king domains. 5 https://anonymous.4open.science/r/agent_skills/tables/skill_content.pdf 6 https://anonymous.4open.science/r/agent_skills/tables/locations.pdf 7 https://reports.exodus- privacy.eu.org/en/trackers/ F rom Capability to V ulnerability 11 Se cr ets. W e further analyzed whether skills contain v alid tok ens and creden- tials. In total, w e disco vered 12 functional credentials, including four for the NVIDIA API, three for Elev enLabs, t w o Gemini tok ens, t w o MongoDB cre- den tials, and one credential each for Amplify , Postgres, and X AI. Attac k ers could abuse these credentials to access third part y services and p erform actions on b ehalf of the credential owner. F or example, NVIDIA, Elev enLabs, Gemini, and xAI tok en allow access to AI services, which attack ers could use to issue requests that incur costs for the owner. One reason for the relatively small n umber of discov ered secrets may be that most skills are hosted on GitHub. In contrast to mobile apps [ 29 ] or accessible storage buck ets [ 16 ], dev elop ers are lik ely more aw are that the co de is publicly visible and that attack ers could access an y embedded secret. Skill Pro visioning F or the securit y of distributed skills, similar considerations as for dep endency managemen t systems apply . Attac k ers can hijack dep endencies if the system do es not host the dep endency itself and the URL hosting it can b e taken o ver, for example b ecause a username on GitHub was renamed [ 29 ]. In addition, the authentication mechanism used to publish a skill plays an imp ortan t role, as weak or missing authentication can enable attack ers to hijac k existing dependencies or skills. W e therefore lo oked into the curren tly imple- men ted authentication mechanisms for publishing skills and their distribution. F or authen tication, ClawHub and SkillsDirectory rely on GitHub authentication, while Skills.sh pro vides no authentication. Instead, the marketplace adds skills when users download them using the command line to ol with telemetry enabled. In this sense, the system resembles Go modules, whic h also do not implement a separate authen tication mechanism but cache dependencies [ 21 ]. How ever, instead of caching or redistributing the skills, Skills.sh directly do wnloads them from GitHub. This design can enable attack ers to hijack existing rep ositories if the previous owner renames their accoun t and the rep ository has not yet reached the required download threshold that would cause GitHub to retire the rep ository name [ 30 ]. The same issue also affects SkillsDirectory . Although SkillsDirectory pro vides the option to download skills from its website, the command line to ol curren tly attempts to do wnload the skill from GitHub. In contrast, Cla wHub directly distributes the skill. This design reduces the dep endency on third-party URL managemen t and therefore decreases the risk of rep ository hijacking. Skil ls V ulner able to Hijacking. T o test whether GitHub mitigates rep ository hijac king for vulnerable skills, we created test accoun ts using the asso ciated usernames and en tered the resp ectiv e repository names without creating the rep ositories. This approac h k eeps existing redirects functional while revealing whether an attack er could recreate the rep ository under the same name [ 30 ]. T o prev ent attack ers from hijac king vulnerable skills, we k eep the account names asso ciated with vulnerable skills reserv ed. W e p erformed this step for all cases that forward to repositories with five or more stars, as the pro cess of registering GitHub accoun ts cannot b e automated. 12 Holzbauer et al. Ov erall, we disco v ered 121 skills that forw ard to sev en vulnerable repos- itories. Among them, 77 skills indexed by Skills.sh reference fiv e vulnerable rep ositories, while 44 skills listed on SkillsDirectory reference t wo additional vulnerable rep ositories. One hijack able rep ository referenced by SkillsDirectory has 159 stars, whereas the maximum star coun t among vulnerable rep ositories referenced by Skills.sh is 48. Using the do wnload statistics provided by Skills.sh, w e further assessed how frequently hijack able skills were downloaded in the past. The median num b er of downloads is 25, while the most frequently do wnloaded skill reac hed 2,032 downloads. W e responsibly disclosed this attack vector to the affected platforms and recommended switc hing to a direct distribution mo del similar to Op enCla w. New Ec osystems, Old Issues. Based on the source co de of ClawHub [ 27 ], w e implemented a cra wler for skill and security rep orts. During this process, w e discov ered that the asso ciated endp oin t returns additional owner metadata. In particular, the API exp oses the email address asso ciated with eac h user’s GitHub account. This information is not sho wn b y default on GitHub profiles and is also not visible through the Cla wHub website. Therefore, w e did not exp ect the ClawHub API to disclose this data. T ake away Individual marketplaces index only a subset of a v ailable skills. Com- bining ClawHub, Skills.sh, SkillsDirectory , and GitHub results in more than 238K unique skills for analysis. 44.1% of skills published on Cla wHub contain additional scripts, while the share of skills including scripts on other marketplaces is substan tially smaller, ranging from 11.8% to 15.7%. Skill marketplaces rely on GitHub rep ositories for distribution, which enables p otential hijac king attacks. Overall, we disco ver 121 skills from Skills.sh and SkillsDirectory that p oin t to seven vulnerable rep ositories. 5 R Q2: Malicious Classification T o answer R Q2 , we study ho w existing security scanners classify agent skills and ev aluate the consistency of their maliciousness assessments. W e compare scanner reports from skill mark etplaces with the results of our o wn analysis pip eline, which includes the Cisco Skill Scanner and an LLM-based b eha v- ioral classifier. This comparison allo ws us to quantify how different scanning approac hes interpret the same skills and to identify p oten tial ov erclassification of malicious b eha vior. Understanding these differences is imp ortant b ecause high false p ositive rates may reduce trust in the ecosystem, confuse end-users, and motiv ate the need for rep ository-aw are analysis. F rom Capability to V ulnerability 13 T able 2. Comparison of security scanners used in the skill ecosystem. The table con trasts scanners deploy ed on ClawHub and Skills.sh (highlight in gray) with Cisco’s skill scanner and our LLM-based feature set. Scanner Skills Scanned P ass F ail F ail Rate Cla whub VirusT otal 12,213 7,792 4,421 36.20% Op enCla w Scanner 14,244 8,271 5,973 41.93% GPT 5.3-based 16,424 10,050 6,374 38.8% Cisco Skill Scan 16,745 13,941 2,804 16.74% Skills.sh agen t-trust-hub 62,163 53,611 8,552 13.76% sn yk 46,414 42,843 3,571 7.69% so c ket 56,695 54,544 2,151 3.79% GPT 5.3-based 52,577 38,234 14,343 27.28% Cisco Skill Scan 52,577 45,196 7,381 14.04% Malicious Classific ation R ates. W e compare the malicious classification rates rep orted b y existing marketplace scanners with the results of our own anal- ysis pipeline. Across b oth platforms, we observe substantial differences b et ween scanners. On Clawh ub, the Op enClaw scanner flags up to 41.93% of skills as suspicious, while VirusT otal rep orts a similar rate of 36.20%. Our GPT-5.3 based approac h pro duces comparable results, classifying 38.8% of skills as malicious. In con trast, the Cisco Skill Scanner rep orts a significantly low er fail rate of 16.74%. These differences highligh t that the p erceiv ed securit y of the skill ecosystem strongly dep ends on the chosen scanning approach. The discrepancy b et ween scanners is even more pronounced when compar- ing the tw o marketplaces. On Cla whub, fail rates range from 16.7% to 41.9%, suggesting that a substan tial fraction of skills ma y exhibit potentially suspicious b eha vior. In con trast, scanners deploy ed on Skills.sh rep ort muc h low er fail rates, ranging from 3.79% to 13.76%. Our own analysis tools show similar trends: the GPT-5.3-based analysis classifies 27.28% of Skills.sh skills as suspicious, while the Cisco scanner flags 14.04%. These large differences indicate that existing skill scanners pro duce inconsisten t classifications when analyzing skills in isolation. In particular, scanners that rely hea vily on b ehavioral heuristics or language- mo del reasoning tend to flag substantially larger portions of the ecosystem as suspicious. Such high fail rates can reduce trust in the ecosystem and suggest that many skills ma y b e misclassified as malicious. This observ ation motiv ates that further con text is needed to classify the behavior of skills more profoundly . Cr oss-Sc anner A gr e ement. T o ev aluate how consisten tly scanners classify skills as malicious, w e compare the results of fiv e scanners on the subset of 27,111 Skills.sh skills analyzed by all to ols. Figure 4 shows the conditional ov erlap b et w een scanners, expressed as the probabilit y that a skill flagged b y scanner A is also flagged by scanner B . Overall, agreement b et ween scanners is lo w and often asymmetric. F or example, 33% of skills flagged b y the Cisco Skill Scanner 14 Holzbauer et al. G P T 5 . 3 - b a s e d C i s c o S k i l l S c a n n e r A g e n t T r u s t H u b S n y k S o c k e t GPT5.3-based Cisco Skill Scanner Agent Trust Hub Snyk Socket 100.0% 10.2% 19.7% 19.4 % 6.5% 33.0% 100.0% 23.2% 24 .6% 7.5% 28.7% 10.4% 100.0% 2 1.4% 8.2% 49.0% 19.2% 37.1% 10 0.0% 14.8% 44.6% 15.8% 38.7% 40.2% 100.0% 0.0 0.2 0.4 0.6 0.8 1.0 Conditional Co-Fl ag Rate Fig. 4. Conditional scanner agreement on Skills.sh common skills, shown as P ( B flags | A flags ) . 1 2 3 4 5 Number of scanners flagging a skill 0 1K 2K 3K 4K 5K 6K Skills (K) 6.0K 71.8% 1.6K 19.4% 540 168 33 Fig. 5. Number of Skills.sh com- mon skills flagged by exactly k ∈ { 1 , . . . , 5 } scanners. T able 3. Skill.sh malicious detection rates for skills and rep ositories (skill flagged if at least one scanner flagged, repository malicious if at least one skill is malicious) Category T otal Flagged Malicious Rate Skills (ov erall) 62,219 12,004 19.29% Skills (rep o stars > 1000) 7,725 1,656 21.44% Skills (installs > 1000) 755 122 16.16% Rep ositories (ov erall) 8,451 3,878 45.89% Rep ositories (stars > 1000) 528 268 50.76% Rep ositories (installs > 1000) 171 105 61.40% are also flagged b y the GPT-5.3-based analysis, whereas only 10.2% of GPT-5.3 detections ov erlap with Cisco. Similar patterns app ear across other scanner pairs, indicating that scanners frequently iden tify different sets of skills as suspicious. Figure 5 illustrates the distribution of detections across scanners. Among the 8,402 skills flagged by at least one scanner, the ma jority are flagged by only a single scanner (6,032 skills), while 1,629 are flagged by tw o scanners and 540 b y three scanners. Only 168 skills are flagged b y four scanners, and just 33 skills are flagged by all five scanners. Overall, the limited ov erlap shows that scanner consensus is rare and that most detections are not corrob orated by other tools. R ep ository-level Classific ation R ates. T o b etter understand ho w scanner results translate from individual skills to rep ositories, we aggregate skill-lev el detections at the rep ository level. T able 3 sho ws the resulting malicious classi- fication rates for both skills and rep ositories on Skills.sh. A skill is considered malicious if at least one scanner flags it, while a rep ository is classified as mali- cious if any of its con tained skills is flagged. Overall, 19.29% of skills are flagged as malicious. When focusing on p opular skills, the rates remain comparable: 21.44% for skills hosted in rep ositories with more than 1,000 stars and 16.16% F rom Capability to V ulnerability 15 for skills with more than 1,000 installs. How ever, the picture c hanges when aggregating these detections at the rep ository lev el. Nearly half of all rep ositories (45.89%) contain at least one flagged skill. This prop ortion increases further for p opular rep ositories, reaching 50.76% for rep ositories w ith more than 1,000 stars and 61.40% for rep ositories asso ciated with highly installed skills. These results suggest that even well-kno wn rep ositories are frequently classified as malicious when applying strict aggregation rules. The effect is largely driven by rep osito- ries containing multiple skills: as the num b er of skills p er rep ository increases, the probabilit y that at least one skill is flagged also increases. Consequen tly , rep ository-lev el aggregation can substantially amplify skill-level detections and ma y ov erstate the prev alence of malicious rep ositories in the ecosystem. T ake away Malicious classification rates v ary widely across scanners, ranging from 3.8% to 41.9%. Scanner consensus is extremely limited: only 33 out of 27,111 skills (0.12%) are flagged as malicious b y all five scanners. Aggregating skill-lev el flags on Skills.sh to rep ositories substantially inflates malicious classifications. While only 19.3% of skills are flagged, 45.9% of rep ositories con tain at least one flagged skill, rising to ov er 50% for rep ositories with more than 1,000 stars. 6 R Q3: Rep ository-A w are Analysis T o answer R Q3 , we analyze whether skills flagged as malicious b y automated scanners remain suspicious when ev aluated within the context of their surround- ing GitHub rep ositories. Existing scanners t ypically analyze skills in isolation, whic h can lead to false p ositives when b enign functionalit y app ears suspicious without considering the broader con text. Our analysis shows that repository con text pla ys an imp ortan t role, as man y flagged skills reside in rep ositories whose do cumen tation and co debase align with the skill functionalit y and do not exhibit malicious b eha vior. W e fo cus on skills that are flagged as malicious b y b oth the Cisco Skill Scanner (severit y high or critical) and our GPT-5.3- based analyzer with a score greater than 3, resulting in 8,153 flagged (skill, rep ository) com binations. W e exclude tw o instances from our analysis. First, w e exclude ClawHub skills, as no GitHub rep ository context exists. Second, we exclude skills lo cated in the rep ository ro ot, as this setup only allo ws metadata based analysis but not co debase scoring. Among the skills, w e find only ∼ 100 hosted as standalone GitHub rep ositories. F rom the remaining set of skill and rep ository , we randomly sample 3,000 skill rep ository combinations and collect the corresponding rep ository metadata together with full rep ository (failed in 113 cases) clones. Metadata Sc or e. T o understand the rep ository context of flagged skills, w e analyze rep ository metadata such as size, age, activit y , p opularit y (stars and 16 Holzbauer et al. R epo Size R epo Age A ctivity Stars F orks Open Issues 0 20 40 60 80 100 % r epositories 43% 66% 64% 48% 10% 47% 41% 27% 29% 22% 73% 32% 9% 19% 12% 19% 11% zer o low medium high very high Fig. 6. Metadata scores for unique rep ositories. 0 20 40 60 80 100 Sorted r epositories (%) 20 40 60 80 100 Metadata scor e skills.sh skillsdir ectory ghar chive Fig. 7. Metadata score from rep os- itories on different mark etplaces. forks), and issue activity using predefined buck et ranges. Figure 6 shows the distribution of these metadata scores for the 1,431 rep ositories con taining flagged skills. Overall, these rep ositories tend to b e small and ha ve limited p opularit y . Almost half of the repositories (47.6%) are smaller than 2 MB, while only 11.1% exceed 50 MB. Popularit y signals are generally low. 43.4% of rep ositories hav e no stars and 66.5% ha ve no forks, and none of the rep ositories exceed 100 stars or forks. Similarly , 64.0% of rep ositories hav e no op en issues. Most rep ositories are relatively y oung and mo derately maintained. The ma jorit y (73.0%) were created b et ween tw o weeks and four months b efore the crawl, while 17.1% are older than one year. Despite their limited p opularit y , many rep ositories remain active: 47.4% w ere up dated within the last w eek. T o ev aluate whether rep ositories with flagged skills differ from typical skill rep ositories, we rep eat the analysis using a random set of 1,500 rep ositories with a matching marketplace distribution. The resulting distributions are similar, suggesting that rep ositories con taining flagged skills do not exhibit distinct metadata characteristics. Instead, the observed differences are driv en by mark etplace comp osition rather than the presence of suspicious skills. Figure 7 illustrates these differences across mark etplaces. Rep ositories originating from SkillsDirectory generally achiev e higher metadata scores, reflecting more established pro jects, whereas Skills.sh and GitHub Arc hive rep ositories show low er scores in comparison, correlating to the lac k of mo deration on the hosted skills. Co deb ase Sc or e. T o ev aluate whether a flagged skill aligns with its surrounding rep ository context, we design an evidence-based prompt that assesses the rela- tionship b et ween the skill and the rep ository co debase. The ev aluation considers domain alignmen t, code similarity , README consistency , rep ository supp ort signals, and a maliciousness-adjusted score. T o con trol analysis cost, w e limit the rep ository context to up to 200 lines from the SKILL.md and README files and up to three rep ository files (100 lines each). Most rep ositories provide sufficien t con text for ev aluation. Among the analyzed rep ositories, 94.1% contain a README and 65.7% con tain rep ository co de, with 61.9% containing both. Rep ository con text frequently aligns with the skill purp ose. Domain matching F rom Capability to V ulnerability 17 0 20 40 60 80 100 Shar e of codebase scans (%) Domain match Code match README match R epository maliciousness* Security -tooling signal Confidence 28% 61% 31% 98% 93% 9% 26% 29% 38% 72% 46% 10% 31% 19% L ow Medium High Fig. 8. Co debase score category results, marked with * means lo wer is b etter. is high for 45.9% of skills and medium for 26.1%, indicating that roughly 72% of rep ositories exhibit at least mo derate thematic alignment with the skill de- scription. Direct co de alignment is less common, with 9.9% showing high sim- ilarit y and 28.6% medium similarit y . Imp ortan tly , very few rep ositories exhibit malicious repository-level behavior. 98.0% of rep ositories fall into the low est maliciousness category , with only tw o rep ositories sho wing high maliciousness signals. Securit y-to oling related signals app ear in a small subset of repositories (5.2%). These results suggest that the ma jorit y of flagged skills are embedded in rep ositories whose do cumen tation and structure are broadly consisten t with the skill purp ose, providing strong contextual evidence that many scanner alerts represen t false p ositives. Ec onomic F e asibility. W e ev aluate the cost of repository-context analysis using a sample of 20 skill–rep ository pairs. On av erage, each analysis consumes 6,725 input tok ens and 125 output tokens. Using the curren t GPT-5 pricing mo del ($1.25 p er million input tokens and $10 p er million output tokens), a single analysis costs appro ximately $0.0097 without caching. With prompt cac hing enabled, the cost drops to roughly $0.0021 p er analysis. Applying this approach to the full sample of 3,000 skills res ulted in a total cost of approximately $24, including o ccasional retries. While contin uous daily scans of the entire ecosystem w ould remain exp ensiv e, p erio dic scans (e.g., weekly) are economically feasible for skill mark etplaces such as SkillsDirectory . R ep ository Context Sc or e. Figure 9 sho ws the distribution of rep ositories across the individual scoring comp onents. The distributions reflect an imp ortan t observ ation: the ma jority of rep ositories in which skills reside were flagged as non-malicious. This is particularly visible in the co debase score, whic h already assigns a baseline score of 40 when a rep ository is classified as non-malicious. Consequen tly , the co debase score distribution spans a wide range, with a mean of 65.1 and v alues extending up to 97, reflecting v arying degrees of alignment 18 Holzbauer et al. 25 50 75 100 Codebase scor e 0 100 200 300 400 Count 20 40 60 80 100 Metadata scor e 0 50 100 150 200 250 300 350 Count 20 40 60 80 R epository -Conte xt Scor e 0 50 100 150 200 250 300 350 Count Fig. 9. Rep ository context score and its weigh ted sub components, with a weigh ted 70% co debase score and 30% metadata score. b et w een the skill implemen tation and the surrounding rep ository co de. Only a small fraction of repositories ac hieves near-p erfect alignment, with most scores distributed b et ween 40 and 70. In contrast, the metadata score is low er and more compressed (mean 42.9). This is driv en b y the large num b er of rep ositories that ha ve b een created within the past six months sp ecifically around the emerging topic of agent skills. As the ecosystem matures and more skills b ecome embedded within established and actively maintained co debases, this score comp onen t is exp ected to increase. The curren t distribution therefore reflects the early stage of the agen t s kill ecosystem, where rep ository maturity indicators such as pro ject age and contributor activity are still limited. The rep ository-con text score, combining b oth signals, results in a smo other distribution (mean 58.5) without significant outliers. The ma jorit y of rep ositories already reach mo derate trust levels, primarily driven by the co debase analysis, which ev aluates alignment b et w een the skill and the surrounding co de and is largely indep enden t of rep osi- tory age. Overall, this combined scoring approach allows us to con textualize the large n umber of initially flagged agent skills. Sc or e Interpr etation. The repository-context score com bines the co debase score (70%) and the repository metadata score (30%). As shown in Figure 9 , the resulting distribution is relatively smo oth (mean 58.5) and do es not exhibit sig- nifican t outliers. Most rep ositories already reach mo derate trust levels, primarily driv en b y the co debase analysis, whic h ev aluates structural alignment b et ween the skill and the surrounding code and is largely independent of rep ository age. T o help interpret this metric, we divide the rep ository-con text score into three categories. The lo west category ( < 40 ) represents skills that app ear weakly em b edded in their rep ository context or originate from rep ositories with lo w maturit y . This group contains 121 cases (4.2%) with an av erage context score of 36.2. The ma jority of skills fall in to the intermediate category ( 40 – < 60 ), represen ting skills that are connected to a repository but whose surrounding con text provides only mo derate supp ort. This group includes 1,373 cases (47.6%) with an av erage context score of 50.8. Finally , 1,393 skills (48.3%) fall into the highest category ( ≥ 60 ), indicating strong alignment b et ween the skill and the F rom Capability to V ulnerability 19 rep ository co debase and do cumen tation. In this group, the a verage co debase score reaches 77.2, suggesting that the rep ository conten t strongly supp orts the intended functionality of the skill. Overall, only a small fraction of flagged skills (4.2%) app ear p o orly embedded in their rep ository context, while the v ast ma jority sho w mo derate to strong alignmen t with their surrounding rep ositories. Suspicious R ep ositories. W e iden tify a small subset of rep ositories where the skill is aligned with the co debase but the rep ository itself appears suspi- cious and is not categorized as a securit y to ol. In total 15 cases fall in to this category . This corresp onds to 0.52% of the 2,887 skill-rep ository combinations with API/co debase ev aluation. While the metadata score differs only slightly (mean 41.1 vs. 42.9), these repositories show substantially low er co debase scores (51.5 vs. 65.2). Consequently , their rep ository-con text score is also low er (48.4 vs. 58.5), indicating that the combined scoring effectiv ely reacts to suspicious en vironments and separates them from the rest of the ecosystem. Cr oss-R ep ository Sc or es. T o understand whether rep ository context influ- ences the ev aluation of individual skills, we analyze skills that app ear in multiple rep ositories. In total, 354 skills are observed across more than one rep ository . A cross these skills, the median cross-rep ository v ariance of the co debase score is 42.0, corresp onding to a median standard deviation of 6.48 and a median score range of 15.0 p oints. This indicates mo derate context sensitivity , where the same skill can receive noticeably different scores dep ending on the surrounding rep osi- tory . The distribution also exhibits a substantial tail of highly v ariable skills: the 90th percentile range reaches 33.75 score p oin ts. This suggests that repository con text can significantly influence the interpretation of a skill’s b eha vior. Sc or e V alidation. Currently , no publicly a v ailable dataset exists that lab els rep ositories as malicious with resp ect to agent skills. Instead, w e follow an inv erse v alidation approach: we ev aluate whether rep ositories con taining scanner-flagged skills app ear b enign when insp ected manually and whether our rep ository-a ware scoring reflects these observ ations. T o obtain qualitative v alidation, we randomly sample 20 rep ositories from the flagged set and pro vide them to tw o indep endent researc hers for manual assessment. Reviewers were asked to determine whether the rep ository app eared b enign or suspicious based on its do cumen tation, co de structure, and functionalit y . Both reviewers classified all 18 reviewed rep ositories as b enign, the tw o remaining were not a v ailable an ymore. This indicates that the rep ositories themselves do not exhibit obvious malicious in tent. Review ers also assessed repository maturity . In this case, judgmen ts were more heterogeneous: Reviewer 1 classified 5/18 rep ositories with medium matu- rit y , and 1/18 as high, while Reviewer 2 classified maturit y as medium for 14/18 and 3/18 high. The tw o review ers agreed on maturity in 5 of the 18 comparable cases. In Figure 10 , we compare manual scores, computed as the av erage score of b oth reviewers, with the scores pro duced by our analysis. Overall, the figure rev eals a p ositiv e correlation, as higher maturity scores assigned by the metadata score coincid e with higher scores from human review ers. Overall, the man ual 20 Holzbauer et al. 1 (low) 1.5 2 (medium) 2.5 3 (high) A verage Manual Maturity V ote (low=1, medium=2, high=3) 20 40 60 80 Metadata Scor e Fig. 10. Comparison of maturity scores assigned by manual review ers and our rep ository aw are scoring approac h. review supp orts our rep ository-a ware scoring approach: rep ositories containing scanner-flagged skills generally app ear benign, even under manual exp ert re- view, while manual maturity assessments match the automatically calculated metadata score. T ake away With rep ository con text, only 0.52% of 2,887 scanner-flagged skills remain in malicious flagged rep ositories. Rep ository context helps agen t users to better judge skills. Only a share of 4.2% of skills p oorly connect to the rep ository con text. The remaining 96% of flagged skills are embedded in rep ositories whose do cumen tation and co debase align with the skill functionality . 7 Conclusion In this study , w e presen ted the largest empirical security analysis of the AI agen t skill ecosystem. W e collected 238,180 unique skills from three market- places and GitHub. W e c haracterized the conten t of these skills by classifying their endp oin ts and searching for embedded secrets. F urther, we compared the results of existing security scanners deploy ed by tw o mark etplaces and observed substan tial differences in their classifications. Malicious classification rates v ary widely across scanners, ranging from 3.8% to 41.9%. At the same time, scanner consensus remains extremely limited, only 33 out of 27,111 skills (0.12%) are flagged as malicious b y all five scanners. T o improv e the curren t situation, we prop osed a repository-aw are scanning metho dology . By incorporating rep ository con text in to the analysis, w e sub- stan tially reduced the num b er of false p ositiv es pro duced by existing scanners. When w e re-ev aluated 2,887 s kills flagged by scanners using rep ository context, only 0.52% remain in malicious flagged rep ositories. In most cases, repository F rom Capability to V ulnerability 21 information allows users and analysts to b etter judge the b eha vior of a skill. Only 4.2% of skills show weak connections b et ween the skill description and the referenced repository . In con trast, 96% of flagged skills are embedded in rep ositories whose do cumen tation and co debase align with the claimed skill functionalit y . In addition to our analysis of skill security , we revealed structural weaknesses in current mark etplace ecosystems. In particular, w e iden tified new attack v ectors that arise from the wa y skills reference external rep ositories. Sp ecifically , aban- doned rep ositories referenced by skill indexes enable adversaries to hijack existing skills, p oten tially affecting 121 skills hosted on tw o marketplaces. F urthermore, w e show ed that Cla wHub leaks the email addresses of skill o wners, as well as additional GitHub based accoun t metadata. References 1. A chary a, D.B., Kuppan, K., Divy a, B.: Agen tic AI: Autonomous Intelligence for Complex Goals—A Comprehensive Survey. IEEE Access 13 , 18912–18936 (2025). https://doi.org/10.1109/ACCESS.2025.3532853 2. Agen t Skills: Agent Skills: A Simple, Op en F ormat for Giving Agen ts New Capabilities. https://agentskills.io/home (2025), accessed: 2026-03-07 3. Anon ymous: “Do Not Mention This to the User” ’: Detecting and Understanding Malicious Agent Skills (2026) 4. An thropic: GitHub – Claude Co de. https://github.com/anthropics/claude- code (2026), accessed: 2026-03-07 5. Basu, M.: OpenClaw AI c hatb ots are running amok — these scien tists are listening in. Nature 650 , 533 (2026) 6. Beurer-Kellner, L., Kudrinskii, A., Milanta, M., Nielsen, K.B., Sark ar, H., T al, L.: Sn yk Finds Prompt Injection in 36%, 1467 Malicious Pa yloads in a T o xicSkills Study of Agen t Skills Supply Chain Compromise, https://snyk.io/blog/toxi cskills- malicious- ai- agent- skills- clawhub/ , accessed: 2026-02-26 7. Beurer-Kellner, L., Kudrinskii, A., Milanta, M., Nielsen, K.B., Sark ar, H., T al, L.: T echnical Rep ort: Exploring the Emerging Threats of the Agen t Skill Ecosystem, https://github.com/snyk/agent- scan/blob/main/.github/r eports/skills- report.pdf 8. Bhardw a j, V.P .: F ormal Analysis and Supply Chain Security for Agentic AI Skills. arXiv preprint arXiv:2603.00195 (2026) 9. Bugiel, S., Nürn b erger, S., Pöppelmann, T., Sadeghi, A.R., Schneider, T.: Amazo- nIA: when elasticity snaps bac k. In: Pro c. of ACM CCS (2011). https://doi.or g/10.1145/2046707.2046753 10. Chandonnet, H.: Meta AI alignment director shares her Op enCla w email- deletion nigh tmare:’I had to RUN to my MAC mini’, https://www.businessinsi der.com/meta- ai- alignment- director- openclaw- email- deletion- 2026- 2 , ac- cessed: 2026-03-08 11. Cisco AI Defense: Skill Scanner: Security Scanner for Agen t Skills. https://gi thub.com/cisco- ai- defense/skill- scanner (2026), accessed: 2026-03-07, V2.0.1 12. Cruz, J.: Op enCla w (ex-Moltb ot (ex-Clawdbot)): The AI Butler With Its Claws On The Keys T o Y our Kingdom, https://www.bitsight.com/blog/openclaw- ai- security- risks- exposed- instances , accessed: 2026-03-04 22 Holzbauer et al. 13. CVE-2025-59536: Claude Code’ startup trust dialog could lead to Command Ex- ecution attac k, https://www.cve.org/CVERecord?id=CVE- 2025- 59536 , accessed: 2026-03-08 14. CVE-2026-21852: Claude Code Leaks Data via Malicious En vironment Configura- tion Before T rust Confirmation, https://www.cve.org/CVERecord?id=CVE- 2026- 21852 , accessed: 2026-03-08 15. Deng, Z., Guo, Y., Han, C., Ma, W., Xiong, J., W en, S., Xiang, Y.: AI Agents Under Threat: A Survey of Key Security Challenges and F uture Path wa ys. A CM Comput. Surv. 57 (7) (F eb 2025). https://doi.org/10.1145/3716628 , https:// doi.org/10.1145/3716628 16. El Y admani, S., Gadyatsk ay a, O., Zhauniarovic h, Y.: The File That Con tained the Keys Has Been Remov ed: An Empirical Analysis of Secret Leaks in Cloud Buck ets and Resp onsible Disclosure Outcomes . In: Pro c of the Symp osium on S&P (2025). https://doi.org/10.1109/SP61157.2025.00009 17. Regulation (EU) 2016/679 of the Europ ean Parliamen t and of the Council of 27 April 2016 on the protection of natural p ersons with regard to the pro cessing of p ersonal data and on the free mov ement of suc h data, and rep ealing Directive 95/46/EC (General Data Protection Regulation) (2016), http://data.europa. eu/eli/reg/2016/679/oj 18. F ogel, A., Cohen, E.: Caugh t in the Wild: Real Attac k T raffic T argeting Exp osed Clawdbot Gatewa ys, https://www.pillar.security/blog/caught- in- the- wild- real- attack- traffic- targeting- exposed- clawdbot- gateways , ac- cessed: 2026-03-07 19. GitHub: T ruffleHog, https://github.com/trufflesecurity/trufflehog/ 20. Gregorik, I.: GH archiv e. https://www.gharchive.org/ , accessed: 2026-03-08 21. Gu, Y., Ying, L., Pu, Y., Hu, X., Chai, H., W ang, R., Gao, X., Duan, H.: In vestigating Pac k age Related Security Threats in Softw are Registries. In: Pro c of the Symp osium on S&P . IEEE (2023). https://doi.org/10.1109/SP46215. 2023.10179332 22. Lakshmanan, R.: Cla wJack ed Fla w Lets Malicious Sites Hijack Lo cal OpenClaw AI Agen ts via W ebSock et, https://thehackernews.com/2026/02/clawjacked- flaw- lets- malicious- sites.html , accessed: 2026-03-04 23. Ling, G., Zhong, S., Huang, R.: Agent Skills: A Data-Driven Analysis of Claude Skills for Extending Large Language Mo del F unctionality . arXiv preprint arXiv:2602.08004 (2026) 24. Liu, Y., W ang, W., F eng, R., Zhang, Y., Xu, G., Deng, G., Li, Y., Zhang, L.: Agent Skills in the Wild: An Empirical Study of Securit y V ulnerabilities at Scale. arXiv preprin t arXiv:2601.10338 (2026) 25. Min tlify: Sp ecification – Agen t Skills, https://agentskills.io/specification , accessed: 2026-03-04 26. Oliv eira, A., T ancia, B., Fiser, D., Lin, P ., Rey ers, R.: Malicious Op enCla w Skills Used to Distribute Atomic macOS Stealer, https://www.trendmicro.com/ en_us/research/26/b/openclaw- skills- used- to- distribute- atomic- macos- stealer.html , accessed: 2026-03-08 27. op encla w: Skill Directory for Op enCla w. https://github.com/openclaw/clawhub , accessed: 2026-03-07 28. P ati, A.K.: Agen tic AI: A Comprehensive Survey of T ec hnologies, Applications, and So cietal Implications. IEEE A ccess 13 , 151824–151837 (2025). https://doi .org/10.1109/ACCESS.2025.3585609 F rom Capability to V ulnerability 23 29. Sc hmidt, D., Sc hrittwieser, S., W eippl, E.: Leaky Apps: Large-scale Analysis of Secrets Distributed in Android and iOS Apps. In: Pro c. of A CM CCS (2025). https://doi.org/10.1145/3719027.3765033 30. Sc hmidt, D., Schritt wieser, S., W eippl, E.: Supply Chain Insecurity: Exposing V ulnerabilities in iOS Dependency Managemen t Systems (2026), https://arxi v.org/abs/2601.20638 31. Sc hmotz, D., Beurer-Kellner, L., Abdelnabi, S., Andriushchenk o, M.: Skill- Inject: Measuring Agent V ulnerability to Skill File Attac ks. arXiv preprint arXiv:2602.20156 (2026) 32. Sho dan: Sho dan Search Enging – Op enCla w, https://www.shodan.io/search/r eport?query=product:openclaw , accessed: 2026-03-07; Archiv ed at: https://ar chive.ph/3TNgq 33. Sh u, R., Gu, X., Enc k, W.: A Study of Securit y V ulnerabilities on Do ck er Hub. In: Pro c. of the ACM on Conference on Data and Application Security and Priv acy (2017). https://doi.org/10.1145/3029806.3029832 34. Skills Directory: Agen t Skills Directory , https://www.skillsdirectory.com/ , accessed: 2026-03-04 35. Stein b erger, P .: ClawHub, the skill do c k for sharp agents, https://clawhub.ai/ , accessed: 2026-02-26 36. Stein b erger, P .: Op enCla w – Personal AI Assistan t. https://openclaw.ai/ , ac- cessed: 2026-03-07 37. Stein b erger, P .: OpenClaw log p oisoning (indirect prompt injection) via W ebSo ck et headers, https://github.com/openclaw/openclaw/security/advisories/GHSA- g27f- 9qjv- 22pm , accessed: 2026-03-04 38. V ercel Labs: The Agent Skills Directory, https://skills.sh , accessed: 2026-02-26 39. VirusT otal. https://www.virustotal.com , accessed: 2026-03-09 40. Zhang, J., Huang, K., Huang, Y., Chen, B., W ang, R., W ang, C., Peng, X.: Killing T wo Birds with One Stone: Malicious Pac k age Detection in NPM and PyPI using a Single Mo del of Malicious Behavior Sequence. ACM T ransactions on Softw are Engineering and Metho dology 34 (4) (Apr 2025). https://doi.org/ 10.1145/3705304 , https://doi.org/10.1145/3705304
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment