An Investigation into the Use of Common Libraries in Android Apps
The packaging model of Android apps requires the entire code necessary for the execution of an app to be shipped into one single apk file. Thus, an analysis of Android apps often visits code which is not part of the functionality delivered by the app. Such code is often contributed by the common libraries which are used pervasively by all apps. Unfortunately, Android analyses, e.g., for piggybacking detection and malware detection, can produce inaccurate results if they do not take into account the case of library code, which constitute noise in app features. Despite some efforts on investigating Android libraries, the momentum of Android research has not yet produced a complete set of common libraries to further support in-depth analysis of Android apps. In this paper, we leverage a dataset of about 1.5 million apps from Google Play to harvest potential common libraries, including advertisement libraries. With several steps of refinements, we finally collect by far the largest set of 1,113 libraries supporting common functionalities and 240 libraries for advertisement. We use the dataset to investigates several aspects of Android libraries, including their popularity and their proportion in Android app code. Based on these datasets, we have further performed several empirical investigations to confirm the motivations behind our work.
💡 Research Summary
The paper addresses a fundamental obstacle in Android security research: the pervasive presence of common third‑party libraries—especially advertising SDKs—within every APK. Because Android’s packaging model bundles all code, static analyses, repackaging (piggy‑backing) detection, and machine‑learning‑based malware detection often process large amounts of code that are unrelated to the app’s core functionality. This “library noise” can lead to high false‑positive rates, missed detections, and excessive computational overhead.
To overcome this, the authors collected a massive dataset of approximately 1.5 million applications from Google Play and designed an automated three‑step pipeline to harvest, verify, and label common libraries.
Step 1 – Candidate Extraction: All package names from the APKs are extracted. Packages that appear frequently across many apps are considered candidates. To reduce redundancy, only the first three name segments are used (e.g., com.facebook.ads) and framework packages such as android.support are excluded.
Step 2 – Library Confirmation: Frequency alone is insufficient because unrelated code can share a name. The authors therefore apply several heuristics: (i) retain only packages present in at least ten apps, (ii) discard single‑segment names (unlikely to be widely distributed libraries), (iii) filter out obviously obfuscated packages that contain single‑letter segments, and (iv) remove packages that are simple prefixes of longer ones. For the remaining candidates, they perform code similarity analysis at the method level. Randomly selected ten pairs of apps per candidate are compared; if ≥90 % of methods match, the package is deemed a common library. This process yields 1,113 verified general‑purpose libraries.
Step 3 – Advertising Library Identification: Advertising SDKs have characteristic patterns (network calls, UI widgets, tracking APIs). Using a combination of automated heuristics and manual inspection, the authors label 240 of the verified libraries as ad libraries.
The paper then conducts several empirical studies. First, it quantifies library prevalence: on average, about 60 % of an app’s code belongs to common libraries, confirming earlier findings. Moreover, malicious apps tend to embed certain ad SDKs more heavily and also include obscure third‑party libraries not typically seen in benign apps.
Second, the authors evaluate the impact on piggy‑backing detection. They present two real‑world pairs of apps. In the first pair, similarity computed on full code (including libraries) is 86 %, which would incorrectly flag a legitimate app as piggy‑backed using a standard 80 % threshold. After removing common libraries, similarity drops to 0 %, eliminating the false positive. In the second pair, a piggy‑backed app replaces one ad SDK with another, causing raw similarity to fall to 47 % (a false negative). Excluding libraries raises similarity to 84 %, correctly identifying the repackaged relationship. These examples demonstrate that library filtering dramatically reduces both false‑positive and false‑negative rates in similarity‑based detection.
Third, the authors show that integrating the harvested whitelist into machine‑learning malware detectors reduces feature noise, leading to higher classification accuracy and lower computational cost, since taint‑analysis tools like FlowDroid no longer waste resources on irrelevant library code.
All identified libraries are released publicly on GitHub (https://github.com/servalsnt-uni-lu/CommonLibraries.git), providing the research community with a comprehensive, up‑to‑date whitelist for Android static analysis.
In summary, the paper makes four key contributions: (1) an automated, scalable method for extracting common libraries from a market‑scale Android corpus, (2) a systematic discrimination of advertising libraries, (3) extensive empirical evidence that library filtering improves repackaging detection and malware classification, and (4) the release of the largest publicly available library whitelist to date. The work not only clarifies the extent of library usage in the Android ecosystem but also offers a practical tool that can be immediately adopted to enhance the precision and efficiency of future Android security analyses.
Comments & Academic Discussion
Loading comments...
Leave a Comment