Reflections

This literature review found several datasets of Android applications. They collect and provide executables, market and distribution data, source code, and even analysis results in various forms of detail.

App store data

One common problem faced by many data datasets is the lack of documented access to data from app stores, especially . Google does not provide a public API and other market places actively block crawlers from collecting data. Tools do exist to gather app store data from many sources but they heavily rely on regular maintenance and updates to keep working. Future work could include creating one dataset with comprehensive access to market place data to facilitate research of Android applications.

Updated data

Researchers have poured a lot of work into creating diverse datasets of Android apps. Information in these app datasets is capable of shedding light on interesting questions in the field of Android research. Unfortunately, many of these datasets have not received updates in years. Information in these datasets turned stale. Researches facing an ever changing environment of application development cannot rely on these old datasets to perform current research. This leads to a gap in possible research since newer Android app datasets may not include similar information necessary to answer some research questions. Future efforts should be directed to update existing datasets and set up new datasets in such a way that they are easier to maintain and kept up-to-date. Releasing tooling to create a dataset is already a step in the right direction. Regularly performing the data collection process and making the results available in a versioned format or a timeline should be the next step.

Accessibility of data

Worse than the problem of outdated data is inaccessible data. Many datasets of Android applications have not been released publicly or authors stopped sharing them after some time. It is unfortunate to see that potentially useful data is not shared with the research community. Instead of re-creating datasets from scratch, building upon previous work and complementing existing data would benefit authors of both old and new publications. Therefore, researchers should make sure they share data in widely accessible formats and on open platforms to be independent of individual maintenance. Also including permanent links to data could help make data more easily accessible years after publication.

Source code

Previous studies and datasets provide different levels of access to data of Android applications. However, none of the datasets combines all potential data. Martin also highlight a key shortcoming of the literature in its current state: There are few mining tools and datasets which combine source code with application metadata from app stores and development tools for large sets of apps. One tool that combines access to all sources mentioned above is . To ease access to app market data, Avdiienko developed a toolchain for mining Android apps. It has modules for data retrieval from various sources. This design allows Avdiienko to combine app metadata, user reviews, executables, and source code where applicable. Modules include crawlers and metadata analysis as well as static program analysis and post-processing. can retrieve source code for Android apps limited to those listed on but does not seem to be publicly available. Some datasets have increased the number of Android applications for which source code is available. Unfortunately, this number is still low and the sample of apps is likely biased. Finding additional means to get access to source code should be on the agenda for future work.

Combining existing data

Finally, future research could benefit more from existing datasets, if the information contained in them was relatable to information in other datasets. Various efforts have been undertaken to gather, process, and present relevant data. This information on Android apps from different datasets complements each other. New insights could be gained from combining datasets and drawing connections between the existing data points. Future work could facilitate this kind of research by creating a meta-dataset which links data on Android applications in existing datasets.

Allix created crawlers for several app stores to collect a comprehensive and up-to-date sample of executable Android app packages — AndroZoo. The crawlers are customized for each app store to collect as many apps as possible. Simultaneously, the authors took measures to minimizing the load on market places they crawl to avoid losing access and jeopardizing long-term integrity of the dataset. The sources from which AndroZoo draws include major market places , Anzhi, and AppChine, as well as smaller directories 1mobile, AnGeeks, Slideme, ProAndroid, HiApk, and F-Droid. The applications from these app stores were complemented with additional artifacts from peer-to-peer distributed torrents and the Android Malware Genome Project . The procedure to download candidate apps is performed by dedicated crawlers for each source and includes a unique identifier and a checksum of the file for deduplication. Most crawlers are based on the scrapy framework. However, Allix created a special software to overcome restrictions of , an undocumented API, rate limits, and the need of an Android device. A central dispatcher spreads the work load to download agents in several locations and over different protocols. With this setup it was possible to eliminate the backlog of old applications. Subsequently, fewer agents were necessary to keep up with new additions to . A web service is tasked with organizing and storing received APKs. This unit also handles authentication for downloads of the dataset and publicly displays statistics. When creating AndroZoo, Allix encountered several data collection challenges. They list unexpected downtime of markets, HTML instability, monitoring of crawlers, protocol changes, and information loss. Overall, the authors were able to collect more than three million Android applications initially. The current count is more than five million . The majority of these apps stems from , Anzhi, and AppChine, with the other market places contributing a much lower number. The dataset is available for download for the research community as a regularly updated list of APKs. This list contains SHA256 hashes as identifiers and additional metadata, such as compilation date, malware status, package name, version, Individual apps can be downloaded with the SHA256 hash as index. One defining feature of AndroZoo is, that all apps in the dataset are tested for malware by over 60 security products hosted by VirusTotal. Allix report that 22 percent of apps in are flagged as malware by at least one product while 50 percent or more are found to be malware in the two major Chinese market places. When counting APKs which at least ten security products recognize as malware, this number drops to around 1 percent of detected malware in and 33 percent and 17 percent in Anzhi and AppChine respectively. All samples of the Android Malware Genome Project are successfully recognized by at least 10 antivirus products. The dataset lends itself to security research since metadata of all samples contains the malware detection status. Examples of such research based on AndroZoo are . Other uses leverage the fact that the dataset contains several version for many apps and the availability of compiled bytecode . AndroZoo also contains many Android applications which are not marketed in . This facilitates analysis of marketed and non-marketed apps . Limitations of AndroZoo mostly stem from the fragility of the data collection process. Collecting was not continuous but rather resumed irregularly, if issues occurred. Additionally, app some market maintainers have blocked crawlers and thus caused outages and incomplete sets of data.

Conclusions

Researchers of Android applications have a vast amount of data at hand. There are already many datasets containing executable artifacts. App store metadata is plentiful and public albeit difficult to access. Many studies report this problem, especially in accessing data from . However, insight into source code is limited because the vast majority of apps is proprietary. Several studies tried to gather and combine source code with other app metadata.

Datasets of app store data and executables have the advantage, that they are independent of licensing of the application source code. Data from marketplaces can be scraped for free while APK archives can be downloaded from app stores. On the other hand, source code for proprietary applications is to a large extent not available at all. Having both a comprehensive dataset of (almost) all available apps – as with AndroZoo – and having access to source code is unfortunately not reconcilable.

Introduction

Mobile phones and tablets have become the most widely used computing devices. Consequently, development of mobile applications has surged and become a major field of study. Additionally, mobile platforms bring their particular set of constraints. For instance, energy on small devices is a scarcity and power management is paramount. Privacy of users and software security are other highly studied topics. These kind of challenges ask for dedicated solutions, tools, and datasets.

This survey reviews Datasets of Android applications. However, not all research needs the same set of data. Martin provide an extensive survey of studies and datasets of app store analysis for various platforms . They identified seven key subfields: API Analysis, Feature Analysis, Release Engineering, Review Analysis, Security, Store Ecosystem, Size and Effort Prediction, and Closely Related Work (among which is Mining Tools). Research may be interested in technical attributes such as API usage or platform version, as well as non-technical attributes, reviews, number of downloads, For dynamic analysis of applications, executable artifacts are necessary. Bytecode from APKs can be decompiled to learn information about data flow and other code metrics. To analyze apps for programming practices and project management, source code and data from source code management programs such as version control and bug trackers is helpful. The latter category of information is not readily available for the vast number of proprietary applications. Studies that need this kind of data need to rely on open-source Android apps.

Therefore, we review existing literature for various characteristics which may facilitate different sub-fields and studies. This survey thus focuses on these main traits of datasets of Android applications:

Does the dataset facilitate access to source code of applications?
Is the source code available in version control (Git)?
Are installable APKs included?
Does the dataset link to app stores where additional information (such as ratings and reviews) are accessible?

This literature survey is structured as follows: First, in Section 6, we explain the iterative literature review process from keyword search and snowballing to a concise view of important information in a table. Following that we review and summarize datasets and studies resulting from the search process (Section 11). Learnings from the review results are detailed in Section 8 where we argue that too few datasets include access to source code and those that link source code contain too few applications. Finally, we conclude this survey in Section 9.

Another dataset of Android applications is the Android Malware Genome Project . Zhou and Jiang collect samples of malicious Android apps from August 2010 to October 2011 to advance understanding of malware on mobile platforms. They present a dataset of 1,260 apps in 49 different malware categories. Furthermore, the authors analyze and characterize the collected malware samples to trace behavior and major outbreaks of certain types. Zhou and Jiang report that most of the samples are repackaged versions of legitimate applications containing malicious payload. Another vector for infecting Android devices are update attacks and drive-by downloads. Types of malware include root-level exploits, botnet clients, incurring costs through calling or messaging to premium-rate numbers, and harvesting of users’ information. In their evolution-based study, Zhou and Jiang describe how Android malware rapidly evolves. Thus, malware authors are able to keep ahead of existing anti-malware solutions through application of sophisticated obfuscation and evasion techniques.

Literature Review Process

Literature presented in this review was collected with a combination of keyword search and snowballing, walking the graph of references in both directions. All queries were ran against the database in winter 2017/18. The review is concerned with datasets of Android applications in general but also more specifically with datasets that allow access to source code of apps. Figure 1 shows the iterative search process which followed four steps, repeating phase 2 and phase 3 until search results were exhausted. The four steps are (1) an initial keyword search, (2) filtering of relevant publications by title and abstract, (3) finding candidate publications by following the graph of citations from new search results to both citing and cited articles, and finally, (4) summarizing all found relevant publications in textual and tabular form.

Phase 1: Keyword search

Initially we searched for “Android app dataset”, and “Android app collection” “Android app mining”. The search results were complemented by replacing the keyword app with application in each search term. Filtering the search results for relevant publications showed one major group of publications around the topic of Android application security. These publications are largely centered around AndroZoo and the Android Malware Genome Project . To broaden the search scope and find datasets including source code, the search terms “android app” “source code” repository dataset were included.

Literature review process

Phase 2.1: Title filter

The search results at this point were filtered to exclude publications that are obviously out of scope for this review by looking at their titles.

Phase 2.2: Abstract filter

After reducing the scope by title, we read through abstracts of all search results and filtered those out which do not create a dataset of Android applications. We looked for indicators, that the paper actually gathers data on Android applications or uses a dataset to study Android apps. Only in the former case did we include the publication in my set of relevant work. In the latter case, we did not deem the paper itself relevant to my review but included it in the snowballing phase to find further links to existing datasets.

If the filter of Phase 2 yielded new results, Phase 3 was revisited. Otherwise, the collection phases were concluded and we would continue with Phase 4.

Phase 3.1: Cited publications

In a next step, we followed links from new papers collected so far to find relevant publications which are cited by them. This allowed to find previous works which the authors of already identified publications deem relevant to the subject.

Phase 3.2: Citing publications

We also searched for publications which refer to papers already in my set of relevant works. While looking at cited publications allows to glance into the past of related literature, searching for articles which cite already known papers gives information about the future from the time of these papers.

This new list of candidate articles was then fed into the filtering process (Phase 2).

Phase 4.1: Summaries

Phase 4 started after the data collection process was complete with 28 relevant publications and repeating phases 2 and 3 did not return any relevant new publications. We read all search results and briefly summarized them (Section 11).

Phase 4.2: Tabular data overview

Data from these summaries was then processed into a table (Appendix 12).

Phase 4.3: Categories of datasets

Finally, we categorized datasets of Android apps which we found in the literature into (1) datasets which use data from app markets (Section 11.1), (2) datasets providing executable APKs (Section 11.2), and (3) datasets with access to source code based on (Section 11.3).