On building an automated responding system for app reviews: What are the characteristics of reviews and their responses?

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent studies showed that the dialogs between app developers and app users on app stores are important to increase user satisfaction and app’s overall ratings. However, the large volume of reviews and the limitation of resources discourage app developers from engaging with customers through this channel. One solution to this problem is to develop an Automated Responding System for developers to respond to app reviews in a manner that is most similar to a human response. Toward designing such system, we have conducted an empirical study of the characteristics of mobile apps’ reviews and their human-written responses. We found that an app reviews can have multiple fragments at sentence level with different topics and intentions. Similarly, a response also can be divided into multiple fragments with unique intentions to answer certain parts of their review (e.g., complaints, requests, or information seeking). We have also identified several characteristics of review (rating, topics, intentions, quantitative text feature) that can be used to rank review by their priority of need for response. In addition, we identified the degree of re-usability of past responses is based on their context (single app, apps of the same category, and their common features). Last but not least, a responses can be reused in another review if some parts of it can be replaced by a placeholder that is either a named-entity or a hyperlink. Based on those findings, we discuss the implications of developing an Automated Responding System to help mobile apps’ developers write the responses for users reviews more effectively.

💡 Research Summary

The paper investigates the characteristics of mobile app store reviews and the human‑written developer responses that could inform the design of an Automated Responding System (ARS). The authors collected 649,645 reviews from 164 Google Play apps over a three‑month period in early 2019, identifying 3,212 review‑response pairs from 33 top‑trending apps (approximately 0.5% of reviews received a response). They manually annotated each sentence in both reviews and responses for topic and intention, yielding 18,301 sentences (7,095 review sentences, 11,032 response sentences). Review intentions were categorized into eight types (comparison, complaint, request, information giving, information seeking, praise, ultimatum, unknown) and response intentions into eleven types (solution, customer support, greeting, promise, appreciation, apology, farewell/signature, information seeking, re‑rating request, information giving, unknown). Inter‑rater agreement was 99% (Cohen’s κ).

Key findings include: (1) Reviews are often multi‑fragmented; a single review can contain several sentences with distinct topics and intentions. Correspondingly, responses are also multi‑fragmented, each fragment addressing a specific review fragment (e.g., promises, appreciation, signature). (2) Rating influences response likelihood but is insufficient alone: half of responded reviews have ratings ≤3, yet 5‑star reviews are the second most frequently answered. (3) The most common topics in responded reviews are Feature/Functionality (52%) and App/Content (38%). (4) Re‑usability of past responses depends on context: responses can be reused at the level of the same app, apps within the same category, or apps sharing common features, but category labels alone are unreliable. (5) Fully reusable responses are rare; however, templated responses containing placeholders (average 1.6 per response, such as named entities or hyperlinks) enable partial reuse across similar reviews. (6) A Markov Chain model that captures the transition probabilities between review intentions and response intentions proved feasible for generating response structures.

Implications for ARS design are articulated: (i) a ranking module should prioritize reviews based on a combination of quantitative features (rating, length, sentiment) and qualitative features (identified intentions and topics); (ii) the generation engine must support multi‑segment output, selecting appropriate intention‑specific templates and filling placeholders dynamically; (iii) a hierarchical repository of past responses indexed by app, category, and shared functionality can accelerate template retrieval; (iv) sequence‑to‑sequence or Markov models can be employed to predict the next response fragment given the current review fragment, ensuring coherence and human‑like flow.

Overall, the study provides a granular, data‑driven foundation for building an ARS that can reduce developer workload while maintaining the quality and personalization of human responses.

On building an automated responding system for app reviews: What are the characteristics of reviews and their responses?

💡 Research Summary

Comments & Academic Discussion

Leave a Comment