Applying NLP to iMessages: Understanding Topic Avoidance, Responsiveness, and Sentiment

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

What is your messaging data used for? While many users do not often think about the information companies can gather based off of their messaging platform of choice, it is nonetheless important to consider as society increasingly relies on short-form electronic communication. While most companies keep their data closely guarded, inaccessible to users or potential hackers, Apple has opened a door to their walled-garden ecosystem, providing iMessage users on Mac with one file storing all their messages and attached metadata. With knowledge of this locally stored file, the question now becomes: What can our data do for us? In the creation of our iMessage text message analyzer, we set out to answer five main research questions focusing on topic modeling, response times, reluctance scoring, and sentiment analysis. This paper uses our exploratory data to show how these questions can be answered using our analyzer and its potential in future studies on iMessage data.

💡 Research Summary

The paper presents a practical framework for extracting, preprocessing, and analyzing iMessage data stored locally on macOS devices. By leveraging the chat.db SQLite file that Apple makes available to end‑users, the authors—Alan Gerber, Sam Cooperman, and an anonymous Emory student—collected between 422,801 and 525,655 messages per participant, spanning from 2018 to 2022. The raw database was cleaned by stripping 78 Apple‑specific metadata tokens (e.g., “bplist”, “tdate”), converting text to lowercase, and removing punctuation. No lemmatization or stop‑word removal was performed, a deliberate choice to preserve the short, informal nature of text messages. The cleaned data were exported to CSV and made publicly available via a GitHub repository, allowing full reproducibility.

For content analysis, the authors employed Gensim’s implementation of Latent Dirichlet Allocation (LDA) to generate 30 topics per user from the aggregated message corpus. Recognizing the well‑known difficulty of interpreting raw LDA word lists, they introduced a “reluctance” metric that combines response latency with topic probability. Specifically, each inbound message received a reluctance score equal to the response time in minutes divided by 1440 (the number of minutes in a day), capped at 1.0. This score was multiplied by the message’s topic probability and summed across all messages to produce an average reluctance per topic. To prevent low‑frequency topics from appearing artificially high, the authors further weighted the average reluctance by the logarithm of the number of messages with a topic probability above 0.3.

Sentiment was assessed using VADER, a rule‑based sentiment analyzer that handles emojis and punctuation without additional preprocessing. Each message received a polarity score from –1.0 (most negative) to +1.0 (most positive). Scores above 0.05 were classified as positive, below –0.05 as negative, and the remainder as neutral. The authors aggregated sentiment over time and by direction (inbound vs. outbound) to examine trends.

The study also explored responsiveness in group versus one‑to‑one chats. A reply was counted if a subsequent message from the user appeared within 1440 minutes, or if a tapback or threaded response was recorded. Reply rates and median response times were computed for four group‑size categories: 1‑on‑1, 2‑4 participants, 5‑8 participants, and 9+ participants. Across all three participants, larger groups exhibited lower reply rates and longer median response times, confirming prior findings that group size negatively impacts responsiveness. However, the authors noted considerable individual variation, especially for the anonymous participant whose median response times in medium‑sized groups were markedly higher, likely due to a single hyper‑active chat skewing the aggregate.

To identify “conversation starter” topics, the authors devised a starter_score that blends three components: (1) reply_rate (weighted 0.4), (2) speed_score = 1 – (average response time for the topic / 1440) (weighted 0.3), and (3) starter_prob = proportion of messages in the topic that are conversation initiators (weighted 0.3). The top ten topics by this composite metric were visualized for each user. For Alan, a travel‑related topic ranked highest; for Sam, a sports/fantasy‑football topic emerged; for the anonymous participant, academic‑project topics were prominent.

Key findings include: (i) each user’s most “reluctant” topics differed—Alan’s were time‑sensitive scheduling messages, Sam’s involved fantasy football, and the anonymous participant’s centered on school project group chats; (ii) topic prevalence over time remained relatively stable for most topics, with only minor fluctuations, suggesting that iMessage usage patterns are consistent once established; (iii) sentiment analysis revealed a dominance of neutral messages, a modest but consistent upward trend in positive sentiment across all participants, and a higher proportion of positive outbound messages for the anonymous participant; (iv) the expected negative sentiment bias observed in public social‑media platforms (e.g., Twitter) was not replicated in private iMessage conversations.

The authors released an open‑source web application (textmessageanalyzer.com) that automates the entire pipeline: data import, LDA topic modeling, reluctance calculation, sentiment scoring, and visualization. The tool also allows users to export results as CSV for custom analysis in Excel, R, or Python, and optionally integrates with the OpenAI API to generate human‑readable topic labels via ChatGPT.

In the discussion, the authors acknowledge several limitations: a sample size of only three users restricts generalizability; the reluctance metric relies solely on elapsed time, ignoring contextual factors such as time of day, work schedules, or message content complexity; LDA hyper‑parameters (number of topics, α, β) were not systematically tuned or cross‑validated; and VADER’s English‑centric lexicon may misclassify non‑English or code‑mixed messages. They propose future work that expands the cohort, applies more sophisticated topic models (e.g., BERTopic, neural LDA), incorporates multilingual sentiment tools, and employs mixed‑effects statistical models to robustly test the relationship between group size and responsiveness.

Overall, the paper demonstrates that iMessage metadata, when processed locally, can yield rich behavioral insights while preserving user privacy. The combination of topic modeling, response latency, and sentiment analysis offers a multi‑dimensional portrait of personal communication habits, and the publicly released analyzer provides a reproducible platform for both individual users and researchers interested in mobile communication studies.

Applying NLP to iMessages: Understanding Topic Avoidance, Responsiveness, and Sentiment

💡 Research Summary

Comments & Academic Discussion

Leave a Comment