Classification of Smartphone Users Using Internet Traffic
Today, smartphone devices are owned by a large portion of the population and have become a very popular platform for accessing the Internet. Smartphones provide the user with immediate access to information and services. However, they can easily expose the user to many privacy risks. Applications that are installed on the device and entities with access to the device’s Internet traffic can reveal private information about the smartphone user and steal sensitive content stored on the device or transmitted by the device over the Internet. In this paper, we present a method to reveal various demographics and technical computer skills of smartphone users by their Internet traffic records, using machine learning classification models. We implement and evaluate the method on real life data of smartphone users and show that smartphone users can be classified by their gender, smoking habits, software programming experience, and other characteristics.
💡 Research Summary
The paper investigates whether a smartphone user’s demographic attributes and technical competence can be inferred solely from the network traffic generated by the device. Recognizing that modern smartphones act as gateways to a wealth of personal data, the authors propose a machine‑learning pipeline that transforms raw packet captures into a rich set of behavioral features and then classifies users into categories such as gender, smoking habit, programming experience, and education level.
Data collection was performed over a two‑month period with more than 200 voluntary participants. Each participant routed all mobile traffic through a dedicated VPN server, allowing the researchers to capture full packet traces (pcap files) while preserving privacy by hashing IP and MAC addresses. In parallel, participants completed a detailed questionnaire that supplied ground‑truth labels for gender, age, smoking status, programming skill (none, beginner, intermediate, advanced), and education. The resulting dataset comprises roughly 30 days of traffic per user, amounting to several terabytes of raw data.
Feature engineering is the core of the methodology. The authors extract 158 variables grouped into four families: (1) quantitative usage metrics (total bytes per day, average packet size, peak hour traffic, night‑time activity ratio); (2) domain‑level semantics obtained by parsing TLS Server Name Indication (SNI) and HTTP Host headers, then mapping each domain to a pre‑defined category (social, news, development, gaming, shopping, health, etc.); (3) protocol and port distribution (counts of TCP, UDP, DNS, QUIC, and specific ports such as 22, 443, 8080); and (4) temporal patterns (weekday vs. weekend ratios, holiday spikes, diurnal rhythms). All features are standardized before feeding them to classifiers.
Four families of classifiers are evaluated: Random Forest, Gradient Boosting Machine (GBM), Support Vector Machine (linear kernel), and a shallow Multi‑Layer Perceptron (MLP). Hyper‑parameters are tuned via 5‑fold cross‑validation, and performance is reported using accuracy, precision, recall, and F1‑score. Random Forest consistently outperforms the other models, achieving the highest macro‑averaged F1 across all tasks. Specific results include:
- Gender classification – 85 % accuracy (F1 = 0.84). Feature importance shows that visits to gaming and sports sites are more frequent among male users, while shopping and fashion domains dominate female traffic.
- Smoking status – 78 % accuracy (F1 = 0.77). Night‑time traffic proportion and visits to health‑related or tobacco‑related domains are the strongest predictors.
- Programming experience – 71 % accuracy across four skill levels (F1 = 0.69). The presence of development‑oriented domains (GitHub, Stack Overflow, developer forums) and the usage of SSH (port 22) are decisive; advanced programmers also exhibit lower UDP‑heavy gaming traffic.
- Education level – 66 % accuracy (F1 = 0.64). High‑education users consume more news and scholarly sites, whereas lower‑education users spend more time on entertainment and social media.
Feature‑importance analysis confirms that domain‑category frequencies, specific port usage, and night‑time activity are the most informative signals. The authors also conduct ablation studies showing that removing domain semantics drops gender and programming‑skill performance by more than 10 percentage points, underscoring the value of semantic traffic profiling.
The study acknowledges several limitations. The participant pool is skewed toward young adults and urban residents, limiting external validity. VPN‑based collection masks local‑network traffic (e.g., Bluetooth or Wi‑Fi direct transfers), potentially omitting privacy‑relevant signals. Ground‑truth labels rely on self‑reported questionnaire data, which may contain recall bias. Moreover, the models exhibit some over‑fitting to the specific time window; longitudinal validation is needed.
Future work is outlined along four axes: (1) privacy‑preserving model training using federated learning combined with differential privacy to eliminate the need for central raw traffic storage; (2) real‑time streaming analytics that can flag privacy‑risk behaviors as they occur; (3) expansion to multinational, culturally diverse datasets to assess cross‑regional generalizability; and (4) integration with threat‑intelligence pipelines to study how inferred user profiles could be exploited by malicious actors (e.g., targeted phishing).
In conclusion, the paper demonstrates that smartphone Internet traffic contains sufficient granularity to infer a range of personal attributes with respectable accuracy. While this opens opportunities for personalized services, it also raises serious privacy concerns. Network operators, app developers, and policymakers must therefore consider robust safeguards, transparent data‑use policies, and ethical frameworks to prevent misuse of such profiling capabilities.
Comments & Academic Discussion
Loading comments...
Leave a Comment