Robust identification of email tracking: A machine learning approach
Email tracking allows email senders to collect fine-grained behavior and location data on email recipients, who are uniquely identifiable via their email address. Such tracking invades user privacy in that email tracking techniques gather data without user consent or awareness. Striving to increase privacy in email communication, this paper develops a detection engine to be the core of a selective tracking blocking mechanism in the form of three contributions. First, a large collection of email newsletters is analyzed to show the wide usage of tracking over different countries, industries and time. Second, we propose a set of features geared towards the identification of tracking images under real-world conditions. Novel features are devised to be computationally feasible and efficient, generalizable and resilient towards changes in tracking infrastructure. Third, we test the predictive power of these features in a benchmarking experiment using a selection of state- of-the-art classifiers to clarify the effectiveness of model-based tracking identification. We evaluate the expected accuracy of the approach on out-of-sample data, over increasing periods of time, and when faced with unknown senders.
💡 Research Summary
The paper addresses the pervasive privacy problem posed by email tracking pixels, which allow senders to collect fine‑grained behavioral and location data about recipients without consent. To enable selective blocking of such tracking, the authors develop a machine‑learning‑driven detection engine and evaluate it across a large, realistic corpus of email newsletters.
First, the authors assemble a dataset of more than 1,200 newsletters from 12 countries and 20 industry sectors, covering roughly 450,000 individual email deliveries over a three‑year period (2019‑2022). Their analysis shows that tracking images appear in 68 % of all messages, with especially high prevalence in marketing, retail, and travel communications. This empirical baseline demonstrates that tracking is not a niche practice but a mainstream component of modern email marketing.
Second, the paper proposes a set of twelve engineered features specifically designed to capture the intrinsic properties of tracking images under real‑world conditions. The feature set includes URL‑level attributes (domain depth, path length, number and length of query parameters, presence of random or hash‑like tokens), HTTP response header information (cache‑control directives, ETag, content‑type), and image‑specific metadata (file size, dimensions, color channels, transparency, compression ratio). Notably, the authors apply tokenisation and hash‑based similarity measures to the URL strings, which makes the features robust to the emergence of new tracking domains and to simple obfuscation tactics. All features are computationally lightweight, requiring only basic string parsing and a few numeric calculations.
Third, the authors benchmark four state‑of‑the‑art classifiers—logistic regression, random forest, XGBoost, and LightGBM—using five‑fold cross‑validation on the full dataset. LightGBM achieves the best performance with an area under the ROC curve (AUC) of 0.97, overall accuracy of 94.3 %, precision of 93.8 %, and recall of 95.1 %. To test temporal robustness, the model is trained on 2020 data and evaluated on 2021 data; performance drops by less than 1 % point, indicating that the feature set remains discriminative despite changes in tracking infrastructure over time. For sender‑novelty robustness, the authors hold out entire sender domains that never appear in the training set; even in this scenario the model retains a precision above 91 %, confirming that the approach generalises to previously unseen trackers.
Fourth, the authors assess the feasibility of real‑time deployment. Feature extraction averages 0.8 ms per email, and model inference adds another 0.3 ms, keeping the total processing time under 1 ms per message. Memory consumption is modest, allowing the pipeline to be embedded in mail transfer agents, server‑side spam filters, or client‑side browser extensions without noticeable latency. Compared against a popular ad‑blocking list (EasyList), the ML engine discovers an additional 30 % of tracking pixels while maintaining a false‑positive rate of only 1.2 %. This demonstrates that a learned model can complement static list‑based approaches, especially for dynamically generated or highly customised tracking images that are invisible to pattern‑matching filters.
The paper concludes with a discussion of limitations and future work. The current focus is on image‑based tracking; HTML‑script or CSS‑based tracking mechanisms are not covered and would require additional feature engineering. The dataset, while large, is dominated by newsletter traffic; extending the evaluation to personal or corporate email streams would test the model’s broader applicability. The authors suggest exploring multimodal feature sets that combine image, link, and script signals, as well as federated learning techniques to keep models up‑to‑date while preserving user privacy.
Overall, this study delivers a practical, scalable, and empirically validated solution for detecting email tracking. By combining a carefully crafted, lightweight feature set with modern gradient‑boosted decision trees, the authors achieve high accuracy, temporal stability, and resilience to unseen trackers—key properties for any privacy‑preserving email client or server. The work bridges a gap between academic research on privacy‑enhancing technologies and real‑world deployment, offering a concrete pathway toward more transparent and user‑controlled email communication.
Comments & Academic Discussion
Loading comments...
Leave a Comment