$A^{4}NT$: Author Attribute Anonymity by Adversarial Training of Neural Machine Translation
Text-based analysis methods allow to reveal privacy relevant author attributes such as gender, age and identify of the text’s author. Such methods can compromise the privacy of an anonymous author even when the author tries to remove privacy sensitive content. In this paper, we propose an automatic method, called Adversarial Author Attribute Anonymity Neural Translation ($A^4NT$), to combat such text-based adversaries. We combine sequence-to-sequence language models used in machine translation and generative adversarial networks to obfuscate author attributes. Unlike machine translation techniques which need paired data, our method can be trained on unpaired corpora of text containing different authors. Importantly, we propose and evaluate techniques to impose constraints on our $A^4NT$ to preserve the semantics of the input text. $A^4NT$ learns to make minimal changes to the input text to successfully fool author attribute classifiers, while aiming to maintain the meaning of the input. We show through experiments on two different datasets and three settings that our proposed method is effective in fooling the author attribute classifiers and thereby improving the anonymity of authors.
💡 Research Summary
The paper addresses the growing privacy threat posed by stylometric and author‑attribute classifiers that can infer gender, age, or identity from short texts. Existing defenses are either semi‑automatic (suggesting edits to users) or rely on generic machine‑translation pipelines, which either require manual effort or fail to sufficiently hide the targeted attributes while often altering the meaning of the text. To overcome these limitations, the authors propose A⁴NT (Adversarial Author Attribute Anonymity Neural Translation), a fully automatic system that treats attribute obfuscation as a style‑transfer problem akin to neural machine translation (NMT).
A⁴NT’s core architecture is a sequence‑to‑sequence (Seq2Seq) model that maps an input sentence from a source attribute distribution (e.g., female‑written) to a target attribute distribution (e.g., male‑written). Because paired sentences with identical semantics but different styles are practically unavailable, the authors cast the problem as distribution matching and train the generator within a Generative Adversarial Network (GAN) framework. The discriminator is an attribute classifier (implemented as a word‑level LSTM encoder followed by a linear‑softmax layer) that predicts the attribute label of a given sentence. The generator learns to produce sentences that the discriminator misclassifies as belonging to the target class, thereby achieving attribute anonymity.
Preserving the original meaning is crucial for a usable defense. Since exact semantic distance is hard to compute, the authors introduce two proxy losses: (1) an embedding‑based semantic similarity loss that measures cosine similarity between the original and generated sentences using pretrained sentence embeddings (e.g., BERT), and (2) a cycle‑consistency loss that translates the generated sentence back to the source style and penalizes deviation from the original. These losses are combined with the adversarial loss, encouraging minimal yet effective stylistic changes.
Experiments are conducted on two publicly available corpora (a blog dataset and Reddit comments) covering three attribute‑obfuscation scenarios: gender, age, and identity. For each scenario, the source and target attribute sets are defined, and the system is trained on unpaired data only. Evaluation metrics include (a) privacy effectiveness measured by the drop in classifier accuracy, (b) semantic preservation measured by BLEU, ROUGE, and BERTScore, and (c) human judgments of meaning retention and fluency.
Results show that A⁴NT consistently outperforms baselines such as round‑trip machine translation and rule‑based transformations. In gender obfuscation, classifier accuracy drops by over 90 %; in age obfuscation, by about 70 %; and in identity obfuscation, by roughly 60 %. At the same time, BERTScore remains in the 0.78–0.85 range, indicating strong semantic similarity, and human evaluators report that the generated texts preserve the original intent while sounding natural. An additional analysis reveals that longer or more lexically diverse inputs incur slightly higher semantic loss, but the overall privacy‑preservation remains robust.
The paper acknowledges limitations: (i) performance degrades when only a few samples are available for the target attribute, (ii) rare proper nouns or domain‑specific terms may be altered or dropped, and (iii) the current setup handles one attribute at a time. Future work is suggested in three directions: leveraging large pretrained language models to improve meaning preservation, extending the framework to multi‑attribute simultaneous anonymization via multi‑objective optimization, and integrating user‑in‑the‑loop feedback to create semi‑automatic tools that combine the strengths of automatic style transfer with human oversight.
In summary, A⁴NT demonstrates that adversarially trained Seq2Seq models, equipped with semantic‑preserving constraints, can automatically and effectively hide author attributes without paired data, offering a practical defense against modern stylometric attacks.
Comments & Academic Discussion
Loading comments...
Leave a Comment