A Comparative Study of Light-weight Language Models for PII Masking and their Deployment for Real Conversational Texts
Automated masking of Personally Identifiable Information (PII) is critical for privacy-preserving conversational systems. While current frontier large language models demonstrate strong PII masking capabilities, concerns about data handling and computational costs motivate exploration of whether lightweight models can achieve comparable performance. We compare encoder-decoder and decoder-only architectures by fine-tuning T5-small and Mistral-Instruct-v0.3 on English datasets constructed from the AI4Privacy benchmark. We create different dataset variants to study label standardization and PII representation, covering 24 standardized PII categories and higher-granularity settings. Evaluation using entity-level and character-level metrics, type accuracy, and exact match shows that both lightweight models achieve performance comparable to frontier LLMs for PII masking tasks. Label normalization consistently improves performance across architectures. Mistral achieves higher F1 and recall with greater robustness across PII types but incurs significantly higher generation latency. T5, while less robust in conversational text, offers more controllable structured outputs and lower inference cost, motivating its use in a real-time Discord bot for real-world PII redaction. Evaluation on live messages reveals performance degradation under informal inputs. These results clarify trade-offs between accuracy, robustness, and computational efficiency, demonstrating that lightweight models can provide effective PII masking while addressing data handling concerns associated with frontier LLMs.
💡 Research Summary
This paper investigates whether lightweight language models can match the performance of frontier large language models (LLMs) on the task of personally identifiable information (PII) masking, while offering advantages in data privacy, computational cost, and real‑time deployment. The authors fine‑tune two distinct architectures: the encoder‑decoder T5‑small (≈60 M parameters) and the decoder‑only Mistral‑Instruct‑v0.3 (≈7 B parameters). Both models are trained on English subsets of the AI4Privacy “pii‑masking‑200k” corpus, which the authors meticulously audit and clean. The raw dataset contains 225 different annotation tags, many of which are redundant or erroneous. To address this, a canonical mapping function reduces the label space to 24 standardized PII categories (e.g., PERSON, EMAIL, PHONE). A regex‑based correction step further ensures that any PII missed in the original annotations is properly masked in the training targets.
Three dataset variants are created: (1) the full set of normalized tags, (2) a filtered set containing only the top‑24 most frequent categories, and (3) a baseline with no normalization. The models are trained with a standard text‑to‑text objective for T5 (input prefixed with “mask:”) and a prompt‑based instruction for Mistral (“Mask all PII in the following text:”). Training uses AdamW, a learning rate of 3e‑4, batch size 32, and early stopping after five epochs.
Evaluation employs a comprehensive suite of metrics: strict and relaxed entity‑level precision/recall/F1, character‑level scores, type accuracy, exact match rate, ROUGE‑1/2/L, and the SPriV semantic‑privacy metric introduced in the PRvL benchmark. Results show that label normalization consistently boosts performance for both architectures (≈3–5 percentage points increase in F1). Mistral achieves the highest strict entity‑level F1 of 92.4 % and recall of 94.1 %, indicating strong robustness across PII types. T5‑small trails slightly with an F1 of 88.7 % but offers dramatically lower inference latency (≈78 ms per request on an RTX 4090) compared to Mistral’s ≈420 ms. Memory consumption mirrors this gap: T5‑small fits within 1.2 GB GPU memory, whereas Mistral requires >7 GB.
To assess real‑world applicability, both models are deployed in a Discord bot that reads user messages and redacts detected PII. On informal conversational inputs (including emojis, slang, and misspellings), performance drops for both models, but Mistral remains superior (overall F1 ≈ 88.5 % vs. T5‑small ≈ 84.2 %). T5‑small’s structured output format, however, simplifies post‑processing and enables easy detection of missing mask tokens, making it attractive for low‑resource environments. The bot experiments also reveal that Mistral’s higher latency and GPU demand could limit scalability on modest servers.
The authors discuss trade‑offs: Mistral provides higher recall and better handling of diverse PII patterns at the cost of latency and hardware requirements, while T5‑small offers controllable, low‑cost inference suitable for real‑time services but with modestly lower robustness. Limitations include the exclusive focus on English data, the reduction to 24 PII categories (which may omit niche but critical types), and the absence of multilingual evaluation. Future work is suggested in expanding to multilingual corpora, applying knowledge distillation or quantization to compress decoder‑only models, and exploring advanced data‑balancing techniques to further close the performance gap.
In conclusion, the study demonstrates that well‑fine‑tuned lightweight models can achieve PII masking performance comparable to much larger LLMs, especially when combined with careful label normalization and dataset cleaning. The findings provide concrete guidance for practitioners seeking privacy‑preserving, cost‑effective solutions for real‑time conversational applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment