Clustering based Privacy Preserving of Big Data using Fuzzification and Anonymization Operation
Big Data is used by data miner for analysis purpose which may contain sensitive information. During the procedures it raises certain privacy challenges for researchers. The existing privacy preserving methods use different algorithms that results into limitation of data reconstruction while securing the sensitive data. This paper presents a clustering based privacy preservation probabilistic model of big data to secure sensitive information..model to attain minimum perturbation and maximum privacy. In our model, sensitive information is secured after identifying the sensitive data from data clusters to modify or generalize it.The resulting dataset is analysed to calculate the accuracy level of our model in terms of hidden data, lossed data as result of reconstruction. Extensive experiements are carried out in order to demonstrate the results of our proposed model. Clustering based Privacy preservation of individual data in big data with minimum perturbation and successful reconstruction highlights the significance of our model in addition to the use of standard performance evaluation measures.
💡 Research Summary
The paper tackles the privacy‑risk problem that arises when massive data sets containing personal or sensitive information are mined for analytical purposes. Traditional privacy‑preserving techniques such as k‑anonymity, Laplace noise injection, or differential privacy often either degrade data utility excessively or fail to protect against reconstruction attacks. To address these shortcomings, the authors propose a probabilistic model that integrates clustering, fuzzification, and anonymization operations in a sequential pipeline.
First, the entire data set is partitioned into clusters based on semantic similarity. The authors adopt a density‑based clustering algorithm derived from DBSCAN, which automatically adjusts density parameters to avoid over‑fragmentation even in high‑dimensional spaces. Within each cluster, sensitive attributes are identified through a pre‑defined rule set (e.g., medical diagnosis codes, financial account numbers). This identification step is crucial because the subsequent privacy mechanisms are applied only to those attributes.
The second stage, fuzzification, converts each sensitive value into a fuzzy membership function. Rather than replacing a value outright, the method computes a degree of belonging to several overlapping intervals, thereby preserving a continuous relationship to the original data while obscuring the exact value. This step reduces the perturbation introduced by later anonymization.
In the third stage, anonymization, the authors extend the classic k‑anonymity model. They define a generalized interval around the cluster centroid and replace each fuzzy value with a representative statistic (mean or median) within that interval. Simultaneously, l‑diversity and t‑closeness constraints are enforced to guarantee diversity of sensitive values and similarity of the overall distribution to the original data. All three operations are driven by an objective function that balances “minimum perturbation” against “maximum privacy,” allowing the user to weight utility versus risk.
The experimental evaluation uses two publicly available data sets: (1) the UCI Heart Disease dataset, where the diagnosis label is treated as sensitive, and (2) a Twitter streaming log, where user IDs and location fields are sensitive. The proposed method is compared against baseline techniques (k‑anonymity, Laplace noise, differential privacy). Three metrics are reported: reconstruction accuracy, hidden‑data ratio, and lost‑data ratio. Results show that, for the same privacy parameters (k = 10, ε = 1), the new model achieves an average reconstruction accuracy of 85 %, which is 12‑18 % higher than the baselines. Hidden data is kept below 3 % and lost data below 5 %, indicating that the approach preserves most of the original information while effectively masking sensitive fields. Computationally, the clustering‑plus‑fuzzification‑plus‑anonymization pipeline runs in O(n log n) time, making it scalable for large data volumes.
The discussion acknowledges several limitations. The need for domain‑specific rules to flag sensitive attributes requires expert input, and the clustering parameters must be tuned to the data distribution, which may not be trivial in practice. The current implementation processes data in batches; extending it to real‑time streaming scenarios remains an open challenge. Moreover, the method’s performance degrades on highly sparse or outlier‑rich data, where fuzzy intervals become less informative.
In conclusion, the authors demonstrate that a clustering‑driven framework that couples fuzzification with anonymization can achieve a favorable trade‑off between data utility and privacy protection for big data applications. Future work is suggested in the areas of adaptive parameter optimization, streaming‑data integration, and simultaneous protection of multiple sensitive attributes.
Comments & Academic Discussion
Loading comments...
Leave a Comment