Randomization based privacy preserving categorical data analysis

Guo, Ling

Randomization based privacy preserving categorical data analysis

Search for this publication on Google Scholar

Guo, L. (2010). Randomization based privacy preserving categorical data analysis. Unc Charlotte Electronic Theses And Dissertations.

Download PDF

Analytics

47 views ◎
24 downloads ⇓

Abstract

This dissertation investigates data utility and privacy of randomization-based models in privacy preserving data mining for categorical data. For the analysis of data utility in randomization model, we first investigate the accuracy analysis for association rule mining in market basket data. Then we propose a general framework to conduct theoretical analysis on how the randomization process affects the accuracy of various measures adopted in categorical data analysis. We also examine data utility when randomization mechanisms are not provided to data miners to achieve better privacy. We investigate how various objective association measures between two variables may be affected by randomization. We then extend it to multiple variables by examining the feasibility of hierarchical loglinear modeling. Our results provide a reference to data miners about what they can do and what they can not do with certainty upon randomized data directly without the knowledge about the original distribution of data and distortion information. Data privacy and data utility are commonly considered as a pair of conflicting requirements in privacy preserving data mining applications. In this dissertation, we investigate privacy issues in randomization models. In particular, we focus on the attribute disclosure under linking attack in data publishing. We propose efficient solutions to determine optimal distortion parameters such that we can maximize utility preservation while still satisfying privacy requirements. We compare our randomization approach with l-diversity and anatomy in terms of utility preservation (under thesame privacy requirements) from three aspects (reconstructed distributions, accuracy of answering queries, and preservation of correlations). Our empirical results show that randomization incurs significantly smaller utility loss.

Details

Author: Guo, Ling
Title: Randomization based privacy preserving categorical data analysis
Physical Description: 1 online resource (127 pages) : PDF
Date: 2010
Degree Granting Institution: University of North Carolina at Charlotte
Abstract: This dissertation investigates data utility and privacy of randomization-based models in privacy preserving data mining for categorical data. For the analysis of data utility in randomization model, we first investigate the accuracy analysis for association rule mining in market basket data. Then we propose a general framework to conduct theoretical analysis on how the randomization process affects the accuracy of various measures adopted in categorical data analysis. We also examine data utility when randomization mechanisms are not provided to data miners to achieve better privacy. We investigate how various objective association measures between two variables may be affected by randomization. We then extend it to multiple variables by examining the feasibility of hierarchical loglinear modeling. Our results provide a reference to data miners about what they can do and what they can not do with certainty upon randomized data directly without the knowledge about the original distribution of data and distortion information. Data privacy and data utility are commonly considered as a pair of conflicting requirements in privacy preserving data mining applications. In this dissertation, we investigate privacy issues in randomization models. In particular, we focus on the attribute disclosure under linking attack in data publishing. We propose efficient solutions to determine optimal distortion parameters such that we can maximize utility preservation while still satisfying privacy requirements. We compare our randomization approach with l-diversity and anatomy in terms of utility preservation (under thesame privacy requirements) from three aspects (reconstructed distributions, accuracy of answering queries, and preservation of correlations). Our empirical results show that randomization incurs significantly smaller utility loss.
Genre: doctoral dissertations
Subjects--Topics: Information technology
Degree: Ph.D.
Subject Area: Information Technology
Advisor(s): Wu, Xintao
Degree Note: Thesis (Ph.D.)--University of North Carolina at Charlotte, 2010.
Rights Statement: This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s). For additional information, see http://rightsstatements.org/page/InC/1.0/.
Rights Holder Information: Copyright is held by the author unless otherwise indicated.
Identifier: Guo_uncc_0694D_10150
Permalink: http://hdl.handle.net/20.500.13093/etd:1771

J. Murrey Atkins Library

J. Murrey Atkins Library