Estimation of Identification Disclosure Risk in Microdata Guang Chen and Sallie KellerMcNulty Abstract: The necessary condition for the occurrence of an identification disclosure is that a target entity, i.e., population unit, can be uniquely identified by some set of characteristics in the population. Therefore, the percentage of the records in a released data set which can be uniquely identified in the population is an important measure of identification disclosure risk in microdata. This research deals with the development of a technique to estimate the number of unique entities in a population based on sample information. A few estimation methods have been developed for this problem, but none of them work well for small sampling fractions. To improve on the existing methods, a model for population cell frequency distributions is developed. A sample cell frequency distribution is derived assuming binomial sampling from each cell. The estimation method of model parameters based on sample cell frequencies is given. This estimation technique is tested intensively on real and simulated data sets. The results show a remarkable improvement over the existing methods, especially for sampling fractions less than 0.1. Keywords: Data security; disclosure risk; population uniqueness.
