Journal of Official Statistics, Vol.18, No.4, 2002. pp. 531–543
Satisfying Disclosure Restrictions With Synthetic Data Sets
Jerome P. Reiter
Abstract:To avoid disclosures, Rubin proposed creating multiple, synthetic data sets for public release so that (i) no unit in the released data has sensitive data from an actual unit in the population, and (ii) statistical procedures that are valid for the original data are valid for the released data. In this article, I show through simulation studies that valid inferences can be obtained from synthetic data in a variety of settings, including simple random sampling, probability proportional to size sampling, two-stage cluster sampling, and stratified sampling. I also provide guidance on specifying the number and size of synthetic data sets and demonstrate the benefit of including design variables in the released data sets.
Keywords:Confidentiality; disclosure; multiple imputation; simulation; synthetic data.
Copyright © Statistics Sweden 1996-2018. Open AccessISSN 0282-423XCreated and Maintained by OKS Group