Journal of Official Statistics, Vol.18, No.4, 2002. pp. 531543

Current Issue
Personal Reference Library (PRL)
Personal Page

Satisfying Disclosure Restrictions With Synthetic Data Sets

To avoid disclosures, Rubin proposed creating multiple, synthetic data sets for public release so that (i) no unit in the released data has sensitive data from an actual unit in the population, and (ii) statistical procedures that are valid for the original data are valid for the released data. In this article, I show through simulation studies that valid inferences can be obtained from synthetic data in a variety of settings, including simple random sampling, probability proportional to size sampling, two-stage cluster sampling, and stratified sampling. I also provide guidance on specifying the number and size of synthetic data sets and demonstrate the benefit of including design variables in the released data sets.

Confidentiality; disclosure; multiple imputation; simulation; synthetic data.

Copyright Statistics Sweden 1996-2018.  Open Access
ISSN 0282-423X
Created and Maintained by OKS Group