Journal of Official Statistics, Vol.21, No.3, 2005. pp. 441–462
Using CART to Generate Partially Synthetic Public Use Microdata
Abstract:To limit disclosure risks, one approach is to release partially synthetic public use microdata sets. These comprise the units originally surveyed, but some collected values, for example sensitive values at high risk of disclosure or values of key identifiers, are replaced with multiple imputations. This article presents and evaluates the use of classification and regression trees to generate partially synthetic data. Two potential applications of CART are studied via simulation: (i) generate synthetic data for sensitive variables; and, (ii) generate synthetic data for variables that are key identifiers.
Keywords:CART, confidentiality, disclosure, multiple imputation, synthetic data, trees
Copyright © Statistics Sweden 1996-2018. Open AccessISSN 0282-423XCreated and Maintained by OKS Group