Tests of Multivariate Hypotheses when using Multiple Imputation for Missing Data and Disclosure Limitation
Satkartar K. Kinney, Jerome P. Reiter
Several statistical agencies use, or are considering the use of, multiple imputation to limit the risk of disclosing respondents identities or sensitive attributes in public use data files. For example, agencies can release partially synthetic datasets, comprising the units originally surveyed with some values, such as sensitive values at high risk of disclosure or values of key identifiers, replaced with multiple imputations. This can be coupled with multiple imputation for missing data in a two-stage imputation approach. First the agency fills in the missing data to generate m completed datasets, then replaces sensitive or identifying values in each completed dataset with n imputed values. Methods for obtaining inferences with the mn datasets have been developed for scalar quantities, but not for multivariate quantities. We present methods for testing multivariate null hypotheses with such datasets. We illustrate the tests using public use files for the Survey of Income and Program Participation that were created with the two-stage imputation approach.
Confidentiality, disclosure, multiple imputation, significance tests, synthetic data