Multiply-Imputed Synthetic Data: Advice to the Imputer

Bronwyn Loong; Donald B. Rubin

doi:10.1515/jos-2017-0047

Abstract

Several statistical agencies have started to use multiply-imputed synthetic microdata to create public-use data in major surveys. The purpose of doing this is to protect the confidentiality of respondents’ identities and sensitive attributes, while allowing standard complete-data analyses of microdata. A key challenge, faced by advocates of synthetic data, is demonstrating that valid statistical inferences can be obtained from such synthetic data for non-confidential questions. Large discrepancies between observed-data and synthetic-data analytic results for such questions may arise because of uncongeniality; that is, differences in the types of inputs available to the imputer, who has access to the actual data, and to the analyst, who has access only to the synthetic data. Here, we discuss a simple, but possibly canonical, example of uncongeniality when using multiple imputation to create synthetic data, which specifically addresses the choices made by the imputer. An initial, unanticipated but not surprising, conclusion is that non-confidential design information used to impute synthetic data should be released with the confidential synthetic data to allow users of synthetic data to avoid possible grossly conservative inferences.

References

Abowd, J.M. and L. Vilhuber. 2008. “How Protective are Synthetic Data?” In Privacy in Statistical Databases, edited by J. Domingo-Ferrer and V. Yucel, 239–246. New York: Springer.10.1007/978-3-540-87471-3_20
Search in Google Scholar Back to article
Drechsler, J., A. Dundler, S. Bender, S. Rässler, and T. Zwick. 2008. “A New Approach for Disclosure Control in the IAB Establishment Panel – Multiple Imputation for a Better Data Access.” Advances in Statistical Analysis 92: 439–458. Doi: http://dx.doi.org/10.1007/s10182-008-0090-1.10.1007/s10182-008-0090-1
Open DOI Search in Google Scholar Back to article
Drechsler, J. and J.P. Reiter. 2010. “Sampling with Synthesis: a New Approach for Releasing Public Use Census Microdata.” Journal of the American Statistical Association 105: 1347–1357. Doi: http://dx.doi.org/10.1198/jasa.2010.ap09480.10.1198/jasa.2010.ap09480
Open DOI Search in Google Scholar Back to article
Duncan, G.T. and D. Lambert. 1989. “The Risk of Disclosure for Microdata.” Journal of Business and Economic Statistics 7: 207–217. Doi: http://dx.doi.org/10.1080/07350015.1989.10509729.10.1080/07350015.1989.10509729
Open DOI Search in Google Scholar Back to article
Harel, O. and J.L. Schafer. 2003. “Multiple Imputation in Two Stages.” In Proceedings of Federal Committee on Statistical Methodology 2003 Conference, November 17–19, 2003, Washington DC. Available at: http://fcsm.sites.usa.gov/files/2014/05/2003FCSM_Harel.pdf (accessed August 2017).
Search in Google Scholar Back to article
Karr, A.F., C.N. Kohnen, A. Oganian, J.P. Reiter, and A.P. Sanil. 2006. “A Framework for Evaluating the Utility of Data Altered to Protect Confidentiality.” The American Statistician 60: 224–232. Doi: http://dx.doi.org/10.1198/000313006X124640.10.1198/000313006X124640
Open DOI Search in Google Scholar Back to article
Li, F., M. Baccini, F. Mealli, E.Z. Zell, C.E. Frangakis, and D.B. Rubin. 2014. “Multiple Imputation by Ordered Monotone Blocks, with Applications to the Anthrax Vaccine Adsorbed Trial.” Journal of Computational and Graphical Statistics 23: 877–892. Doi: http://dx.doi.org/10.1080/10618600.2013.826583.10.1080/10618600.2013.826583
Open DOI Search in Google Scholar Back to article
Meng, X.L. 1994. “Multiple-Imputation Inferences with Uncongenial Sources of Input.” Statistical Science 9: 538–558. Doi: http://dx.doi.org/10.1214/ss/1177010269.10.1214/ss/1177010269
Open DOI Search in Google Scholar Back to article
Neyman, J. 1934. “On Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection (with Discussion).” Journal of the Royal Statistical Society 97: 558–625.10.2307/2342192
Open DOI Search in Google Scholar Back to article
Raghunathan, T.E. and D.B. Rubin. 2000. “Bayesian Multiple Imputation to Preserve Confidentiality in Public-Use Data Sets.” In Proceedings of ISBA 2000 – The Sixth World Meeting of the International Society for Bayesian Analysis, Crete, May 2000.
Search in Google Scholar Back to article
Raghunathan, T.E., J.M. Lepkowski, J. van Hoewyk, and P. Solenberger. 2001. “A Multivariate Technique for Multiply Imputing Missing Values Using a Series of Regression Models.” Survey Methodology 27: 85–96.
Search in Google Scholar Back to article
Raghunathan, T.E., J.P. Reiter, and D.B. Rubin. 2003. “Multiple Imputation for Statistical Disclosure Limitation.” Journal of Official Statistics 19: 1–16.
Search in Google Scholar Back to article
Reiter, J.P. 2002. “Satisfying Disclosure Restrictions with Synthetic Datasets.” Journal of Official Statistics 18: 531–543.
Search in Google Scholar Back to article
Reiter, J.P. 2003. “Inference for Partially Synthetic, Public Use Microdata Sets.” Survey Methodology 29: 181–189.
Search in Google Scholar Back to article
Reiter, J.P. 2005a. “Releasing Multiply Imputed Synthetic Public Use Microdata: An Illustration and Empirical Study.” Journal of the Royal Statistical Society, Series A 168: 185–205. Doi: http://dx.doi.org/10.1111/j.1467-985X.2004.00343.x.10.1111/j.1467-985X.2004.00343.x
Open DOI Search in Google Scholar Back to article
Reiter, J.P. 2005b. “Significance Tests for Multi-Component Estimands from Multiply Imputed, Synthetic Microdata.” Journal of Statistical Planning and Inference 131: 365–377. Doi: http://dx.doi.org/10.1016/j.jspi.2004.02.003.10.1016/j.jspi.2004.02.003
Open DOI Search in Google Scholar Back to article
Reiter, J.P. 2009. “Multiple Imputation for Disclosure Limitation: Future Research Challenges.” Journal of Privacy and Confidentiality 1: 223–233.10.29012/jpc.v1i2.575
Search in Google Scholar Back to article
Reiter, J.P., T.E. Raghunathan, and S. Kinney. 2006. “The Importance of Modelling the Sampling Design in Multiple Imputation for Missing Data.” Survey Methodology 32: 143–149.
Search in Google Scholar Back to article
Reiter, J.P. and R. Mitra. 2009. “Estimating Risks of Identification and Disclosure in Partially Synthetic Data.” Journal of Privacy and Confidentiality 1: 99–110.10.29012/jpc.v1i1.567
Search in Google Scholar Back to article
Reiter, J.P. and J. Drechsler. 2010. “Two Stage Multiple Imputation to Protect Confidentiality.” Statistica Sinica 20: 405–422.
Search in Google Scholar Back to article
Reiter, J.P., Q. Wang, and B.E. Zhang. 2014. “Bayesian Estimation of Disclosure Risks for Multiply Imputed, Synthetic Data.” Journal of Privacy and Confidentiality 6: 17–33.10.29012/jpc.v6i1.635
Search in Google Scholar Back to article
Rubin, D.B. 1978. “Multiple Imputation in Sample Surveys.” In Proceedings of the Survey Research Methods Section of the American Statistical Association, 20–34. Alexandria, VA: American Statistical Association, August 14-17, San Diego. Available at: https://ww2.amstat.org/sections/srms/Proceedings/papers/1978_004.pdf (accessed August 2017).
Search in Google Scholar Back to article
Rubin, D.B. 1987. Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons, Inc.10.1002/9780470316696
Search in Google Scholar Back to article
Rubin, D.B. 1993. “Discussion: Statistical Disclosure Limitation.” Journal of Official Statistics 9: 461–468.
Search in Google Scholar Back to article
Rubin, D.B. 2003. “Nested Multiple Imputation of NMES via Partially Incompatible MCMC.” Statistica Neerlandica 57: 3–18. Doi: http://dx.doi.org/10.1111/1467-9574.00217.10.1111/1467-9574.00217
Open DOI Search in Google Scholar Back to article
Schafer, J.L. 1997. Analysis of Incomplete Multivariate Data. London: Chapman & Hall.10.1201/9781439821862
Search in Google Scholar Back to article
Shen, Z. 2000. Nested Multiple Imputation. Ph.D. thesis, Harvard University, Dept. of Statistics: Cambridge, MA.
Search in Google Scholar Back to article
Van Buuren, S. and C.G.M. Oudshoorn. 2000. Multivariate Imputation by Chained Equations: MICE v1.0 user’s manual. Leiden: TNO. Available at: http://www.stefvanbuuren.nl/publications/mice%20v1.0%20manual%20tno00038%202000.pdf (accessed september 2017).
Search in Google Scholar Back to article
Woo, M.J., J.P. Reiter, A. Oganian, and A.F. Karr. 2009. “Global Measures of Data Utility for Microdata Masked for Disclosure Limitation.” Journal of Privacy and Confidentiality 1: 111–124.10.29012/jpc.v1i1.568
Search in Google Scholar Back to article
Xie, X. and X.L. Meng. 2014. “Dissecting Multiple Imputation from a Multi-Phase Inference Perspective: What Happens When God’s, Imputer’s and Analyst’s Models are Uncongenial?” Statistica Sinica. Preprint. Doi: http://dx.doi.org/10.5705/ss.2014.067.10.5705/ss.2014.067
Open DOI Search in Google Scholar Back to article

Multiply-Imputed Synthetic Data: Advice to the Imputer

Abstract

Paradigm

My account