On Proxy Variables and Categorical Data Fusion

Li-Chun Zhang

doi:10.1515/jos-2015-0045

Abstract

The problem of inference about the joint distribution of two categorical variables based on knowledge or observations of their marginal distributions, to be referred to as categorical data fusion in this paper, is relevant in statistical matching, ecological inference, market research, and several other related fields. This article organizes the use of proxy variables, to be distinguished from other auxiliary variables, both in terms of their effects on the uncertainty of fusion and the techniques of fusion. A measure of the gains of efficiency is provided, which incorporates both the identification uncertainty associated with data fusion and the sampling uncertainty that arises when the theoretical bounds of the uncertainty space are unknown and need to be estimated. Several existing techniques for generating fusion distributions (or datasets) are described and some new ones proposed. Analysis of real-life data demonstrates empirically that proxy variables can make data fusion more precise and the constructed fusion distribution more plausible.

References

Brozzi, A., A. Capotorti, and B. Vantaggi. 2012. “Incoherence Correction Strategies in Statistical Matching.” International Journal of Approximate Reasoning 53: 1124–1136. Doi: http://dx.doi.org/10.1016/j.ijar.2012.06.009.
Search in Google Scholar
Conti, P.L., D. Marella, and M. Scanu. 2008. “Evaluation of Matching Noise for Imputation Techniques Based on Nonparametric Local Linear Regression Estimators.” Computational Statistics & Data Analysis 53: 354–365. Doi: http://dx.doi.org/10.1016/j.csda.2008.07.041.
Search in Google Scholar
Conti, P.L., M. Di Zio, D. Marella, and M. Scanu. 2009. “Uncertainty Analysis in Statistical Matching.” Paper given at the First Italian Conference on Survey Methodology (ITACOSM09), June 10–12, 2009, Siena
Search in Google Scholar
Conti, P.L., D. Marella, and M. Scanu. 2012. “Uncertainty Analysis in Statistical Matching.” Journal of Official Statistics 28: 69–88.
Search in Google Scholar
Conti, P.L., D. Marella, and M. Scanu. 2013. “Uncertainty Analysis for Statistical Matching of Ordered Categorical Variables.” Computational Statistics & Data Analysis 68: 311–325. Doi: http://dx.doi.org/10.1016/j.csda.2013.07.004.
Search in Google Scholar
Cain, M. 1994. “The Moment-generating Function of the Minimum of Bivariate Normal Random Variables.” The American Statistician 48: 124–125. Doi: http://dx.doi.org/10.1080/00031305.1994.10476039.
Search in Google Scholar
Chambers, R.L. and R.G. Steel. 2001. “Simple Methods for Ecological Inference in 2 × 2 Tables.” Journal of the Royal Statistical Society Series A 164: 175–192. Doi: http://dx.doi.org/10.1111/1467-985X.00195.
Search in Google Scholar
D’Orazio, M., M. Di Zio, and M. Scanu. 2006a. “Statistical Matching for Categorical Data: Displaying Uncertainty and Using Logical Constraints.” Journal of Official Statistics 22: 137–157.
Search in Google Scholar
D’Orazio, M., M. Di Zio, and M. Scanu. 2006b. Statistical Matching: Theory and Practice. Chichester: Wiley.10.1002/0470023554
Search in Google Scholar
Kadane, J.B. 1978. “Some Statistical Problems in Merging Data Files.” In 1978 Compendium of Tax Research, (pp. 159–171). Washington, D.C. Department of Treasury. (Reprinted in Journal of Official Statistics 17: 423–433.).
Search in Google Scholar
King, G. 1997. A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior from Aggregate Data. Princeton: Princeton University Press.10.3886/ICPSR01132
Search in Google Scholar
Koopmans, T. 1949. “Identification Problems in Economic Model Construction.” Econometrica 17: 125–144. Doi: http://dx.doi.org/10.2307/1905689.
Search in Google Scholar
Lindley, D.V., A. Tversky, and R.V. Brown. 1979. “On the Reconciliation of Probability Assessments (incl. discussions).” Journal of the Royal Statistical Society Series A 142: 146–180. Doi: http://dx.doi.org/10.2307/2345078.
Search in Google Scholar
Manski, C.F. 1995. Identification Problems in the Social Sciences. Cambridge, MA: Harvard University Press.
Search in Google Scholar
Marella, D., P.L. Conti, and M. Scanu. 2008. “On the Matching Noise of Some Nonparametric Imputation Procedures.” Statistics and Probability Letters 78: 1593–1600. Doi: http://dx.doi.org/10.1016/j.spl.2008.01.020.
Search in Google Scholar
Moriarity, C. and F. Scheuren. 2001. “Statistical Matching: A Paradigm for Assessing the Uncertainty in the Procedure.” Journal of Official Statistics 17: 407–422.
Search in Google Scholar
Nadarajah, S. and S. Kotz. 2008. “Exact Distribution of the Max/Min of Two Gaussian Random Variables.” IEEE Transactions on Very Large Scale Integration (VLSI) Systems 16: 210–212. Doi: http://dx.doi.org/10.1109/TVLSI.2007.912191.
Search in Google Scholar
Okner, B.A. 1972. “Constructing a New Microdata Base From Existing Microdatasets: the 1966 Merge File.” Annals of Economic and Social Measurement 1: 325–342.
Search in Google Scholar
Patel, J.K., C.H. Kapadia, and D.B. Owen. 1976. Handbook of Statistical Distributions. New York: Marcel Dekker.
Search in Google Scholar
Plackett, R.L. 1977. “The Marginal Totals of a 2 × 2 Table.” Biometrika 64: 37–42. Doi: http://dx.doi.org/10.1093/biomet/64.1.37.
Search in Google Scholar
Purcell, N.J. and L. Kish. 1980. “Postcensal Estimates for Local Areas (or Domains).” International Statistical Review 48: 3–18. Doi: http://dx.doi.org/10.2307/1402400.
Search in Google Scholar
Rässler, S. 2002. Statistical Matching: A Frequentist Theory, Practical Applications and Alternative Bayesian Approaches, Vol. 168 of Lecture Notes in Statistics. New York: Springer Verlag.10.1007/978-1-4613-0053-3_2
Search in Google Scholar
Rässler, S. and H. Kiesl. 2009. “How Useful Are Uncertainty Bounds? Some Recent Theory With an Application to Rubin’s Causal Model.” In Proceedings of the 57th Sessions of the International Statistical Institute. (2009) CD-ROM. Durban, South Africa.
Search in Google Scholar
Singh, A.C., H. Mantel, M. Kinack, and G. Rowe. 1993. “Statistical Matching: Use of Auxiliary Information as an Alternative to the Conditional Independence Assumption.” Survey Methodology 19: 57–79.
Search in Google Scholar
Vantaggi, B. 2008. “Statistical Matching of Multiple Sources: A Look Through Coherence.” International Journal of Approximate Reasoning 49: 701–711. Doi: http://dx.doi.org/10.1016/j.ijar.2008.07.005.
Search in Google Scholar
Wakefield, J. 2004. “Ecological Inference for 2 × 2 Tables (incl. discussions).” Journal of the Royal Statistical Society Series A 167: 385–445. Doi: http://dx.doi.org/10.1111/j.1467-985x.2004.02046.x.
Search in Google Scholar