Subpopulation Discovery in Epidemiological Data with Subspace Clustering

Uli Niemann; Myra Spiliopoulou; Henry Völzke; Jens-Peter Kühn

doi:10.2478/fcds-2014-0015

.blurhash-client-img { display: none !important; }

Subpopulation Discovery in Epidemiological Data with Subspace Clustering

Foundations of Computing and Decision Sciences

Volume 39 (2014): Issue 4 (December 2014)

By: Uli Niemann, Myra Spiliopoulou, Henry Völzke and Jens-Peter Kühn

Open Access

|Dec 2014

Abstract

A prerequisite of personalized medicine is the identification of groups of people who share specific risk factors towards an outcome. We investigate the potential of subspace clustering for finding such groups in epidemiological data. We propose a workflow that encompasses clusterability assessment before cluster discovery and quality assessment after learning the clusters. Epidemiological usually do not have a ground truth for the verification of clusters found in subspaces. Hence, we introduce quality assessment through juxtaposition of the learned models to “models-of-randomness”, i.e. models that do not reflect a true cluster structure. On the basis of this workflow, we select subspace clustering methods, compare and discuss their performance. We use a dataset with hepatic steatosis as outcome, but our findings apply on arbitrary epidemiological cohort data that have tenths of variables and exhibit class skew.

References

[1] B. Preim, P. Klemm, H. Hauser, K. Hegenscheid, S. Oeltze, K. Toennies, and H. Völzke, Visualization in Medicine and Life Sciences III, ch. Visual Analytics of Image-Centric Cohort Studies in Epidemiology. Springer, 2014.
Search in Google Scholar
[2] A. D. Hingorani, D. A. van der Windt, R. D. Riley, (...), W. Sauerbrei, D. G. Altman, and H. Hemingway, “Prognosis research strategy (PROGRESS) 4: Stratified medicine research,” BMJ: British Medical Journal, vol. 346, no. e5793, 2013.10.1136/bmj.e5793356568623386361
Search in Google Scholar
[3] H. Völzke, C. Schmidt, K. Hegenscheid, J. Kühn, F. Bamberg, W. Lieb, H. Kroemer, N. Hosten, and R. Puls, “Population imaging as valuable tool for personalized medicine,” Clin Pharmacol Ther, vol. 92, no. 4, pp. 422-424, 2012.10.1038/clpt.2012.10022910443
Search in Google Scholar
[4] H. Völzke, D. Alte, . . . , R. Biffar, U. John, and W. Hoffmann, “Cohort profile: the Study of Health In Pomerania,” International Journal of Epidemiology, vol. 40, no. 2, pp. 294-307, 2011.
Search in Google Scholar
[5] L. Parsons, E. Haque, and H. Liu, “Subspace Clustering for High Dimensional Data: A Review,” ACM SIGKDD Explorations Newsletter, vol. 6, pp. 90-105, 2004.10.1145/1007730.1007731
Search in Google Scholar
[6] K. Sim, V. Gopalkrishnan, A. Zimek, and G. Cong, “A survey on enhanced subspace clustering,” Data mining and knowledge discovery, vol. 26, pp. 332-397, 2013.10.1007/s10618-012-0258-x
Search in Google Scholar
[7] A. Zimek, Data Clustering: Algorithms and Applications, ch. Clustering High- Dimensional Data, pp. 201-230. CRC Press, 2013.10.1201/9781315373515-9
Search in Google Scholar
[8] C. Zhang and R. L. Kodell, “Subpopulation-specific confidence designation for more informative biomedical classification,” Artificial Intelligence in Medicine, vol. 58, no. 3, pp. 155-163, 2013.10.1016/j.artmed.2013.04.008372724423731649
Search in Google Scholar
[9] S. Glaßer, U. Niemann, B. Preim, and M. Spiliopoulou, “Can we Distinguish Between Benign and Malignant Breast Tumors in DCE-MRI by Studying a Tumor's Most Suspect Region Only?,” in 26th International Symposium on Computer- Based Medical Systems (CBMS), pp. 77-82, 2013.10.1109/CBMS.2013.6627768
Search in Google Scholar
[10] U. Niemann, H. Völzke, J.-P. Kühn, and M. Spiliopoulou, “Learning and inspecting classification rules from longitudinal epidemiological data to identify predictive features on hepatic steatosis,” Expert Systems with Applications, vol. 41, pp. 5405-5415, September 2014.
Search in Google Scholar
[11] T. Hielscher, M. Spiliopoulou, H. Völzke, and J.-P. Kühn, “Using participant similarity for the classification of epidemiological data on hepatic steatosis,” in Proc. of the 27th IEEE Int. Symposium on Computer-Based Medical Systems (CBMS'14), pp. 1-7, IEEE, 2014.10.1109/CBMS.2014.28
Search in Google Scholar
[12] M. A. Hall, “Correlation-based feature selection for discrete and numeric class machine learning,” in Proc. of 17th Int. Conf. on Machine Learning, pp. 359-366, Morgan Kaufmann, 2000.
Search in Google Scholar
[13] P. Klemm, L. Frauenstein, D. Perlich, K. Hegenscheid, H. Völzke, and B. Preim, “Clustering Socio-demographic and Medical Attribute Data in Cohort Studies,” in Bildverarbeitung für die Medizin (BVM), pp. 180-185, Springer Berlin Heidelberg, 2014.10.1007/978-3-642-54111-7_36
Search in Google Scholar
[14] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic subspace clustering of high dimensional data for data mining applications,” in Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 61-72, 1998.10.1145/276304.276314
Search in Google Scholar
[15] C. C. Aggarwal, C. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park, “Fast Algorithms for Projected Clustering,” in Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 61-72, 1999.10.1145/304181.304188
Search in Google Scholar
[16] D. Damian, M. Orešič, E. Verheij, J. Meulman, J. Friedman, A. Adourian, N. Morel, A. Smilde, and J. van der Greef, “Applications of a new subspace clustering algorithm (COSA) in medical systems biology,” Metabolomics, vol. 3, no. 1, pp. 69-77, 2007.10.1007/s11306-006-0045-z
Search in Google Scholar
[17] L. S. Friedman and E. B. Keeffe, Handbook of Liver Disease. Library of Congress Cataloging-in-Publication Data, 2011.
Search in Google Scholar
[18] A. P. Levene and R. D. Goldin, “The epidemiology, pathogenesis and histopathology of fatty liver disease,” Histopathology, vol. 61, pp. 141-152, 2012.10.1111/j.1365-2559.2011.04145.x22372457
Search in Google Scholar
[19] S. Bellentani, G. Bedogni, L.Miglioli, and C. Tiribelli, “The epidemiology of fatty liver,” European Journal of Gastroenterology & Hepatology, vol. 16, pp. 1087-1093, 2004.
Search in Google Scholar
[20] G. Bedogni, S. Bellentani, L. Miglioli, F. Masutti, M. Passalacqua, A. Castiglione, and C. Tiribelli, “The Fatty Liver Index: a simple and accurate predictor of hepatic steatosis in the general population,” BMC Gastroenterology, vol. 6, no. 33, 2006.10.1186/1471-230X-6-33163665117081293
Search in Google Scholar
[21] X. Yuan, D. Waterworth, J. R. Perry, (...), T. M. Frayling, J. S. Kooner, and V. Mooser, “Impact of fatty liver disease on health care utilization and costs in a general population: A 5-year observation,” Gastroenterology, vol. 134, no. 1, pp. 85-94, 2008.10.1053/j.gastro.2007.10.02418005961
Search in Google Scholar
[22] H. Völzke, S. Schwarz, S. E. Baumeister, H. Wallaschofski, C. Schwahn, H. J. Grabe, T. Kohlmann, U. John, and M. Dören, “Menopausal status and hepatic steatosis in a general female population,” Gut, vol. 56, pp. 594-595, 2007.10.1136/gut.2006.115345185685217369390
Search in Google Scholar
[23] S. Baumeister, H. Völzke, P. Marschall, U. John, C. Schmidt, and D. Alte, “Impact of fatty liver disease on health care utilization and costs in the general population: a 5-year observation,” Gastroenterology, vol. 134, pp. 85-94, 2008.10.1053/j.gastro.2007.10.024
Search in Google Scholar
[24] J.-P. Kühn, D. Hernando, B. Mensel, (...), J. Mayerle, N. Hosten, and S. B. Reeder, “Quantitative chemical shift-encoded MRI is an accurate method to quantify hepatic steatosis,” Journal of Magnetic Resonance Imaging, vol. 39, no. 6, pp. 1494-1501, 2014.
Search in Google Scholar
[25] J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques Third Edition. Morgan Kaufmann Publishers, 2012.
Search in Google Scholar
[26] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 226-231, 1996.
Search in Google Scholar
[27] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals of eugenics, vol. 7, no. 2, pp. 179-188, 1936.10.1111/j.1469-1809.1936.tb02137.x
Search in Google Scholar
[28] V. G. Sigillito, S. P. Wing, L. V. Hutton, and K. B. Baker, “Classification of radar returns from the ionosphere using neural networks,” Johns Hopkins APL Tech. Dig, vol. 10, pp. 262-266, 1989.
Search in Google Scholar
[29] D. Dias, R. Madeo, T. Rocha, H. Biscaro, and S. Peres, “Hand movement recognition for brazilian sign language: A study using distance-based neural networks,” in International Joint Conference on Neural Networks (IJCNN 2009), pp. 697-704, 2009.10.1109/IJCNN.2009.5178917
Search in Google Scholar
[30] K. Kailing, H.-P. Kriegel, and P. Kröger, “Density-Connected Subspace Clustering for High-Dimensional Data,” in Proc. SIAM Int. Conf. on Data Mining (SDM'04), pp. 246-257, 2004.10.1137/1.9781611972740.23
Search in Google Scholar
[31] I. Assent, R. Krieger, E. Müller, and T. Seidl, “DUSC: Dimensionality Unbiased Subspace Clustering,” in ICDM, pp. 409-414, 2007.10.1109/ICDM.2007.49
Search in Google Scholar
[32] U. Niemann, “The potential of high-dimensional clustering for subpopulation discovery in epidemiological datasets.” Otto-von-Guericke University Magdeburg, Faculty of Computer Science, 2014. Master Thesis.
Search in Google Scholar
[33] D. R. Wilson and T. R. Martinez, “Improved heterogeneous distance functions,” J. Artif. Int. Res., vol. 6, pp. 1-34, Jan. 1997.10.1613/jair.346
Search in Google Scholar
[34] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Pearson/Addison-Wesley, 2006.
Search in Google Scholar
[35] P. J. Hanly and S. B. Ahmed, “Sleep Apnea and the Kidney: is sleep apnea a risk factor for chronic kidney disease?,” CHEST Journal, vol. 146, no. 4, pp. 1114-1122, 2014.
Search in Google Scholar
[36] J. Zhao, “Subspace clustering with gravitation.,” in Grundlagen von Datenbanken, 2010.
Search in Google Scholar
[37] J. Zhao and S. Conrad, “Automatic subspace clustering with density function.,” in DATA, pp. 63-69, 2012.
Search in Google Scholar
[38] E. Müller, I. Assent, S. Günnemann, R. Krieger, and T. Seidl, “Relevant subspace clustering: Mining the most interesting non-redundant concepts in high dimensional data,” in Ninth IEEE International Conference on Data Mining (ICDM'09), pp. 377-386, IEEE, 2009.10.1109/ICDM.2009.10
Search in Google Scholar
[39] S. Günnemann, E. Müller, I. Färber, and T. Seidl, “Detection of orthogonal concepts in subspaces of high dimensional data,” in Proceedings of the 18th ACM conference on Information and knowledge management, pp. 1317-1326, ACM, 2009.
Search in Google Scholar
[40] G. Moise and J. Sander, “Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering,” in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining: ACM, pp. 533-541, 2008.10.1145/1401890.1401956
Search in Google Scholar
[41] U. Fayyad and K. Irani, “Multi-interval discretization of continuous-valued attributes for classification learning,” in Proc. of 17th Int. Conf. on Machine Learning, pp. 1022-1029, Morgan Kaufmann, 1993.
Search in Google Scholar
[42] M. J. Zaki, M. Peters, I. Assent, and T. Seidl, “Clicks: An effective algorithm for mining subspace clusters in categorical datasets,” Data & Knowledge Engineering, vol. 60, no. 1, pp. 51-70, 2007.10.1016/j.datak.2006.01.005
Search in Google Scholar
[43] G. Gan and J. Wu, “Subspace clustering for high dimensional categorical data,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 2, pp. 87-94, 2004.10.1145/1046456.1046468
Search in Google Scholar
[44] E. Müller, I. Assent, and T. Seidl, “HSM: Heterogeneous subspace mining in high dimensional data,” in Scientific and Statistical Database Management, pp. 497-516, Springer, 2009.10.1007/978-3-642-02279-1_36
Search in Google Scholar
[45] F. Cao, J. Liang, D. Li, and X. Zhao, “A weighting k-modes algorithm for subspace clustering of categorical data,” Neurocomputing, vol. 108, pp. 23-30, 2013.10.1016/j.neucom.2012.11.009
Search in Google Scholar
[46] I. Färber, S. Günnemann, H.-P. Kriegel, P. Kröger, E. Müller, E. Schubert, T. Seidl, and A. Zimek, “On using class-labels in evaluation of clusterings,” in MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD, 2010.
Search in Google Scholar

Articles in this issue

DOI: https://doi.org/10.2478/fcds-2014-0015 | Journal eISSN: 2300-3405 | Journal ISSN: 0867-6356

Journal RSS Feed

Language: English

Page range: 271 - 300

Submitted on: Aug 1, 2014

Published on: Dec 20, 2014

Published by: Poznan University of Technology

In partnership with: Paradigm Publishing Services

Related subjects:

Computer sciences,

Software development

© 2014 Uli Niemann, Myra Spiliopoulou, Henry Völzke, Jens-Peter Kühn, published by Poznan University of Technology
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Volume 39 (2014): Issue 4 (December 2014)