Clustimpute: k-means Clustering with Built-in Missing Data Imputation

Oliver Pfaffel

doi:10.5334/jors.345

Abstract

This article introduces a novel k-means clustering methodology and implementation designed to handle missing values efficiently. The method supports multivariate missingness and is computationally efficient, as it leverages current cluster assignments to define plausible distributions for missing values within each sample. Our experiments demonstrate strong scalability with increasing dataset size, comparable to simple random imputation—measured in terms of runtime. Regarding clustering performance, assessed via the Rand Index against ground truth labels, the method performs competitively with state-of-the-art approaches such as MICE and Amelia, especially when the proportion of missing values is moderate or the imputation runtime is a constraint.

References

Van Buuren S. Flexible imputation of missing data. Chapman and Hall/CRC; 2018. DOI: 10.1201/9780429492259
Open DOI Search in Google Scholar Back to article
van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate imputation by chained equations in r. Journal of Statistical Software. 2011;45(3):1–67. URL: https://www.jstatsoft.org/v45/i03/.
Search in Google Scholar Back to article
Mouselimis L. Clusterr: Gaussian mixture models, k-means, mini-batch-kmeans and k-medoids clustering; 2020. URL: https://CRAN.R-project.org/package=ClusterR. R package version 1.2.1.
Search in Google Scholar Back to article
Eddelbuettel D, Sanderson C. Rcpparmadillo: ‘rcpp’ integration for the ‘armadillo’ templated linear algebra library; 2020. URL: https://CRAN.R-project.org/package=RcppArmadillo. R package version 0.9.850.1.0.
Search in Google Scholar Back to article
Wickham H, François R, Henry L, Müller K. dplyr: A grammar of data manipulation; 2020. URL: https://CRAN.R-project.org/package=dplyr. R package version 0.8.5.
Search in Google Scholar Back to article
Henry L, Wickham H. rlang: Functions for base types and core r and ‘tidyverse’ features; 2020. URL: https://CRAN.R-project.org/package=rlang. R package version 0.4.5.
Search in Google Scholar Back to article
Wickham H. testthat: Unit testing for r; 2020. URL: https://CRAN.R-project.org/package=testthat. R package version 2.3.2.
Search in Google Scholar Back to article
van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate imputation by chained equations in r; 2020. URL: https://CRAN.R-project.org/package=mice. R package version 3.8.0.
Search in Google Scholar Back to article
Honaker J, King G, Blackwell M. Amelia II: A program for missing data. Journal of Statistical Software. 2011;45(7):1–47. URL: https://www.jstatsoft.org/v45/i07/.
Search in Google Scholar Back to article
Honaker J, King G, Blackwell M. Amelia ii: A program for missing data; 2020. URL: https://CRAN.R-project.org/package=Amelia. R package version 1.7.6.
Search in Google Scholar Back to article
Stekhoven DJ, Bühlmann P. Missforest—non-parametric missing value imputation for mixed-type data. Bioinformatics. 2012;28(1):112–118. DOI: 10.1093/bioinformatics/btr597
Open DOI Search in Google Scholar Back to article
Berndt P. missranger: Fast imputation of missing values; 2020. URL: https://CRAN.R-project.org/package=missRanger. R package version 2.1.3.
Search in Google Scholar Back to article
Wright MN, Ziegler A. ranger: A fast implementation of random forests; 2020. URL: https://CRAN.R-project.org/package=ranger. R package version 0.12.1.
Search in Google Scholar Back to article
Little RJA, Rubin DB. Statistical analysis with missing data, volume 793. John Wiley & Sons; 2019. DOI: 10.1002/9781119482260
Open DOI Search in Google Scholar Back to article
Rand WM. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association. 1971;66(336):846–850. DOI: 10.1080/01621459.1971.10482356
Open DOI Search in Google Scholar Back to article
Fisher RA. The use of multiple measurements in taxonomic problems. Annals of Eugenics. 1936;7(2):179–188. DOI: 10.1111/j.1469-1809.1936.tb02137.x
Open DOI Search in Google Scholar Back to article
Sugino KY, Hernandez TL, Barbour LA, Kofonow JM, Frank DN, Friedman JE. Distinct plasma metabolomic and gut microbiome profiles after gestational diabetes mellitus diet treatment: Implications for personalized dietary interventions. Microorganisms. 2024;12(7):1369. DOI: 10.3390/microorganisms12071369
Open DOI Search in Google Scholar Back to article
Knight EC, Carlisle J, Boyce AJ, Bradley D, Cimprich P, Coates S, Dinsmore SJ, Gregory CJ, Jorgensen JG, Kelly JF, et al. Delineating ecologically distinct groups for annual cycle management of a declining shorebird. Journal of Applied Ecology. 2025;62(5):1152–1165. DOI: 10.1111/1365-2664.14885
Open DOI Search in Google Scholar Back to article
Nowinski B, Feng X, Preston CM, Birch JM, Luo H, Whitman WB, Moran MA. Ecological divergence of syntopic marine bacterial species is shaped by gene content and expression. The ISME Journal. 2023;17(6):813–822. DOI: 10.1038/s41396-023-01390-4
Open DOI Search in Google Scholar Back to article
Xu D, Hu PJ-H, Fang X. Deep learning-based imputation method to enhance crowdsourced data on online business directory platforms for improved services. Journal of Management Information Systems. 2023;40(2):624–654. DOI: 10.1080/07421222.2023.2196770
Open DOI Search in Google Scholar Back to article

Clustimpute: k-means Clustering with Built-in Missing Data Imputation

Abstract

Paradigm

My account