Abstract
This article introduces a novel k-means clustering methodology and implementation designed to handle missing values efficiently. The method supports multivariate missingness and is computationally efficient, as it leverages current cluster assignments to define plausible distributions for missing values within each sample. Our experiments demonstrate strong scalability with increasing dataset size, comparable to simple random imputation—measured in terms of runtime. Regarding clustering performance, assessed via the Rand Index against ground truth labels, the method performs competitively with state-of-the-art approaches such as MICE and Amelia, especially when the proportion of missing values is moderate or the imputation runtime is a constraint.
