Clustimpute: k-means Clustering with Built-in Missing Data Imputation

Oliver Pfaffel

doi:10.5334/jors.345

Full Article

(1) Overview

1 Introduction

When missing values are present in data, partitional clustering is not possible out of the box. A typical strategy is missing-value imputation before clustering. For an in-depth account on the imputation of missing values, we refer to [1]. At a high-level, there exist broadly two different types of imputation methods that are applied to each column/variable of a data set that has at least one missing entry:

Unconditional approaches such as mean, median or random imputation. These approaches have the benefit of being computationally inexpensive. However, mean and median imputation reduce variance in the data and are deterministic, which may lead to false confidence in the resulting clusters. Random imputation is exact if values are missing completely at random, but it ignores multicollinearity between variables under a missing-at-random assumption.
Conditional approaches iteratively train regression models on the subset of rows without missing values, using all other variables—including those with current imputations—as predictors. These models are then used to estimate missing values in the remaining rows. While such methods can be accurate under a missing-at-random assumption, they are often computationally intensive—typically requiring significantly more resources than the clustering step itself.

This article introduces a novel k-means clustering methodology and implementation that includes an efficient missing data imputation. A version of this algorithm is implemented in the ClustImpute R package. The imputation method basically is a column-wise nearest neighborhood imputation which is nested in the clustering algorithm. The neighborhoods are given by the current cluster assignments. Therefore, there is no need to fit a k-nearest neighbors algorithm for each imputation step which reduces computational time considerably. Typically k-means initializes with random partitions that will not yield good imputation neighborhoods if the missing values are not completely at random. Thus, the imputed values drawn from each neighborhood are weighted down for a “‘burn-in”’ phase (to be set by the user).

In statistical clustering it is usually recommended to standardize the data, if not measured on the same scale, prior to clustering so that each column has a mean of zero and a standard deviation of one. If all variables were measured on the same scale it is still important that they are centered before ClustImpute is applied with weights, i.e., with parameter n_end > 1; otherwise, the weighting procedure will introduce bias.

The benefit of this approach can be visualized by an idealised example: Often a clustering based on median or random imputation will not provide good results even if we know the number of clusters. As shown in Figure 1, both approaches badly distort the data and lead to unreasonable clusters. The result with ClustImpute for the very same data set can be observed in Figure 2.

Imputation with the median vs. random imputation on a simulated data set further described in section 3. Median imputation on the left produces an artificial point mass near the x and y axes. Random imputation on the right produces many points far away from the actual clusters.

Clustering based on the simulated data with missing values as shown in Figure 1. On the left, k-means is applied to the data set where a random imputation was applied. On the right, the proposed package **ClustImpute** was used without any imputation or other pre-processing step. Note that the underlying data has four additional “‘noise”’ variables not shown in the figures; thus the clusters are somewhat overlapping when plotted w.r.t x and y only.

ClustImpute draws the missing values iteratively based on the current cluster assignment so that correlations are considered at this level. We assume that a more granular dependence structure between the missing values is not relevant because the primary goal is to obtain k meaningful partitions. Subsequently, penalizing weights are imposed on imputed values and successively decreased (to zero) as both the partitioning and the associated missing data imputation become more accurate. The heuristic is that, after a burn-in phase, each observation lies close to a cluster providing a suitable neighborhood from which to draw plausible values for its missings. The algorithm is computationally efficient because the imputation does not aim to recover the full conditional distribution of the missing values but only needs to support accurate clustering. As a result, it is significantly faster than approaches that model the full conditional distribution, such as the MICE package [2].

2 Implementation and architecture

The implementation of the main function of the package ClustImpute is explained below and consists of two steps: imputation and clustering. The imputation is a conditional sampling and implemented directly in the algorithm. The clustering is performed by the function ClusterR::KMeans_arma for c_steps where the centroids are initialized randomly in the first step and with the current clusters centroids thereafter. In the second step, we use ClusterR::KMeans_arma(…,CENTROIDS=cl_old,…) where cl_old denotes the current cluster centroids. The ClusterR package [3] was chosen for its fast implementation using RcppArmadillo [4] and the ability to start from user-defined centroids. The missing values are sampled conditionally on the current cluster assignment provided by ClusterR::predict_KMeans. To facilitate this, packages like dplyr [5] and rlang [6] are imported by ClustImpute. A detailed description of the algorithm can be found in the vignette of the package on CRAN: https://cran.r-project.org/web/packages/ClustImpute/vignettes/description_of_algorithm.pdf. The package also uses testthat [7] for unit testing and has full test coverage. Further dependencies arise from helper functions for the vignette and will be explained in the next chapter.

3 Quality control

We are not aware of any comparable implementations. However there are several packages for pure missing data imputation (without clustering) that can be used before an application of any k-means implementation in R, for example:

MICE provides multiple imputation using a Fully Conditional Specification as described in [2]. Each variable has its own imputation model, and built-in imputation models are provided for continuous data, binary data, unordered categorical data and ordered categorical data. Thus, its applicability is much broader than ClustImpute’s. In our study, we apply the same imputation model to all variables and show results for predictive mean matching, denoted by MICE_pmm, and classification and regression trees (CART), denoted by MICE_cart [8].
Amelia implements bootstrap-based multiple imputation using the Expectation-Maximization (EM) algorithm to estimate parameters. For quantitative data, it imputes under a Multivariate Gaussian assumption, as described in [9], and implemented in version [10].
missRanger provides a fast implementation of the MissForest algorithm, which imputes mixed-type data sets by chaining Random Forests, originally introduced by [11]. Internally, it relies on the efficient ranger package for training the forests [12, 13].

Here we use these packages in combination with ClusterR for a fair comparison with ClustImpute. Our goal is to compare against popular packages that represent a diverse set of methodologies. However, we are aware that there exists a large number of other imputation packages. For a good overview, we refer to a CRAN task view: https://cran.r-project.org/web/views/MissingData.html. To reproduce the experiments, all scripts are available at https://github.com/o1iv3r/ClustImpute_paper_material. All result tables presented in this article can also be downloaded from the repository in CSV format.

3.1 Scalability on simulated data

This experiment evaluates the scalability of ClustImpute for a data set with a growing number of rows. To reproduce this experiment, the corresponding script Scalability.R is available in the public GitHub repository at https://github.com/o1iv3r/ClustImpute_paper_material.

The data is generated by a simple function that produces an n times nr_other + 2 matrix with three clusters, where n denotes the number of rows and nr_other + 2 the number of columns. The variables in the nr_other columns are noise variables to distract the algorithms from the true clusters that lie in a 2-dimensional hyperplane. A visualization of a sample can be seen in Figures 1 and 2 with nr_other = 4. For this experiment, the noise variables are multivariate normal with a mean of zero and a covariance matrix where diagonal entries are 1 and off-diagonal entries are equal to ρ = 0.3:

A = \begin{matrix} 1 & ρ & \dots & ρ \\ ρ & 1 & ⋮ \\ ⋮ & ⋱ & ρ \\ ρ & \dots & ρ & 1 \end{matrix}]

Missing values are created with the function ClustImpute::Missing_simulation. In this function copula::normalCopula() is used to generate a Gaussian Copula based on a random covariance matrix, which is diagonal in the ‘Missing completely at random (MCAR)’ setting, otherwise non-diagonal (‘Missing at random (MAR)’ case). A sample from this Gaussian Copula via copula::rMvdc() provides indicators used to replace existing values with NA values. For a mathematical definition of MCAR and MAR we refer to [14].

We compare ClustImpute with R packages for missing data imputation that are applied to the data in a pre-processing step before k-means clustering is performed. For the latter, we use the same clustering algorithm as in ClustImpute, namely the ClusterR package, to ensure that only the computation time of the imputation is compared. The running time is measured as an average of 10 clustering runs for each procedure and data set.

The parametrization is as follows:

# iterations of procedure
nr_iter <- 14
# steps until convergence of weight function
n_end <- 10
# number of clusters
nr_cluster <- 3
# number of cluster steps per iteration
c_steps <- 50
# steps after convergence of weight function
nr_iter_other <- (nr_iter-n_end) * c_steps

The parameter $n r_{i t e r_{o t h e r}}$ for clustering following external imputation strategies is chosen to ensure a comparable number of effective clustering steps. In ClustImpute, iterations with a weight below 1 are treated as a burn-in phase, during which the imputed values are not yet considered fully reliable. Aligning the total number of credible clustering steps avoids giving ClustImpute an unfair advantage over methods that perform imputation separately.

Due to its very slow runtime, we excluded MissForrest from the benchmark results and focused instead on the faster missRanger method. However, the implementation of MissForrest remains available in the codebase, allowing interested readers to include it in their own comparisons. The results shown below assume a missingness rate of 20%, which can be easily modified by changing a single line in the script. Each combination of imputation and clustering was repeated five times with different random seeds. The typical usage of ClustImpute in these experiments is as follows:


res <- ClustImpute(dat_with_miss,
              nr_cluster=nr_cluster, nr_iter=nr_iter,
              c_steps=c_steps, n_end=n_end,
              seed_nr = random_seed)
clusterR::external_validation(random_
data$true_clusters, res$clusters)

In Figure 3 we observe that Amelia and ClustImpute scale much better than MICE and missRanger. ClustImpute scales similar to a simple random imputation and to Amelia. We only show the result from a MAR simulation, but the results for MCAR are comparable.

This figure shows the median running time in seconds for an application of ClustImpute vs. a comparable k-means clustering performed on a data set imputed by **Amelia, MICE, MissRanger** or simple random imputation, on a regular (left) and on a log-scale (right).

We observe that ClustImpute exhibits scalability comparable to simple random imputation and is therefore significantly faster than pre-processing with MICE or missRanger. In this experiment, the number of clusters was assumed to be known a priori. The package vignette illustrates how the ClustImpute::var_reduction() function can be used to tune this hyperparameter based on variance reduction criteria.

A short note on convergence: ClustImpute tracks the mean and variance of imputed variables. These values are part of the output list and can help to asses “convergence”’ and to tune the nr_iter parameter appropriately, as illustrated in the somewhat comparable trace plots from the MICE package. Therefore, the number of iterations was not kept fixed but rather calibrated within a range of 10 to 20 iterations to mimic a practical situation, in which there exists a trade-off between run time and convergence. The latter cannot be clearly determined in practice, since due to the nature of clustering gold labels do not exist.

Of course, running time is only of relevance if the resulting clusters are adequate. For this reason, we benchmark clustering performance on a simulated data set consisting of three well-separated clusters lying in a two-dimensional subspace, with additional noise variables added to increase dimensionality. A ground truth cluster assignment is available by design. In Table 1, we report the Rand indices, cf. [15], which quantify the agreement between the predicted and true cluster assignments. The Rand Index takes values between 0 and 1, where 1 indicates perfect agreement. The results show that ClustImpute can perform on par with alternative methods. Because these results were obtained with a fixed set of hyperparameters, the running time of ClustImpute corresponds to a fixed number of random imputations plus a fixed number of clustering steps. Note that the other imputation methods (except random imputation) could potentially yield better results if hyperparameter tuning were performed, which is out of scope of this paper.

Table 1

Rand index on simulated data. The maximum values are shown in bold.

Nr. of obs.	ClustImpute	RandomImp +ClusterR	missRanger +ClusterR	MICE(PMM) +ClusterR	MICE(CART) +ClusterR	Amelia +ClusterR
100	0.1787	0.1846	0.1508	0.1193	0.1401	0.1417
200	0.2607	0.2534	0.2291	0.2254	0.2015	0.1684
400	0.3395	0.2676	0.2619	0.2284	0.3142	0.2041
800	0.2682	0.2669	0.2154	0.2524	0.2405	0.2098
1600	0.2473	0.2187	0.2525	0.1984	0.2610	0.1973

3.2 Benchmarking Clustering Performance on Iris

For additional benchmarking, we use the popular Iris data set [16], which contains 150 observations of iris flowers with four numeric features and species labels. To reproduce this experiment, the corresponding script Benchmarking_Iris.R is available in the GitHub repository at https://github.com/o1iv3r/ClustImpute_paper_material. In contrast to the previous example, the size of the data set stays fixed, but the share of missing values for each of the four variables is varied. Missing values are generated similarly to the approach used in the scalability experiment.

The parametrization is as follows:


nr_iter <- 15
n_end <- 8
nr_cluster <- 3
c_steps <- 1
nr_iter_other <- nr_iter * c_steps

In contrast to the previous section we use the same number of clustering steps for all approaches (see last line of the code above) so that the benchmark approaches are not disadvantaged. Clustering performance is measured as an average of the RandIndex resulting from 30 clustering runs for each procedure and share of missing value.

For an MCAR missing scheme, we observe in Figure 4 we that all approaches yield good clusters when the share of missing values is low. ClustImpute is among the top for a missingness rate of 30% or less. For an even larger missingness rate, a pre-imputation with MICE provides better results. Nevertheless, the performance of ClustImpute is higher than for MissRanger or a simple random imputation. Interestingly we see that ClustImpute without a weight function (meaning a weight of 1 already as of the first iteration), or using the weight function only in the cluster computation but not the assignment, performs worse than with a weight function. Note that Amelia provides very good results for a low missingness rate but fails to converge for a larger share, even though empirical priors were used as suggested by the package’s documentation.

This figure shows the Rand Index for a censored IRIS data set for comparison between **ClustImpute** and a k-means clustering performed on a data set imputed by **Amelia, MICE, MissRanger** or simple random imputation. A different share of missing values was simulated using **MCAR**. The (dodged) point ranges cover +/– 2 times the standard error.

In the second experiment, everything was kept the same except for the missing simulation that now is MAR. The results are shown in 6. To generate MAR, the missing values were correlated using the previously mentioned function ClustImpute::Missing_simulation. Figure 5, made with the corrplot package, shows that missing values are indeed strongly correlated. The results in the MAR setting are somewhat similar; however, the performance gain by using MICE compared to the other packages is larger than it was for MCAR.

This figure shows the correlations between missing values in the censored IRIS data set with 40% of values **missing at random (MAR)**.

Overall, these results (see Figure 6) highlight the setting in which ClustImpute is most beneficial: when missing values are strongly correlated and the data dimensionality is high. In such cases, pre-imputation with MICE becomes computationally expensive, Amelia may fail to converge, and simple random imputation yields poor clustering results.

This figure shows the Rand Index for a censored IRIS data set for an application of **ClustImpute** vs. a comparable k-means clustering performed on a data set imputed by **Amelia, MICE, missRanger** or simple random imputation. Different missingness rates were simulated using MAR. The (dodged) point ranges cover +/– 2 times the standard error.

Setting number of iterations and clustering steps: The benchmarking experiments on the Iris data have been conducted for a fixed parametrization of ClustImpute. The following experiment shows that, at least for this data set, there is a rather low sensitivity with respect to changes in the parametrization: We considered all combinations of the following parameters:


nr_iter <- c(5,10,20,30,50)
c_steps <- c(1,10,25,50)

The parameter n_end was set either to 1 (i.e., no weight function is used), or 30%, 60% or 90% of nr_iter, rounded to the closest integer. The missing simulation was performed as above using MAR and a missing share of 30%. For each of the resulting 80 parameter combinations, 30 ClustImpute runs were performed and the resulting Rand Index averaged. Results are shown in Figure 7. One observes that the Rand Index is, on average, higher if a weight function is used. Moreover, higher scores can be obtained if the convergence is well before the final iteration by defining n_end as a fraction of 30% or 60% of nr_iter. The size of the points corresponds to the total number of clustering steps, i.e., the product of nr_iter and c_steps. Best scores can be obtained by a total number of iterations that is not too high. For example, a good result can be obtained by setting nr_iter and c_steps to 10 and n_end to 6 (this is the 5th best result).

Points refer to the averaged Rand Index over 30 runs for each parameter combination. In the first plot from the left, the end point of convergence of the weight function was set either to 1, or to 30%, 60% or 90% of the number of iterations. The size of the points corresponds to the total number of clustering steps.

While the experiments in this section offer valuable insights into the clustering performance of ClustImpute, the results should be interpreted with caution. Performance may vary substantially across different data sets and missingness mechanisms. Additionally, many alternative imputation methods exist, and even the ones considered here have numerous tuning parameters that can significantly influence clustering outcomes. Finally, in real-world applications, true cluster labels are typically unavailable, so the quality of the resulting partitions must be evaluated using domain-specific criteria in addition to purely statistical metrics.

(2) Availability

Operating system

The ClustImpute package is available for all operating systems that support the R statistical software.

Programming language

R (version 4.0.5 or higher)

Additional system requirements

None

Dependencies

Required: R, ClusterR, copula, dplyr, magrittr, rlang

Suggested: psych, ggplot2, tidyr, Hmisc, tictoc, corrplot

List of contributors

Oliver Pfaffel

Software location

Code repository

Name: Github
Persistent identifier: https://github.com/o1iv3r/ClustImpute
Licence: GPL-3
Date published: 26/07/2020

Language

English

(3) Reuse potential

The software can be used in any setting where a large and complex data set needs to be segmented into a smaller number of clusters. Its efficient handling of missing values ensures broad applicability in domains where data is often incomplete due to collection or measurement limitations. The interface mirrors familiar paradigms and should be intuitive for R users with prior experience in clustering. As a non-commercial open-source project, formal support is limited. However, bug reports and questions can be submitted via the GitHub Issues page, and the hope is that the community will actively contribute to ongoing development and extensions of the package.

The package has already been applied in a range of real-world research contexts, demonstrating its versatility across domains. For instance, it has been used to analyze metabolomic and microbiome data in gestational diabetes research [17], to identify ecological groupings in migratory bird populations [18], to explore ecological divergence in marine bacterial communities [19], and to improve data quality in crowdsourced business directory platforms using deep learning-based workflows [20].

Acknowledgements

The author thanks Dr. Sebastian Kaiser for fruitful conversations around this article and clustering topics in general. Moreover, special thanks go to Levent Alkaya for carefully reading a pre-print of this article and his excellent suggestions to improve upon it.

Competing Interests

The author has no competing interests to declare.

Clustimpute: k-means Clustering with Built-in Missing Data Imputation

Full Article

(1) Overview

1 Introduction

Figure 1

Figure 2

2 Implementation and architecture

3 Quality control

3.1 Scalability on simulated data

Figure 3

Table 1

3.2 Benchmarking Clustering Performance on Iris

Figure 4

Figure 5

Figure 6

Figure 7