
Figure 1
Imputation with the median vs. random imputation on a simulated data set further described in section 3. Median imputation on the left produces an artificial point mass near the x and y axes. Random imputation on the right produces many points far away from the actual clusters.

Figure 2
Clustering based on the simulated data with missing values as shown in Figure 1. On the left, k-means is applied to the data set where a random imputation was applied. On the right, the proposed package ClustImpute was used without any imputation or other pre-processing step. Note that the underlying data has four additional “‘noise”’ variables not shown in the figures; thus the clusters are somewhat overlapping when plotted w.r.t x and y only.

Figure 3
This figure shows the median running time in seconds for an application of ClustImpute vs. a comparable k-means clustering performed on a data set imputed by Amelia, MICE, MissRanger or simple random imputation, on a regular (left) and on a log-scale (right).
Table 1
Rand index on simulated data. The maximum values are shown in bold.
| Nr. of obs. | ClustImpute | RandomImp +ClusterR | missRanger +ClusterR | MICE(PMM) +ClusterR | MICE(CART) +ClusterR | Amelia +ClusterR |
|---|---|---|---|---|---|---|
| 100 | 0.1787 | 0.1846 | 0.1508 | 0.1193 | 0.1401 | 0.1417 |
| 200 | 0.2607 | 0.2534 | 0.2291 | 0.2254 | 0.2015 | 0.1684 |
| 400 | 0.3395 | 0.2676 | 0.2619 | 0.2284 | 0.3142 | 0.2041 |
| 800 | 0.2682 | 0.2669 | 0.2154 | 0.2524 | 0.2405 | 0.2098 |
| 1600 | 0.2473 | 0.2187 | 0.2525 | 0.1984 | 0.2610 | 0.1973 |

Figure 4
This figure shows the Rand Index for a censored IRIS data set for comparison between ClustImpute and a k-means clustering performed on a data set imputed by Amelia, MICE, MissRanger or simple random imputation. A different share of missing values was simulated using MCAR. The (dodged) point ranges cover +/– 2 times the standard error.

Figure 5
This figure shows the correlations between missing values in the censored IRIS data set with 40% of values missing at random (MAR).

Figure 6
This figure shows the Rand Index for a censored IRIS data set for an application of ClustImpute vs. a comparable k-means clustering performed on a data set imputed by Amelia, MICE, missRanger or simple random imputation. Different missingness rates were simulated using MAR. The (dodged) point ranges cover +/– 2 times the standard error.

Figure 7
Points refer to the averaged Rand Index over 30 runs for each parameter combination. In the first plot from the left, the end point of convergence of the weight function was set either to 1, or to 30%, 60% or 90% of the number of iterations. The size of the points corresponds to the total number of clustering steps.
