A Comprehensive R Package for Clusterability Testing

Zachariah Neville; Margareta Ackerman; Andreas Adolfsson; Naomi C. Brownstein

doi:10.5334/jors.389

(1) Overview

Introduction

Clustering involves the separation of data into natural groupings, aiming for similar observations within a group and dissimilar observations in separate clusters [1]. Unlike classification, clustering is unsupervised, and the groups or labels are not available in advance. Furthermore, clustering is inherently ambiguous, whereby the desired partitioning may depend on the needs of the application at hand [2]. Lack of a clear ground truth makes validation of clustering results particularly challenging.

While not all data possesses inherent cluster structure, clustering algorithms produce a partitioning regardless of whether such structure is present, potentially yielding misleading results. As such, measuring clusterability is essential to cluster analysis. Clusterability methods measure the underlying cluster structure in a dataset, and can inform whether clustering is appropriate for the data.

The inherent ambiguity of cluster analysis extends to clusterability analysis. That is, whether a given data set should be described as clusterable depends on the needs of the given application. For instance, whether small clusters are appropriate for a given application can alter whether the same data set should be viewed as clusterable.¹ Consider, for instance, the data depicted in Figure 1. If small clusters are appropriate, the two radically different-sized clusters shown in Figure 1 would form a clusterable data set. On the other hand, if more balanced clusters are desired and small groups of data are best viewed as outliers, then the same data would be considered unclusterable. Consequently, the inherently ambiguous nature of cluster analysis precludes a single clusterability method from successfully applying to all clustering applications, giving raise to the need for diverse clusterability techniques. Further discussions of the importance of clusterability and the need for diverse clusterability methods from a practical standpoint are found elsewhere [4].

Illustration of the inherent ambiguity of clustering. One large cluster along with a set of outliers that could be considered either as a separate cluster or as noise.

Importantly, cluster validation is a separate problem that is dependent on the choice of clustering algorithm. Clusterability evaluation is intended to take place prior to cluster analysis as a check to potentially alert the user of situations when the data is relatively homogeneous and thus cluster analysis may not be merited. By contrast, cluster validation is intended to evaluate performance of a specific clustering after cluster analysis is complete on a clusterable dataset. Thus, cluster validation algorithms, which may be found elsewhere [5, 6, 7], are not part of the clusterability package presented in this paper. Some methods are dependent on the choice of a specific clustering algorithm [8], such as k-means clustering [9] or spectral clustering [10]. Although numerous R packages exist for cluster analysis tasks [11, 12], we have excluded these from our package which specifically focuses on clusterability. Further discussion of how clusterability, clustering, and cluster validation fit in the clustering pipeline can be found elsewhere [4, 13, 14].

There are no prior R packages available for accessible and practical clusterability testing. To bring the benefits of clusterability analysis into wider practice, in this paper, we implement several methods in a clusterability package for the R language. Our clusterability package is the first implementation that allows clusterability testing to be readily accessible and performed through a single function. Critically, this package embraces the inherent ambiguity of cluster analysis, giving the user a variety of clusterability methods which lets them choose a technique that fits their data and application.

Moreover, we simplify the process of clusterability testing so that the user can run their test in one step from our thoroughly tested R package, rather than coding each step individually from scratch. By contrast, individual programming would require the user to choose and program their data reduction method, choose their multimodality test, choose desired parameters, etc. Having a single package saves time, improves reproducibility, and reduces errors potentially induced when having to recode each step.

This extensive clusterability package implements and integrates the clusterability approaches developed and analyzed previously [4, 15]. In brief, our previous work provides a detailed comparison of some of the underlying algorithms made available through this package and offers practical guidelines for selecting a clusterability technique suitable for a given application, by considering criteria such as the dimensionality of the data and role of small clusters in the selection process [4]. See the clusterability methods section of the present paper for a brief summary and update of the current methodology landscape.

Clusterability methods

Table 1 lists a large number of methods related to clusterability or cluster tendency. However, only a subset of clusterability methods are appropriate for widespread use. For example, historical methods are often proposed within theoretical frameworks or with primarily simple structures [16] and are often impractical for use on real data. In fact, many are NP-hard to compute [13].

Table 1

Comparisons of Clusterability Methods. Method/Ref refers to the method and its citation.

METHOD/REF	DATA TYPE	TEST	PRE CLUST	TYPE I	NOTES	LANG	PACKAGE
PCA dip [4]	Num	Y	Y	Y	Good perf: robust	R	Y
PCA Silv [4]	Num	Y	Y	Y	Good perf: small clust	R	Y
SPCA dip [15]	Num	Y	Y	Y	Good perf: robust	R	Y
SPCA Silv [15]	Num	Y	Y	Y	Good perf: small clust	R	Y
Dist. dip [4]	Num	Y	Y	Y	Robust; power varies	R	Y
Dist. Silv: [4]	Num	Y	Y	Y	Good perf, small clust	R	Y
Hopkins [24]	Num	Y	Y	N	poor perf	R, py	N
PC dip [4]	Num	Y	Y	N	poor perf	R, py, mat	N
PC Silv. [4]	Num	Y	Y	N	poor perf	R, py, mat	N
GMM [21]	Num	Y	M¹	NT²	Assume Gauss	R	N
Epter [16]	Num	N	Y	N/A	Historical	None	N
Klopotek [9]	Num	N	N	N/A	Validation	None	N
TestCat [28]	Cat	Y	Y	Y³	cat. data	R, Mat	N
PHI/PSI [14]	Num/Mix	N	Y	N/A	score only	Py	N
VAT [17]	Num/Cat⁴/Mix⁴	N	Y	N/A	visual only	Mat, Py, R	N
iVAT [18]	Num/Cat⁴/Mix⁴	N	Y	N/A	visual only	Mat, Py, R	N
aVAT [18]	Num/Cat⁴/Mix⁴	N	Y	N/A	visual only	Manual⁵	N
Ultramet [20]	Num	N	Y	N/A	score only	None	N
Miasnikof [32]	Graph	Y	Y	Y	Graph test, good perf	None	N
Gao/Zhang [30]	Graph	N⁶	N	N/A	Y/N descn; no p-value	None	N
FOCS 2018 [33]	Graph	N⁶	Y	N/A	Y/N descn; no p-value	None	N
Li. 2025 [31]	Graph	N⁶	Y	N/A	Y/N descn; no p-value	None	N
FCN [34]	Graph	Y	Y	Unclear⁷	Graph test, No T1E	None	N
PHIClust [22]	RNA-seq	TH⁶	Y	N/A	App spec	R	N
Build Clust [23]	Spatial	N	Y	N/A	App spec	None	N

[i] Data type includes numeric (Num), categorical (Cat), or mixed (Mix).

Test refers to whether or not the method conducts a formal statistically backed test of whether the data is clusterable or whether it is not clusterable.

Pre Clust refers to whether or not the test is conducted prior to clustering, without explictly requiring a clustering algorithm to be chosen first.

Type I refers to whether or not the method has type I error tested and close to the nominal value.

Lang describes the languages available for immediate implementation of the methodology. Options are R, py (python), mat (Matlab), or None.

Package Y/N denotes whether or not the method is included in our clusterability package on CRAN. Methods with Package=Y are in bold.

Notes summarizes the method framework, with performance notes or major limitations.

¹ M: Gaussian Mixture Models (GMM), while done prior to clustering, do assume that a GMM is a reasonable fit to the data. We consider this model-based and therefore exclude it from our package.

² NT: We could not find simulations testing type I error in their paper.

³ Based on the chi-square distribution, the test is theoretically controlled when assumptions are met. Type I error evaluations were mostly favorable, with a few minor exceptions.

⁴ Existing software is readily available for numeric data. Categorical and mixed type data is theoretically possible but may require some preprocessing before implementation.

⁵ Similarly, aVAT may require preprocessing to run through VAT.

⁶ These tests provide binary decisions (clusterable or not clusterable), but do not provide p-values and have not been type I error tested.

⁷ For FCN, simulations reported average p-values. The type I error rate, i.e. the proportion of the time the p-value was below the nominal level, was not reported.

Our package focuses on statistical tests for clusterability, excluding measures that are merely visual [17, 18], heuristic [19] or score-based [14, 20], rather than inferential. Although one could conduct a clusterability test based on Gaussian mixture models – testing whether one component is needed vs. the alternative that more components are required – such tests assume that a Gaussian mixture model is appropriate for the data. Similar to excluding methods that require a clustering algorithm before conducting clusterability tests, we exclude clusterability tests based on Gaussian-mixture models [21]. Some clusterability methods are designed for specific applications, such as single cell transcriptomics [22] or spatial data [23] rather than for general use. Some methods, such as the Hopkins statistic [24] and tests that operate on principal curves, a non-linear dimension reduction method [25], have been shown to perform poorly on simulated and real examples [4, 26, 27] and are therefore excluded.

Other methods exist in the literature for different data types. For example, a recently proposed method applies clusterability tests to categorical data [28]. Although it may be computationally possible (i.e. algorithms may exist) to apply tests to a variety of data types, such as categorical data or mixtures of categorical and numeric data, data of mixed types have not yet been tested formally in the context of clusterability. Furthermore, the reliability of clustering and clusterability tests on categorical and mixed types is non-trivial and by no means guaranteed [28, 29]. We also consider graphs to be out of scope [30, 31, 32, 33, 34]. Finally, clusterability methods that are only available in other languages such as Python, but are not available in R, are excluded [14].

In this package, we have implemented the selection² of general clusterability techniques [4, 15] that are inferential rather than solely index, score based or visual, efficient and effective in practice, able to be conducted prior to cluster analyses without explicitly depending on a selected clustering or model, a specific clustering algorithm, and available in the R language. Thus, this R package consists of six clusterability methods resulting from two multimodality tests – Dip Test of Unimodality [35] and Silverman’s Critical Bandwidth Test [36] – combined with three techniques for dimensionality reduction – Principal Component Analysis (PCA) [37], Sparse Principal Component Analysis (Sparse PCA) [38] [39] and pairwise distances. To perform a clusterability test, the user selects both a multimodality test and a dimension reduction technique. A flowchart showing the clusterability tests available in our R package is provided in Figure 2. Summaries and comparisons of these approaches, along with recommendations on how to select from amongst these techniques for different applications are available elsewhere [4, 15]. We summarize and refine the main ideas of the methods comparison here and in Table 1. The package is available in the Comprehensive R Archive Network (CRAN), and its main function, clusterabilitytest(), is used to perform tests of clusterability.

Flowchart for options of data reduction methods and multimodality tests.

The clusterability methods included in our package involve the selection of a multimodality test and a dimension reduction technique. Multimodality tests assume the null hypothesis that the data was generated from a distribution with a single mode. Essentially, this assumes that the data has no inherent clusters. If the data is sufficiently different from that initial assumption, then the null hypothesis is rejected in favor of the alternative hypothesis that the data was generated from a distribution with more than one mode, and that the data likely contains clusters. One of the main differences between the Dip and Silverman tests applied to clusterability analysis is that Silverman’s Test tends to treat outliers as clusters (additional small modes), whereas Dip exhibits a preference for more balanced clusters (considering small outliers as noise rather than additional modes) [4]. Although others [40] have calibrated the Dip Test to alleviate known statistical conservatism, the calibration is not available in either the diptest R package or our clusterability package, which utilizes the capabilities of the diptest R package. Thus, the conservatism of the diptest emanates into clusterability tests. In short, when small clusters are permissible, Silverman’s critical bandwidth test should be considered, while the dip test may be preferable when small groups of distant data are better treated as outliers.

The second choice required is the dimension reduction technique. Multimodality tests require the data to be a single dimension, as described elsewhere [4, 41]. In some clustering applications, the user may only have the pairwise distances rather than the initial data; in this case, distances would be the inherent choice of dimension reduction. However, as described subsequently, pairwise distances reduce the number of features (to one) but increase the sample size in a quadratic fashion. Often the user may prefer a simple projection that preserves the sample size. Principal component analysis (PCA) [37] is a well-known method to reduce the data to components that explain the maximum variation in the data. Sparse PCA [38] is a tailored version of PCA designed for data with high dimensions, especially for data where it would be suitable for most of the loadings to be zero. Although cluster structure may be present in multivariate data but masked in single dimensional reductions [15], reductions provided in the clusterability package yield reasonable performance on a variety of simulated clusters and simple empirical datasets [4, 42, 15].

Importantly, the use of a dimension reduction technique depends on the user having the original data available. In some clustering applications, the user may not have the original data and may have only the set of pairwise distances between the data points. In this case, the user can feed the set of pairwise distances to the clusterability package, call the argument reduction=“None”, and choose a multimodality test to run on the distances. Similarly, if the user has data in only one dimension, then the user can use the reduction=“None” argument to directly conduct a multimodality test on their data without having to reduce the dimensionality.

Implementation and architecture

The clusterabilitytest() function is the main function of the clusterability package. To perform a clusterability test, the user must select a dimension reduction technique using the reduction parameter and a multimodality test using the test parameter. The remaining parameters of the clusterabilitytest() function are used to fine-tune the dimension reduction, standardization, or multimodality test. Details of the required and optional arguments of the clusterabilitytest() function are found in the appendix.

The clusterabilitytest() function returns a clusterability object which contains elements such as the p value and test statistic, allowing the results to be used as part of a larger workflow. To view the results of a test in a user-friendly manner, the print() function is available. Sample usage of both functions and the output of the print() function are shown later in the paper.

Differences in parameters

There are some minor differences in implementation from prior work. The code used to perform Silverman’s Test was originally obtained from an earlier implementation by others [43] and modified slightly. Most changes do not affect any calculations or results, but there is a difference in the distribution from which sample observations are generated while assessing the significance of the bandwidth. Although Silverman’s Critical Bandwidth test allows for a k parameter to test the number of modes, for clusterability testing we are typically interested in the k = 1 case. For this reason, the k parameter is not available for the user to set in the clusterabilitytest() function, and Silverman’s Test is always performed with k = 1, and the calibration [44] is available in our clusterability package and enabled by default. We direct the reader elsewhere [41] for further details on how our implementation of the Silverman test in the clusterabilitytest() function differs from that by [43] and from an implementation in the multimode package.

In addition to the changes detailed above, there is another change made to rounding behavior in Silverman’s Test in this package. Prior implementations of Silverman’s test in [41, 43] round p values below 0.005 down to 0 if adjustment [44] is specified, and otherwise spline interpolation on a table of values is used to adjust the p value. To maximize information for small p-values, in the clusterability package, the spline interpolation is always used, and the p value is set to 0 only if interpolation yields a negative p value.

Quality control

Example function calls and corresponding output are provided in this paper and in the README.md file in the GitHub repository. Users can follow the same steps to install the package and makes calls to the clusterabilitytest() function to verify the software is working. Several supplemental files are included in both the R package and the GitHub repository to reproduce the examples, performance benchmarks, and plots provided in this manuscript. The package has been installed and tested on multiple Windows machines and achieved identical results on each machine when using the same random number generator seed and random number generation method.³ The package was also tested on a machine using R version 4.5.2, which was the latest R version available at the time of manuscript preparation. Unit testing to verify the behavior for many internal functions is implemented using the testthat testing package [46].

If large datasets or simulations are planned, then run-time may be a consideration, with dip running faster than Silverman. Run time analysis is discussed in the Computational Performance section.

Examples

We provide several examples to illustrate the use of the clusterability package – in particular, the clusterabilitytest() function – to determine the clusterability of a dataset. The examples were chosen for their simplicity to keep the focus of the paper on the details of the code implementation. The print() function, which displays results in a user-friendly manner, is also demonstrated using one of the examples below.

First, we begin with five datasets sampled from one or more Gaussian distributions. Next, we examine the famous iris and cars datasets. Simulated datasets are provided as normals1 through normals5 with the clusterability package. The iris and cars datasets are included with the base R software. The three-dimensional plots for the normals4 and normals5 datasets were created using the plotly package [47]. We conclude with a discussion of the relative computational performance of each test and dimension reduction combination.

The datasets used in this manuscript are the same as those used in a related cluserability macro in preparation for the SAS language [48]. The p values associated with each dataset and test are the same in both papers, which illustrates reproducibility of results of clusterability tests on both platforms. When testing the package on multiple machines, results varied, however. Results for p values associated with the dip test were identical on all machines. Results associated with Silverman’s test were qualitatively similar but not numerically identical across machines. Fortunately, within each machine tested, results of clusterability tests using Silverman’s test of multimodality were identical for all tests at multiple time points. We hypothesize that the difference between machines is due to a portability problem in R itself rather than in our package.

Data sampled from mixtures of normal distributions

We begin with five simulated datasets with each containing 150 observations selected from a mixture of one or more normal distributions. Plots of each dataset are provided in Figures 3 and 4. Each of these datasets contains a cluster column which indicates the cluster membership for each observation. The cluster column is given for informational purposes for users. However, we subsquently remove the cluster column when performing clusterability tests in the examples below.

Three plots showing the *normals1, normals2*, and *normals3* datasets, which are sampled from mixtures of one or more normal distributions.

Plots displaying the *normals4* and *normals5* datasets.

The first dataset is normals1, which contains 150 observations sampled from $N ((\begin{matrix} 0 \\ 4 \end{matrix}), I_{2})$ . This is a two-dimensional dataset designed to have just one cluster because all observations are sampled from a single bivariate normal distribution.

The second dataset is normals2, which contains 75 observations sampled from $N ((\begin{matrix} - 3 \\ - 2 \end{matrix}), I_{2})$ and 75 observations sampled from $N ((\begin{matrix} 1 \\ 1 \end{matrix}), I_{2})$ . This two-dimensional dataset is designed to have two clusters because the data are sampled from two normal distributions whose mean vectors are far apart. These clusters are visible in the Figure 3 plot.

The normals3 dataset is constructed to have three clusters by sampling from different three normal distributions. These clusters are visible in Figure 3. It contains 50 observations sampled from $N ((\begin{matrix} 3 \\ 0 \end{matrix}), I_{2})$ , 50 observations sampled from $N ((\begin{matrix} 0 \\ 3 \end{matrix}), I_{2})$ , and 50 observations sampled from $N ((\begin{matrix} 3 \\ 6 \end{matrix}), I_{2})$ .

The normals4 dataset contains three variables and is designed to have two clusters by sampling from two trivariate normal distributions. The two clusters are visible in Figure 4. The dataset contains 75 observations sampled from $N ((\begin{matrix} 1 \\ 3 \\ 2 \end{matrix}), I_{3})$ and 75 observations sampled from $N ((\begin{matrix} 4 \\ 6 \\ 0 \end{matrix}), I_{3})$ .

The fifth dataset is normals5, which contains three variables and is designed to have three clusters. The data is sampled from three normal distributions: 50 observations are sampled from $N ((\begin{matrix} 1 \\ 3 \\ 3 \end{matrix}), I_{3})$ , 50 observations are from $N ((\begin{matrix} 4 \\ 6 \\ 0 \end{matrix}), I_{3})$ , and 50 observations are from $N ((\begin{matrix} 2 \\ 8 \\ - 3 \end{matrix}), I_{3})$ .

For each of these five simulated datasets, we used the clusterabilitytest() function to determine if the population from which the data were sampled contained cluster structure. Dimension reduction was necessary because each dataset contained two or more variables. The two tests (Dip and Silverman) and three dimension reduction techniques create a total of six test-reduction combinations, and all six combinations were used for each dataset. When evaluating sparse PCA, the elastic net implementation [49] was used because it is the default option for the spca_method parameter. The results of these clusterability tests are provided in Table 2, and sample code for conducting Silverman’s Test on the normals2 dataset, using pairwise distances for dimension reduction, is provided below. The results are printed in a user-friendly manner with the print() function.

Table 2

Results from the clusterabilitytest() function used on five normal mixtures and the iris and cars datasets. Values are p values reported by the test. Results are rounded to 6 digits as specified by function parameter values (s_adjust=TRUE and s_digits=6).

DATA SET	DIP PCA	DIP SPARSE PCA (EN)	DIP DISTANCE	SILVERMAN PCA	SILVERMAN SPARSE PCA (EN)	SILVERMAN DISTANCE
normals1	0.98522	0.97978	0.99361	0.80364	0.78757	0.10218
normals2	0	0	0	0	0	0
normals3	0.04596	0.01817	3.291×10^–5	0.00101	0.00401	9.0×10^–6
normals4	6.468×10^–6	0	0.01969	0	0	0
normals5	1.380×10^–5	0.00181	4.726×10^–5	1.571×10^–5	0.00057	0
iris	0	0	0	9.0×10^–6	0	0
cars	0.85818	0.83202	0.66042	0.52577	0.41324	0.99425

R> library(“clusterability”)
R> data(“normals2”)
# Remove cluster column R> normals2 <- normals2[,-3]
R> norm2_silvdist <- clusterabilitytest(normals2, "silverman",reduction = "distance", distance_standardize = "NONE",s_setseed = 123)
R> print(norm2_silvdist)
 
----------------------
Clusterability Test
----------------------
 
Data set name: normals2
Your data set has 150 observation(s) and 2 variable(s).
There were no missing values. Your data set is~complete.
 
Data Reduced Using: Pairwise Distances
Distance Metric: euclidean
 
 
--------------------------------------------------
Results: Silverman’s Critical Bandwidth Test
--------------------------------------------------
 
Null Hypothesis: number of modes <= 1
Alternative Hypothesis: number of modes > 1
p-value: 0
Critical bandwidth: 1.060891
 
-----------------
Test Options Used
-----------------
 
Seed set in R to: 123
p value based on 999 bootstrap replicates
Adjusted p-values based on the adjustment in Hall and York (2001)
p value rounded to: 6 digits

Using an a priori established significance level of 0.05, the normals1 dataset is deemed unclusterable by all six tests. This conclusion agrees with the known truth, because normals1 consists of observations sampled from a single normal distribution. The normals2 dataset is considered clusterable based on the low p values for each of the six tests, which again matches the known truth, because the data are sampled from two separate normal distributions that spread somewhat far apart. Similar results are found with the normals3, normals4, and normals5 datasets. Each of these datasets was constructed by sampling from multiple, well-separated normal distributions, and the cluster structure is visible in the sample data in Figures 3 and 4. As expected, the p values associated with the tests for these datasets are well below the 0.05 level. This suggests that the original distributions from which these datasets were sampled contain cluster structure.

The iris dataset

Next, we demonstrate the use of the clusterabilitytest() function on the popular iris dataset [50]. This dataset is available in the datasets package as part of base R software. The iris dataset contains 50 observations each of three species of iris flowers – versicolor, virginica, and setosa – for 150 total observations. In addition to the species, each observation has four numeric measurements – petal length, petal width, sepal length, and sepal width. Descriptive statistics on these variables are provided elsewhere [42]. The data contains three known clusters, corresponding to the three iris species [51]. Visually, the scores from the first principal component and first sparse principal component, as well as the set of pairwise distances, appear to be multimodal, as shown in Figure 5.

To empirically test for clusterability of the measurements of the iris flowers, we used the clusterabilitytest() function on the four numeric variables. Each of the six test-dimension reduction combinations was used, and the p values obtained from each procedure are provided in Table 2. For each test and dimension reduction combination – Dip PCA, Dip Sparse PCA, Dip Distance, Silverman PCA, Silverman Sparse PCA, and Silverman Distance – the p value from the test was well below our pre-chosen value of 0.05, suggesting the data is clusterable. This agrees with our prior knowledge that the data can be clustered by species.

The cars dataset

Our final example comes from the cars dataset [52], which is available in the datasets package and part of the base R distribution. The dataset contains two numeric variables – speed (in miles per hour) and stopping distance (in feet) – of 50 cars. For basic descriptive statistics on the cars dataset, please see [42]. Multiple clusters, if present in the data, could suggest that different types of cars or drivers consistently perform differently. However, as shown in Figure 6, the raw data and each form of dimension reduction do not illuminate subgroups within the data.

A plot showing the original *cars* dataset and histograms with pairwise distances and scores from the first principal component and first sparse principal component.

We used the clusterabilitytest() function to test the null hypothesis that the cars data were sampled from a population containing a single cluster. The results of each test and dimension-reduction method are provided in Table 2. For each of these six clusterability tests, the p value from the Dip or Silverman Test is greater than our pre-chosen value of 0.05. This suggests the principal component scores (traditional or sparse) and pairwise distances are unimodal, which suggests that the data are not clusterable. The cars dataset was not previously known to contain clusters, and our conclusion that the dataset is unclusterable matches the result from a similar analysis [4].

Computational performance

Computational performance is an important consideration, even when working with smaller datasets. We provide a comparison of the relative computational performance of each test and dimension reduction combination, using the five simulated and two real datasets throughout the examples section.

In Table 3, the median execution time of the clusterabilitytest() function is provided for each dataset and test combination. When evaluating the performance of Sparse PCA, the elastic net implementation [49] was used since it is the default setting for the spca_method parameter. The times were calculated using the mark() function of the bench package [53], with at least 10 iterations each. Benchmarking with a specialized package like bench package confers a number of benefits over running the function repeatedly and calculating a simple average, such as more precise timing and isolating the effects of garbage collection on running time [53].

Table 3

Median execution time of the clusterabilitytest() function for each dataset and test/dimension reduction combination. Time is measured in milliseconds.

DATA SET	DIP PCA	DIP SPARSE PCA (EN)	DIP DISTANCE	SILVERMAN PCA	SILVERMAN SPARSE PCA (EN)	SILVERMAN DISTANCE
normals1	2.35	3.35	4.45	948	698	3,130
normals2	2.62	4.04	5.17	751	688	2,920
normals3	2.23	5.54	3.79	732	827	2,920
normals4	2.43	4.51	4.40	724	697	2,880
normals5	2.42	4.64	5.53	663	712	2,840
iris	2.51	4.50	4.40	827	702	2,830
cars	2.49	3.95	2.16	756	702	847

With median execution times around 2 to 6 milliseconds, the Dip Test significantly outperformed Silverman’s Test, which took around 700 milliseconds for PCA and Sparse PCA, and nearly 3,000 milliseconds for pairwise distances. These resuls agree with previous testing showing that the dip test is more efficient than Silverman’s test [4, 15].

The relative efficiency of the dimension reduction techniques is less clear. For clusterability tests with our package using the Dip Test, execution times were typically shorter for PCA than for Sparse PCA and pairwise distances. When using Silverman’s Test, running on distances took the longest, with PCA and Sparse PCA taking much less time. Using pairwise distances on a dataset with n observations will produce a univariate dataset with 𝒪(n²) observations, assuming n dominates the number of features p, while the output size of PCA and sparse PCA is still n observations. This difference in dataset size may explain the slower execution time when using distances. Previous results in modest dimensions similarly show when the number of observations exceeds the number of features, reducing the dimension with PCA is more efficient than with distances, which form a quadratic data set compared to the initial sample size [4]. These results compound, leaving the method with Silverman and distances as quartic in the number of observations and potentially computationally infeasible for very large datasets. In our testing, also on modest dimensions, distances were much slower than either PCA or Sparse PCA. By contrast, Sparse PCA had the highest runtime in previous studies on data with much higher dimensions [15]. Users with a large number of observations may choose to reduce their dimension with PCA. Alternatively, despite potentially long computational times in high dimensions, sparse PCA may be desired especially if the data is of ultra high dimensions, or if it has more features than observations, or if it is known to be sparse.

(2) Availability

Operating system

The clusterability package was developed and tested on a machine running Windows 11. However, R software is broadly available to users with Windows, Mac, Linux, and Unix operating systems [45].

Programming language

R version 3.4.0 or greater is required.

Additional system requirements

System requirements for R and R packages, such as the clusterability package, are maintained by others [54, 55].

Dependencies

The diptest [56], splines [45], elasticnet [49], sparsepca [57], and cluster [58] R packages are dependencies, i.e. they are required to run the clusterability package. Upon installing the clusterability package through CRAN, the R software should automatically install these dependencies if they are not already present on the user’s system.

While not required to use the clusterability package, the plotly [47] package is required to reproduce the plots in Figure 4 and the bench [53] package is required to reproduce the performance benchmarks in Table 3.

List of contributors

The clusterability package was created by Zachariah Neville, Margareta Ackerman, Andreas Adolfsson, and Naomi C. Brownstein.

Software location

Code repository GitHub

Name: Clusterability
Persistent identifier: https://github.com/ZachNeville/Clusterability
Licence: GPLv2
Date published: 15/03/26

Language

English

(3) Reuse potential

We hope that by providing an R package to perform a chosen clusterability test in a single, reproducible line of code, clusterability tests will gain greater adoption and become an integral step of the clustering process. With the widespread use of clustering algorithms, such clusterability tests are important tools for users to assess if clustering methods are appropriate for their data. Due to the established and extensive use of cluster analysis in a variety of applications, along with increasing analysis of clusterability and cluster tendency, there is strong potential for frequent utilization of the software in this paper among users in numerous fields.

For a given application, the choice of the most suitable clusterability method depends on factors such as the desired interpretation of tiny clusters (as noisy outliers or as distinct clusters) and the sample size. Additional guidance on choosing a clusterability test is provided in the clusterability methods section.

Like all methods, clusterability tests in general and our package in particular have limitations. To date, the tests we provide have been evaluated with independent, numeric, complete observations. Although for convenience our package provide implementation of clusterability tests using broad types of data, such as feeding custom distance matrices calculated externally from non-Euclidean, categorical or mixed types of data, their theoretical, statistical and empirical properties (e.g. type I error, power) have not yet been examined. Formal theoretical and empirical evaluation of clusterability tests for other data types, such as network data [34], categorical data [28] or data of mixed numerical and categorical types are left for future work. While the completecase parameter performs the test with only complete cases (each observation has a value for each variable) for user convenience, the statistical implications of dropping incomplete cases are left for future work. In general, research is needed on further theoretical validation of clusterability tests.

Contributions or suggestions may be proposed by raising an “issue” on GitHub.

Appendices

Appendix: Details on functional parameters for the clusterabilitytest() function

We describe the parameters available to fine tune the behavior of the main function, clusterabilitytest(), which allows the user to perform tests of clusterability.

Several parameters have a prefix followed by an underscore, which indicates how the parameter is used. For example, parameters prefixed with s_ are used to control the behavior of Silverman’s test, while parameters prefixed with d_ are used for the Dip test. These parameters are only used when relevant to the method being used. For example, when performing a clusterability test using Silverman’s test, then Dip test parameters (prefixed with d_) will be ignored.

data (Required) This is the data set used in the test. It must only contain numeric data, unless using pairwise distances with Gower’s metric for dimension reduction.

test (Required) This is the name of the test to be used. Either “dip” or “silverman” are accepted, and will perform the Dip Test of Unimodality and Silverman’s Critical Bandwidth Test, respectively.

reduction (Default: “pca”) Specifies the type of dimension reduction to be performed, if any. Valid values include any of the following:

“none”: No dimension reduction is performed.
“pca”: The scores from the first principal component are used in the test. When using this method, the additional parameters pca_center and pca_scale may optionally be used to control centering and scaling behavior during PCA.
“spca”: The scores from the first (sparse) principal component are used in the test. The additional parameter spca_method determines which implementation of sparse PCA to use. Further parameters, described later, are available to refine the behavior of sparse PCA.
“distance”: The set of pairwise distances between observations will be used in the test. When using this method, the distance_metric parameter will control the metric used when calculating distances. Data standardization is also performed according to the distance_standardize parameter.

For multivariate data, dimension reduction is required, with options provided as illustrated in Figure 2 and refinements described below. For univariate data, you may wish to select “none”. The choice (reduction = "none") may also be useful for users who wish to reduce their multivariate data to a single dimension in a different, customized manner. In this case, the user would first reduce their data separately as desired, and then pass the unidimensional reduced data into the package, selecting “none” as the reduction argument.

distance_metric (Default: “euclidean”)

This is the metric used when computing pairwise distances, and is only applicable if the reduction parameter is set to “distance”. Some of the metrics measure similarity and not all metrics have been thoroughly tested. They are provided in the hopes they will be beneficial for some users who may need them. The following metrics are available for use:

The “euclidean”, “maximum”, “manhattan”, “canberra”, and “binary” metrics work the same as in the R function dist() and the user is referred there for further details. The Minkowski metric, which internally uses the same dist() function, is also available by providing “minkowski(p)”, with p being a positive number.
“sqeuc”: Computes the squared Euclidean distance.
“cov”: Computes the covariance similarity coefficient.
“corr”: Computes the correlation similarity coefficient.
“sqcorr”: Computes the squared correlation similarity coefficient.
“gower”: Computes distance using Gower’s metric using the cluster package [58].

distance_standardize (Default: “std”)

This indicates how to standardize the observations before computing pairwise distances, if at all. This is only applicable if reduction is set to “distance”. The available choices are:

“none”: No standardization is performed.
“std”: This is the default. Each variable is standardized to have mean zero and standard deviation one.
“mean”: Each variable is standardized to have mean zero. The standard deviation is unchanged.
“median”: Each variable is standardized to have median zero. The standard deviation is unchanged.

pca_center (Default: TRUE)

This argument is a logical (TRUE or FALSE) value indicating whether to shift the variables to be zero-centered when performing principal component analysis. This is only applicable when reduction is set to “pca”.

pca_scale (Default: TRUE)

This argument is a logical value indicating whether to scale the variables to have unit variance when performing principal component analysis. This argument is only used when reduction is set to “pca”.

spca_method (Default: “EN”)

This argument is the sparse PCA implementation to use – either “EN” to use the elasticnet implementation [49] or “VP” to use the sparsepca variable projection implementation [57]. The argument is only applicable when reduction is set to “spca”.

spca_EN_para (Default: 0.01)

This argument, applicable when using reduction = “spca” and spca_method = “EN”, is a positive number used as a penalty parameter during sparse PCA.

Below are several parameters to refine the behavior of sparse PCA, which correspond to the parameters of the same name in the elasticnet or sparsepca packages.

spca_EN_lambda (Default: 1e-6)

This argument, applicable when using reduction = “spca” and spca_method = “EN”, is a positive number used as the quadratic penalty parameter during sparse PCA.

spca_VP_center (Default: TRUE)

This argument, applicable when using reduction = “spca” and spca_method = “VP”, is a logical value indicating whether the variables should be shifted to be zero centered.

spca_VP_scale (Default: TRUE)

This argument, applicable when using reduction = “spca” and spca_method = “VP”, is a logical value indicating whether the variables should be scaled to have unit variance.

spca_VP_alpha (Default: 1e-3)

This argument, applicable when using reduction = “spca” and spca_method = “VP”, is the sparsity controlling parameter.

spca_VP_beta (Default: 1e-3)

This argument, applicable when using reduction = “spca” and spca_method = “VP”, is “the amount of ridge shrinkage to apply in order to improve conditioning” [57].

is_dist_matrix (Default: FALSE)

This is a logical argument indicating whether the data set in the data argument is a distance matrix. If TRUE, then the lower triangular values in the matrix are extracted into a univariate dataset and used in the Dip or Silverman test. When this argument is TRUE, then the reduction argument must be set to NONE. Otherwise, an error will be generated and the clusterability test will not be performed.

completecase (Default: FALSE)

This argument is a logical value indicating whether a complete case analysis should be performed. When set to TRUE, all observations that are missing values for one or more variables will be removed before performing the test. Because the Dip and Silverman tests cannot be used with missing data, the completecase argument must be set to TRUE when using incomplete or missing data.

Dip test parameters

d_simulatepvalue (Default: FALSE)

This is a logical value which determines whether Dip Test [35] p values are obtained via Monte Carlo simulation. The default is FALSE, meaning p values are interpolated using a table of empirical values computed from a large simulation. This is the default from the original diptest package [56].

d_reps (Default: 2000)

The d_reps argument is a positive integer determining the number of repetitions used in Monte Carlo simulation, if applicable. This argument is only used when d_simulatepvalue is TRUE.

Silverman test parameters

s_m (Default: 999)

This argument is the number of bootstrap replicates used in the Silverman test, and must be a positive integer.

s_adjust (Default: TRUE)

This is a logical value indicating whether p values should be adjusted according to work by [44]. The default is TRUE due to the improved level accuracy [44].

s_digits (Default: 6)

This is a positive integer which determines the number of decimal places to which the p value is rounded. This is only used when the s_adjust argument is TRUE.

s_setseed (Default: NULL)

This is an integer used to set the random number generator (RNG) seed. If the default of NULL is used, then the seed will not be set. When set, the s_setseed argument ensures reproducibility of results at different times or by different users.

s_outseed (Default: FALSE)

The s_outseed argument is a logical value that determines whether to return the current state of the random number generator as part of the output. It is recommended to use the default value of FALSE because this is mainly used for troubleshooting.

Notes

[2] For instance, Phylogenetic analysis allows for small clusters, while small groups of data are often better treated as outliers when applied to market segmentation [3].

[3] Specifically, this selection corresponds to the rows in Table 1 with the following entries: Data type includes “Num”, Test=“Y”, Pre Clust=“Y”, Type I E=“Y”, and Lang includes R. We also require that the method has not previously been shown to have uniformly poor performance such as inflated type I error, as reflected in the Notes column.

[4] The RNGkind() function in base R software can be executed on each machine to verify the random number generation settings are the same. The s_setseed parameter in the clusterabilitytest() function should also be set the same on each machine to allow reproducible results. Documentation in the base R software discusses random number generation in further detail [45].

Acknowledgements

The authors acknowledge support from the biomedical library at H. Lee Moffitt Cancer Center in considering journals to which to submit our paper. We also acknowledge the contributions of programmers (particularly Jose Laborde, Paul Stewart, Ann Chen, Zhihua Chen) in the 2020 Moffitt BBSR Hackathon who assisted with some sparse clusterability functionality.