Skip to main content
Have a personal or library account? Click to login
A Comprehensive R Package for Clusterability Testing Cover

A Comprehensive R Package for Clusterability Testing

Open Access
|May 2026

Figures & Tables

Figure 1

Illustration of the inherent ambiguity of clustering. One large cluster along with a set of outliers that could be considered either as a separate cluster or as noise.

Table 1

Comparisons of Clusterability Methods. Method/Ref refers to the method and its citation.

METHOD/REFDATA TYPETESTPRE CLUSTTYPE INOTESLANGPACKAGE
PCA dip [4]NumYYYGood perf: robustRY
PCA Silv [4]NumYYYGood perf: small clustRY
SPCA dip [15]NumYYYGood perf: robustRY
SPCA Silv [15]NumYYYGood perf: small clustRY
Dist. dip [4]NumYYYRobust; power variesRY
Dist. Silv: [4]NumYYYGood perf, small clustRY
Hopkins [24]NumYYNpoor perfR, pyN
PC dip [4]NumYYNpoor perfR, py, matN
PC Silv. [4]NumYYNpoor perfR, py, matN
GMM [21]NumYM1NT2Assume GaussRN
Epter [16]NumNYN/AHistoricalNoneN
Klopotek [9]NumNNN/AValidationNoneN
TestCat [28]CatYYY3cat. dataR, MatN
PHI/PSI [14]Num/MixNYN/Ascore onlyPyN
VAT [17]Num/Cat4/Mix4NYN/Avisual onlyMat, Py, RN
iVAT [18]Num/Cat4/Mix4NYN/Avisual onlyMat, Py, RN
aVAT [18]Num/Cat4/Mix4NYN/Avisual onlyManual5N
Ultramet [20]NumNYN/Ascore onlyNoneN
Miasnikof [32]GraphYYYGraph test, good perfNoneN
Gao/Zhang [30]GraphN6NN/AY/N descn; no p-valueNoneN
FOCS 2018 [33]GraphN6YN/AY/N descn; no p-valueNoneN
Li. 2025 [31]GraphN6YN/AY/N descn; no p-valueNoneN
FCN [34]GraphYYUnclear7Graph test, No T1ENoneN
PHIClust [22]RNA-seqTH6YN/AApp specRN
Build Clust [23]SpatialNYN/AApp specNoneN

[i] Data type includes numeric (Num), categorical (Cat), or mixed (Mix).

Test refers to whether or not the method conducts a formal statistically backed test of whether the data is clusterable or whether it is not clusterable.

Pre Clust refers to whether or not the test is conducted prior to clustering, without explictly requiring a clustering algorithm to be chosen first.

Type I refers to whether or not the method has type I error tested and close to the nominal value.

Lang describes the languages available for immediate implementation of the methodology. Options are R, py (python), mat (Matlab), or None.

Package Y/N denotes whether or not the method is included in our clusterability package on CRAN. Methods with Package=Y are in bold.

Notes summarizes the method framework, with performance notes or major limitations.

1 M: Gaussian Mixture Models (GMM), while done prior to clustering, do assume that a GMM is a reasonable fit to the data. We consider this model-based and therefore exclude it from our package.

2 NT: We could not find simulations testing type I error in their paper.

3 Based on the chi-square distribution, the test is theoretically controlled when assumptions are met. Type I error evaluations were mostly favorable, with a few minor exceptions.

4 Existing software is readily available for numeric data. Categorical and mixed type data is theoretically possible but may require some preprocessing before implementation.

5 Similarly, aVAT may require preprocessing to run through VAT.

6 These tests provide binary decisions (clusterable or not clusterable), but do not provide p-values and have not been type I error tested.

7 For FCN, simulations reported average p-values. The type I error rate, i.e. the proportion of the time the p-value was below the nominal level, was not reported.

Figure 2

Flowchart for options of data reduction methods and multimodality tests.

Figure 3

Three plots showing the normals1, normals2, and normals3 datasets, which are sampled from mixtures of one or more normal distributions.

Figure 4

Plots displaying the normals4 and normals5 datasets.

Table 2

Results from the clusterabilitytest() function used on five normal mixtures and the iris and cars datasets. Values are p values reported by the test. Results are rounded to 6 digits as specified by function parameter values (s_adjust=TRUE and s_digits=6).

DATA SETDIP PCADIP SPARSE PCA (EN)DIP DISTANCESILVERMAN PCASILVERMAN SPARSE PCA (EN)SILVERMAN DISTANCE
normals10.985220.979780.993610.803640.787570.10218
normals2000000
normals30.045960.018173.291×10–50.001010.004019.0×10–6
normals46.468×10–600.01969000
normals51.380×10–50.001814.726×10–51.571×10–50.000570
iris0009.0×10–600
cars0.858180.832020.660420.525770.413240.99425
Figure 5

Histograms with scores from the first principal component, pairwise distances, and first sparse principal component for the iris dataset.

Figure 6

A plot showing the original cars dataset and histograms with pairwise distances and scores from the first principal component and first sparse principal component.

Table 3

Median execution time of the clusterabilitytest() function for each dataset and test/dimension reduction combination. Time is measured in milliseconds.

DATA SETDIP PCADIP SPARSE PCA (EN)DIP DISTANCESILVERMAN PCASILVERMAN SPARSE PCA (EN)SILVERMAN DISTANCE
normals12.353.354.459486983,130
normals22.624.045.177516882,920
normals32.235.543.797328272,920
normals42.434.514.407246972,880
normals52.424.645.536637122,840
iris2.514.504.408277022,830
cars2.493.952.16756702847
DOI: https://doi.org/10.5334/jors.389 | Journal eISSN: 2049-9647
Language: English
Page range: 37 - 37
Submitted on: Aug 24, 2021
Accepted on: Apr 1, 2026
Published on: May 20, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Zachariah Neville, Margareta Ackerman, Andreas Adolfsson, Naomi C. Brownstein, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.