DBSCAN in Domains with Periodic Boundary Conditions

Xander M. de Wit; Alessandro Gabbana

doi:10.5334/jors.555

Full Article

(1) Overview

Introduction

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a widely used unsupervised machine learning algorithm designed to identify clusters in spatial data by leveraging density-based criteria [1, 2]. Unlike traditional clustering methods such as k-means, DBSCAN does not require prior knowledge of the number of clusters and is particularly effective at detecting clusters of arbitrary shapes and distinguishing noise points. The algorithm operates by grouping points that are closely packed together, based on a specified neighborhood radius ϵ and minimum number of points min_points criteria. It is highly effective in applications ranging from geographical data analysis, to image segmentation, and many areas of physics [3, 4, 5, 6].

Conventional implementations of the DBSCAN algorithm tacitly assume that all points reside in a space with open boundaries. There are, however, many applications where the embedding space of the data instead has periodic boundaries in some or all dimensions, as if the data points reside on the surface of a (possibly higher-dimensional) torus. Periodic boundary conditions, where particles exiting on one side of the domain re-enter on the other side of the domain, are a commonly used tool, particularly in computational physics simulations, to mimic systems that are spatially unbounded, as if extending infinitely in space. It is omnipresent, for example, in fluid dynamics or molecular dynamics simulations [7]. However, it can also arise in many other areas when studying data that is naturally defined in modulo sense, such as angular data that periodically ranges from 0 to 360 degrees or the time of day in a 24-hour cycle.

Applying clustering algorithms in domains with periodic boundary conditions requires special care. While a well-tailored approach exists for clustering in periodic domains based on k-means clustering [8], to the best of the authors’ knowledge, no such optimized implementation is publicly available for the DBSCAN clustering algorithm. In this work, we discuss how to efficiently apply DBSCAN in domains with periodic boundary conditions.

Conceptually, one could achieve a DBSCAN with periodic boundary conditions simply by swapping out the conventional distance metric (e.g. Euclidean distance or Manhattan distance) for its periodic counterpart that takes into account the periodic boundary conditions when computing the distance between two points, as proposed for example in [9]. In its most naive implementation, however, this would require $O$ (N²)operations to compute all pairwise distances. Instead, optimized implementations of nearest neighbor search algorithms achieve complexity of $O$ (N log N) or better by using some form of spatial indexing, such as the K-D tree or Ball tree algorithms [10, 11, 12, 13]. The approach we propose for clustering in domains with periodic boundaries remains fully compatible with existing optimized search algorithms designed for domains with open boundaries, ensuring efficient computation even for large datasets.

Implementation and architecture

Algorithm

The approach we propose leverages the property of DBSCAN that proximity is defined by a single well-defined radius ϵ. Consequently, the algorithm only needs to search for neighbors across the periodic boundary up to this distance. The method works by periodically extending the domain by a limited distance of ϵ in all periodic directions. This allows the clustering problem to be solved by applying the conventional DBSCAN algorithm – designed for open boundaries – to the extended domain. In the final step, the algorithm identifies and merges cluster labels assigned to different periodic copies of the same data point, ensuring that points in different periodic images are recognized as belonging to the same cluster in the periodic domain.

The algorithm takes as input the data points $S$ (embedded in a space with dimension D) that need to be labeled, the lower and upper periodic boundaries $x_{min} = (x_{min}^{(1)}, x_{min}^{(2)}, \dots, x_{min}^{(D)})$ and $x_{max} = (x_{max}^{(1)}, x_{max}^{(2)}, \dots, x_{max}^{(D)})$ , respectively, and finally the DBSCAN parameters, being the neighborhood ϵ and min_points. The procedure consists of four steps, which are illustrated with an example in Figure 1:

Example of the different steps of the algorithm for DBSCAN with periodic boundary conditions: **(a)** original input dataset, **(b)** periodic extension by ϵ (step 1), **(c)** DBSCAN of the extended dataset (step 2), **(d)** final clustering after linking and resolving equivalent clusters (steps 3 & 4). This is a 2D example with periodicity L and neighborhood ϵ = 0.06 L.

Periodic extension. Extend data set $S$ from [x_min, x_max] to [x_min–ϵ, x_max+ϵ] through periodic extension, saving the padded data points (the periodic copies) into $S$ _pad. For all padded points s_pad ∈ $S$ _pad, save the index of the corresponding point in the original dataset $S$ .
DBSCAN. Apply original DBSCAN with neighborhood ϵ and min_points to all data points $S_{all} = S \cup S_{pad}$ , yielding labels $L$ _all.
Linking equivalent clusters. For each padded point s_pad ∈ $S$ _pad, compare its label l_pad to the label of the corresponding point in the original dataset l_orig. If l_pad ≠ l_or_ig, save the labels as a linked cluster if that link does not already exist. If one of the labels already exists in another link, extend that link by including the other label.
Resolving linked clusters. For all the saved linked clusters, replace the linked labels by a single unique label (e.g. the minimum of the linked labels). This yields the final labels $L$ corresponding to the clustering of the original data points $S$ obeying the periodic boundary conditions.

Since this approach employs the conventional DBSCAN algorithm with open boundaries, it is automatically compatible with all optimized implementations of DBSCAN and its underlying neighbor search algorithms. Since the neighborhood distance ϵ is typically small with respect to the domain size, the number of padded points is typically a small fraction of the total number of points N. The impact on the performance of our approach for solving the clustering problem in the periodic domain is thus small with respect to the conventional clustering problem with open boundaries. And crucially, owing to its compatibility, it can be run at the same complexity of $O$ (N log N) that the optimized neighbor search algorithms for open boundaries are able to achieve.

Implementation

We have implemented the proposed approach for DBSCAN in domains with periodic boundaries in a Python package that is publicly available in the repository at github.com/XanderDW/PBC-DBSCAN. It uses the widely employed and highly optimized Scikit-learn implementation of DBSCAN [14] to ensure broad compatibility. The repository also provides ready-to-use code examples for running the different example cases provided in this work.

Quality control

Examples with synthetic data

Here we provide examples of the proposed approach for the DBSCAN clustering problem with periodic boundaries using data that is synthetically generated from (multivariate) Gaussian distributions.

Figure 2 depicts the simplest example of periodic clustering in one dimension. It shows that the algorithm successfully connects the purple cluster that traverses the periodic boundary.

1D example of DBSCAN clustering with periodic boundary conditions with periodicity L and neighborhood ϵ = 0.05 L. The example shows the raw data (a) and the clustering (b), where different colors represent different clusters, while black points indicate noise points that do not belong to a cluster.

In Figure 3 we show an example in two dimensions, distinguishing the cases of doubly periodic Figure 3(a,b) and singly periodic boundary conditions Figure 3(c,d).

2D example of DBSCAN clustering with doubly periodic boundary conditions **(a,b)** and with singly periodic boundary conditions **(c,d)** where in the latter the left and right boundaries are periodic while the top and bottom boundaries are open. The periodicity is L and neighborhood is ϵ = 0.08 L. Panels and colors are as in Figure 2.

Finally, Figure 4 shows an example of periodic clustering in three dimensions, where all three dimensions have periodic boundaries.

3D example of DBSCAN clustering with triply periodic boundary conditions with periodicity L and neighborhood ϵ = 0.08 L. Panels and colors are as in Figure 2.

Our implementation supports data with an arbitrary number of dimensions and can arbitrarily mix open boundaries and periodic boundaries for every dimension separately.

Example with real data

Real world data can often involve clusters with highly non-Gaussian shapes. DBSCAN is very effective in identifying clusters of these complex shapes. One such example is encountered in turbulent flows, when studying the clustering of light bubbles submerged in a heavier turbulent fluid flow. There, bubbles are found to strongly concentrate in regions of high vorticity, forming filamentary clusters inside the cores of these elongated vortex structures [15, 16]. Such clustering behavior is typically studied computationally in domains with periodic boundary conditions to ensure full homogeneity and to eliminate any effect of confinement, such as boundary layer formation. An example is provided in Figure 5, obtained from a direct numerical simulation of homogeneous isotropic turbulence with Lagrangian bubbles [17]. It shows that the clustering algorithm proposed in this work is able to successfully capture the bubble clusters in accordance with the periodic boundary conditions. Notice how, for instance, the turquoise cluster traverses the top/bottom boundary and the purple cluster crosses four different corners of the domain.

Example of DBSCAN clustering on a real dataset of light particles in turbulence in a 3D triply periodic domain with periodicity L and neighborhood ϵ = 0.009 L. Light particles tend to cluster in high-vorticity regions of the flow in filamentary structures. Colors show the six largest clusters of particles as identified by the algorithm. Other clusters are colored in gray for readability.

Conclusions

In this work, we have presented a clustering algorithm based on DBSCAN for data embedded in a domain with periodic boundaries. The approach leverages the conventional DBSCAN algorithm designed for open boundaries, ensuring compatibility with existing optimized neighborhood search methods. As a result, it maintains the same runtime complexity of $O$ (N log N) as conventional optimized DBSCAN algorithms. Our Python implementation of this method is publicly available as a ready-to-use package in the repository at https://github.com/XanderDW/PBC-DBSCAN.

(2) Availability

Operating system

PBC-DBSCAN works on any operating system that supports a standard Python installation, which includes Linux, Windows, and macOS.

Programming language

Python ≥3.7

Additional system requirements

No special requirements

Dependencies

The following Python libraries are a required dependency:

– scikit-learn ≥1.0

The following Python library are required to run the demo and reproduce the plots shown in the previous sections:

– Jupyter
– Numpy
– Matplotlib

List of contributors

Xander M. de Wit, Alessandro Gabbana

Software location

Code repository

Name: GitHub

Persistent identifier: https://github.com/XanderDW/PBC-DBSCAN

Licence: MIT

Date published: 24/01/2025

Language

English

(3) Reuse potential

PBC-DBSCAN can be used for any clustering problem in data science that involves data with periodicity. This can include e.g. angular data, time in a 24-hour cycle, date in a yearly cycle, distance traveled on a loop, and many more. In particular, there are also many examples in computational physics that treat systems in domains with periodic boundary conditions. The package is written in Python, a widely used programming language. It is well documented, including examples, and publicly available on GitHub. Users can provide their questions or comments on GitHub issues or through email, and the authors of the package will do their best to support them.

Competing interests

The authors have no competing interests to declare.