Assessment of an optimal parameter space for spatial cluster detection of SMEAR Estonia flux footprint data using unsupervised learning algorithms

Steffen M. Noe; Anuj Thapa Magar; Emílio Graciliano Ferreira Mercuri

doi:10.2478/fsmu-2025-0002

Introduction

Determining the exchange of matter and energy between terrestrial ecosystems and the atmosphere is a pivotal task in understanding the processes of carbon exchange under the changing climate. Large-scale measurement stations like SMEAR (Station for Measuring Ecosystem-Atmosphere Relations) in Estonia and Finland (Hari & Kulmala, 2005; Kulmala, 2018; Noe et al., 2015), European research infrastructures like ICOS (Franz et al., 2018) and globally operating networks like Fluxnet (Keenan et al., 2021; Keenan et al., 2019) provide vast numbers of eddy flux data that allow to verify carbon uptake and release from terrestrial ecosystems. Addressing the carbon sink capacity of forests is important to develop strategies for carbon sequestration, to mitigate climate change and develop climate-smart forestry solutions.

The concept of a flux footprint (Kljun et al., 2015), the area that is relevant for assessing the ecosystems’ exchange processes, is used in determining the source or sink capacity of a terrestrial ecosystem when applying the eddy covariance technique (Baldocchi, 2014; Rannik et al., 2004). The footprint is subject to constant change by natural processes like forest growth, disturbances like windbreak and by human activities like land use change by, e.g. forest management (Kollo et al., 2023). A footprint is not a static area but a dynamic region that is influenced by the wind direction and speed, atmospheric stability, and the surface roughness of the vegetation (Yu et al., 2018). The footprint area can be estimated by using different models like the Lagrangian particle dispersion model (Hsieh et al., 2000), the backward Lagrangian stochastic model (Kormann & Meixner 2001) or the parameterized footprint model by Kljun et al. (2015). Because it is applicable to a wider range of atmospheric stability and the relative low computational cost the latter is used in this work to determine the flux footprint characteristics of SMEAR Estonia.

The footprint’s shape and the location of the strongest signal in the x-y plane that are detecable by a flux tower depend strongly on the wind patterns, weather phenomena, turbulent transport, tower properties, geography and land cover of the site (Rey-Sanchez et al., 2022). Therefore, not every potential point within the flux footprint contributes at the same scale to a timely integrated carbon flux signal (Rey-Sanchez et al., 2022). The source regions tend to form clusters and the contribution to the ecosystematmosphere exchange of the trees located within these cluster regions will be higher than that of the areas of less coverage by wind patterns.

Density-based spatial clustering of applications with noise (DBSCAN) (Ester et al., 1996) and the hierarchical variation (HDBSCAN) (Campello et al., 2013) are cluster recognition algorithms optimised for large spatial data with noise. They also allow the recognition of clusters via unsupervised machine learning and are applicable to the data without presetting, e.g. the number of clusters necessary for the k-means algorithm (Wu, 2012). Density-based cluster detection also allows outliers, i.e., points in the given spatial data that do not fulfil the connectivity criteria. Eddy covariance relies on the turbulent mixing of the atmosphere (Kowalski & Serrano-Ortiz, 2007) and we will therefore have stable regions and prevailing wind directions as well as rapidly changing wind speeds and directions, very likely causing spatially distributed outliers. We note here that in the context of our data an outlier is just a data point that does not belong to a specific cluster and it does not imply that this data point should be removed from the dataset.

Both clustering algorithms use two hyperparameters: the maximum distance between two samples for one to be considered as in the neighborhood of the other ϵ and the minimum number of points in a region minPts to determine cluster points. In this study, we aimed a) to assess the best hyper-parameter pairs according to the given flux data, and b) to assess the feasibility of the method to find spatially distributed clusters of the maximal flux signal in the tower’s footprint area.

Materials and Methods

Study area

The study area is located in Järvselja in the southeastern part of Estonia and data was obtained from SMEAR Estonia (Noe et al., 2015). The terrain in the study area is flat and the flux tower is at a height of 36 m asl. We used eddy covariance data measured at 70 m above ground, covering an area of up to 4 km distance from the tower reaching up to 3,332 ha whereof about 88% or 2,900 ha are forest land. The other major land cover types are agricultural land (3.6%), clear cuts (2.5%), roads and access ways (2%), water bodies (1.3%), raised bog (1.1%), power lines (0.7%) and the rest are buildings and yards (Noe et al., 2021; Kollo et al., 2023). Flux footprint data represents the exchange of gases from the Järvselja experimental and training forest, which is a drained peatland forest and contains the dominant tree species found in the hemiboreal zone: Scots pine (Pinus sylvestris L.), Norway spruce (Picea abies (L.) H. Karst.), silver birch (Betula pendula Roth), downy birch (Betula pubescens Ehrh.), European aspen (Populus tremula L.), common alder (Alnus glutinosa (L.) Gaertn.) and grey alder (Alnus incana (L.) Moench) (Mercuri et al., 2023).

Locating the point of maximal flux in x-y coordinates

The footprint function defines a 3D space where the location of the flux tower is at the origin of the projected 2D footprint area in the x-y plane. The flux footprint is defined as the mathematical transfer function between the sources and sinks, characterized as passive scalars in a surface area, and the turbulent flux that is measured at the tower (Schmid, 2002; Kljun et al., 2015). Applying Kljun’s footprint model (Kljun et al., 2015) allows to either obtain an area depending on the calculated footprint function over a time interval or for each footprint function calculated at a discrete time of the vector of the footprint function’s maximal point in x-y coordinates. To assess the clustering, we chose the latter option and created a dataset of x-y coordinates for each half hourly flux in our input data. Because the algorithm gives the distance with respect to the wind direction, we construct the cartesian coordinates by: 1 $(x, y) = (\sin (v_{H} \frac{π}{180}), \cos (v_{H} \frac{π}{180})) f_{\max},$ (x,y) = \left( {\sin \left( {{v_H}{\pi \over {180}}} \right),\cos \left( {{v_H}{\pi \over {180}}} \right)} \right){f_{\max }}, where f_max is the distance of the footprint’s maximum in the horizontal wind direction v_H as given in the output of the footprint model.

Cluster algorithms

The choice of cluster algorithms to assess groups of data that follow “human intention” has several difficulties. The intuitive concept of a cluster is poorly defined and depends on the application (McInnes & Healy, 2017). Our choice for DBSCAN and HDBSCAN follows several implications our flux data entail. Knowing that the horizontal wind direction is the major factor to determine the location of the flux footprint’s maximum and the previous research on the SMEAR Estonia flux footprint and horizontal wind direction densities (Figures 4 and 6 in Kollo et al. (2023)) we can assume that our data will have:

- clusters of arbitrary shape
- clusters with varying densities
- clusters of different sizes
- noise and outliers

All these features in our data will rule out algorithms like k-Means where we need to fix the number of clusters beforehand and the assumtion that clusters are spherical and equal in size and density.

The chosen algorithms, DBSCAN and HDBSCAN, are thoroughly described (McInnes & Healy, 2017; Campello et al., 2013; Ester et al., 1996) and have proven their feasibility in the detection of clusters, especially for spatial data. To this point, we would like to note that an intuitive criterion for a cluster is the density of data points. In dense regions, the probability of a cluster is high and in sparce regions the probability for a cluster is low. This can be used to motivate a statistical argument where the number of points within a certain range e is linked to the cluster probability (McInnes & Healy, 2017). We may note further that the core distance is smaller or equal to ϵ and there is a probability density function PDF(x) ≥ λ and we have λ = 1/ϵ. The λ criteron can be used to construct the hierarchy of the HDBSCAN algorithm.

Another metric that impacts on the quality of estimation of the PDF from data are the minimal number of points that determine what is graded a valid cluster. Values that are too low yield to a very “bumpy” PDF where the noise can be graded as mini-clusters. To avoid that, a threshold is set that is called minPts which determines the maximal size of a bump to be graded a peak in the PDF and by that a contributing cluster.

ϵ and minPts are the two hyper-parameters for the cluster determination we need to optimize for our data.

Data processing and cluster hyper-parameter assessment

To assess the best hyper-parameter set for our measurement station we used data from April 2015 until December 2020. Half-hourly x-y coordinates of the 70 m high eddy flux system’s location of maximal contribution, accounting for a total of 68,266 coordinate points, were used to determine ϵ and minPts for monthly clusters. The x-y coordinates are given in metres from the measurement tower in cartesian coordinates using the Estonian plane coordinate system L-EST97 (Maaja Ruumiamet, 2019).

We created for each algorithm a Jupyter notebook (Kluyver et al., 2016) to read the monthly data, scale the x-y coordinates according to Scikit-learn’s (Pedregosa et al., 2011) RobustScaler or StandardScaler function and apply either the DBSCAN or HDBSCAN functions with ϵ = [0, 1] in steps of 0.05 and minPts = [1, 50]. The cutoff at 50 minPts was chosen because the number of clusters was either one large cluster or no cluster was detected.

The scaling functions have the purpose of avoiding large numbers as given by the L-EST97 grid that may lead to numerical difficulties in the cluster estimation procedure. The scalers are defined by: 2 $X_{scaled} = \frac{X - μ}{σ},$ {X_{{\rm{scaled }}}} = {{X - \mu } \over \sigma }, where μ and σ denote the mean and the standard deviation of the coordinate ranges in the x-y plane for the standard scaler and by: 3 $X_{scaled} = \frac{X - Q_{1}}{Q_{3} - Q_{1}},$ {X_{{\rm{scaled }}}} = {{X - {Q_1}} \over {{Q_3} - {Q_1}}}, with the first quartile Q₁ (25th percentile) and third quartile Q₃ (75th percentile) for the robust scaler. The robust scaler reduces the impact of possible outliers in the data.

We calculated the silhouette (Rousseeuw, 1987) and Davies-Bouldin (Davies & Bouldin, 1979) scores, both possible on unlabeled data. The silhouette score is defined as: 4 $S = \frac{b - a}{\max (a, b)},$ S = {{b - a} \over {\max (a,b)}}, where a is the intra-cluster distance and b the nearest cluster distance and S is a metric on how different the data point is from points in other clusters. The Davies-Bouldin score is defined as: 5 $D B = \frac{1}{k} \sum_{i = 1}^{k} \max (\frac{Δ (X_{i}) + Δ (X_{j})}{δ (X_{i}, X_{j})}),$ DB = {1 \over k}\sum\limits_{i = 1}^k {\max } \left( {{{\Delta \left( {{X_i}} \right) + \Delta \left( {{X_j}} \right)} \over {\delta \left( {{X_i},{X_j}} \right)}}} \right), with a dataset X = X₁, X₂, X₃,… and where Δ (X_k) is the intra-cluster distance and δ(X_i, X_j) the intercluster distance calculating the average cluster similarity of each cluster to the cluster most similar with it. The score shows the ratio between intra-cluster and inter-cluster distances.

We used these scores in selection of clustering parameters where lower values denote better estimates of the clusters. Further, we determined the number of clusters and outlier points that do not belong to clusters. The number of core points is also used as a selection criterion while the number of outliers may be used in quality controlling and assessing the cluster connectivity.

The software used for data processing and cluster determination was Python 3.10.9 (Van Rossum & Drake 2009), Pandas 1.5.3 (McKinney, 2010), Numpy 1.24.2 (Harris et al., 2020), Matplotlib 3.6.3 (Hunter, 2007), and Scikit-learn 1.4.1.post1. (Pedregosa et al., 2011).

Results and Discussion

Determining clusters per month on all data by using the two algorithms gave us an estimation for the possible ranges of the hyperparameters related to the SMEAR Estonia stations footprint data. These findings are summarized in Table 1. In short, for DBSCAN ϵ is on average between 0.25 and 0.71, minPts between 11 and 39 regardless of which scaling was chosen. For HDBSCAN e ranged on average between 0.12 and 0.4 for the standard scaling and between 0.09 and 0.37 for the robust scaling, and minPts varied between 19 and 43.

Table 1.

Summary of the parameters obtained from applying clustering algorithms to the strongest signal location within the SMEAR Estonia footprint. The averages and standard deviations of discrete variables have been rounded to the nearest integer. The execution time is measured in seconds per iteration.

Scaler	DBSCAN		HDBSCAN
Scaler	Standard	Robust	Standard	Robust
ϵ	0.48 ± 0.23	0.48 ± 0.23	0.26 ±0.14	0.23 ± 0.14
minPts	25 ± 14	25 ± 14	31 ± 12	31 ± 12
Clusters (min. max)	3 ± 10 (0. 285)	2 ± 6 (0. 293)	3 ± 1 (0. 25)	3 ± 1 (0. 20)
Outliers	169 ± 256	107 ± 202	224 ± 126	209 ± 132
(min. max)	(0.1274)	(0. 1195)	(3. 938)	(3. 963)
Silhouette score	0.31 ± 0.20	0.36 ±0.19	0.27 ± 0.14	0.30 ±0.15
Davies-Bouldin score	3.38 ± 6.99	3.35 ± 8.22	3.03 ±5.11	2.8 ± 1.6
Execution time in s/it	8.3	7.3	6.5	5.6

The average number of estimated clusters ranged from 2 to 4 for HDBSCAN regardless of which scaling was chosen. It is possible that no cluster is detected and the maximal numbers of detected clusters are 25 for the standard and 20 for the robust scaling. In DBSCAN, the variation of cluster estimation was much larger, maximal numbers reached up to 293 in the robust and 285 in the standard scaling cases. This result reflects the improved performance in cluster detection by the hierarchical feature of the cluster detection algorithm. Including the hierarchical component as given by HDBSCAN reduces the influence of the noise in cluster detection and the results get more robust. The about 10 times larger maximal number of detected clusters by DBSCAN shows that the noise in the data is likely to lead to too fuzzy PDFs of the density and therefore to a very high number of mini-clusters that fail to match visible comparison with the data. Therefore, HDBSCAN is the algorithm of choice in cases of large noisy datasets.

The silhouette scores are low for both algorithms and scalings indicate that the clusters detected by the two algorithms are located rather near each other over all the monthly data provided. HDBSCAN has obtained the lower silhouette score for both scaling scenarios and would be the algorithm of choice due to these criteria as well. Also, the Davies-Bouldin score was found lower for HDBSCAN and would suggest using this algorithm preferably on the SMEAR Estonia data. The algorithms’ execution speed (numbers depend on hardware) was found consistently lower for HDBSCAN, also suggesting it as the best choice for the data given in the context of computational efficiency.

Figures 1a and 1b visualize the hyperparameter space for ϵ and minPts for the robust scaling. Figure 1a suggests DBSCAN according to the marginal distributions for which the fitting hyper-parameters are ϵ = 0.3 and minPts ∈ [17, 20]. We would obtain about three clusters. In the case of HDBSCAN (Figure 1b) the cluster detection is finer-granulated even though we used exactly the same steps during calculations. In line with the results in Table 1 we have the highest probability with small ϵ ≈ 0.2 and also minPts ∈ [17, 20] and the number of clusters detected are between 3 and 4 mostly in that parameter range.

Interestingly we found a certain pattern in both algorithms’ parameter spaces. For ϵ ≈ 0.2 and 0.5 the number of clusters is higher. If minPts is in the smaller range the largest number of clusters that can be found here is up to 4. In the range of ϵ ≈ 0.3 to 0.5 and minPts ≤ 15 an area with a lower number of clusters detected was found. This seemed to be more prominent for the DBSCAN algorithm. Examples for cluster detections using similar hyper-parameters for both algorithms are shown in Figures 2 and 3. In Figure 2, when comparing the input data, we found that both algorithms – HDBSCAN and DBSCAN – capture the most dense clusters in an almost similar way but DBSCAN misses the less connected cluster centered at about [2.5, 0]. Instead, DBSCAN locates a very small cluster between the two most dense regions. HDBSCAN classifies this region with a lower probability as part of the largest cluster it detected. In the case of Figure 3, both algorithms were able to detect the clusters visible in the scaled data. In this case, HDBSCAN omitted one potential small cluster that was detected by DBSCAN.

Conclusion

Both algorithms resulted in preferable intervals for the hyper-parameters. On average, we found about three clusters for the maximal signal of ecosystem flux exchange on a monthly basis when we used all input data. Table 1 and Figures 1a and 1b confirm this finding. In that sense, both algorithms are capable of detecting clusters in the spatial data provided from the SMEAR Estonia station. The HDBSCAN algorithm led to a finer-grained detection (Figure 1), had a better retrieval of clusters which is seen by the lower Davies-Bouldin score, got the more consolidated number in maximal detected clusters and generally gave a smaller range of outliers.

We can conclude that the choice of scaling does not introduce very large differences in cluster detection for the most cases in our data. Usually, scaling prevents data from exhibiting large outliers in the features and eliminates distortions between the features (Chanal et al., 2021). The major effect in our case is the re-scaling of the Estonian plane coordinate system’s large numbers into a smaller domain ranging between [-3, 3] in both axes avoiding very large numbers in the calculation. Since the x and y coordinates show minimal scale distortion, the StandardScaler transformation – mean removal and scaling to unit variance – yields results that are very similar to those obtained with RobustScaler, which centers data by the median and scales by the interquartile range (Pedregosa et al., 2011). The latter scaling algorithm is less affected by outliers and the choice of the robust scaling algorithm is safe for all types of data used in our work.

Calculating over a range of possible parameter values allows to decide which parameter pairs will be promising to detect clusters by unsupervised learning from the data. The choice of similar parameter pairs for both algorithms yielded very similar results (Figures 2 and 3) which means choosing either algorithm would give reasonable results. Depending on our results the HDBSCAN algorithm would be preferable in terms of finer granularity and lower computing times.

In summary, we would encourage the use of both techniques tested in this study to assess clusters in eddy covariance flux data. The safer choice is according to our results the use of HDBSCAN with the (RobustScaler). Since CO2 and H₂O fluxes have strong dependence on soil and vegetation types, the determination of clusters of maximum flux contribution can help to understand the carbon and water dynamics of the forest ecosystem-atmosphere. The effects of different species and forest management practices within the flux tower footprint can be linked to spatially explicit areas, enabling comparisons and experiments to determine which measures improve carbon uptake and support climate mitigation strategies.

Assessment of an optimal parameter space for spatial cluster detection of SMEAR Estonia flux footprint data using unsupervised learning algorithms

Full Article

Paradigm

My account