Towards Explainable Graph Spectral Clustering for BERT Embeddings

Mieczysław A. Kłopotek; Sławomir T. Wierzchoń; Bartłomiej Starosta; Piotr Borkowski; Dariusz Czerski

doi:10.14313/jamris-2026-005

Full Article

1.

Introduction

As noted in [1], accessing textual information has never been easy in human history. There was a time when text documents were scarce; today, we face an overwhelming stream of diverse materials. However, in each case, creating overviews of document collections is vital. For centuries, subject indexing and classification systems have served this purpose. In the past, it was possible to prepare a list of categories and then manually add documents to that list. However, today there is a growing urgency to automate both tasks.

Although there exist elaborate human-made category dictionaries for general purposes, specialized collections of documents require specialized local methods for splitting the documents into reasonable groups and labeling them appropriately. The latter task seems to require the involvement of automated methodologies.

Clustering methodologies and accompanying technologies for explaining group membership can prove to be helpful supporting tools here.

Methods of clustering text documents have already found numerous applications. They can help uncover connections between different texts, group them into “natural” categories, or pinpoint the most relevant topics in their content and express them in distinct terms. They are used for rapid information retrieval and filtering, topic extraction, and automatic document organization.

Clustering requires the development of a similarity measure between documents. The simplest way would be to embed the documents in an Euclidean space and (1) to use the distance between the embedding points and apply common algorithms, such as k-means [2] for clustering in such a space, or (2) to use, for example, the cosine similarity between embedding vectors and then use similarity-based algorithms like those of the Graph Spectral Clustering class [3]. Although a multitude of similarity measures have been proposed [4], cosine similarity seems to reflect semantic similarity quite well [5].¹ The latter is embraced in this paper.

Many types of embedding spaces have been developed since the very first proposal of term vector space (TVS) [6], [7]. Despite its straightforwardness, TVS has two notable drawbacks: (1) the document is regarded as a collection (or “bag”) of words, which leads to the loss of context and the relationships among terms, and (2) the dimensionality can be as high as dozens of thousands, even for a moderate-size collection of several thousand documents. Therefore, new embedding approaches, such as Word2vec [8], Doc2Vec [9], GloVe [10], or BERT [11], and many others (see [12-14], or https://huggingface.co/spaces/mteb/leaderboard), were developed to accommodate relationships between terms, and to reduce dimensionality, although dimensionality can still be high, reaching hundreds of dimensions.

Fortunately, an embedding space with a small number of dimensions can be sufficient if GSC is used for clustering.

Usually, we do not need only the clusters of textual documents but also a characterization of their content in terms of keywords, for example. The usual approach to this task was to get the clustering first, and then to seek differences/similarities in the word sets related to each of the clusters. This approach can be termed the “characterization of clusters.” It identifies the features of the obtained clusters but does not explain why a document belongs to one cluster rather than another. Yet often, we want to understand why a document is part of a group and why the algorithm put it in that group. We shall term this “cluster membership explanation”. It seems like a simple process if we cluster the documents directly in the TVS via, e.g., k-means algorithm. This is because the center coordinates of the clusters show how important the words/terms are for the cluster membership.

However, both when clustering in the modern embedding spaces, like word2vec, doc2vec, GloVe, BERT and other transformer methods, and when using GSC based clustering, we get results that are hard to explain as the relationship between the embedding space coordinates and document words/terms gets lost which is counterproductive with respect to the librarian’s goal to assign some automated subject index. For example, in GSC, the clustering process is completely detached from the words, as only similarities are used. BERT embedding, on the other hand, makes the relationship between words and document embedding completely unrecoverable.

In this paper, we make an attempt to overcome the mentioned problems with explainability. We outline a methodology for the explanation of GSC results performed in a BERT embedding, whereby cosine similarity is used as similarity method. Before doing so in section 4, we first recall some related work. In section 2 we provide a brief overview of the methodologies behind BERT embeddings of documents. In section 3, we present an introduction to GSC methodology. In Section 5 we present experiments supporting the validity of our approach. Section 6 concludes the paper.

2.

BERT Embedding Extractions

BERT (Bidirectional Encoder Representations from Transformers), available since 2018, is a method of pretraining language representations of natural language text modeling [11]. One can either use these models to extract high-quality language features from one’s text data, or one can fine-tune these models on a specific machine learning (ML) task, such as classification, entity recognition, or question answering, based on one’s dataset. There exist multiple BERT models nowadays, starting with the so-called bert-base-uncased, a 110 M parameter model containing 12 layers (blocks) and working with lowercase text.²

To extract the contextual embedding of each word in the sentence, one needs to tokenize the document and feed the tokens, together with positional and token type information, to the pre-trained BERT, which will return the embeddings for each of the tokens. Apart from obtaining the token-level (word-level) representation, one can also obtain the document level representation. The result is obtained by passing the input through multiple (12 in bert-base-uncased) neural layers. Each layer (called a transformer block) consists of two sublayers. The first sub-layer implements a multi-head self-attention mechanism, and the second one is a position-wise fully connected feed-forward network. The output of each sublayer is added to the input that bypasses the sublayer through a residual connection, and the resulting signal ends in layer normalization that is passed to the next layer. At the final layer, one gets a representation of each token at each position. Note that generally, BERT models return token embeddings relative to a given sentence/document. This means that the same word can have different embeddings in different sentences, and even in the same sentence when it is used several times.

As explained by [15], there are multiple ways of extracting embeddings of words, sentences, or documents from pre-trained BERT models.

As words are concerned, there exist multiple possibilities to obtain a static word embedding. For example, the “Averaged BERT” method consists of averaging the representations of the same word in its different contexts to acquire a static embedding. The “Averaged plus Regular BERT” approach means that the representation of a word in a given context is the sum of the original BERT representation plus the mentioned averaged value. One may not restrict oneself to the last layer only, but use, e.g., last 4 layers. Other variants are also discussed in the literature.

The simplest representation of the entire text is to use the average of the embeddings of individual words. More advanced is the weighting of words with the logarithm of inverted document frequency.

The obtained vectors may be normalized to the unit length, or via dividing by their standard deviations, and in many other ways.

BERT was used in ML applications requiring explanation. For instance, [16] utilized the BERT model for the purpose of medical classification, using a two-stage model. The first stage was used for purposes of document embedding, while the second stage was trained to predict the classification. Explanations were word-based, where the word importance for classification decision was produced based on a gradient-based method called integrated gradients [17, 18].

[19] discusses the explainability of BERT model results in the task of sentiment analysis.

3.

A Short Introduction to GSC

Let us now explain how to exploit the properties of the Graph Spectral Clustering methodology when clustering documents in BERT embedding. Usage of GSC for clustering instead of direct clustering in the BERT embedding has the advantage of reducing the dimensionality of the clustering problem by an order of magnitude. As we will show subsequently, the result of GSC approximates the result of direct clustering in BERT embedding. This allows clustering in a low-dimensional space. In addition, it is possible to build an explanatory bridge to the GSC/BERT combination.

Let us characterize the GSC approach to clustering briefly. A more detailed description can be found, for example, in [3].

Let S be a (symmetric) similarity matrix between pairs of items (documents, like tweets in our case). It induces a graph whose nodes correspond to the entities (documents), hence the “Graph” part of the GSC name. Let n denote the number of items for which S has been computed. Let D be the diagonal matrix with $d_{j j} = \sum_{k = 1}^{n} s_{j k}$ {d_{jj}} = \sum\nolimits_{k = 1}^n {{s_{jk}}} for each j ∈ [n].

In the domain of text mining, the mentioned similarity matrix is usually based on either a graph representation of relationships (links, hyperlinks) between items, or such a graph is induced by (cosine) similarity measures between these items. However, mixed object representations (text and links) have also been studied, for example by [20]. In this paper, we use the cosine similarity between BERT embedding vectors. Hence, the elements of S are of the form S_i𝓁 = 𝑔(δ_i)^𝑇 𝑔(δ_𝓁), where 𝑔(δ_i) stands for the BERT embedding for document δ_i. We assume that all these embeddings are vectors of unit length. Additionally, by GSC convention, all diagonal elements of S are zero. (Unnormalized or) combinatorial Laplacian L corresponding to this matrix is defined as 1 $L = D - S .$ L = D - S{\rm{ }}.

A normalized Laplacian ℒ of the graph represented by S is defined as 2 $ℒ = D^{- 1 / 2} L D^{- 1 / 2} = I - D^{- 1 / 2} S D^{- 1 / 2} .$ {\cal L} = {D^{ - 1/2}}L{D^{ - 1/2}} = I - {D^{ - 1/2}}S{D^{ - 1/2}}.

The rationormalized Laplacian³ takes the form 3 $ℒ_{ℛ} = {D^{'}}^{- 1 / 2} L {D^{'}}^{- 1 / 2} = I - {D^{'}}^{- 1 / 2} S^{'} {D^{'}}^{- 1 / 2}$ {{\cal L}_{\cal R}} = {D'^{ - 1/2}}L{D'^{ - 1/2}} = I - {D'^{ - 1/2}}S'{D'^{ - 1/2}} where S′ = S + I and D′ = D + I.

The split into k disjoint clusters is achieved as follows. One computes the eigen-decomposition of the respective Laplacian (e.g., L, ℒ, or ℒ_ℛ), getting n eigen-values λ₁ ≤ … ≤ λ_n (always λ₁ = 0, due to mathematical properties of the mentioned Laplacian’s) and corresponding eigenvectors v₁, …, v_n. Then one embeds the documents into the k-dimensional space spanned by the k eigenvectors corresponding to k lowest positive eigenvalues. That is, i-th document is represented by the vector 𝔵_i = [𝑣_i,2,…,𝑣_i,𝑘+1]^T. This shall be called L-embedding, resp. N-embedding, or R-embedding if the eigenvectors are determined from the combinatorial (resp. normalized or rationormalized) Laplacian L, ℒ, or ℒ_ℛ. Mathematical properties imply that the eigenvector v₁ of the combinatorial Laplacian is a constant vector; hence, to get an informative embedding, we use the eigenvectors v₂, …, v_k+1. We also apply this convention to the other embeddings, that is N and R for consistency. Once a proper embedding has been determined, it is possible to cluster the documents in a chosen embedding space using, for example, the k-means algorithm. See [3] or [21] for details. k-means clustering in the L-embedding shall be called L-based clustering. N-based clustering and R-based clustering be defined analogously. Note also, that the L-based clustering allows us to approximate the clustering optimizing so-called RCut criterion. In contrast, the N-based clustering approximates the clustering optimizing NCut criterion, while R-based clustering allows the approximation of the NRCut clustering criterion, a mixture of both.

Note, that it is possible to formalize the three clustering criteria of the form 4 $Q^{[G S A 𝔵]} (Γ) = \sum_{j = 1}^{k} \sum_{i \in C_{j}} {‖ 𝔵_{i} - μ (C_{j}) ‖}^{2}$ {Q^{[GSA{\rm{x}}]}}(\Gamma ) = \sum\limits_{j = 1}^k {\sum\limits_{i \in {C_j}} {{{\left\| {{{\rm{x}}_i} - \mu \left( {{C_j}} \right)} \right\|}^2}} } 5 $= \frac{1}{2} \sum_{j = 1}^{k} \frac{1}{| C_{j} |} \sum_{i \in C_{j}} \sum_{ℓ \in C_{j}} {‖ 𝔵_{i} - 𝔵_{ℓ} ‖}^{2}$ = {1 \over 2}\sum\limits_{j = 1}^k {{1 \over {\left| {{C_j}} \right|}}} \sum\limits_{i \in {C_j}} {\sum\limits_{\ell \in {C_j}} {{{\left\| {{{\rm{x}}_i} - {{\rm{x}}_\ell }} \right\|}^2}} } where Γ = {𝐶₁,…,𝐶_𝑘} is a partition of the set of documents, and 𝔵_i is a vector with components x_ij(j ∈ 1 : k) being the i-th entries of the first k eigenvectors of the appropriate matrix L, ℒ, or 𝑅. That is if v_j is an eigenvector, then x_ij = v_j,i.

The j-th cluster center is defined 6 $μ (C_{j}) = \frac{1}{| C_{j} |} \sum_{i \in C_{j}} 𝔵_{i}$ \mu \left( {{C_j}} \right) = {1 \over {\left| {{C_j}} \right|}}\sum\limits_{i \in {C_j}} {{{\rm{x}}_i}} where 𝔵_i represents embedding of i-th document as described above.

The formula (4), is precisely the loss function of the k-means algorithm in the respective (Euclidean) embedding space. So, it is quite natural that their minima are sought by applying the traditional k-means algorithm.

Graph clustering was primarily associated with the quantity 7 $c u t (C_{j}, {\bar{C}}_{j}) = \sum_{i \in C_{j}} \sum_{ℓ \notin C_{j}} s_{i ℓ}$ cut\left( {{C_j},{{\bar C}_j}} \right) = \sum\limits_{i \in {C_j}} {\sum\limits_{\ell \notin {C_j}} {{s_{i\ell }}} } which represents the aggregate similarity of nodes that are neighbors in a given graph but belong to different clusters. Properly normalizing this quantity we obtain three different criteria, i.e., 8a $R C u t (Γ) = \sum_{j = 1}^{k} \frac{c u t (C_{j}, {\bar{C}}_{j})}{| C_{j} |} = \sum_{j = 1}^{k} \frac{1}{| C_{j} |} \sum_{i \in C_{j}} \sum_{ℓ \notin C_{j}} s_{i ℓ}$ RCut(\Gamma ) = \sum\limits_{j = 1}^k {{{cut\left( {{C_j},{{\bar C}_j}} \right)} \over {\left| {{C_j}} \right|}}} = \sum\limits_{j = 1}^k {{1 \over {\left| {{C_j}} \right|}}} \sum\limits_{i \in {C_j}} {\sum\limits_{\ell \notin {C_j}} {{s_{i\ell }}} } 8b $N C u t (Γ) = \sum_{j = 1}^{k} \frac{c u t (C_{j}, {\bar{C}}_{j})}{V_{j}} = \sum_{j = 1}^{k} \frac{1}{V_{j}} \sum_{i \in C_{j}} \sum_{ℓ \notin C_{j}} s_{i ℓ}$ NCut(\Gamma ) = \sum\limits_{j = 1}^k {{{cut\left( {{C_j},{{\bar C}_j}} \right)} \over {{{\cal V}_j}}}} = \sum\limits_{j = 1}^k {{1 \over {{{\cal V}_j}}}} \sum\limits_{i \in {C_j}} {\sum\limits_{\ell \notin {C_j}} {{s_{i\ell }}} } 8c $N R C u t (Γ) = \sum_{j = 1}^{k} \frac{c u t (C_{j}, {\bar{C}}_{j})}{V_{j}^{'}} = \sum_{j = 1}^{k} \frac{1}{{V^{'}}_{j}} \sum_{i \in C_{j}} \sum_{ℓ \notin C_{j}} s_{i ℓ}$ NRCut(\Gamma ) = \sum\limits_{j = 1}^k {{{cut\left( {{C_j},{{\bar C}_j}} \right)} \over {{\cal V}_j^\prime }}} = \sum\limits_{j = 1}^k {{1 \over {{{{\cal V}'}_j}}}} \sum\limits_{i \in {C_j}} {\sum\limits_{\ell \notin {C_j}} {{s_{i\ell }}} } where 9 ${V^{'}}_{j} = \sum_{i \in C_{j}} (d_{i i} + 1) = V_{j} + | C_{j} | .$ {{\cal V}'_j} = \sum\limits_{i \in {C_j}} {\left( {{d_{ii}} + 1} \right)} = {{\cal V}_j} + \left| {{C_j}} \right|.

The last formula allows us to consider the NRCut clustering criterion as a mixture of NCut and RCut.

Note the symmetry between the k-means clustering criterion and the cutting criterion. In (4), the averaged difference (measured by the Euclidean distance) between the members of the j-th cluster is minimized, while in (8), the averaged similarity between neighboring nodes assigned to different clusters is minimized, respectively.

4.

Our Approach

Our approach to explanation differs from that presented, e.g., by [18] as their methodology is (1) targetting explanation for classification while we aim at explanation of clustering results, (2) the classification mechanism they use is a black-box itself, while we aim at a methodology linking the clustering result cleanly to the textual content of documents, without referring to mysterious coefficients.

4.1.

Preparing BERT Embedding for Explanation

As recalled in section 2, one can get an embedding of an entire sentence as well as embeddings of each word occurrence in the sentence in multiple ways. Let us try to get a kind of uniform representation for them. Assume we have a collection 𝒟 of documents δ₁, …, δ_n and a dictionary 𝒯 consisting of words 𝔴₁,…,𝔴_m. A word 𝔴_t occurs in document δ_i 𝑜(𝑡,i) times. Let us define a function ℰ(δ_i) returning the BERT embedding in the space ℝ^d of the document δ_i and a function ℰ(𝔴_𝑡,𝑝,δ_i) returning the BERT embedding of the p^th occurrence in the space ℝ^w of the word 𝔴_t in the document δ_i [11]. The dimensions d and w are assumed to be identical.

Furthermore, as we want to compute cosine similarities, assume that ℰ*(δ_i) = ℰ(δ_i)/‖ℰ(δ_i)‖, and ℰ*(𝔴_𝑡,𝑝,δ_i) = ℰ(𝔴_𝑡,𝑝,δ_i)/‖ℰ(𝔴_𝑡,𝑝,δ_i)‖.

Frequently, the embedding extraction technologies induce that ℰ(δ_i) is a linear combination of all occurrences 𝔴_j,p but it does not need to be so [15].

However, the dimensionality d is higher than the maximum number of word occurrences in a document, we can assume that there exists a set of nonnegative coefficients f for each document δ_i such that $‖ \sum_{w_{t} \in δ_{i}} \sum_{p = 1}^{o (t, i)} f_{i, j, p} ℰ^{*} (w_{t}, p, δ_{i}) - ℰ^{*} (δ_{i}) ‖$ \left\| {\sum\nolimits_{{_t} \in {\delta _i}} {\sum\nolimits_{p = 1}^{o(t,i)} {{f_{i,j,p}}{{\cal E}^*}\left( {{_t},p,{\delta _i}} \right) - {{\cal E}^*}\left( {{\delta _i}} \right)} } } \right\| is minimized (constrained minimization).

In other words 10 $ℰ^{*} (δ_{i}) \approx \sum_{w_{t} \in δ_{i}} \sum_{p = 1}^{o (t, i)} f_{i, j, p} ℰ^{*} (w_{t}, p, δ_{i}) .$ {{\cal E}^*}\left( {{\delta _i}} \right) \approx \sum\limits_{{_t} \in {\delta _i}} {\sum\limits_{p = 1}^{o(t,i)} {{f_{i,j,p}}} } {{\cal E}^*}\left( {{_t},p,{\delta _i}} \right).

Let us define the importance (or contribution) of a word for the document in such a way that the sum of the embeddings of all occurring words is equal to 1 (approximately), while closeness of a word embedding vector to the document embedding vector is reflected by higher values.

11

i m p o r t a n c e (w_{t}, δ_{i}) = \sum_{p = 1}^{o (t, i)} f_{i, j, p} ℰ^{*} {(w_{t}, p, δ_{i})}^{T} ℰ^{*} (δ_{i}) .

importance\left( {{_t},{\delta _i}} \right) = \sum\limits_{p = 1}^{o(t,i)} {{f_{i,j,p}}} {{\cal E}^*}{\left( {{_t},p,{\delta _i}} \right)^T}{{\cal E}^*}\left( {{\delta _i}} \right).

Clearly, if the document embedding is a linear combination of token embeddings, then $\sum_{w_{t} \in δ_{i}} i m p o r t a n c e (w_{t}, δ_{i}) = 1$ \sum\nolimits_{{_t} \in {\delta _i}} {importance} \left( {{_t},{\delta _i}} \right) = 1

Let us cluster the documents in this embedding using k-means. That is, we minimize the loss function 12 $Q^{[B E R T]} (Γ) = \sum_{j = 1}^{k} \sum_{i \in C_{j}} {‖ ℰ^{*} (δ_{i}) - μ (C_{j}) ‖}^{2}$ {Q^{[BERT]}}(\Gamma ) = \sum\limits_{j = 1}^k {\sum\limits_{i \in {C_j}} {{{\left\| {{{\cal E}^*}\left( {{\delta _i}} \right) - \mu \left( {{C_j}} \right)} \right\|}^2}} } where Γ = {𝐶₁,…,𝐶_𝑘} is a partition of the set of documents.

Each cluster 𝐶_𝑗 will have a cluster center (or prototype) 𝜇(𝐶_𝑗) such that 13 $μ (C_{j}) = \frac{1}{| C_{j} |} \sum_{δ_{i} \in C_{j}} ℰ^{*} (δ_{i}) .$ \mu \left( {{C_j}} \right) = {1 \over {\left| {{C_j}} \right|}}\sum\limits_{{\delta _i} \in {C_j}} {{{\cal E}^*}} \left( {{\delta _i}} \right).

By analogy, the importance of a word for the cluster may be deined as 14 $\begin{array}{l} i m p o r t a n c e (w_{t}, C) = \\ = μ {(C)}^{T} (\frac{1}{| C |} \sum_{δ_{i} \in C} \sum_{p = 1}^{o (t, i)} f_{i, j, p} ℰ^{*} (w_{t}, p, δ_{i})) . \end{array}$ \matrix{ {importance{\rm{ }}\left( {{_t},C} \right) = } \hfill \cr {\, = \mu {{(C)}^T}\left( {{1 \over {|C|}}\sum\limits_{{\delta _i} \in C} {\sum\limits_{p = 1}^{o(t,i)} {{f_{i,j,p}}} } {{\cal E}^*}\left( {{_t},p,{\delta _i}} \right)} \right).} \hfill \cr }

By sorting the words by their decreasing importance, we get the knowledge which word explains best cluster membership.

Note, that 15 $i m p o r t a n c e (w_{t}, C) = \frac{1}{| C |} \sum_{δ_{i} \in C} i m p o r t a n c e (w_{t}, δ_{i}) .$ importance\left( {{_t},C} \right) = {1 \over {|C|}}\sum\limits_{{\delta _i} \in C} {importance} \left( {{_t},{\delta _i}} \right).

4.2.

Weighted Clustering in BERT Embedding

Consider now a modified vector in the BERT embedding. Let ω_i = d_ii. 16 $u^{*} (δ_{i}) = (\frac{ℰ^{*} (δ_{i})}{ω_{i}}, g_{i})$ {u^*}\left( {{\delta _i}} \right) = \left( {{{{{\cal E}^*}\left( {{\delta _i}} \right)} \over {{\omega _i}}},{{\bf{g}}_{\bf{i}}}} \right) where g_i is a vector of dimension n, equal to zero everywhere except the ith element, $g_{i i} = \frac{\sqrt{ω_{i} - 1}}{ω_{i}}$ {g_{ii}} = {{\sqrt {{\omega _i} - 1} } \over {{\omega _i}}}. With this notation, let us compute the squared distance between two documents: (for different i, 𝓁) 17 $\begin{array}{l} {‖ U^{*} (δ_{i}) - U^{*} (δ_{ℓ}) ‖}^{2} = \\ = {‖ U^{*} (δ_{i}) ‖}^{2} + {‖ U^{*} (δ_{ℓ}) ‖}^{2} - 2 U^{*} {(δ_{ℓ})}^{T} U^{*} (δ_{i}) \end{array}$ \matrix{ {{{\left\| {{{\cal U}^*}\left( {{\delta _i}} \right) - {{\cal U}^*}\left( {{\delta _\ell }} \right)} \right\|}^2} = } \hfill \cr {\, = {{\left\| {{{\cal U}^*}\left( {{\delta _i}} \right)} \right\|}^2} + {{\left\| {{{\cal U}^*}\left( {{\delta _\ell }} \right)} \right\|}^2} - 2{{\cal U}^*}{{\left( {{\delta _\ell }} \right)}^T}{{\cal U}^*}\left( {{\delta _i}} \right)} \hfill \cr } 18 $= \frac{ω_{i} - 1 + 1}{ω_{i}^{2}} + \frac{ω_{ℓ} - 1 + 1}{ω_{ℓ}^{2}} - 2 \frac{s_{i ℓ}}{ω_{i} ω_{ℓ}}$ = {{{\omega _i} - 1 + 1} \over {\omega _i^2}} + {{{\omega _\ell } - 1 + 1} \over {\omega _\ell ^2}} - 2{{{s_{i\ell }}} \over {{\omega _i}{\omega _\ell }}} 19 $= \frac{1}{ω_{i}} + \frac{1}{ω_{ℓ}} - 2 \frac{s_{i ℓ}}{ω_{i} ω_{ℓ}} .$ = {1 \over {{\omega _i}}} + {1 \over {{\omega _\ell }}} - 2{{{s_{i\ell }}} \over {{\omega _i}{\omega _\ell }}}.

The dissimilarity is greater when the vectors are longer, but smaller if the dot product is bigger. Under this assumption, let us perform weighted k-means clustering (for weighted kernel-k-means see, e.g., [22]). That is, we minimize the loss function 20 $Q^{[ω B E R T]} (Γ; ω) = \sum_{j = 1}^{k} \sum_{i \in C_{j}} ω_{i} {‖ U^{*} (δ_{i}) - μ_{ω} (C_{j}) ‖}^{2}$ {Q^{[\omega BERT]}}(\Gamma ;\omega ) = \sum\limits_{j = 1}^k {\sum\limits_{i \in {C_j}} {{\omega _i}} } {\left\| {{{\cal U}^*}\left( {{\delta _i}} \right) - {\mu _\omega }\left( {{C_j}} \right)} \right\|^2} where Γ = {𝐶₁,…,𝐶_𝑘} is a partition of the set of documents. Each cluster C_j will have a cluster center (or prototype) 𝜇_𝜔(C_j)s such that 21 $μ_{ω} (C_{j}) = \frac{1}{\sum_{δ_{i} \in C_{j}} ω_{i}} \sum_{δ_{i} \in C_{j}} ω_{i} U^{*} (δ_{i})$ {\mu _\omega }\left( {{C_j}} \right) = {1 \over {\sum\nolimits_{{\delta _i} \in {C_j}} {{\omega _i}} }}\sum\limits_{{\delta _i} \in {C_j}} {{\omega _i}} {{\cal U}^*}\left( {{\delta _i}} \right) 22 $\begin{array}{l} Q^{[ω B E R T]} (Γ; ω) = \\ = \sum_{j = 1}^{k} \frac{1}{2 V_{j}} \sum_{i \in C_{j}} \sum_{\begin{matrix} ℓ \in C_{j} \\ ℓ \neq i \end{matrix}} ω_{i} ω_{ℓ} {‖ U^{*} (δ_{i}) - U^{*} (δ_{ℓ}) ‖}^{2} \end{array}$ \matrix{ {{Q^{[\omega BERT]}}(\Gamma ;\omega ) = } \hfill \cr {\, = \sum\limits_{j = 1}^k {{1 \over {2{{\cal V}_j}}}} \sum\limits_{i \in {C_j}} {\sum\limits_{\scriptstyle \ell \in {C_j} \atop \scriptstyle \ell \ne i} {{\omega _i}} } {\omega _\ell }{{\left\| {{{\cal U}^*}\left( {{\delta _i}} \right) - {{\cal U}^*}\left( {{\delta _\ell }} \right)} \right\|}^2}} \hfill \cr } 23 $= \sum_{j = 1}^{k} \frac{1}{2 V_{j}} \sum_{i \in C_{j}} \sum_{\begin{matrix} ℓ \in C_{j} \\ ℓ \neq i \end{matrix}} ω_{i} ω_{ℓ} (\frac{1}{ω_{i}} + \frac{1}{ω_{ℓ}} - 2 \frac{s_{i ℓ}}{ω_{i} ω_{ℓ}})$ = \sum\limits_{j = 1}^k {{1 \over {2{{\cal V}_j}}}} \sum\limits_{i \in {C_j}} {\sum\limits_{\scriptstyle \ell \in {C_j} \atop \scriptstyle \ell \ne i} {{\omega _i}} } {\omega _\ell }\left( {{1 \over {{\omega _i}}} + {1 \over {{\omega _\ell }}} - 2{{{s_{i\ell }}} \over {{\omega _i}{\omega _\ell }}}} \right) 24 $= \sum_{j = 1}^{k} \frac{1}{2 V_{j}} \sum_{i \in C_{j}} \sum_{\begin{matrix} ℓ \in C_{j} \\ ℓ \neq i \end{matrix}} (ω_{i} + ω_{ℓ} - 2 s_{i ℓ}) .$ = \sum\limits_{j = 1}^k {{1 \over {2{{\cal V}_j}}}} \sum\limits_{i \in {C_j}} {\sum\limits_{\scriptstyle \ell \in {C_j} \atop \scriptstyle \ell \ne i} {\left( {{\omega _i} + {\omega _\ell } - 2{s_{i\ell }}} \right)} } .

Note, that the 𝒰*(δ_i) vectors are of higher dimension. There is no need to use them in practice during the clustering process. One uses them only “mentally” for the sake of explanation. As proven in [23], clusters resulting from clustering the 𝒰*(δ_i) vectors are the same as clusters resulting from clustering using GSC based on normalized Laplacians.

As for explanation purposes note, that 25 $μ_{ω} (C_{j}) = \frac{1}{\sum_{δ_{i} \in C_{j}} ω_{i}} \sum_{δ_{i} \in C_{j}} (ℰ^{*} (δ_{i}), ω_{i} g_{i})$ {\mu _\omega }\left( {{C_j}} \right) = {1 \over {\sum\nolimits_{{\delta _i} \in {C_j}} {{\omega _i}} }}\sum\limits_{{\delta _i} \in {C_j}} {\left( {{{\cal E}^*}\left( {{\delta _i}} \right),{\omega _i}{{\bf{g}}_{\bf{i}}}} \right)} 26 $μ_{ω} (C_{j}) = \frac{1}{\sum_{δ_{i} \in C_{j}} ω_{i}} (\sum_{δ_{i} \in C_{j}} ℰ^{*} (δ_{i}), \sum_{δ_{i} \in C_{j}} ω_{i} g_{i})$ {\mu _\omega }\left( {{C_j}} \right) = {1 \over {\sum\nolimits_{{\delta _i} \in {C_j}} {{\omega _i}} }}\left( {\sum\limits_{{\delta _i} \in {C_j}} {{{\cal E}^*}} \left( {{\delta _i}} \right),\sum\limits_{{\delta _i} \in {C_j}} {{\omega _i}} {{\bf{g}}_{\bf{i}}}} \right) 27 $μ_{ω} (C_{j}) = \frac{1}{\sum_{δ_{i} \in C_{j}} ω_{i}} (| C_{j} | μ (C_{j}), \sum_{δ_{i} \in C_{j}} ω_{i} g_{i}) .$ {\mu _\omega }\left( {{C_j}} \right) = {1 \over {\sum\nolimits_{{\delta _i} \in {C_j}} {{\omega _i}} }}\left( {\left| {{C_j}} \right|\mu \left( {{C_j}} \right),\sum\limits_{{\delta _i} \in {C_j}} {{\omega _i}} {{\bf{g}}_{\bf{i}}}} \right). which means that the word importance can be computed in analogy to the unweighted case, but with the distinction that potentially the clusters are different due to weighted clustering.

The weighted importance can be defined as 28 $\begin{array}{l} i m p o r t a n c e_{ω} (w_{t}, C) = \\ = μ_{ω} {(C_{j})}^{T} (\frac{1}{| C |} \sum_{δ_{i} \in C} \sum_{p = 1}^{o (t, i)} f_{i, j, p} (ℰ^{*} (δ_{i}), ω_{i} g_{i})) \\ = μ_{ω} {(C_{j})}^{T} (\frac{1}{| C |} \sum_{δ_{i} \in C} (ℰ^{*} (δ_{i}), ω_{i} g_{i}) \sum_{p = 1}^{o (t, i)} f_{i, j, p}) . \end{array}$ \matrix{ {importance{{\rm{ }}_\omega }\left( {{_t},C} \right) = } \hfill \cr { = {\mu _\omega }{{\left( {{C_j}} \right)}^T}\left( {{1 \over {|C|}}\sum\limits_{{\delta _i} \in C} {\sum\limits_{p = 1}^{o(t,i)} {{f_{i,j,p}}} } \left( {{{\cal E}^*}\left( {{\delta _i}} \right),{\omega _i}{{\bf{g}}_{\bf{i}}}} \right)} \right)} \hfill \cr { = {\mu _\omega }{{\left( {{C_j}} \right)}^T}\left( {{1 \over {|C|}}\sum\limits_{{\delta _i} \in C} {\left( {{{\cal E}^*}\left( {{\delta _i}} \right),{\omega _i}{{\bf{g}}_{\bf{i}}}} \right)} \sum\limits_{p = 1}^{o(t,i)} {{f_{i,j,p}}} } \right).} \hfill \cr }

4.3.

A Proposal of Double-centered “Normalized” Document Similarity Matrix Based Embedding (For Use With Weighted k-Means)

In [23], it has been suggested to use the 𝒜 matrix of the following form. 𝔈 be a matrix of the following form 29 $E = 1 1^{T} - I .$ = {\bf{1}}{{\bf{1}}^T} - I.

Then define 30 $A = D^{- 1} (E D + D E - 2 S) D^{- 1} .$ {\cal A} = {D^{ - 1}}(D + D - 2S){D^{ - 1}}. with D,S being defined as previously Let ℳ be the matrix of the form: 31 $ℳ = - \frac{1}{2} (I - \frac{1}{n} 1 1^{T}) A (I - \frac{1}{n} 1 1^{T}) .$ {\cal M} = - {1 \over 2}\left( {I - {1 \over n}{\bf{1}}{{\bf{1}}^T}} \right){\cal A}\left( {I - {1 \over n}{\bf{1}}{{\bf{1}}^T}} \right).

Note, that 1 is an eigenvector of ℳ, with the corresponding eigenvalue equal to 0. All the other eigenvectors must be orthogonal to it as ℳ is real and symmetric, so for any other eigenvector v of ℳ we have: 1^T v = 0.

Let Λ be the diagonal matrix of eigenvalues of ℳ, and V the matrix where columns are corresponding (unit length) eigenvectors of ℳ. Then ℳ = VΛV^T. Let $ζ_{i} = Λ^{1 / 2} V_{i}^{T}$ {\zeta _i} = {\Lambda ^{1/2}}V_i^T, where V_i stands for the i-th row of V. Let ζ_i, ζ_ℓ be the embeddings of the documents i,ℓ, resp. This embedding shall be called ℳ-embedding. Then 32 ${‖ ζ_{i} - ζ_{ℓ} ‖}^{2} = A_{i ℓ} = (d_{i i} + d_{ℓ ℓ} - 2 s_{i ℓ}) / (d_{i i} d_{ℓ ℓ})$ {\left\| {{\zeta _i} - {\zeta _\ell }} \right\|^2} = {{\cal A}_{i\ell }} = \left( {{d_{ii}} + {d_{\ell \ell }} - 2{s_{i\ell }}} \right)/\left( {{d_{ii}}{d_{\ell \ell }}} \right) for i ≠ ℓ, and zero otherwise. Let us now discuss performing weighted k-means clustering on the vectors ζ_i with weights amounting to d_ii respectively.

Let us use the following weighting of documents: ω = d_ii. Clustering via weighted k-means with weights ω_i in the ℳ embedding will optimize the following criterion 33 $Q^{[ℳ b a s e d]} (Γ; ω) = \sum_{j = 1}^{k} \sum_{i \in C_{j}} ω_{i} {‖ ζ_{i} - μ_{ω} (C_{j}) ‖}^{2}$ {Q^{[{\cal M}{\rm{ based }}]}}(\Gamma ;\omega ) = \sum\limits_{j = 1}^k {\sum\limits_{i \in {C_j}} {{\omega _i}} } {\left\| {{\zeta _i} - {\mu _\omega }\left( {{C_j}} \right)} \right\|^2} whereby 34 $μ_{ω} (C_{j}) = \frac{\sum_{\in C_{j}} ω_{i} ζ_{i}}{\sum_{\in C_{j}} ω_{i}} = \frac{1}{V_{j}} \sum_{\in C_{j}} ω_{i} ζ_{i} .$ {\mu _\omega }\left( {{C_j}} \right) = {{\sum\nolimits_{ \in {C_j}} {{\omega _i}} {\zeta _i}} \over {\sum\nolimits_{ \in {C_j}} {{\omega _i}} }} = {1 \over {{{\cal V}_j}}}\sum\limits_{ \in {C_j}} {{\omega _i}} {\zeta _i}.

In [23] it has been proven that 35 $Q^{[ℳ b a s e d]} (Γ; ω) = n - 2 k + N C u t (Γ)$ {Q^{[{\cal M}{\rm{ }}based]}}(\Gamma ;\omega ) = n - 2k + NCut(\Gamma )

Recall, that NCut(Γ) was defined in equation (8c). Since n – 2k is a constant, minimizing one criterion minimizes the other. As N-based Clustering (clustering using the normalized Laplacian 𝔏) has the same target as NCut clustering, they are equivalent (see sec. 3).

In [23], proof can be found that 36 $Q^{[ℳ b a s e d]} (Γ; ω) = \sum_{j = 1}^{k} \frac{1}{2 V_{j}} \sum_{i \in C_{j}} \sum_{\begin{matrix} ℓ \in C_{j} \\ ℓ \neq i \end{matrix}} (d_{i i} + d_{ℓ ℓ} - 2 s_{i ℓ}) .$ {Q^{[{\cal M}{\rm{ }}based]}}(\Gamma ;\omega ) = \sum\limits_{j = 1}^k {{1 \over {2{{\cal V}_j}}}} \sum\limits_{i \in {C_j}} {\sum\limits_{\scriptstyle \ell \in {C_j} \atop \scriptstyle \ell \ne i} {\left( {{d_{ii}} + {d_{\ell \ell }} - 2{s_{i\ell }}} \right)} } .

It is easily seen that it is identical with the equation (24) showing equivalence of Q^[ℳbased](Γ;ω) with Q^[ωBERT](Γ;ω).

This completes the explanation bridge: you can cluster with normalized Laplacian based GSC and explain with weighted BERT explanation method. The advantage is that the clustering is performed in a much lower dimensional space (k dimensions when seeking k clusters), but without the disadvantage of basic GSC of non-explainability. Explanation is reached in the BERT space.

4.4.

Algorithmic Description of Clustering with Explanation Under BERT Embedding

The outlined methodology can be summarized as follows:

1)
Embed the document collection into BERT.
2)
For each document ℰ(δ_i) compute the normalized embedding vector ℰ*(δ_i).
3)
Compute cosine similarity between documents i, 𝓁 as cosine between the embedding vectors of documents $s_{i ℓ} ℰ^{*} {(δ_{i})}^{T} ℰ^{*} (δ_{ℓ})$ {s_{i\ell }}{{\cal E}^*}{\left( {{\delta _i}} \right)^T}{{\cal E}^*}\left( {{\delta _\ell }} \right) forming the similarity matrix S.
4)
For the similarity matrix S, apply normalized Laplacian based clustering, eq. (4).
5)
For each token ℰ(𝔴_𝑡,𝑝,δ_i) of the document compute the normalized embedding vector ℰ*(𝔴_𝑡,𝑝,δ_i).
6)
For each document, compute the linear approximation of the normalized document embedding vector based on normalized token embedding vectors eq. (10).
7)
For each cluster, with center eq. (21), calculate the importance of words associated with the cluster eq. (28).
8)
Take n most important words as the cluster explanation for each cluster.

5.

Experiments

In [23], it has been shown that explainability can be achieved when performing GSC for similarities obtained in the Term Vector Space. Therefore, the question can be raised: What is gained when considering the BERT embedding? Experiments have been performed showing the improved clustering performance of the latter.

For this purpose, we studied the effectiveness of BERT-based clustering versus Glove and traditional TVS embeddings. The sBERT and BERT embedding models were downloaded automatically (in the background) by libraries used for embedding of textual documents, whereby model names needed to be provided. The models are available, for example, in the huggingface repository (https://huggingface.co/models). Alternatively, manual download is also possible. Python library used for BERT embeddings was transformers. Models used in experiments were: bert-base-uncased, vinai/bertweet-base, distilbert-base-uncased, cardiffnlp/twitter-roberta-base. Python library used for sBERT embeddings is called sentence_transformers. The names of sBERT and BERT embeddings visible in the tables 1 - 5, consist of two and three parts, respectively, separated with and #. The first part (sBERT, BERT) indicates the type of embedding. The second part is the common name of the respective embedding model. The last part, for BERT embeddings, following #, indicates the version of embedding extraction. Two different document embedding extractors were considered: those marked with #[CLS] mean embeddings taken from the [CLS] token, while #T_AVG indicate that the document embedding was taken as the plain average of (the other) token embeddings.

Table 1.

N-based clustering of Dataset 0 using various embeddings (vectorizers)

Vectorizer	F1-avg	F1-stdev
CountVectorizer	0.255152	0.000049
TfVectorizer	0.255146	0.000102
TfidfVectorizer	0.396431	0.000360
GloVe@wiki	0.363963	0.000401
GloVe@twitter	0.355927	0.001543
sBERT@all-MiniLM-L6-v2	0.974979	0.000092
sBERT@all-distilroberta-v1	0.951857	0.000382
sBERT@multi-qa-mpnet-base-dot-v1	0.942263	0.000090
BERT@bert-base-uncased#[CLS]	0.472723	0.000262
BERT@vinai/bertweet-base#[CLS]	0.675733	0.000439
BERT@distilbert-base-uncased#[CLS]	0.592854	0.000481
BERT@cardiffnlp/twitter-roberta-base#[CLS]	0.831508	0.000091
BERT@bert-base-uncased#T_AVG	0.602534	0.000423
BERT@vinai/bertweet-base#T_AVG	0.618835	0.000151
BERT@distilbert-base-uncased#T_AVG	0.558688	0.000899
BERT@cardiffnlp/twitter-roberta-base#T_AVG	0.476953	0.002215

Table 2.

N-based clustering of Dataset 1 using various embeddings (vectorizers)

Vectorizer	F1-avg	F1-stdev
CountVectorizer	0.367469	0.000429
TfVectorizer	0.367626	0.000624
TfidfVectorizer	0.497295	0.001748
GloVe@wiki	0.463999	0.000942
GloVe@twitter	0.594311	0.001918
sBERT@multi-qa-mpnet-base-dot-v1	0.906746	0.000166
BERT@bert-base-uncased#[CLS]	0.475842	0.000682
BERT@vinai/bertweet-base#[CLS]	0.573888	0.000470
BERT@distilbert-base-uncased#[CLS]	0.654812	0.000458
BERT@cardiffnlp/twitter-roberta-base#[CLS]	0.799531	0.000085
BERT@bert-base-uncased#T_AVG	0.609975	0.001175
BERT@vinai/bertweet-base#T_AVG	0.683997	0.000223
BERT@distilbert-base-uncased#T_AVG	0.694325	0.000253
BERT@cardiffnlp/twitter-roberta-base#T_AVG	0.727976	0.000198

Table 3.

N-based clustering of Dataset 2 using various embeddings (vectorizers)

Vectorizer	F1-avg	F1-stdev
CountVectorizer	0.277486	0.000189
TfVectorizer	0.277534	0.000169
TfidfVectorizer	0.575343	0.000483
GloVe@twitter	0.385300	0.000959
sBERT@all-MiniLM-L12-v2	0.834372	0.000242
sBERT@multi-qa-distilbert-cos-v1	0.820115	0.000252
sBERT@multi-qa-mpnet-base-dot-v1	0.854759	0.000197
BERT@bert-base-uncased#[CLS]	0.488475	0.000369
BERT@vinai/bertweet-base#[CLS]	0.780850	0.000328
BERT@distilbert-base-uncased#[CLS]	0.464572	0.000172
BERT@cardiffnlp/twitter-roberta-base#[CLS]	0.632636	0.000423
BERT@bert-base-uncased#T_AVG	0.478280	0.000458
BERT@vinai/bertweet-base#T_AVG	0.540016	0.003658
BERT@distilbert-base-uncased#T_AVG	0.444183	0.000424
BERT@cardiffnlp/twitter-roberta-base#T_AVG	0.551978	0.000632

Table 4.

N-based clustering of Dataset 3 using various embeddings (vectorizers)

Vectorizer	F1-avg	F1-stdev
CountVectorizer	0.231842	0.001354
TfVectorizer	0.231546	0.001468
TfidfVectorizer	0.320499	0.000485
GloVe@twitter	0.444521	0.000620
sBERT@all-MiniLM-L6-v2	0.985108	0.000009
sBERT@all-MiniLM-L12-v2	0.986921	0.000000
sBERT@multi-qa-MiniLM-L6-cos-v1	0.960467	0.000130
sBERT@multi-qa-mpnet-base-dot-v1	0.978597	0.000071
BERT@bert-base-uncased#[CLS]	0.537486	0.000581
BERT@vinai/bertweet-base#[CLS]	0.641715	0.000124
BERT@distilbert-base-uncased#[CLS]	0.633663	0.000141
BERT@cardiffnlp/twitter-roberta-base#[CLS]	0.935586	0.000230
BERT@bert-base-uncased#T_AVG	0.652909	0.000174
BERT@vinai/bertweet-base#T_AVG	0.713428	0.000090
BERT@distilbert-base-uncased#T_AVG	0.611432	0.000196
BERT@cardiffnlp/twitter-roberta-base#T_AVG	0.677452	0.000217

Table 5.

N-based clustering of Dataset 4 using various embeddings (vectorizers)

Vectorizer	Fl-avg	F1-stdev
CountVectorizer	0.238978	0.000000
TfVectorizer	0.238978	0.000000
TfidfVectorizer	0.239120	0.000000
GloVe@wiki	0.292245	0.000981
GloVe@twitter	0.327367	0.059270
sBERT@multi-qa-mpnet-base-dot-v1	0.670961	0.000124
BERT@bert-base-uncased#[CLS]	0.532433	0.000220
BERT@vinai/bertweet-base#[CLS]	0.450124	0.000068
BERT@distilbert-base-uncased#[CLS]	0.644404	0.000593
BERT@cardiffnlp/twitter-roberta-base#[CLS]	0.769169	0.000094
BERT@bert-base-uncased#T_AVG	0.614746	0.000177
BERT@vinai/bertweet-base#T_AVG	0.592477	0.000221
BERT@distilbert-base-uncased#T_AVG	0.663272	0.000055
BERT@cardiffnlp/twitter-roberta-base#T_AVG	0.672432	0.000529

GloVe-based embeddings are the ones trained on Twitter data (GloVe@twitter) and trained on Wikipedia data (Glove@wiki), downloaded from the pages indicated in [10]. The embeddings with traditional TVS (tf, tfidf) are based on the Python scikit-learn library.

The clustering method over all the embeddings was N-based clustering. sBERT embeddings differ from BERT embeddings in that for BERT embeddings, we get both document embedding and embeddings of individual tokens. This enables explanations to be derived. However, sBERT provides document embedding only. In this study of clustering efficiency, we compare sBERT and BERT embeddings to see whether or not sBERT-based clusterings perform significantly better than BERT embeddings. If they do not perform significantly better, then we can use BERT embeddings for clustering and can be satisfied with the explanation methodology provided in this paper. If sBERT were significantly better, separate explanation methods need to be sought for sBERT.

The clustering experiments were performed with popular Python libraries: numpy [24], scipy [25], scikit-learn [26] and soyclustering [27] which is an implementation of spherical k-means [28]. In particular, we used SpectralClustering class from scikit-learn with the affinity parameter set to: precomputed (affinity from similarity matrix) as a representative of the N-embedding based clustering.

The Hungarian method [29] is applied to match hashtags and clusters, aiming to achieve the best agreement (lowest error rate).

We made the comparison based on 5 datasets drawn from Twitter. Each dataset consisted of documents related to one of five hashtags only. Below, a short characterization of each dataset is given.

- DataSet 0: 8050 documents with 23855 distinct terms. It consists of 5 hashtags: mentalhealth (1003 docs), brexit (1607), ukraine (1687), covid_19 (1696), writingcommunity (2057).
- DataSet 1: 8159 documents with 21188 distinct terms. It consists of 5 hashtags: bbcqt (1025 docs), sidnaaz (1153), lufc (1470), 100daysofcode (1718), tejran (2793).
- Dataset 2: 8256 documents with 20217 distinct terms, and 5 hashtags: r4today (1036 docs), rhobh (1298), bb22 (1622), blm (1849), cdnpoli (2451).
- Dataset 3: 8218 documents with 21061 distinct terms, and 5 hashtags: smackdown (1063 docs), rhop (1154), btc (1487), maga (2079), nufc (2435).
- Dataset 4: 8440 docs with 26939 distinct terms, and 5 hashtags: dnd (1031 docs), robostopia (1323), browns (1493), aewdynamite (1821), 2 (2772).

To measure the quality of clustering, we compared the clustering results with the human-made clustering in terms of the hashtags assigned to documents. Out of the multitude of possibilities (e.g., [30] or [31]), as a quality measure for clustering, we took the popular external quality F1 measure as it reflects both precision and recall. Based on the Hungarian method, we associate each cluster with a single hashtag, calculate F1 for each hashtag against the rest. Then compute the average for all hashtags.

We had to refrain from evaluation of the explanations limiting ourselves to visual inspection (see an example below) as no appropriate reference data sets for the tweets we considered are available. We are currently developing an evaluation method that does not require human intervention and is therefore objective. This paper establishes only the theoretical basis for the explanation process itself. We refrain from adding unnecessary complexity to this paper. There exist multiple works [32-36] and surveys on assessing the quality of explanations like [37-40]. Most quality evaluation methods require either direct interaction with human user or some predefined sets with “ground truth” explanations. We chose in this paper a different pathway and provided an analytical justification why the explanations are trustworthy.

Tables 1 - 5 present clustering results for datasets 0 - 4, respectively. The F1-score presented there in is an average over 30 runs. Note, that in some cases, GSC clustering (as based on normalized Laplacian) failed due to negative similarities. Most frequently, sBERT was affected, but also in some cases GloVe and BERT embedding.

In all cases except for dataset 4, sBERT embedding methods yield the best clustering results. They are, however, not the best option for explanation purposes as an approximation equation. (10) constitutes a poor approximation of document embeddings via token embeddings.

BERT embeddings are not as good as sBERT embeddings. However, if we compare #[CLS] embeddings with #T_AVG in the case of BERT methods, they do not differ very much. Hence, the usage of the averaging method for document embedding is well justified and paves the way for the application of the eq. (10) approximation, and hence the proposed explanation method is justifiable. And, as already mentioned, sBERTs are more problematic than BERTs due to failures caused by frequent negative similarities.

BERT methods are generally better than the TVS and GloVe-based embeddings, but not in all cases (as stated also in, e.g., [41]).

Explanations of clusters under BERT embeddings can be essentially used in a meaningful way only when the embedding of the document is the average over token embeddings (#T_AVG variants). One faces a similar problem as in other settings, that is, the stopwords play a non-negligible role. See an example of a cluster description with top 50 tokens: the to of # mental-health and a is in covid writingcommunity for you are this that we @ _ i on 19 your be all - with it have not & people health our an can from ukraine mental will need those at about my their so who by as.

Same cluster described after removing the stop-words looks like this: mentalhealth covid writing-community people health ukraine mental support time pandemic love care social day distancing safe life crisis issues feel lives remember complacency current depression medical zelenskyyua future llinzigray wip feeling decisionproject government based school banffacade-mybxa virus children talk psychosocial public equipment retweeting uncivilized power flyfofaaviation food-safetygov money brexit writing, which seems to be more topical.

Although removal of stopwords improves the quality of the description, one issue remains. BERT like models create embeddings based on the entire document and there is no obvious justification for discarding stopwords from the text when knowing the specific way how the models are trained (transformer method). The effects of removing stopwords under these settings would require an in-depth investigation. Whereas embeddings like GloVe, TVS that are performed on a word-by-word basis are free from this ambiguity - we can simply ignore the stopwords under embeddings.

6.

Conclusions

In this paper, we studied the problems behind explainability of the results of Graph Spectral Clustering (GSC) methods applied under diverse types of document embeddings, belonging to the TVS group, GloVe group and BERT group.

The essential problem with explainability under GSC is that the GSC embeddings of the results of clustering have nothing to do with the document content. Therefore, [23] suggested a multistage explanation process which is applicable to TVS embedding types. In [42], an extension was discussed to GloVe type embeddings which is more complex than that for TVSembeddings.

The mentioned papers constitute a kind of break-through in the domain of GSA. GSA was previously considered as a black-box method, while the mentioned articles make it explainable. Hereby, there is a grading of complexity on the side of the NL embeddings. TVS represents a relatively simple case where the weight of explaining word is directly the TVS coordinate. GloVe embeddings are more complicated because of inter-section of word vectors in this space. Though this is still a simpler case than BERT, as each word has a fixed embedding. BERT family has a different embedding of each word not only in different documents, but also within the same document. The embedding of the entire document may not correspond to a linear combination of word tokens.

As demonstrated in this paper, a still more complex extension of the explanation concept is also possible for BERT embeddings.

We faced two challenges: (1) deriving theoretical formulas of extracting word importance under the assumption of linear combination of token embeddings (2) experimental checking if usage of linear word token combination deteriorates performance compared to the document embedding (CLS token). Managing these two points makes the significant difference to the previous research and allows to say: clustering with GSA under BERT is also explainable.

The question was also raised as to whether or not it is worth the extra work to use the BERT-based embeddings. The accuracy of clustering for embeddings of these different classes of doc embeddings were studied. BERT embeddings appear to yield best results.

However, an issue was observed that needs a future investigations. Similarly to GloVe embeddings, it may happen that BERT embeddings produce negative cosine similarity between documents. An investigation similar to that described in [42] is needed to ensure completeness of applicability of explnations under BERT embeddings.

Also, a deeper investigation is needed with respect to the role of stopwords in BERT, both on their impact of clustering quality and explanation quality. The research question is: shall we remove stopwords (1) before starting the process of BERT embedding, (2) after it but before computing the embedding of the document based on tokens for clustering or (3) after the clustering but before explaining.

Note, that a number of researchers insist that not the explainability but rather the interpretability of ML results is more important, see [43, 44], but we see it differently. Interpretability is one side of ML usage pointing at the possibility to “make money” out of the ML results. However, the other side of the results is their trustfulness – investment risk. One has to understand how the results came about. The results must be hence explainable.

We restricted ourselves to the BERT-type document embedding models, although new ones are constantly being developed, such as those listed on the leaderboard at https://huggingface.co/spaces/mteb/leaderboard. We chose, however, BERT due to its availability and broad usage, and due to its open source nature and relatively low computational demands. We look at the path TVS-GloVe-BERT as a pathway towards development of more general explanation methods for the steadily evolving LLM models.

While there are multiple ways of evaluating clusters, we concentrate on comparison with “intrin-sic” clustering that is hashtags. Based on Hungarian method, we associate each cluster with a single hashtag, calculate F1 for each hashtag against the rest. Then compute the average for all hashtags.

Towards Explainable Graph Spectral Clustering for BERT Embeddings

Full Article

Paradigm

My account