Is more always better? Measuring the quality of ranking data through information entropy

Liu, Yishan; Xiao, Yu; Long, Xin; Wu, Jun

doi:10.2478/jdis-2025-0055

Full Article

1

Introduction

Multi-source data fusion (MSDF) refers to the use of specific methods to process and integrate different types of information sources or relational data, thereby obtaining results that more comprehensively and objectively reflect the characteristics of the research object (Xu et al., 2017). As information sources become increasingly diverse and data types continue to expand in the era of big data, the limitations of relying on a single data source in scientific research, including limited coverage, insufficient reliability, and an inability to fully capture the complexity of research objects, have become increasingly apparent (Hong et al., 2018; Rong et al., 2019). Therefore, leveraging its ability to integrate multi-dimensional information and enhance data value, MSDF has been widely applied across various research domains (Adade et al., 2024; Liu et al., 2022; Wang et al., 2024). Among these applications, rank aggregation represents a typical and critical scenario, aiming to use appropriate methods to process ranking data from multiple sources and produce an aggregated ranking that integrates information from all aspects, thereby ensuring the reliability and objectivity of decision-making (Frank Hsu & Taksa, 2005; Lillis, 2020).

Ranking is the process of arranging a set of objects based on specific criteria to reflect their relative importance. It has been widely applied across various domains, including webpage ranking (Kayed et al., 2010), academic institution ranking (Faghri & Bergman, 2024), gene ranking (Soneson & Robinson, 2018), and sports ranking (Maanijou & Mirroshandel, 2019). With the ongoing advancement of information collection and processing technologies in modern society, the number of rankings generated based on different criteria has been steadily increasing. These rankings reflect the relative superiority and inferiority of objects from multiple perspectives. Therefore, to more comprehensively utilize information from different sources and ensure the reliability and objectivity of decision-making (Chen et al., 2018; Ossadnik et al., 2016; Ursu,2015), appropriate methods are typically employed to process and integrate these ranking data. This process is referred to as rank aggregation (Dwork et al., 2001; Schalekamp & Zuylen, 2009; Wang et al., 2020; Zhao et al., 2019), which plays a crucial role in multiple fields, such as social choice (Caplin & Nalebuff, 1991), bioinformatics (Li et al., 2019), and recommendation systems (Pujahari & Sisodia, 2020).

Early efforts in rank aggregation sought to develop effective methods to solve the problem of inconsistent voting for political candidates in the 18 century (de Borda, 1781; de Caritat Mis, 1785). Since then, researchers in various fields have developed a wide range of rank aggregation methods for various applications to ensure more reliable results (Lin, 2010). These studies focus on the late-stage fusion phase of MSDF, with the proposed methods aiming to integrate output rankings from different data sources without involving the early processing or fusion of raw data. Among these, some studies leverage social choice theory, optimization theory, or graph theory to design algorithms specifically for handling ranking data, such as the Borda’s method (de Borda, 1781), the minimum violations ranking method (Pedings et al., 2012), and the competition graph method (Xiao et al., 2021).A few studies have also explored the application of MSDF methods to rank aggregation, such as using information fusion operators to generate the final ranking (Keyhanipour, 2025). With the substantial increase in both the quantity and types of available data, some researchers have begun exploring rank generation and aggregation methods based on heterogeneous data to further enhance the reliability of aggregated rankings. Unlike studies that focus solely on the final ranking results, these methods adopt a more comprehensive workflow, not only integrating different types of data but also considering both the relationships between the data and the internal relationships within it (Dourado et al., 2019). For example, some scholars have combined image semantic features with traditional text features for search ranking, significantly improving ranking accuracy, making it more effective compared to models that use only text (Chen et al., 2017; Lynch et al., 2016).

Although numerous rank aggregation methods have been proposed in existing studies, the effectiveness of rank aggregation depends not only on the methods employed but also heavily on the quality of the input ranking data (Deng et al., 2014; Dwork et al., 2001). In previous studies and practical applications of rank aggregation, the number and length of input rankings, as the most intuitive and computationally simple characteristics, are often considered key quantitative indicators by experts to evaluate the quality of ranking data (Tadbier & Shoufan, 2021). However, the lengths of the rankings from different sources often exhibit significant inconsistency (Li et al., 2019), thereby making it difficult to accurately assess the quality of input ranking data. Moreover, simply increasing the number of rankings does not necessarily improve data quality, as ranking data that is overly concentrated on a few objects may still lead to aggregated results lacking reliability. Therefore, when evaluating the quality of input ranking data, it is essential to consider not only the number and length of rankings but also their distribution across objects, as this distribution significantly influences the overall quality of the data.

However, existing studies often neglect to analyze the distribution characteristics of the input ranking data, which can lead to biased and misleading aggregation results. The objective of this study is to propose a simple yet effective method for measuring the quality of input ranking data based on its distribution characteristics across objects. In general, the method proposed in this study will help researchers and information managers to more accurately evaluate the quality of input data before rank aggregation, thus reducing biases and errors in the results due to poor quality input.

The remainder of this paper is organized as follows. We first introduce the concepts of information entropy, typical rank aggregation methods, and ranking correlation measurement methods in Section 2. Then, we develop two entropy-based methods to measure the quality of input ranking data in Section 3, and in Section 4, we introduce the datasets used in our experiments and how they were obtained. In Section 5, we perform an experimental analysis using several typical rank aggregation methods to demonstrate the effectiveness of the proposed measures. Finally, we conclude this study and provide an outlook for future research in Section 6.

2

Preliminaries

2.1

Information entropy

Information entropy, also known as Shannon entropy, is derived from the foundational quantitative theory of information transmission and communication (Hartley, 1928; Shannon, 1948). It measures the average amount of information contained in each symbol of an information source. Higher entropy means that the symbol probability distribution is more spread out, indicating greater randomness in the source; conversely, lower entropy implies a more concentrated probability distribution, reflecting stronger certainty in the source. Specifically, given an information source X with finite set of symbols and probability distribution P, the information entropy is defined as follows: 1 $H (X) = - \sum_{x \in X} P (x) l o g P (x)$ H(X) = - \sum\limits_{x \in X} P (x)logP(x)

Currently, information entropy has been widely applied in various research fields, such as machine learning (Wimmer et al., 2023), coding (Li et al., 2024), and data analysis (Guo et al., 2022), due to its efficiency in measuring the complexity of various objects. In particular, its application in network science, such as in quantifying the complexity of network topology, has received significant attention (Omar & Plapper, 2020). The classical network structural entropy is often defined using global graph invariants, such as node count, edge count, or degree distribution (Dehmer & Mowshowitz, 2011). These different definitions reflect the complexity and orderliness of network structure from various perspectives. Now, network structural entropy has been widely applied in numerous research fields, including financial markets (Almog & Shmueli, 2019), privacy protection (Tian et al., 2022), and identifying breakthrough topics (Xu et al., 2022).

2.2

Typical rank aggregation methods

Generally, researchers classify rank aggregation methods into two categories from the perspective of computation: optimization methods and heuristic methods (Argentini & Blanzieri, 2012). To comprehensively examine how varying distributions of input ranking information impact aggregation results of different methods, we select five typical rank aggregation methods for experimental analysis. As representatives of heuristic methods, we chose four typical methods: Borda’s method, Dowdall method, a variant of Borda’s method, and a newly proposed method called the competition graph method. As a representative of optimization methods, we select the minimum violations ranking method.

2.2.1

Borda’s method (BM)

As a classic position-based approach, Borda’s method is a conceptually simple and highly influential rank aggregation method (de Borda, 1781; Neveling & Rothe, 2021). Given m rankings R₁, R₂, …, R_m, for each object o_j∈ R_j, this method first assigns a score B_i(o_j) to object o_j based on the number of objects ranked below it. Then, the Borda count for object o_j is the total score of m rankings, denoted as $\sum_{i = 1}^{m} B_{i} (o_{j})$ \sum\limits_{i = 1}^m {{B_i}} \left( {{o_j}} \right) Finally, the objects are ranked in descending order of their Borda counts to produce an aggregated ranking.

2.2.2

Dowdall method (DM)

Dowdall method is another typical position-based rank aggregation method and can be regarded as a variant of the BM (Reilly, 2002). Given m rankings R₁, R₂, …, R_m, for each object o_j ∈R_j this method first assigns a score D_i(o_j) to object o_j, which is defined as the reciprocal of the rank assigned to it in R_i. Then, the total score for object o_j is given by the sum across all m rankings, denoted as $\sum_{i = 1}^{m} D_{i} (o_{j})$ \sum\nolimits_{i = 1}^m {{D_i}\left( {{o_j}} \right)} . As a result, an aggregated ranking of objects can be generated by sorting the objects in descending order of their total scores.

2.2.3

A variant of Borda’s method (VB)

A variant of Borda’s method uses extension sets to manage unobserved information, specifically by accounting for the possible ranking positions of objects absent from the given rankings to address the uncertainty associated with these objects (Aledo et al., 2016). Given a partial ranking R_i, the method defines a permutation σ, which is consistent with R_i if one of the two following conditions holds for all objects o_s, o_t ranked in R_i:

o_s and o_t share the same rank in R_i,
If o_s is ranked above o_t in R_i, then o_s must be ranked above o_t in σ, and vice versa.

Moreover, σ will be said to be restricted consistent with R_i if the following two conditions hold for all objects o_s, o_t in R_i:

σ is consistent with R_i,
∀o_k non-ranked in R_i, o_k cannot be ranked between o_s and o_t which share the same ranking position in R_i.

Therefore, the extension set of a partial ranking R_i is denoted by E^r(R_i)={σ|σ is restricted consistent with R_i}.

Next, the restricted precedence extension value $V_{s t}^{r} (R_{i})$ V_{st}^r\left( {{R_i}} \right) of R_i between o_s and o_t is defined as follows: 2 $V_{s t}^{r} (R_{i}) = \frac{1}{| E^{r} (R_{i}) |} \sum_{σ \in E^{r} (R_{i})} 1 (o_{s} i s r a n k e d a b o v e o_{t})$ V_{st}^r\left( {{R_i}} \right) = {1 \over {\left| {{E^r}\left( {{R_i}} \right)} \right|}}\sum\limits_{\sigma \in {E^r}\left( {{R_i}} \right)} 1 \left( {{o_s}{\rm{ }}is ranked above{\rm{ }}{o_t}} \right)

Then, given m rankings R₁, R₂, …, R_m, the research define the restrict precedence extension matrix by M^r = [M_st]_s,t=1:n by 3 $M_{s t}^{r} = \frac{1}{m} \sum_{i = 1}^{m} V_{s t}^{r} (R_{i})$ M_{st}^r = {1 \over m}\sum\limits_{i = 1}^m {V_{st}^r} \left( {{R_i}} \right)

Finally, Borda’s method is used to obtain a consensus ranking by sorting the column sums of M^r in ascendingorder.

2.2.4

Competition graph method (CG)

The competition graph method is an efficient approach for aggregating high-dimensional and partial rankings (Xiao et al., 2021). Given m rankings R₁, R₂, …, R_m with respect to n objects, this method first represents objects as nodes and employs directed weighted edges to represent pairwise comparison relationships between objects from the given rankings. Specifically, for a pair of nodes, the weight of an outgoing edge represents the number of rankings in which one node ranks above the other, while the weight of an incoming edge reflects the reverse. Based on this, the total weights of outgoing and incoming edges for each node are defined as its out-degree $d_{i}^{+}$ d_i^ + and indegree $d_{i}^{-}$ d_i^ - , respectively. Then, the “ratio of out- and in-degrees” (ROID) of each object is calculated as follows: 4 $s_{i} = \frac{d_{i}^{+} + 1}{d_{i}^{-} + 1}$ {s_i} = {{d_i^ + + 1} \over {d_i^ - + 1}}

Finally, the objects are ranked in descending order of ROIDs to obtain an aggregated ranking.

2.2.5

Minimum violations ranking method (MVR)

As a typical optimization method, the minimum violations ranking method aims to find a consensus ranking that minimizes violations of input rankings (Ali et al., 1986; Chartier et al., 2010; Pedings et al., 2012). To solve this problem, researchers often formulate it as a binary integer linear program (BILP). Specifically, Given m rankings R₁, R₂, …, R_m of n objects, this method first defines the decision variable x_ij and the constant c_ij. Let 5 $x_{i j} = {\begin{matrix} 1, & i f o b j e c t o_{i} i s r a n k e d a b o v e o_{j} \\ 0, & o t h e r w i s e, \end{matrix}$ {x_{ij}} = \left\{ {\matrix{ {1,} & {{\rm{ }}if object{\rm{ }}{o_i}{\rm{ }}is ranked above{\rm{ }}{o_j}} \cr {0,} & {{\rm{ }}otherwise,} \cr } } \right. and 6 $c_{i j} = | {R_{k} : 0 < r_{k i} < r_{k j}} | - | {R_{k} : 0 < r_{k j} < r_{k i}} |$ {c_{ij}} = \left| {\left\{ {{R_k}:0 < {r_{ki}} < {r_{kj}}} \right\}} \right| - \left| {\left\{ {{R_k}:0 < {r_{kj}} < {r_{ki}}} \right\}} \right| where R_k ∈ R. Note that the optimal value of the decision variable x_ij corresponds to the aggregated ranking, while the constants c_ij are related to the input rankings. Then, the MVR problem can be presented as the following BILP formulation. $m a x \sum_{i = 1}^{n} \sum_{j = 1}^{n} c_{i j} x_{i j}$ max\sum\limits_{i = 1}^n {\sum\limits_{j = 1}^n {{c_{ij}}} } {x_{ij}}

7

s . t . {\begin{matrix} x_{i j} + x_{j i} = 1 f o r a l l i \neq j \\ x_{i j} + x_{j k} + x_{k i} \leq 2 f o r a l l i \neq j \neq k \\ x_{i j} \in {0, 1} \end{matrix}

s.t.{\rm{ }}\left\{ {\matrix{ {{x_{ij}} + {x_{ji}} = 1\quad {\rm{ }}for all{\rm{ }}i \ne j} \cr {{x_{ij}} + {x_{jk}} + {x_{ki}} \le 2\quad {\rm{ }}for all{\rm{ }}i \ne j \ne k} \cr {{x_{ij}} \in \{ 0,1\} } \cr } } \right.

Finally, a “branch and bound” algorithm based on linear programming (LP) is commonly used to solve BILP problems to obtain a minimum violation consensus ranking.

2.3

Measures of ranking correlation

The purpose of rank aggregation is to find an optimal consensus ranking that maximizes its consistency with the input rankings. There are various methods for measuring the consistency between two rankings, among which Kendall’s rank correlation coefficient (Kendall’s tau) (Kendall, 1938) and Spearman’s rank correlation coefficient (Spearman’s rho) (Spearman, 1987) are two classical and widely used metrics.

2.3.1

Kendall’s tau

Kendall’s tau measures the correlation between two rankings by evaluating the consistency of pairwise ordering. Mathematically, given two complete rankings R₁ and R₂ of length n, Kendall’s tau is defined as: 8 $τ = \frac{2 (N_{c} - N_{d})}{n (n - 1)}$ \tau = {{2\left( {{N_c} - {N_d}} \right)} \over {n(n - 1)}} in which N_c and N_d respectively denote the number of object pairs whose orders are concordant and discordant in R₁ and R₂. It is obvious that the more consistent the two rankings are, the larger the value of Kendall’s tau. There are two extreme cases: τ = -1 indicates that the two lists are completely opposite, while τ = 1 signifies complete concordance.

The above definition has a flaw when dealing with rankings that include ties. Therefore, the researcher extended the Kendall’s tau to a distance metric applicable to rankings with ties in subsequent studies (Kendall, 1945). In cases where one or both rankings contain ties, Kendall’s tau-b is defined as: 9 $τ_{b} = \frac{N_{c} - N_{d}}{\sqrt{(N_{c} + N_{d} + T_{x}) (N_{c} + N_{d} + T_{y})}}$ {\tau _b} = {{{N_c} - {N_d}} \over {\sqrt {\left( {{N_c} + {N_d} + {T_x}} \right)\left( {{N_c} + {N_d} + {T_y}} \right)} }} where T_x and T_y represent the numbers of tied pairs in R₁ and R₂, respectively.

2.3.2

Spearman’s rho

Spearman’s rho quantifies the correlation between two rankings by assessing how consistently the ranks of each object change across the two lists. Mathematically, given two complete rankings R₁ and R₂ of length n, Spearman’s rho is defined as: 10 $ρ = 1 - \frac{6 \sum_{i = 1}^{n} {(r_{1 i} - r_{2 i})}^{2}}{n (n^{2} - 1)}$ \rho = 1 - {{6\sum\nolimits_{i = 1}^n {{{\left( {{r_{1i}} - {r_{2i}}} \right)}^2}} } \over {n\left( {{n^2} - 1} \right)}} where r_1i and r_2i are the ranks of object o_i in R₁ and R₂, respectively.

Spearman’s rho is a simple and intuitive measure: the larger its value, the higher the correlation between the two rankings and the stronger the ranking consistency; when the two rankings are identical, ρ reaches its maximum value of 1, and when they are completely opposite, ρ reaches its minimum value of -1.

3

Measure the quality of input ranking information

3.1

Problem statement

The rank aggregation problem is typically defined as follows: Given m rankings R₁, R₂,…, R_m over n objects, where each ranking represents the preferences of an individual voter, the task is to aggregate these rankings into a consensus ranking $\hat{R}$ {\hat R}. Let R_i=(r_1i, r_2i,…, R_in) denote the ranking provided by the voter v_i, where r_ij is the rank of the object o_j given by voter v_i. In particular, r_ij = 0 indicates that voter v_i does not rank object o_j. The length of ranking R_i is therefore denoted by Lⁱ= ⌈{r_ij | r_ij > 0, 1 ≤ j ≤ n}⌉. Note that if objects o_k and o_s are tied in R_i, we have r_ik= r_is.

The purpose of rank aggregation is to combine rankings from multiple sources to maximize the consistency between the aggregated ranking and each input ranking, thereby obtaining a ranking that best reflects the collective preferences. Consequently, the quality of input rankings is critical in determining the quality of the aggregated ranking. Previous studies and applications often used the length and number of input rankings as quantitative indicators to evaluate their quality. However, the distribution of input ranking information across objects also directly affects the effectiveness of the aggregated ranking. To clarify this impact, two simple examples with four partial rankings over five objects are applied, which are shown in Figure 1. And we suppose that the truth ranking is a>b>c>d>e.

We employed the BM and DM methods to aggregate these input rankings. In Figure 1(a), the distribution of input ranking information is relatively concentrated, as it not only includes two identical rankings but also three rankings that involve the comparison between objects a and c. In Figure 1(b), four distinct rankings are presented, where the recorded pairwise comparison information is evenly distributed across objects. It can be observed that when the input ranking information is evenly distributed, both methods can produce aggregated rankings that accurately reflect the true ranking. In contrast, when the distribution is uneven, neither method yields accurate aggregation results.

The above examples indicate that the even distribution of input ranking information across objects is a key factorin ensuring the accuracy of aggregation results. Therefore, developing a measure of ranking information quality based on distribution characteristics is of significant value and constitutes the core objective of this study.

3.2

Network representation of input ranking information

Each input ranking records pairwise comparison information between its objects, whereas a network represents the nodes and their interrelationships. Thus, if each object is mapped to a node in the network and the ranking information is mapped to the edges between these nodes, the network provides an effective representation of the ranking information for rank aggregation.

Specifically, we can use an undirected weighted graph G = (V, E) to represent the ranking information between objects in all input rankings. Here, V denotes the set of nodes representing the objects, and E denotes the set of edges indicating the presence of ranking information between objects. The weight of an edge between nodes v_i and v_j, denoted by w_ij, corresponds to the number of rankings that include both objects v_i and v_j, with 0 ≤ w_ij ≤ n. In particular, w_ij=0 indicates that no ranking contains information on v_i and v_j, whereas w_ij = n indicates that all input rankings include this information.

For example, we can transform the input ranking information from the previous example into a corresponding undirected weighted network, as shown in Figure 2. It can be observed that when the input information is concentrated on a few objects, the resulting network is relatively sparse, whereas when the input information is more evenly distributed, the network becomes denser.

3.3

Entropy-based quality measurements of input ranking information

After transforming the ranking information into a network, we employed entropy to characterize the structural features of the network, thereby enabling a quantitative assessment of the ranking information’s quality. To explore the applicability of different types of entropy in measuring quality, we designed entropy measures across both the node and edge dimensions, capturing how information distribution influences the quality of ranking information from multiple perspectives.

3.3.1

Ranking information quality measurement based on degree entropy

For ranking data, we aimed for each object to be compared with as many other objects as possible. In the corresponding ranking network, the degree of a node reflects the number of other objects with which it has comparisons. When the degree distribution is uniform, most objects have comparisons with a similar number of other objects, indicating that the ranking information is widely covered and evenly distributed. Conversely, when the degree distribution is highly concentrated, most comparisons involve only a few objects, suggesting that the ranking information is unevenly distributed. Degree entropy can be used to quantify the uniformity of this distribution: higher degree entropy indicates a more balanced distribution of ranking information, while lower degree entropy indicates that the information is concentrated on a small subset of objects.

For a network containing N nodes, let k_i denote the degree of the i-th node. Then the relative proportion of each node’s degree in the network is defined as: 11 $p_{i} = \frac{k_{i}}{\sum_{j = 1}^{N} k_{j}}$ {p_i} = {{{k_i}} \over {\sum\nolimits_{j = 1}^N {{k_j}} }}

Building on this, the degree entropy H_d can be computed using the Shannon entropy formula: 12 $H_{d} = - \sum_{i = 1}^{N} p_{i} l o g p_{i}$ {H_d} = - \sum\limits_{i = 1}^N {{p_i}} log{p_i}

We used degree entropy to quantitatively assess the quality of the two sets of ranking data mentioned in the above section, with the results shown in Figure 3. It is clearly observed that when the overall distribution of ranking information is relatively concentrated, the coverage of comparisons among nodes is limited, resulting in lower degree entropy values. Conversely, when the ranking information is more evenly distributed, the degree entropy values are higher. This trend indicates that lower degree entropy corresponds to poorer data quality, as the information is concentrated among a few objects and cannot comprehensively reflect the relative ordering of all objects. In contrast, higher degree entropy indicates broader coverage and a more balanced distribution of information, reflecting higher data quality.

3.3.2

Ranking information quality measurement based on edge-weighted entropy

The degree entropy method primarily examines the degree of network nodes to capture the global distribution characteristics of information. However, as this method fails to reflect the differences in edge weights, another approach is introduced, shifting the focus to the weights of network edges. Specifically, for an undirected weighted network derived from rankings, if the input ranking information is concentrated on a few objects, the edge weights between the corresponding nodes will be relatively large, resulting in an uneven distribution of edge weights across the network. Therefore, we can use edge weight entropy to quantify this unevenness, thereby reflecting the quality of the input ranking data.

For m input rankings, since the comparison information between any two objects o_i and o_j can appear at most once in each ranking, the corresponding edge weight w_ij takes values ranging from 0 to m. Therefore, the probability p_ij that a ranking contains the comparison information between o_i and o_j is given by: 13 $p_{i j} = \frac{w_{i j}}{m}$ {p_{ij}} = {{{w_{ij}}} \over m}

Then, we can calculate the edge-weighted entropy based on the probability corresponding to the edge weight between each pair of nodes in the network, thereby obtaining a quantitative measure of the information distribution recorded in the m rankings. The edge-weighted entropy H_w is defined as: 14 $H_{w} = - \sum_{(i, j) \in E} p_{i j} l o g p_{i j}$ {H_w} = - \sum\limits_{(i,j) \in E} {{p_{ij}}} log{p_{ij}}

In Figure 4, we calculated the H_w of input ranking information with different distribution characteristics based on the aforementioned example. The results similarly show that as the ranking information becomes more concentrated, the accuracy of the aggregated results decreases, and the H_w correspondingly declines, thereby reflecting lower data quality.

4

Datasets

To investigate how the distribution characteristics of input ranking information affect the effectiveness of aggregation results and to validate the efficacy of our proposed quality measurement method, this study used both real and synthetic datasets in the following experiments. In this section, we first introduce two real datasets, and then present extensions to three commonly used ranking data generation models to enable the generation of ranking data with controllable distribution properties. All experiments were conducted on a computer equipped with an Intel Core i7-14700HX, 16 GB RAM, and an NVIDIA GeForce RTX 4070 GPU.

4.1

Real world datasets

4.1.1

Election dataset

We downloaded a dataset from the website http://www.dublincountyreturningofficer.com, containing the complete voting records of two independent elections held in Dublin, covering the northern and western areas of the Irish capital, as well as Meath County. In our experiments, we selected the voting dataset ED-00001-00000002. soi from western Dublin for empirical analysis. This dataset contains 29,988 incomplete rankings of varying lengths for nine candidates. Upon detailed examination of the original data, we found that many rankings included only a single candidate. Such rankings are essentially meaningless for ranking aggregation, as they provide no information about comparisons or outcomes between candidates. Therefore, we removed these single-item rankings. After this preprocessing, the dataset contained 28,245 candidate rankings.

4.1.2

Course evaluation dataset

We selected a course evaluation dataset from AGH University of Science and Technology in Krakow, obtained from PrefLib: A Library for Preferences, which records students’ course preferences. In this study, we used the dataset ED-00009-00000001.soc for empirical analysis. The dataset contains 146 course rankings of length 9, all of which are complete rankings. Since our experiments required incomplete rankings, we randomly removed 0 to 7 items from each ranking to create partial rankings.

4.2

Synthetic datasets

4.2.1

The Mallows model (MM)

The Mallows model is a distance-based conditional model over rankings, applicable to a wide range of tasks due to its compatibility with various ranking distance metrics (Mallows, 1957; Qin et al., 2010). Given a distance d over rankings, the model can generate a ranking R_i based on: 15 $p_{R_{0}, θ} (R_{i}) = \frac{1}{ϕ (θ)} e^{- θ d (R_{i}, R_{0})}$ {p_{{R_0},\theta }}\left( {{R_i}} \right) = {1 \over {\phi (\theta )}}{e^{ - \theta d\left( {{R_i},{R_0}} \right)}} where R₀ and θ are the parameters of the model, representing the location and the dispersion parameters, and $ϕ (θ) = \sum_{R_{i}} e^{- θ d (R_{i}, R_{0})}$ \phi (\theta ) = \sum\limits_{{R_i}} {{e^{ - \theta d\left( {{R_i},{R_0}} \right)}}} is a normalization constant. Specifically, we chose the Kendall tau distance as the metric for measuring the distance between rankings.

In the Mallows model, the parameter θ (θ ≥ 0) controls the concentration of the distribution of rankings around the central ranking R₀. When θ = 0, the distribution is uniform; as θ increases, the distribution becomes increasingly peaked.

To generate input ranking information with varying distribution characteristics, we introduce a distribution parameter α, which takes values in [0, ∞]. Each object is assigned an index from 1 to n, and the probability that the j-th object is selected by a voter is proportional to j^−α. After generating m complete rankings, we randomly select L₀ objects from each complete ranking according to this probability distribution, and reorder them according to their positions in the original ranking to construct m partial rankings. By adjusting the value of α, we can control the frequency with which each object appears in the partial rankings. Specifically, when α = 0, all objects have an equal probability of being selected, resulting in ranking information that is evenly distributed across all objects. As α increases, the probability of selecting certain objects rises significantly, leading to a more skewed distribution of the ranking data, where a few objects are ranked frequently while others appear only rarely.

4.2.2

The Plackett-Luce model (P-L)

The Plackett-Luce model is another classic probabilistic model for rankings (Luce, 1959; Plackett, 1975). Given n objects, this model first assigns a parameter s_{o_i} to item o_j, which can be interpreted as the importance of item o_j. Then, a ranking R_k=[r_k₁, r_k₂,…, r_kn] is given as 16 $p_{s} (R_{k}) = \prod_{i = 1}^{n} \frac{S_{o_{r_{k i}}}}{\sum_{j = 1}^{n} S_{o_{r_{k j}}}}$ {p_s}\left( {{R_k}} \right) = \prod\limits_{i = 1}^n {{{{S_{{o_{{r_{ki}}}}}}} \over {\sum\nolimits_{j = 1}^n {{S_{{o_{{r_{kj}}}}}}} }}} where s_{o_{r_ki}} <0, items with larger values of s_{o_{r_ki}} are more likely to appear earlier in the ranking.

The consensus in this model becomes stronger as the s_{o_i} values decrease more rapidly (Ali & Meilă, 2012). In our experiments we assign s_{o_i} to each object using an exponential decay form defined as: 17 $s_{o_{i}} = s_{m a x} \times r^{i - 1}$ {s_{{o_i}}} = {s_{max}} \times {r^{i - 1}} where s_max is the maximum importance value, r ∈ (0,1) is the decay ratio.

We adopted the same approach as in the Mallows model to obtain ranking information with varying distribution characteristics. By introducing the distribution parameter α, we define the probability distribution for selecting objects by each voter. After generating m complete rankings, we randomly select L₀ objects from each ranking and reorder them based on their positions in the original ranking to construct partial rankings.

4.2.3

The object inherent ability–based model (IA)

Xiao et al. (2017) proposed a ranking data generation method based on the inherent ability values of objects. The model first assume that each object o_j has an inherent ability ω_j, which follows a uniform distribution over [0,1]. Let R₀=[r₀₁, r₀₂,…, r_0n] be the ground truth ranking of these objects, and let r_i be the true ranking position of object o_j based on ω_j. Intuitively, a greater ω_j corresponds to a higher ranking position for o_j.

Then, this model introduces the displayed inherent ability ${\tilde{ω}}_{i j}$ {{\tilde \omega }_{ij}} to represent the estimate of ω_j by voter v_i. Since voter v_i may not accurately estimate ω_j of object o_j for various reasons, the model assumes that ${\tilde{ω}}_{i j}$ {{\tilde \omega }_{ij}} is a random variable uniformly distributed in the interval [ω_j – ω_j(1 – β_ij), ω_j + (1 – ω_j)(1 – β_ij)], where β_ij∈[0,1] denotes the strength of random error. A larger β_ij indicates voter v_i ranks object o_j more accurately based on ω_j. Thus, a ranking R_i = [r_i1, r_i2,…, r_in] is generated by voter v_i based on the estimated values ${\tilde{ω}}_{i j}$ {{\tilde \omega }_{ij}} of each object o_j. In this study, we assume that all voters have the same level of estimation accuracy β_ij meaning that β_ij=β for all i ∈ [1, m] and j∈ [1, n]. Moreover, we assume that all rankings have the same length, that is, L_i = L₀ ∈ (0, n], where i = 1,2,…, m.

Finally, we also introduce the distribution parameter α to control the distribution characteristics of the generated ranking information. Unlike the previous two models, this model allows us to generate partial rankings directly based on the specified probability distribution for selecting objects, without the need to first generate complete rankings.

5

Experimental analysis

5.1

Validation of the effectiveness of the extended data generation models

To validate the performance of the three extended models in generating ranking data with varying distribution characteristics, we generate ranking data for varying α and compute the corresponding degree entropy H_d and edge-weighted entropy H_w. Since all three models employ the same method to control the distribution characteristics of the ranking data, we use the MM model as an example for the experiments. We set n = 20, m = 1,000, and the length of rankings L₀ = 2,3,4. The experimental results are shown in Figure 5. It can be observed that as the value of α increases, both entropies gradually decrease, indicating that the distribution of input ranking information among objects becomes increasingly concentrated.

5.2

Effect of distribution on ranking data quality

To examine whether an increase in the number of input rankings necessarily leads to an improvement in data quality, we generate two sets of data based on three models: baseline data and comparison data, with the numbers of input rankings denoted as m_b and m_c, respectively. For the baseline data, the distribution parameter α is set to 0, while for the comparison data, α is adjusted to generate ranking data with varying distribution characteristics. Then, we apply the five aggregation methods proposed in Section 2.2 to aggregate the two types of data, reflecting their quality differences through changes in the effectiveness of the aggregation results.

The effectiveness of the aggregation results is evaluated using Kendall’s tau-b between the aggregated rankings and the ground truth. The value obtained for the baseline data serves as a reference, and the relative change Δτ_b quantifies the difference in aggregation effectiveness between the baseline and comparison data.

The experimental results are shown in Table 1 and Figure 6, where n = 50, m_b = 1,000, m_c=2,000, θ = 0.1, s_max = 100, r = 0.9, β = 0.8, and L₀ = 10. It can be observed that as the value of α gradually increases, the value of ∆τ_b shows a gradual decline under all five methods. Since the comparison data contains a greater number of input rankings compared to the base data, and when the distribution of input ranking information is more uniform, the quality of the comparison data is higher, thereby improving the effectiveness of the aggregation results. However, as the distribution of ranking information becomes more concentrated, the quality of the comparison data rapidly declines, soon falling below that of the base data, thereby reducing the effectiveness of the aggregation results.

Table 1.

Kendall’s tau-b between the aggregated rankings obtained by the five methods from three sets of baseline data and the ground truth rankings, where m_b = 1,000, n = 50 and L₀ = 10. The results are averaged over 100 independent trials.

Model	BM	DM	VB	MVR	CG
MM	0.89	0.89	0.94	0.93	0.96
P-L	0.92	0.91	0.96	0.94	0.97
IA	0.93	0.92	0.95	0.96	0.98

5.3

Validation of the effectiveness of entropy-based methods in ranking data quality measurement

We conducted a series of experiments to evaluate the effectiveness of the two proposed entropybased methods in measuring the quality of input ranking data from a distributional perspective. First, they were compared with classical ranking consistency measures to highlight the unique role of entropy-based methods in assessing ranking data quality. Then, their performance was tested on datasets of different sizes to examine their adaptability under varying conditions.

5.3.1

Comparison of entropy-based methods and classical consistency measures

To comprehensively analyze the advantages and limitations of different methods, we employed four approaches, Kendall’s tau, Spearman’s rho, degree entropy, and edge-weighted entropy, to assess the quality of data generated by three ranking data generation models with different distribution characteristics. Since the generated rankings are all incomplete, when calculating Kendall’s tau or Spearman’s rho for any pair of rankings, we consider only their common elements and finally take the average of all computed values to obtain the correlation coefficient for the dataset. The experimental results are presented in Tables 2, 3, 4, and 5, with the parameter settings as follows: n = 50, m = 1,000, θ = 0.1, S_max = 100, r =0.9, β =0.8, and L₀ = 10.

Table 2.

Kendall’s tau of the data generated by the three ranking data generation models under different values of α. The results are averaged over 100 independent trials.

Model	α
Model	0.00	1.00	2.00	3.00	4.00	5.00	6.00	7.00	8.00
MM	0.393	0.394	0.399	0.439	0.417	0.383	0.396	0.347	0.373
P-L	0.517	0.540	0.497	0.564	0.505	0.519	0.570	0.551	0.592
IA	0.782	0.773	0.781	0.785	0.782	0.775	0.784	0.803	0.795

Table 3.

Spearman’s rho of the data generated by the three ranking data generation models under different values of α. The results are averaged over 100 independent trials.

Model	α
Model	0.00	1.00	2.00	3.00	4.00	5.00	6.00	7.00	8.00
MM	0.519	0.510	0.522	0.532	0.534	0.462	0.507	0.535	0.496
P-L	0.659	0.661	0.624	0.685	0.669	0.638	0.677	0.660	0.692
IA	0.807	0.817	0.852	0.901	0.895	0.919	0.890	0.908	0.872

Table 4.

Degree entropy of the data generated by the three ranking data generation models under different values of α. The results are averaged over 100 independent trials.

Model	α
Model	0.00	1.00	2.00	3.00	4.00	5.00	6.00	7.00	8.00
MM	3.912	3.912	3.895	3.812	3.688	3.511	3.351	3.211	3.102
P-L	3.912	3.912	3.895	3.814	3.688	3.515	3.354	3.213	3.113
IA	3.912	3.912	3.895	3.812	3.685	3.510	3.354	3.211	3.106

Table 5.

Edge-weighted entropy of the data generated by the three ranking data generation models under different values of α. The results are averaged over 100 independent trials.

Model	α
Model	0.00	1.00	2.00	3.00	4.00	5.00	6.00	7.00	8.00
MM	7.098	6.400	5.513	4.964	4.643	4.444	4.319	4.232	4.168
P-L	7.097	6.399	5.511	4.965	4.642	4.446	4.318	4.232	4.168
IA	7.098	6.399	5.512	4.965	4.640	4.445	4.318	4.230	4.168

Subsequently, we aggregated the generated ranking data using five rank aggregation methods and computed the Kendall’s tau-b between the aggregated rankings and the ground-truth rankings, with the results shown in Figure 7. It can be seen that as α increases, the consistency between the aggregated results and the ground-truth rankings for the data generated by the three ranking data generation models under the five aggregation methods gradually decreases, indicating that data quality declines as the input ranking information becomes more unevenly distributed among the objects. Therefore, the effectiveness of the four data quality measures can be evaluated by examining their measurement results on the data under different values of α.

For the two classical ranking consistency measures, Tables 2 and 3 show that their results fluctuate non-monotonically as α increases, indicating that these methods cannot effectively capture the changes in data quality caused by variations in the distribution characteristics of the ranking data. For degree entropy, as shown in Table 4, when α increases slightly, its value changes little; as α increases further and the unevenness of the data distribution becomes more pronounced, the degree entropy decreases steadily, indicating that it can reflect the quality differences caused by variations in the distribution characteristics of the ranking data, particularly when the data distribution is highly uneven. Finally, as shown in Table 5, edge-weighted entropy decreases gradually as α increases; compared to degree entropy, it is more sensitive to changes in α and can more effectively reflect the quality differences caused by variations in the distribution characteristics of the ranking data.

5.3.2

Performance of entropy-based ranking data quality measures across different data scales

We further conducted experiments to evaluate the performance of the two entropy-based methods on datasets of different scales. Specifically, using three data generation models, we generated varying numbers of rankings under four conditions m = 50, m = 500, m = 5,000 and m = 50,000. The quality of each dataset was then assessed using degree entropy and edge-weighted entropy. The experimental results are shown in Figure 8 and Figure 9, where n=50, θ=0.1, S_max = 100, r = 0.9, β = 0.8, and L₀ = 10.

As shown in Figure 8, for degree entropy, when the dataset is relatively small (i.e. m=50), degree entropy gradually decreases as α increases. However, as the dataset size grows, the change of degree entropy with increasing α initially remains stable and then gradually declines, and the initial stable phase becomes more pronounced with larger dataset sizes. This indicates that when the dataset is large and the unevenness of the data distribution is relatively low, degree entropy has limited capability to effectively measure data quality.

For edge-weighted entropy, as shown in Figure 9, it exhibits a decreasing trend with increasing α across datasets of all sizes. This indicates that edge-weighted entropy can effectively capture variations in data distribution, and its applicability is broader than that of degree entropy regardless of dataset size.

5.4

Analyzing the computational efficiency of entropy-based methods

To analyze the computational efficiency of degree entropy and edge-weighted entropy under large-scale ranking data, we use ranking data generation models to generate datasets with varying values of n and m. We then measure the time required by these two entropy-based quality metrics to process the generated datasets. Since this experiment primarily depends on the size of the dataset rather than the specific data generation model, we generate the required data using only the IA. The experimental results are shown in Figure 10 and Figure 11, where L₀ = 0.1×n, θ = 0.1, S_max = 100, r = 0.9, α =0.2 and β = 0.8.

As shown in Figure 10 and Figure 11, when the number of ranking objects reaches 1,000 and the number of rankings reaches 9×10⁴, both entropy-based methods can complete the computations within a short time, indicating that they maintain high computational efficiency in large-scale ranking scenarios. Among them, degree entropy performs better than edge-weighted entropy. In addition, when n is fixed, the computational time of both methods increases approximately linearly with m, further demonstrating that they can remain efficient even when processing a larger number of rankings.

5.5

Collaborative analysis of the length and distribution of input rankings

Furthermore, we also examined the combined impact of the length and distribution of input rankings generated by the three models on aggregation effectiveness. The experimental results are presented in Figure 12, with n = 50, m = 1,000, θ = 0.1, S_max = 100, r = 0.9, and β = 0.8. Each heatmap cell shows the Kendall’s tau-b for a specific combination of parameters L₀ and α. In the heatmaps, the color gradient of the cells from blue to red indicates a gradual improvement in the effectiveness of the aggregation results.

For the data generated by the three models, it can be observed that as the value of α increases, an increase in the input ranking length L₀ causes the red cells to gradually deepen. This indicates that increasing the ranking length L₀ can mitigate the decline in aggregation effectiveness caused by the unequal probability of each object being selected for ranking. In particular, the MVR and CG methods outperform the other three methods in significantly improving aggregation effectiveness as the ranking length L₀ increases. This is because both methods leverage pairwise comparison information between objects. As the ranking length L₀ increases, more node pairs become connected in the network. Even if these edges have low weights, MVR and CG can still effectively utilize the available comparison information to enhance aggregation performance.

5.6

Empirical analysis of entropy-based ranking data quality measurement methods

We validated the effectiveness of degree entropy (H_d) and edge-weighted entropy (H_w) in assessing ranking data quality from a distributional perspective using two empirical datasets. First, we computed the initial H_d and H_W for both datasets: for the election dataset, H_d = 2.192 and H_w = 3.570; for the course evaluation dataset, H_d = 2.196 and H_W = 3.581. Then, to construct datasets with varying distribution characteristics, we randomly selected a ranking with a certain number of occurrences from each dataset, increased its occurrences while reducing those of other rankings to keep the total number of rankings constant, and recalculated the corresponding H_d and H_W at each step. The experimental results, shown in Tables 6 and 7, illustrate the trends of H_d and H_W as the occurrences of the selected ranking change in the two datasets. Here, m_n represents the increased number of occurrences of the selected ranking in the dataset.

Table 6.

Degree entropy (H_d) and edge-weighted (H_w) of the election dataset under different distribution characteristics, with n=9 and m=28,245. The results are averaged over 100 independent trials.

Method	m_n
Method	1,000	2,000	3,000	4,000	5,000	6,000	7,000	8,000
H_d	2.191	2.188	2.183	2.178	2.171	2.156	2.118	2.082
H_w	3.568	3.561	3.548	3.536	3.516	3.478	3.383	3.292

Table 7.

Degree entropy (H_d) and edge-weighted (H_W) of the course evaluation dataset under different distribution characteristics, with n=9 and m=146. The results are averaged over 100 independent trials.

Method	m_n
Method	15	30	45	60	75	90	105	120
H_d	2.195	2.191	2.189	2.181	2.177	2.173	2.166	2.158
H_w	3.577	3.566	3.558	3.539	3.527	3.517	3.496	3.473

Based on Tables 6 and 7, as the value of m_n increases—indicating that the information distribution in both datasets becomes increasingly uneven—both H_d and H_W exhibit a decreasing trend. This indicates that, for these two real-world datasets from different domains, both entropy-based methods can effectively capture changes in data quality as the distribution characteristics vary.

6

Conclusion and discussion

Rank aggregation plays a crucial role in various academic studies and practical applications, and the quality of input rankings significantly affects the effectiveness of the aggregation results. In this study, we propose two general and intuitive methods for measuring the quality of ranking data from a distributional perspective. Specifically, we first convert the input rankings into an undirected weighted network and then use degree entropy and edge-weighted entropy, two types of network structural entropy, to quantify the quality of the ranking data. We then validate the effectiveness and applicability of the proposed methods through experiments on both real-world and synthetic datasets. The results of numerical experiments indicate that an increase in the number of input rankings does not necessarily lead to improved data quality, as the distribution characteristics of the ranking information also play a critical role in determining quality. We further confirm that, compared with traditional consistency measures, the two entropy-based methods can more effectively assess the quality of input ranking information from the perspective of information distribution and demonstrate good computational efficiency on large-scale datasets. Moreover, different rank aggregation methods show varying performance when handling unevenly distributed rankings. Increasing the length of input rankings can enhance aggregation effectiveness, particularly for methods that fully utilize pairwise comparison information between objects, where this effect is more pronounced.

It should be noted that the quality of ranking data is influenced by multiple factors. This study primarily focuses on the impact of the distribution characteristics of ranking data on its quality. Future research should further investigate other potential influencing factors, such as the effect of conflicts present in the data on ranking quality.

Is more always better? Measuring the quality of ranking data through information entropy

Full Article

Paradigm

My account