One Standard for All: Uniform Scale for Comparing Individuals and Groups in Hierarchical Bayesian Evidence Accumulation Modeling

Rotem Berkovich; Nachshon Meiran

doi:10.5334/joc.394

Full Article

Introduction

In recent years, there is a growing body of research that uses Evidence Accumulation Models (EAMs) to reveal the cognitive mechanism that accounts for experimental effects (Forstmann et al., 2016), and these models are recently also extensively used to study individual and group differences (Pleskac et al., 2019; Schubert & Frischkorn, 2020; Sripada & Weigard, 2021). EAMs are a family of models all sharing the assumption that the representation of stimuli in the central nervous system is noisy, and a decision is made after accumulating successive samples from this noisy representation until a sufficient amount of evidence has been obtained and a decision criterion has been reached (Ratcliff & Smith, 2004). Core parameters in EAM include the across-trials mean rate of evidence accumulation- drift-rate (v), and the standard deviation the drift rate (sv).

However, using EAMs turns out to be challenging (van Maanen & Miletić, 2021). Specifically, when fitting EAMs, one needs to constrain at least one parameter that will set the scale for the remaining parameters. This requirement arises since the parameters of the EAMs represent latent variables, which require that their units be defined.

More specifically, the mean rate of evidence accumulation, the drift-rate, is expressed in terms of units-of-evidence per second. While time (measured in seconds) is not latent, the latent “unit of evidence” must somehow be defined. A common practice in fitting the Linear Ballistic Accumulator model (LBA; Brown & Heathcote, 2008), which is a type of EAM, is to set the unit such that one of the parameters is fixed at unity and sets the scale for the remining parameters. Analogous practices are used when fitting other EAMs such as the Drift-Diffusion Model (DDM, Ratcliff & McKoon, 2008). This practice is also not unique to EAMs. For example, in Signal-Detection Theory (Macmillan & Creelman, 2004), the unit which is used to define sensitivity is often set to be the standard deviation of the noise. In other words, when comparing individuals/conditions, an implicit assumption is made that the parameter being fixed at the same value for all participants, such as the standard deviation of the noise is equal (to unity) across participants.

The problematics associated with this practice can be illustrated with the concept of distance. If one arbitrarily defines the unit for distance as the unit that is being used in a given country, one may wrongly conclude that two US cities that are 200 miles apart are as distant as two cities in France that are 200 km apart, for example. This is in fact analogous to how the unit is defined when fitting EAM models: such that the unit is defined per participant (participant being analogous to country, in this example). The proper comparison requires adopting a constant unit that applies across countries/participants, of course.

Setting the scale in this way (e.g., fixing a parameter at some value, such as unity) implies assuming lack of meaningful individual and group differences in the parameter which is used for scale setting. This is a powerful assumption that might be wrong. For example, one might decide to use sv to set the scale, but sv may be said to represent the degree of noise, and brain recordings suggest that there are important individual and group differences in neural noise (Dinstein et al., 2015). The implication of violating the assumption regarding lack of meaningful differences in the fixed parameter concerns erroneous individual/experiment differences (or lack thereof) in the estimate model parameters. For instance, imagine that two participants are truly different from one another in sv, but are truly not different from one another in the free (estimated) parameter (v). Since parameter estimation is commonly made by fixing a parameter such as sv at some value, it is possible to find individual differences or an experimental effect in a parameter (e.g., v) where such differences are absent, and failing to find them in a parameter (sv) where such differences exist. A reviewer suggested an alternative by which instead of fixing sv across participants, one may fix another parameter such as the boundary, and employ an experimental manipulation (e.g., accuracy emphasis) that likely eliminates any individual or group differences. We argue, however, this approach still runs some risk since it cannot be fully assured that the individual/group differences were truly eliminated by the manipulation. The method suggested in this work overcomes the problem much more effectively.

Two solutions to the aforementioned problem were suggested by van Maanen & Miletić (2021). The first of these two solutions was also empirically demonstrated by these authors. It consists of showing that when one interprets the results in terms of how parameters are related relative to one another results in a consistent interpretation across different parameter constraints. van Maanen & Miletić ‘s second solution was just suggested and was not empirically tested. The proposal was to fit the data with a Bayesian hierarchical model. In the second solution, since participants are nested within a group, it is possible to instead of defining the unit by constraining the individual-level parameters (e.g., all participants’ sv values are 1) to constrain the group level parameter (e.g., the population mean sv = 1). This constraint is accomplished through the setting of the Bayesian priors. This method sets the scale (unit = population-level mean sv) while still allowing for individual and group differences in (e.g.) sv. It seems to us that the second (not yet tested) solution is more effective because it involves eliminating the source of the problem. Therefore, in the current study we adopt this second solution and suggest a practical approach to implement it. We validate the new method using a simple EAM: The LBA, in its Bayesian hierarchical estimation version (Lin & Strickland, 2020). The next section provides a short description of the LBA.

LBA

In the LBA, each choice alternative is associated with an evidence accumulator. The initial amount of evidence in each accumulator prior to evidence accumulation is determined by the starting point parameter, which defines the range (0-starting point) from which the amount is sampled in each trial. In each trial, the drift rate is also sampled from a normal distribution with mean drift rate (v) and a standard deviation of the drift rate (sv). The LBA is acknowledged to be simplified because it assumes that evidence accumulation rate remains constant in a given trial (hence, “Linear Ballistic”). A decision is reached once the amount of evidence in an accumulator crosses the boundary, b, which is sometimes estimated indirectly through B, where b = starting point +B. The last parameter is non-decision time (t0) expressed in sec and describing the duration of the non-decisional processes such as early feature extraction and motor preparation.

When applying the model using the Bayesian hierarchical method, one estimates the parameters for each individual, separately, but constrains this estimation with a population level distribution. The process of model fitting starts with the definition of the model which includes the determination of the free (to be estimated) parameters and the fixed parameter that sets the scale. Following, one needs to specify two types of priors: base- level priors, which are the priors for each parameter at the individual participant’s level (typically, all participants’ parameters have the same priors), and hyper level priors which are the priors for the parameters that characterize the population. Estimation proceeds by posterior sampling using Markov Chains Monte Carlo (MCMC) method. Posterior sampling adequacy is assessed using the Potential Scale Reduction Factor (PSRF), that should fall below 1.1 (Brooks & Gelman, 1998).

The new method

In the new method, we de facto fixed sv at the population level. This was done indirectly using the hyper- level Bayesian priors. More specifically, we defined the Mu (population level mean) prior for sv, using a Beta distribution with α = β = 1 (creating a uniform distribution) and a tiny range from .999 to 1.001. This method de facto fixes the mean of the population sv at about 1, and thus sets the scale (one unit of evidence is equal to the population-level sv), but still allows for individual and group differences in sv. In the present work, we report two parameter recovery studies, providing the necessary proof of concept for the new method. In each parameter-recovery study, we simulated data using pre-determined parameter values, and then estimated the parameter values with the model, and checked whether the estimates are close enough to the pre-determined values.

To summarize, the goal of this study was to find a credible implementation of the notion to fix one parameter at the population level, which allows the EAM (here, LBA) to investigate individual differences or experimental effects while not being exposed to the aforementioned risks.

Study 1

The goal of Study 1 was to check whether the parameter recovery, using the new method, is successful.

Method

The steps involved in Study 1 are summarized in Figure 1. We first simulated 21 data sets, that differed from one another in the combinations of mean (across participants) parameter values (Table 1). Based on pervious empirical work (Berkovich & Meiran, 2023; Berkovich & Meiran, 2024; Givon et al., 2023), the range of parameter values is realistic for the type of tasks that we study in our lab. Given the fact that each model fitting takes a very long time (just to give an idea, a single fitting for the 21 models took about one month to converge, despite using massive parallel processing), we limited the number of mean-combinations by not changing the starting point mean value, which was fixed across simulations at 2. Additionally, since research questions that concern t0 are not the focus of our studies, the value of t0 for all ”participants” in all models was set to 0.3, and its recovery was accordingly not tested. Not changing t0 and the fact that we used a particular range of parameter values consist of a limitation of the current study.

Summary of the steps involved in Study 1.

Table 1

Population mean value in the 21 models from Study 1. v. TRUE and v. FALSE represent the mean drift rate for the correct and incorrect response, respectively.

STARTING POINT	BOUNDARY	V. TRUE	V. FALSE	sv
2	2.3	2.3	0.6	1
2	2.5	2.3	0.6	1
2	2.7	2.3	0.6	1
2	2.3	2.5	0.6	1
2	2.5	2.5	0.6	1
2	2.7	2.5	0.6	1
2	2.3	2.7	0.6	1
2	2.5	2.7	0.6	1
2	2.7	2.7	0.6	1
2	2.3	2.3	0.8	1
2	2.5	2.3	0.8	1
2	2.7	2.3	0.8	1
2	2.3	2.5	0.8	1
2	2.5	2.5	0.8	1
2	2.7	2.5	0.8	1
2	2.3	2.7	0.8	1
2	2.5	2.7	0.8	1
2	2.7	2.7	0.8	1
2	2.3	2.3	1	1
2	2.5	2.3	1	1
2	2.7	2.3	1	1

To simulate participants’ data, we first determined the “true” individual parameter values by drawing from normal distributions, with means as defined in Table 1, and with a standard deviation being 0.6 for starting point and boundary, 0.8 for v. true and v. false, and 0.4 for sv. Randomly drawn values were corrected if they fell outside an allowable range. The lower bound for starting point, boundary and v. false was 0.05, for v. true the lower bound was 1.3, and for sv the lower bound was 0.01. Additionally, if the difference between v. true and v. false in a specific model was smaller than 0.2 (favoring v. true), then the value of v. false was replaced by the value of v. true minus 0.2. Using these “true” individual parameter values, we simulated 21 datasets, each comprising 100 “participants”, each with 1,000 “trials”.

Each such simulated dataset was then fitted using Bayesian hierarchical LBA. The priors for individual estimates were the mean values that were used for simulation with noise added. We added noise to the mean values by adding a value that was drawn from a normal distribution with mean = 0 and SD = 0.1. Adding noise makes the priors more realistic in the sense of being “in the neighborhood” of the true value but not being exactly at it.

The population level prior means were the same as the individual level prior means except for sv which was defined using (uniform) Beta distribution with α and β values set to 1 and range from 0.999 to 1.001. The Sigma prior (population standard deviation) was uninformative (meaning that it was de factor determined by the data alone and not by the priors) and was defined with a (uniform) Beta distribution, with α = β = 1 with a range of 0–3.

Posterior sampling was accomplished with a burn in period of 1,000 samples per sample-chain and actual sampling of 12,000 samples and thinning equal to 12, which means keeping each 12^th sample. We used the default number of chains: number of free parameters times three.

The results for each one of the 21 data-sets thus comprised of a set of “true” parameter values (those used to simulate the data) for each one of the 100 “participants” and estimated parameters for these “participants”. Of interest was the degree of correspondence between the “true” and estimated parameter values across participants in each one of the data-sets.

Method for assessing parameter recovery

After the model has converged, we calculated PSRF for each participant (to assess model convergence). Then we computed the Pearson correlation and the ICC_2,1 between the “true” parameter values and the (estimated) recovered parameter values. The logic is that if Pearson correlation and the ICC_2,1 are high, we can deduce that the new method is credible and can be used to compare individuals/groups. This is true in the sense that if the LBA provides a reasonable approximation of the data generation process, then the estimated parameter values correctly represent that operation of this process. We used both Pearson correlation and ICC_2,1 to allow us to inspect both the stability of the relative score (Pearson correlation) and the absolute agreement (ICC_2,1). Since this procedure produced 21 correlations and 21 ICC’s, we averaged these values through Fisher’s Z-transformation. It is important to note that while we employed a Bayesian method for parameter estimation, the hypothesis testing was conducted using frequentist statistics. This approach was taken because we are not aware of a Bayesian method to test the ICC for significance. Consequently, we decided to utilize frequentist statistics for all our tests to maintain consistency.

Results

Most of the models converged successfully, with PSRF falling below 1.1 for 95%–100% of the ”participants” in each model. Only in the case of one model, PSRF fell below 1.1 for only 19% of the “participants”. Since for some simulated participants, PSRF was unsatisfactory and fell above 1.1, we report the correlations and the ICC_2,1s value after removing these “participants”. The parallel analysis that includes these “participants” did not change the picture too much and is reported in the Supplementary Materials.

We found that with the exception of v.false, all correlation and all ICC_2,1s were significantly different from zero (p < 0.05). For the v.false its ICC_2,1 was not significant in 8 out of the 21 models. The range of values is shown in Table 2. Additionally, the Pearson correlation results are shown in Figure 2a, and ICC_2,1 results are shown in Figure 2b.

Table 2

Pearson correlation and ICC_2,1 range and mean for all parameters using the new method in Study 1.

PARAMETER	METHOD	LOWER BOUND	HIGHER BOUND	MEAN VALUE
Starting point	Pearson cor.	0.770	0.900	0.840
Boundary	Pearson cor.	0.614	0.909	0.827
v.true	Pearson cor.	0.867	0.959	0.911
v.false	Pearson cor.	0.274	0.566	0.439
sv	Pearson cor.	0.875	0.958	0.924
Starting point	ICC_2,1	0.660	0.897	0.818
Boundary	ICC_2,1	0.578	0.905	0.806
v.true	ICC_2,1	0.842	0.941	0.898
v.false	ICC_2,1	0.038	0.540	0.209
sv	ICC_2,1	0.861	0.950	0.917

Mean **(a)** Pearson correlation, and **(b)** ICC2,1 Across the 21 Simulated Datasets (Means were computed through Fisher’s Z transformation), using the new method in Study 1. Error bars represent the lower and upper range of correlations from the 21 models.

It seems that all parameters, except for v.false, were recovered successfully at the individual (simulated) participant level, as reflected in both a reasonably high Pearson correlation and a reasonably high ICC_2,1.

We suspected that v.false failed to recover successfully not because of the new method, but that it might be a disadvantage of the classic method as well. In order to test if this is indeed the case, we ran another parameter recovery study, using the exact same method as before, but this time we fixed sv = 1 for all participants, as customarily done (i.e., the “classic” method). The results of this parameter recovery are reported below.

This time we found that for all “participants” in all models, PSRF fell below 1.1, except for one “participant” in one model. This “participant” has been removed from the analysis and the results of the analyses that include this “participant” are reported in the Supplementary Materials.

Once again, all Pearson correlations for all parameters were significantly different from zero (p < 0.05). This time, all ICC_2,1 but that of v.false met this criterion. The results of this analysis are reported in Table 3 and Figure 3.

Table 3

Pearson correlation and ICC_2,1 range and mean for all parameters using the classic method in Study 1.

PARAMETER	METHOD	LOWER BOUND	HIGER BOUND	MEAN VALUE
Starting point	Pearson cor.	0.780	0.920	0.881
Boundary	Pearson cor.	0.966	0.992	0.987
v.true	Pearson cor.	0.988	0.993	0.991
v.false	Pearson cor.	0.424	0.957	0.615
Starting point	ICC_2,1	0.756	0.917	0.869
Boundary	ICC_2,1	.960	0.992	0.984
v.true	ICC_2,1	0.988	0.993	0.991
v.false	ICC_2,1	0.0431	0.956	0.434

Mean **(a)** Pearson correlation, and **(b)** ICC2,1 Across the 21 Simulated Datasets (Means were computed through Fisher’s Z transformation), using the classic method in Study 1. Error bars represent the lower and upper range of correlations from the 21 models.

Summary

The results of Study 1 demonstrate the applicability of the newly suggested method by a quite successful parameter recovery, including that of sv. However, in retrospect, we identified three shortcomings in our methodology. Specifically, (a) it is plausible that the successful parameter recovery was attained because the priors were unrealistically similar to the true population mean values. (b) It seems that the recovery by the new method, despite being successful, was less successful than when using the “classic” method. Additionally, (c) most real-world studies model at least one parameter as a function of groups or conditions, and this division is lacking in Study 1. (Nonetheless, individual differences in the parameter values were successfully captured). To address these shortcomings, we conducted Study 2, using a slightly different approach.