
Figure 1
Outline of the two-step task (TST). Transition probabilities from the first stage to the second stage remain the same in both versions of the task. The second stage with a green frame depicts the modified task version employed in data set data1 (Mathar et al., 2022): after making a S2-choice subjects receive feedback in the form of continuous reward magnitudes (rounded to the next integer). The lower S2 stage (orange frame) depicts the classic version (used in data set data2; Gillan et al., 2016), in which the S2 feedback is presented in a binary fashion (rewarded vs. unrewarded based on fluctuating reward probabilities).
Table 1
Free and fixed parameters for all models.
| MODEL | FREE PARAMETERS |
|---|---|
| Q | α, α2, α3, ßMB, ßMF, ßpersev, ß2 |
| Q + BANDIT | α, α2, α3, ßMB, ßMF, ßpersev, ß2, φ |
| Q + TRIAL | α, α2, α3, ßMB, ßMF, ßpersev, ß2, φ |
| Q + HOP | α, α2, α3, αHOP, ßMB, ßMF, ßpersev, ß2 |
| Q + BANDIT + HOP | α, α2, α3, αHOP, ßMB, ßMF, ßpersev, ß2, φ |
| Q + TRIAL + HOP | α, α2, α3, αHOP, ßMB, ßMF, ßpersev, ß2, φ |
[i] Note. Q refers to the basic hybrid model with a FOP term. BANDIT/TRIAL = added first-stage exploration bonus based on respective counter heuristic (c.f. Computational Models section); φ: parameter that scales the exploration bonus. Note that this parameter remains the same for both exploration bonus variants, regardless of the specific formalisation of uncertainty estimates in a given model.
Table 2
Results from regression analyses of S1 choice repetition probability.
| ESTIMATE | 95% CI | z-VALUE | p-VALUE | ||
|---|---|---|---|---|---|
| data1 | Intercept | 1.35 | [1.12; 1.58] | 11.48 | <.01 |
| Reward | 0.11 | [0.05; 0.18] | 3.47 | <.01 | |
| Transition | –0.07 | [–0.14; –0.01] | –2.14 | .03 | |
| Reward*Transition | 0.47 | [0.36; 0.59] | 8.10 | <.01 | |
| data2 | Intercept | 1.73 | [1.53; 1.94] | 16.68 | <.01 |
| Reward | 0.64 | [0.51; 0.77] | 9.89 | <.01 | |
| Transition | 0.02 | [–0.03; 0.08] | 0.81 | .42 | |
| Reward*Transition | 0.16 | [0.07; 0.24] | 3.76 | <.01 |
[i] Note. Reward: main effect of reward type (unrewarded vs. rewarded), commonly interpreted as an indicator for MF control; Transition: main effect of transition type (rare vs. common); Reward*Transition: interaction of Reward and Transition type, commonly interpreted as an indicator for MB control.

Figure 2
Stay-Probabilities of S1 choices and difference scores. Upper panel: MB and MF difference scores as defined by Eppinger et al. (2013), bar heights depict mean scores over all participants, error bars show the standard error. Lower panel: Probabilities for S1 choice repetition as a function of reward (rew+: rewarded; rew-: unrewarded) and transition type (common/rare) of the preceding trial. The left plots (green, A) show results from data1; the right plots (orange) show results from data2.

Figure 3
Model Comparison Results via the Widely Applied Information Criterion (WAIC) for all Q Models (c.f. Table 1). The upper/lower panel (green/orange bar plots) refer to data1 and data2, respectively. Bandit/Trial refer to the model variants with added heuristic-based exploration bonus using stimulus identity/recency, respectively. HOP: model variants with higher order perseveration term; all other versions use a classic FOP term instead (Q, Q+BANDIT, Q+TRIAL).
Table 3
Results from model comparison of QL-models with a HOP extension using leave-one-out cross-validation (LOO).
| DATA SET | MODEL | –elpddiff | sediff | WAIC |
|---|---|---|---|---|
| data1 | Q + HOP | –28.9 | 9.6 | 17750.21 |
| Q + BANDIT + HOP | –4.0 | 6.2 | 17715.03 | |
| Q + TRIAL + HOP | 0.0 | 0.0 | 17714.46 | |
| data2 | Q + HOP | –13.8 | 8.3 | 35905.27 |
| Q + BANDIT + HOP | –11.2 | 6.2 | 35887.17 | |
| Q + TRIAL + HOP | 0.0 | 0.0 | 35871.03 |
[i] Note. The difference in the expected log pointwise predictive density (elpddiff) and standard error of the difference (sediff). These values show the results of a model comparison using LOO estimates. Each model is compared to the preferred model Q + TRIAL + HOP), as there is no difference between the best-fitting model and itself, values in the first column are always zero.
Table 4
Proportion of correct S1 choice predictions by the winning model Q +HOP.
| DATA SET | MIN | 25th PERCENTILE | MEDIAN | MEAN | 75th PERCENTILE | MAX |
|---|---|---|---|---|---|---|
| data1 | .519 | .638 | .764 | .748 | .841 | .916 |
| data2 | .505 | .687 | .767 | .754 | .829 | .977 |
[i] Note. Summary statistics are based on the comparison of individuals’ choices with model predictions, which were pooled and averaged for each data set.

Figure 4
Posterior Distributions of Group-Level Mean Parameters From Model Q + HOP. Solid gray lines show the 95% highest density interval (HDI) and dots depict the point-estimate of the mean. Panels A and B (green and orange plots) show results on the basis of data sets data1 and data2, respectively.

Figure 5
Probabilities of S1 choice repetition as a function of reward and transition type. Y-axis: Stay probabilities for 1st stage choices; data: empirical stay probabilities from data sets data1 (panel A; green) and data2 (panel B; orange). simulation: stay-probabilities from N = 8000 simulated choice sequences per subject, derived from the winning model (Q + HOP).; rew+/–: previous trial was rewarded (+) or unrewarded (–).; common/rare: previous trial followed a common/rare transition, respectively. Error bars in the simulation plots depict the 95% HDI over 8000 simulated data sets.
Table 5
Posterior Estimates of Group-Level Parameters from Model Q + HOP.
| PARAMETER | data1 | data2 | ||
|---|---|---|---|---|
| MEDIANx | 95%HDI | MEDIANx | 95%HDI | |
| α1 | 0.38 | [0.10, 0.83] | 0.59 | [0.51, 0.67] |
| α2 | 0.82 | [0.64, 0.96] | 0.45 | [0.38, 0.53] |
| α3 | 0.99 | [0.97, 1.00] | 0.76 | [0.68, 0.83] |
| αHOP | 0.57 | [0.35, 0.82] | 0.98 | [0.95, 1.00] |
| ßmb | 10.59 | [7.65, 13.33] | 2.80 | [1.44, 4.11] |
| ßmf | 1.39 | [0.87, 1.91] | 3.00 | [2.46, 3.52] |
| ßHOP | 1.44 | [1.20, 1.65] | 1.82 | [1.58, 2.10] |
| ß2 | 9.76 | [8.26, 11.49] | 6.84 | [5.97, 7.72] |
[i] Note. Posterior point-estimates of hyperparameter medians and corresponding 95% highest density intervals (95%HDI) for data1 and data2 from the winning model (Q + HOP) for all subject-level parameters × listed in the first column.

Figure 6
Associations of model-agnostic and model-derived indices of MB and MF control for data1 (a) and data2 (b). Empty tiles (left panels) indicate non-significant associations. ßrew, ßtrans, ßrew:trans: regression weights for main effects of reward, transition type and their interaction; MBdiff, MFdiff: differences scores of MB and MF influences on S1 stay probabilities respectively; ßMB, ßMF: MB and MF S1 choice parameters from the winning model; ßHOP: S1 higher order perseveration parameter; mean reward: mean reward gained throughout TST (data1: 300 trials, data2: 200 trials). Right panel: association of model-derived MB (ßMB) and habit step-size parameter ∝HOP with mean reward. Circles depict individual participants. Plots in panel A (top row, green) are based on data1, plots in panel B (bottom row, orange) are based on data2.

Figure 7
Posterior Density Estimates Based on The Full Sample of Data2. Group-level parameter estimates from model variant Q+HOP derived from fitting the full sample of the original publication (Gillan et al., 2016; N = 548; Experiment 1). The lower panel of Figure 4 shows corresponding results based on data2 (N = 100). Grey dots indicate the mean point-estimate, bars depict the 95%-HDI.
