
Figure 1
High-level view of proposed model in an example with two choices. For each choice, the distribution of rewards Rt (gray histograms) is learned by competing critics through the updates δt+ and δt–. One system is optimistic, upweighting large rewards, and another is pessimistic, downweighting large rewards (blue histograms). As a result, each choice is associated with multiple values Q– and Q+. To determine which choice is selected, a random variable Ut is drawn for each choice uniformly from (Q–,Q+) (teal histograms). The largest Ut determines which choice is selected and when the decision is made.
Algorithm 1
Competing-Critics.
| Input: Learning rate α, parameters k+, k–, discount factor γ, and exploration parameter ɛ. |
| Initialize Q±(s, a) for all (s,a) ∈ × |
| Initialize S |
| While not terminated do |
| Sample U(a)∼Unif [Q–(S, a), Q+(S, a)] for each action a in state S |
| Choose A using ɛ-greedy from the values U(a) |
| Take action A, observe R, S′ |
| % Compute prediction errors |
| δ± ← R+γ maxa Q±(S′, a)–Q±(S, A) |
| % Update state-action value functions |
| % move to new state |
| S ← S′ |
| end while |

Figure 2
Comparison of mean and interquartile range of state-action value functions over 30,000 simulations. The state-action values Q+ and Q– reflect changes in the mean μ, standard deviation σ, and skew of the reward distribution. Notably, asymptotes of these values shift by 0.25 when μ decreases by 0.25, and their gap decreases by 1/2 when σ decreases by a factor of 1/2.

Figure 3
A single simulation run of state-action value functions Q± and Q. The state-action values preserve the ordering Q–< Q< Q+ through the entire run.

Figure 4
Impact of parameters k+ and k– on A) the midpoint and gap between Q+ and Q– averaged over 30,000 simulations, and B) how an individual makes decisions. In particular, the model decomposes decision-making behavior along two axes, a risk-sensitivity and an uncertainty-sensitivity, which are rotated 45° degrees from the k± axes. In the simulation, μ = 0.5, σ = 0.2, and skew = 0.

Figure 5
Four different decision makers with different k+ and k– parameter values interpret the same reward distributions differently. Parameter values associated with risk-seeking are more likely to prefer the rewards drawn from the black distribution, while risk-averse parameter values prefer the red distribution. Meanwhile, deliberative parameter values are more likely to explore the two best competing choices, as those choices have overlap between their Q intervals, while decisive parameter values pick only their preferred distribution. Note that none of the four learners would select the blue distribution.

Figure 6
Risk-sensitivity of the Competing-Critics model during the Iowa Gambling task aggregated over 100 trials and 30,000 simulations. A) The “risky” Deck B becomes the most popular choice rather than Deck C, when parameter k– is decreased from 0.9 to 0.1. B) Deck selection is determined by the highest value of a random variable drawn uniformly from the interval Q+ to Q–. Here, the interval from median Q+ to median Q– is plotted to help illustrate which decks are viable options Deck B becomes more favorable because of a dramatic increase to the pessimistic value function Q–. C) Bad decks A and B are chosen at higher rates moving along the risk-sensitivity axis (i.e. the k+ = 1–k– line).

Figure 7
Stay probabilities after a first stage choice over a horizon of 80 decisions (40 first-stage decisions) and 30,000 simulations. The gap between stay probabilities for common vs. rarer transitions increases along the uncertainty-sensitivity axis (i.e. k+ = k– axis) as the learner increases their deliberation about multiple choices.

Figure 8
Mean and standard deviation (SD) of maxa Ut(a) in the (A) learning example with μ = 0.5, σ = 0.2, and skew = 0; (B) Iowa Gambling Task; and (C) two-stage Markov task. Larger values of maxa Ut(a) are hypothesized to correspond to faster reaction times.

Figure 9
Mean updates as a function of bet levels and reward prediction error (RPE) over 30,000 simulations. (A) Mirroring dopamine transients in (Kishida et al., 2016), large mean ΔQ+ reinforces either a large bet for positive RPE or a small bet when negative RPE. Mirroring serotonin transients in (Moran et al., 2018), large mean –ΔQ– reinforces either a large bet for negative RPE or a small bet for positive RPE. (B–C) In addition, mean updates can predict the upcoming bet and are asymmetrical, respecting potential asymmetry in the degree to which dopamine and serotonin transients can increase vs. decrease.
