
Figure 1
First 8 bars of tune “250 to Vigo (sessiontune9)” from “The Session” dataset. The figure shows a key similar to the query, both of which are followed by the same pitch E. PPM would fail to match the key and query since the key and query do not match at any order n.
Table 1
NLL and precision of the DCSTM, the CCSTM (our models), and the baselines, categorized in short-term (STM) and long-term (LTM) models. The length of the respective temporal context is denoted by n. * denotes that the maximal context is used. EVT means that the performance was measured only at time steps where the pitch changes.
| TYPE | NAME | N | NLL | PRECISION |
|---|---|---|---|---|
| STM | CCSTM-512 | 512 | 0.574 | 0.848 |
| CCSTM-32 | 32 | 0.733 | 0.783 | |
| DCSTM-512 | 512 | 0.792 | 0.781 | |
| MC-3 | 3 | 1.922 | 0.606 | |
| PPM | * | 1.387 | 0.798 | |
| Repetition | 1 | 2.724 | 0.606 | |
| LTM | WaveNet-512 | 512 | 0.502 | 0.849 |
| Transformer-512 | 512 | 0.370 | 0.887 | |
| Transformer-32 | 32 | 0.852 | 0.718 | |
| EVT | CCSTM-512 | 512 | 1.237 | 0.682 |
| IDyOM | * | 1.870 | 0.426 |

Figure 2
Negative log-likelihood (NLL) and precision as functions of time-steps in intra-opus prediction, averaged over pieces of the test set.

Figure 3
Prediction of DSTMs on A Scone For Breakfast (sessiontune157) from the test dataset. Green indicates the actual pitch, and red indicates a prediction error.

Figure 4
Confusion matrices for DSTMs.

Figure 5
Aggregate saliency maps for DSTMs. Pixel intensities indicate how important variables are for prediction on average.

Figure 6
Similarity of codes grouped by their pitch value.

Figure 7
Precision computed for each pitch in the test set and the histogram of pitches in the training data set.

Figure 8
Precision computed for each time signature in the test set and the histogram of time signatures in the training data set.
