Differentiable Short-Term Models for Efficient Online Learning and Prediction in Monophonic Music

Mathias Rose Bjare; Stefan Lattner; Gerhard Widmer

doi:10.5334/tismir.123

Figures & Tables

First 8 bars of tune “*250 to Vigo (sessiontune9)*” from “*The Session*” dataset. The figure shows a key similar to the query, both of which are followed by the same pitch E. PPM would fail to match the key and query since the key and query do not match at any order n.

Table 1

NLL and precision of the DCSTM, the CCSTM (our models), and the baselines, categorized in short-term (STM) and long-term (LTM) models. The length of the respective temporal context is denoted by n. * denotes that the maximal context is used. EVT means that the performance was measured only at time steps where the pitch changes.

TYPE	NAME	N	NLL	PRECISION
STM	CCSTM-512	512	0.574	0.848
	CCSTM-32	32	0.733	0.783
	DCSTM-512	512	0.792	0.781
	MC-3	3	1.922	0.606
	PPM	*	1.387	0.798
	Repetition	1	2.724	0.606
LTM	WaveNet-512	512	0.502	0.849
	Transformer-512	512	0.370	0.887
	Transformer-32	32	0.852	0.718
EVT	CCSTM-512	512	1.237	0.682
EVT	IDyOM	*	1.870	0.426