Abstract
Hierarchical structures describing a syntax of harmony have long been studied and proposed by music theorists, but algorithms that model these structures either require costly expert annotations for training or are based on music theorists' predispositions about harmonic syntax. We build upon a line of work that models harmonic sequences with probabilistic context-free grammars (PCFGs), inspired by the well-known formalism for syntax in human language. By using neural networks for parameter sharing when estimating PCFG rule probabilities, we learn the grammar in an entirely unsupervised manner. Our model induces a harmonic syntax purely from data, with minimal bias, and with parse trees as latent variables, while simply maximizing the likelihood of training sequences. This frees us from the need, for the first time, both for expert-annotated harmonic syntax trees and for human-defined grammar rules. We propose improvements inspired by music theory, including chord symbol representations and a training objective that facilitates the inclusion of short and frequent chord progressions that are based on musical relations. Experiments show that our methods can model harmony in datasets of jazz pieces, often resulting in realistic parse trees that overlap with expert annotations, without access to these annotations during training at all. Code, models, and predictions are publicly available.1
