PBSCR: The Piano Bootleg Score Composer Recognition Dataset

Arhan Jain; Alec Bunn; Austin Pham; TJ Tsai

doi:10.5334/tismir.185

Full Article

1 Introduction

Composer recognition is the task of predicting the composer of an unseen piece or fragment of music. Similar to other music classification tasks such as genre recognition, emotion/mood detection, and music tagging, a composer recognition system allows one to describe stylistic elements in a piece of music and is thus useful in music recommendation and organization of digital music collections. Unlike these other tasks, however, composer recognition characterizes the style in a self-referential manner, using similarities to known composers rather than subjective, human-generated labels. In an age where data is often the limiting factor in training powerful and expressive models, composer recognition offers several desirable characteristics as a music classification task: it has objective (rather than subjective) ground truth labels, it does not require expensive human labeling, and the task difficulty can be adjusted by including more or fewer composers. The task also offers a testbed for developing deeper musicological insights into the stylistic differences among composers, eras, and historical movements. Our goal in this work is to design a dataset that facilitates large-scale research on composer recognition, that is suitable for modern architectures and practices, and that presents a large database of sheet music images and metadata in a form that is accessible and easy to use.

While composer recognition has received a lot of interest in the last 20 years, previous work has been hindered by major data constraints. Most previous works have focused on symbolic representations of music like MIDI or **kern, where note and duration information are explicitly encoded. One significant drawback of using these representations is that they are less common than audio or sheet music and are thus limited to smaller-scale datasets of varying quality collected from various websites. The Lakh MIDI dataset (Raffel, 2016) has assembled a large amount of MIDI data from various websites but does not include metadata. The lack of large-scale datasets has become a major bottleneck in recent years, as models have become dependent on large amounts of data for pretraining and finetuning. Audio data is available in abundance but has been limited by copyright restrictions, which have hindered the systematic collection and organization of large, open datasets for composer recognition. Sheet music with high-quality metadata and permissive licenses is available in abundance on IMSLP^¹, but it struggles with a different problem: an inconvenient format. Raw sheet music images do not directly encode note information, and current optical music recognition (OMR) capabilities are unreliable on scanned sheet music of varying quality.

The GiantMIDI-Piano dataset (Kong et al., 2022) proposes one way to address these constraints: it downloads YouTube recordings of solo piano works, processes the recordings with an automatic music transcription (AMT) system, and releases only the estimated MIDI transcriptions. This has several benefits: it utilizes the fact that audio recordings are plentiful, avoids copyright issues by not releasing the audio, and provides the music in symbolic form. But it also has several drawbacks when used to study composer recognition: the note transcriptions are noisy estimates, the metadata is not entirely reliable (since it relies on a YouTube search based on composer name and piece title), and the audio conflates performance and compositional aspects.

This article proposes an alternative way to address these data constraints: it uses solo piano sheet music images from IMSLP, extracts a previously proposed feature representation called a bootleg score, and presents the features in a compact and convenient format (binary 2-dimensional images). The bootleg score (Yang et al., 2019) uses classical computer vision techniques to detect noteheads and encodes their locations relative to the staff lines. It can be thought of as a redacted, onset-only piano roll in which duration and accidental information have been discarded. This approach has several benefits: it utilizes the fact that classical sheet music is plentiful, it has rich and reliable metadata from IMSLP, and it provides the data in a symbolic format. Its drawbacks are that notehead detection and localization are noisy estimates, the bootleg score encodes only a selected set of musically relevant information, and the format is less commonly used. This approach can be thought of as a companion and a complement to the GiantMIDI-Piano dataset, where the focus is on creating a useful research dataset from sheet music images rather than from YouTube recordings.

The proposed dataset can be useful in a number of ways. First and foremost, it is designed to facilitate progress on composer recognition by providing the largest-scale benchmark to date. Even though it contains only selected information about notes, it allows us to study the problem at a larger scale with modern architectures and to gain a deeper understanding of musical aspects (e.g., texture, melodic contours, etc) that are distinctive about individual composers. Second, the simplicity of the dataset (which has 2-dimensional images patterned after MNIST) can enable rapid experimentation on questions of broader interest to the MIR community, such as designing effective music representations or data-augmentation strategies. Third, the rich (and reliable) metadata associated with the bootleg scores can serve as a foundation for multimodal research, linking together various data sources such as sheet music, audio recordings, MIDI files, composer and piece metadata, relevant Wikipedia pages, and All Music Guide descriptions (all of which are linked on IMSLP). For example, Yang and Tsai (2021a) demonstrate the feasibility of cross-modal transfer learning, in which a model is trained only on (sheet music) bootleg scores and uses them to perform composer classification of audio recordings. Fourth, the dataset can be used to link and de-anonymize data from other datasets. For example, Yang and Tsai (2021b) used bootleg scores to identify matches between IMSLP sheet music and the Lakh MIDI dataset (Raffel, 2016). Similar techniques could be used to verify and clean datasets, such as GiantMIDI-Piano (Kong et al., 2022), that have unreliable metadata. Fifth, the dataset can be used to study large-scale retrieval problems like piece or passage identification of audio, MIDI, or sheet music images (Yang et al., 2022).

The main contributions of this article are to describe, introduce, and motivate the Piano Bootleg Score Composer Recognition (PBSCR) dataset.^² This dataset was designed to facilitate research on composer recognition, with a focus on size, diversity, and ease of use. Our guiding principle was to design a composer recognition dataset that is “as accessible as MNIST and as challenging as ImageNet.” Thus, our goal was to design a dataset that presents a challenging task but is as compact, lightweight, and easy to work with and visualize as MNIST images. To maximize data quantity while ensuring data simplicity and ease of use, we consider piano sheet music images from IMSLP and use a previously proposed sheet music feature representation called a bootleg score (Yang et al., 2019) to encode the locations of noteheads relative to staff lines. The dataset consists of three parts: a labeled set of 40,000 piano bootleg score images for a 9-class composer recognition task, a labeled set of 100,000 piano bootleg score images for a 100-class composer recognition task, and a large unlabeled set of variable-length piano bootleg scores in IMSLP for self-supervised learning. For each labeled piano bootleg score, the dataset includes information to allow researchers to access the raw sheet music images from which the bootleg score fragment was taken. For both labeled and unlabeled data, we also scrape, organize, and compile metadata from IMSLP about each work. We release a set of baseline systems and results for researchers to compare against in future works. In addition, we discuss several research tasks and open research questions that the PBSCR dataset is especially well suited to study. This discussion lays out interesting directions and potential roadmaps for future work. The dataset and code for this project can be found at https://github.com/HMC-MIR/PBSCR.

2 Background

In this section we provide background about previous work and previous datasets used to study the composer recognition task.

2.1 Previous work

In this subsection we provide a brief overview of previous methods in composer recognition, and we describe how recent methods motivate a need for larger datasets. Previous work can be divided into two time periods: classical machine learning and deep learning.

Classical machine learning methods were the dominant approach between roughly 2000 and 2015. Most methods for composer classification from this era fall into one of two categories. The first category involves defining a set of manually designed features and then feeding the features into a standard classifier. Some examples of features include absolute and relative values of pitch and duration (Pape et al., 2008), global statistics on pitch intervals and durations (Goienetxea et al., 2018), chroma-based features (Anan et al., 2012), n-gram statistics on intervals and durations (Hajj et al., 2018), high-level musicological features like detecting 9–8 suspensions (Brinkman et al., 2016) or detecting sonata form (Kempfert and Wong, 2020), and using standardized feature sets (Herremans et al., 2016) such as the jSymbolic toolbox (McKay and Fujinaga, 2006). The second category of classical machine learning approaches involves training a sequence-based model. The two most common sequence-based models from this era are N-gram models (e.g., Hontanilla et al., 2013) and Markov chains (e.g., Hedges et al., 2014). In this approach, a sequence-based model is trained on each composer of interest, and test sequences are classified by selecting the model that has the highest likelihood.

Deep learning–based approaches have increasingly become the dominant paradigm since around 2015. Most methods in this era fall into one of two categories.

The first category represents the underlying music information as a continuous signal. Common representations include mel spectrogram and MFCC features for audio (e.g., Micchi, 2018; Kher, 2022); a piano roll–like matrix or a tensor specifying note events for symbolic music (e.g., Verma and Thickstun, 2019; Velardeet al., 2018); and 2-dimensional images for sheet music (Walwadkar et al., 2022). Given an input represented as a continuous signal, various neural network architectures have been explored, including Convolutional Neural Network (CNN) architectures (e.g., Deepaisarn et al., 2022; Kim et al., 2020; Walwadkar et al., 2022)), Convolutional Recurrent Neural Networks (CRNNs) (Kong et al., 2020), and Long Short-Term Memory (LSTM) models (Kher, 2022; Micchi, 2018).

The second category represents the underlying music information as a sequence of discrete tokens. Some methods of forming discrete tokens from music data include: converting piano roll–like data into binary text (Takamoto et al., 2018) or sequence of characters (Yang and Tsai, 2021a), considering (note, duration) tuples (Deepaisarn et al., 2023), or using a REMI (Huang and Yang, 2020) or compound word representation (Hsiao et al., 2021). Once the data is represented as a sequence of discrete tokens, a variety of Transformer architectures have been used to model the data (Chou et al., 2021; Li et al., 2023; Tsai and Ji, 2020; Yang et al., 2021). One benefit of this approach is the ability to pretrain models on unlabeled data in a self-supervised manner. These methods are data hungry and provide a strong incentive to construct benchmarks that contain large amounts of data.

2.2 Previous datasets

In this section we describe the landscape of datasets used to study the composer classification task. This landscape provides historical context for understanding the contributions of the PBSCR dataset.

Table 1 provides an overview of recent works on composer classification and the datasets used in these studies. From left to right, the columns indicate the paper (author and year published), number of composers in the classification task, original source data type (symbolic, audio, sheet music), preprocessed data format (i.e., after any data preprocessing), and dataset size. For dataset size, numbers in parentheses indicate unlabeled data for pretraining. Entries in the table have been arranged according to publication year, and the last entry in the table corresponds to the proposed PBSCR dataset.

Table 1

Overview of recent works on composer classification and the datasets used in these studies. The third column indicates whether the original source data is symbolic, audio, or sheet music images. The fourth column indicates the format of the data after any data format conversions or preprocessing. The fifth column indicates the size of the dataset, where numbers in parentheses indicate unlabeled files for pretraining. For papers that use multiple datasets, we have indicated only the largest.

Paper	Composers	Original Source Data Type	Preprocessed Data Format	Data Size
Wołkowicz and Kešelj (2013)	5	symbolic	MIDI	251 pieces
Hontanilla et al. (2013)	5	symbolic	MIDI	274 movements
Herlands et al. (2014)	2	symbolic	MIDI	74 movements
Hedges et al. (2014)	9	symbolic	chords	5700 lead sheets
Herremans et al. (2015)	3	symbolic	MIDI	1045 pieces
Saboo et al. (2015)	2	symbolic	museData, kern	366 pieces
Brinkman et al. (2016)	6	symbolic	no info	no info
Velarde et al. (2016)	2	symbolic	kern	107 movements
Herremans et al. (2016)	3	symbolic	MIDI	1045 movements
Shuvaev et al. (2017)	31	audio	audio	62 hrs
Sadeghian et al. (2017)	3	symbolic	MIDI	417 sonatas
Takamoto et al. (2018)	5	symbolic	MIDI	75 pieces
Hajj et al. (2018)	9	symbolic	MIDI	1197 pieces
Velarde et al. (2018)	5	symbolic	MIDI, audio (synthesized)	207 movements
Micchi (2018)	6	audio	audio	320 recordings
Goienetxea et al. (2018)	5	symbolic	kern	1586 pieces
Verma and Thickstun (2019)	19	symbolic	kern	2500 pieces
Costa and Salazar (2019)	3	symbolic	no info	10 pieces
Kim et al. (2020)	13	symbolic	MIDI	505 pieces
Kong et al. (2020)	100	audio	MIDI (transcribed)	10854 pieces
Revathi et al. (2020)	4	audio	audio	40 pieces
Kempfert and Wong (2020)	2	symbolic	kern	285 movements
Chou et al. (2021)	8	symbolic	MIDI	411 pieces
Yang and Tsai (2021a)	9	sheet music	bootleg score	787 works (29310 works)
Walwadkar et al. (2022)	9	sheet music	image, bootleg score	32k images
Deepaisarn et al. (2022)	5	symbolic	MIDI	809 pieces
Kher (2022)	11	symbolic	MIDI, audio (synthesized)	110 pieces
Foscarin et al. (2022)	13	symbolic	MIDI	667 pieces
Li et al. (2023)	8	audio	MIDI (transcribed)	411 pieces
Deepaisarn et al. (2023)	5	symbolic	MIDI	809 pieces
Simonetta et al. (2023)	7	symbolic	MIDI	211 pieces
Zhang et al. (2023)	9	symbolic	MIDI, MusicXML	415 scores
PBSCR	100	sheet music	bootleg score	4997 works (29310 works)

There are three things to notice about the landscape of previous datasets described in Table 1. First, most previous works consider a small number of composers. For example, only 6 out of the 32 previous works shown in Table 1 consider more than 10 composers. Second, most previous works consider a relatively small amount of data, by modern standards. It is difficult to compare dataset sizes directly since previous works report sizes in different ways, including number of movements/pieces, total audio duration, and number of sheet music images. Nonetheless, by comparing entries by the number of pieces in the labeled dataset (the most common metric), we can see that most works consider on the order of hundreds of pieces. Third, the vast majority of previous works focuses on symbolic music formats. As a practical matter,the choice to use symbolic music as source data limits the size and the diversity of datasets, given that symbolic music data is less plentiful than sheet music or audio.

It is useful to point out how the PBSCR dataset fits into this data landscape. It is distinctive in three ways. First, it has the highest number of composers (100, tied with Kong et al., 2020). As mentioned above, this is much higher than most previous works, so it poses a more challenging classification task. Second, the PBSCR dataset is among the largest in size. In particular, it is one of the only datasets in Table 1 that comes with a large unlabeled dataset for pretraining. Given the shift in recent years toward pretraining models in a self-supervised manner, this unlabeled data provides an essential resource for supporting the development of competitive models. Based on the number of works in both the labeled (4997) and the unlabeled (29,310) datasets, the PBSCR dataset is almost certainly the largest in terms of total dataset size. Third, it is one of only a few works (Tsai and Ji, 2020; Walwadkar et al., 2022; Yang and Tsai, 2021a) that uses sheet music images as source data. By using a bootleg score representation, the PBSCR dataset maintains the advantage of plentiful sheet music data (on IMSLP) while presenting the data in an extremely compact and simple form (binary 2-dimensional images).

Given this context, we can reasonably make the following claim: the PBSCR dataset presents the most challenging classification task (based on the number ofcomposer classes), has the largest and most diverse set of data available (based on the number of pieces and composers), and has the simplest and most accessible data format (2-dimensional binary images).

It is useful to note that the PBSCR dataset has a very different philosophy than that of most previous works in composer recognition. Whereas previous approaches require full symbolic music information and accept the consequence of limited data size and diversity, the PBSCR dataset requires that the dataset be large, open, diverse, and easy to work with and, accepts the consequence of a noisy, selective feature representation. By using the bootleg score representation, we construct a dataset that is as easy to work with as MNIST data and can facilitate rapid exploration and iteration.

3 Dataset Preparation: Unlabeled Data

The PBSCR Dataset consists of three parts: a large set of unlabeled piano bootleg scores for pretraining (Section 3.1), a set of labeled data for a 100-class composer recognition task (Section 4.1), and a set of labeled data for a 9-class composer recognition task (Section 4.2). In this section, we describe the preparation of the unlabeled dataset, which we refer to as the IMSLP Piano Bootleg Scores Data (v1.1). The labeled datasets will be described in Section 4. The novel contributions of the v1.1 dataset are (a) identifying and removing non-music filler pages from the v1.0 dataset (Section 3.2) and (b) scraping, organizing, and including metadata from IMSLP on all works in order to facilitate multimodal research and allow for convenient linking to other datasets (Section 3.3).

3.1 IMSLP piano bootleg scores v1.1

The IMSLP piano bootleg scores repository (v1.0) was introduced in Yang and Tsai (2020) for a sheet musicidentification task and was first used for composer classification in Tsai and Ji (2020). This repository contains the bootleg scores for all solo piano works in IMSLP. Below, we describe the history of its construction as well as the steps that were taken to clean up the data (v1.1).

The first step in creating this repository was to download IMSLP sheet music. The IMSLP website provides a list of composers, a list of works for each composer, and a webpage for each work that contains audio recordings, sheet music, and metadata. For each composer, we iterated through all of their works and downloaded all PDF sheet music scores and associated metadata. The scraping and downloading took more than a month to complete and resulted in a set of 420,271 PDF files from 164,248 composers that was 1.2 terabytes in size.

The second step was to filter the full dataset by instrumentation tag label in order to compile a list of solo piano works. After filtering, the dataset contained 29,310 works, 31,384 PDFs, and 374,758 individual pages. Note that a work may contain several PDF versions on the IMSLP website. All of the remaining steps were applied only to this filtered dataset.

The third step was to convert each PDF into a sequence of PNG images. We performed the decoding at 300 dpi and then resized the image to have a fixed width of 2550 pixels while preserving the aspect ratio. This resizing step was necessary to appropriately handle the extremely large range of image sizes in IMSLP.

The fourth step was to compute the bootleg score representation from each PNG image (i.e., page of sheet music). The bootleg score (Tsai et al., 2020; Yang et al., 2019) is a mid-level feature representation that encodes the position of filled noteheads relative to staff lines in the sheet music while ignoring many other aspects of thesheet music such as note duration, accidentals, rests, time signatures, clef and octave markings, and non-filled noteheads. The feature extraction process uses classical computer vision techniques to detect notehead and staff line locations, so it is a noisy estimation that contains errors. More details can be found in Yang et al. (2019). Figure 1 shows two examples of a piano sheet music excerpt and its corresponding bootleg score. Note that staff lines are not encoded in the bootleg score representation itself but have been overlaid in Figure 1 as a visual aid. The bootleg score for each page of sheet music is a binary matrix, where 62 indicates the total number of different staff line positions in both the left- and the right-hand staves and where indicates the number of detected simultaneous notehead events on the page. We will refer to each column of the bootleg score as a bootleg score event. So, for example, the bottom example in Figure 1 shows a bootleg score fragment with events (columns), where the first bootleg score event is a 62-length array containing 60 zeros and two “1“ entries corresponding to the noteheads at D4 and A2.

Two examples of a piano sheet music excerpt (left) and corresponding bootleg score representation (right). Staff lines are not encoded in the bootleg score representation itself, but they are overlaid in the examples above as a visual reference.

It is worth mentioning a few practical details at this point. First, each 62-bit bootleg score column is encoded as a single 64-bit integer so that the bootleg scores are compactly represented as a list of integers. Each page of sheet music is thus reduced to a list of 64-bit integers that compactly encode the bootleg score events on the page. Since a PDF consists of multiple pages, we store the features for each page in a separate list to keep track of which page each feature comes from. This representation makes it possible to store an inconveniently large ( in PNG format) dataset very compactly in memory (0.5 GB). Second, the dataset is structured as a file hierarchy that is separated first by composer and then by work, which ensures a clean separation among different partitions. Third, the resulting repository after the fourth step above is the IMSLP piano bootleg scores data v1.0, which was originally presented in Yang and Tsai (2020). The fifth and sixth steps (below) describe the improvements in the newly released v1.1 dataset.

The fifth step is to filter out non-music pages from the bootleg score repository. One of the problems with the original repository is that many PNG images arenot sheet music—they may be title pages, blank pages, forewords, a table of contents, etc. In the original v1.0 repository, a bootleg score was computed for every single PNG image without any consideration of the contents in the image. This results in a non-trivial amount of gibberish bootleg score data that has been extracted fromnon-music images. Figure 2 shows some examples of non-music filler pages (left) and their corresponding gibberish bootleg scores (right). In order to identify non-music pages, we trained a Transformer-based model to identify gibberish bootleg scores. The process for training this model is described in detail in Section 3.2. In the revised v1.1 IMSLP piano bootleg scores repository, the bootleg scores for (predicted) non-music pages have been identified and removed.

Examples of non-music filler pages and their extracted (gibberish) bootleg scores.

The sixth step is to scrape, organize, and include metadata that is available on IMSLP for each work in the unlabeled dataset. This process is described in Section 3.3, and this metadata is included in the v1.1 IMSLP piano bootleg scores repository to facilitate multimodal research and to allow for convenient linking to other datasets.

3.2 Identifying non-music pages

In this subsection, we describe the process of identifying non-music pages by training a Transformer-based model on bootleg score fragments.

The first step is to label a set of music and non-music pages. This was done in the following manner. First, we took the original 9-class dataset proposed in Tsai and Ji (2020) in which each page had been manually labeled as music or non-music. We manually relabeled these pages as one of three categories: music, non-music, or mixture. The “mixture” category contains pages that have both sheet music and text, as is often seen in a table of contents (e.g., showing excerpts of pieces) or foreword. We ultimately decided to exclude the mixture pages and include only pure music and pure non-music pages for training our classifier. In total, there were 5938 music pages and 259 non-music pages. We divided these pages into training and validation partitions using a 60–40 split.

The second step is to sample bootleg score fragments. Figure 3 shows a histogram of the number of bootleg score events in music pages (top) and nonmusic pages (bottom). We can see that many filler (non-music) pages have a very small number of bootleg score events, so our classifier will need to handle short bootleg score fragments. Accordingly, we decided to train our model on bootleg score fragments of length 16. We densely sampled bootleg score fragments from the non-music pages by sampling 16-length fragments with overlap. This resulted in a total of 2799 non-music bootleg score fragments (1689 train, 1110 validation). To maintain a balanced dataset, we randomly sampled the same number of fragments from the music pages. This sampling was done by randomly sampling a work (PDF) from thetrain/validation partition, randomly sampling a music page from the PDF, and then randomly sampling a length-16 fragment from the page’s bootleg score. At the end of this step, we have a labeled dataset of 3378 training bootleg score fragments (1689 filler, 1689 non-filler) and 2220 validation bootleg score fragments (1110 filler, 1110 non-filler).

Histogram of the number of bootleg score events in a set of manually labeled music pages (top) and non-music pages (bottom).

The third step is to train a music vs. non-music fragment classifier. We adopted a similar approach as in Tsai and Ji (2020), which we describe here for completeness. We encode each 62-bit bootleg score column as a sequence of eight 8-bit characters and learn a subword vocabulary using Byte Pair Encoding (Gage, 1994). Using this BPE tokenizer, we pretrain a GPT-2 language model on the entirety of the IMSLP piano bootleg scores repository (v1.0). It is worth noting that this pretrained language model was originally used for composer classification and here we simply fine-tuned it on a different downstream task. Next, we add a classification head with two output classes (music vs. non-music) and fine-tune it on the labeled dataset of music and non-music fragments. In this way, our classifier is trained to classify 16-length bootleg score fragments as music or non-music.

We apply our classifier model to full pages in the following manner: we first extract a bootleg score representation from the page. If the resulting bootleg score has a length smaller than 64, it is automatically classified as non-music. (Note from Figure 3 that very few music pages have bootleg score lengths smaller than 64.) Otherwise, fragments of length 16 are densely sampled from the bootleg score with overlap, and each fragment is passed through our classifier model. We average the outputs of each fragment prediction to get an ensemble prediction for the entire page.

Figure 4 shows a histogram of predicted probabilities on the validation pages, where a higher probability corresponds to a non-music page. We can see that there is a fairly clean separation between the music and the non-music data. We set a very conservative threshold of 0.5, which ensures that non-music pages will be excluded from the data with high confidence (and sometimes music data will be excluded as well, which we are okay with). With this threshold value, we achieve a precision of 0.85 and a recall of 0.98 on the validation pages. Because we care more about ensuring that non-music pages are excluded, the recall of 0.98 is the more important metric. We use this ensemble classifier to identify and remove non-music pages from the IMSLP piano bootleg score repository.

Predicted probability of an ensemble classifier that classifies validation pages as filler (non-music) vs. non-filler. We use a hard threshold of 0.5 to ensure that filler pages are excluded from our dataset with high confidence.

3.3 Adding metadata from IMSLP

In addition to cleaning up the IMSLP piano bootleg scores repository, we also collected and added metadata for these works. This process is described next.

The metadata is scraped from the IMSLP website. For each work, IMSLP has a webpage that contains links to audio performances and various sheet music editions. The webpage also contains a set of metadata for each composition that may include attributes such as the work’s title, composer, opus/catalogue number, key, year/date of composition, composer time period, instrumentation, movements/sections, alternative title, dedication, first publication, etc. We scraped the compositionwebpages to extract these metadata attributes and stored the metadata in a file on our github repository.

This metadata is valuable for two reasons. First, it provides a rich set of information that could be used to study many tasks besides composer classification. Second, it allows for convenient linking to other datasets. As one concrete example, we used simple string matching based on composer name and work title attributes to link the bootleg scores in the PBSCR dataset to corresponding files in the GiantMIDI-Piano dataset (Kong et al., 2020). A file containing 7413 matches is included in our github repository. The provided metadata can similarly be used to link the PBSCR dataset to other datasets.

4 Dataset Preparation: Labeled Data

In this section, we describe the preparation of the labeled (100-class, 9-class) PBSCR data. The 100-class (Section 4.1) and 9-class (Section 4.2) data provide labeled bootleg score fragments to train and evaluate models for the composer recognition task. These datasets have been designed to make the data as accessible and easy to use as MNIST in order to enable rapid iteration and experimentation. Compared to the labeled dataset used in Tsai and Ji (2020), the novel contributions are expanding the number of composer classes from 9 to 100 (Section 4.1) and offering a discussion of known data leakage issues (Section 4.3). The 9-class labeled dataset is included to enable experimentation on tasks of varying difficulty and to allow for historical comparisons to previous work.

4.1 100-Class labeled data

The preparation of the 100-class labeled dataset consists of four steps, which are described below. At a high level, it consists of 100,000 bootleg score fragments (70k train, 15k validation, 15k test) that are balanced across 100 different classical composers.

The first step is to identify a list of 100 composers to include. These 100 composers are selected as a subset from the IMSLP Piano Bootleg Scores Data described in Section 3.1. We first ranked all composers by the total amount of bootleg score data they have available on IMSLP. We then manually reviewed the ordered list of composers and selected the top 100, being sure to remove those who are not primarily composers (e.g., some people on the list were primarily arrangers and editors). The full list of 100 composers can be found at https://github.com/HMC-MIR/PBSCR/blob/main/100_class_list.txt.

The second step is to select a set of sheet music PDFs for each composer. Each work in IMSLP may have multiple PDFs associated with it that correspond to different publishers or editions. Because popular works tend to have a large number of sheet music versions, we select one representative PDF per work in order to avoid over-representing a small number of works. In order to maximize the amount of data available to us, we simply selected the PDF that had the highest number of total bootleg score events. Figure 5 shows the total number of works available for these top 100 composers (top), along with the total number of bootleg score events extracted from each composer’s piano sheet music (bottom). In total, there are 4997 works and 70,440 sheet music pages.

(Top) The total number of pieces/works available on IMSLP for the composers in the 100-class dataset. (Bottom) The total number of bootleg score events for each composer in the 100-class dataset. The list of composers sorted by number of works can be found at https://github.com/HMC-MIR/PBSCR/blob/main/forPaper/composers_sorted_numpieces.txt.

The third step is to identify non-music pages in the selected set of PDFs. We used a Transformer-based model to identify filler pages, as described in Section 3.2. In the 100-class labeled data, there are a predicted 64,129 (out of 70,440) pages with music content containing 12.1 million bootleg score events.

The fourth step is to sample bootleg score fragments from each composer. This sampling serves two purposes: it allows us to achieve class balance among the fragments even though the number of works per composer is different, and it standardizes the size of each labeled sample () in order to achieve the data simplicity of MNIST images. This sampling is done in the following manner. First, we divide the works into training, validation, and test sets, using a split of , and , respectively. Next, we decided on the total number of bootleg score fragments to sample from each partition. For the 100-class data, we have 70,000 train fragments, 15,000 validation fragments, and 15,000 test fragments, resulting in a total of 100,000 examples. Based on these numbers, we calculated how many fragments per composer need to be sampled in order to achieve class balance. Each fragment is drawn by randomly selecting a work by a given composer and then by randomly selecting a 64-length fragment from the bootleg score. Our sampling process guarantees that our classes are perfectly balanced, and it gives equal weight to all piano works (in IMSLP) that a composer has composed.

Figure 6 shows a set of example bootleg score images for nine selected composers (those in the 9-class dataset). The staff lines for the left hand and the right hand are not present in the bootleg score representation itself, but they have been overlaid for ease of reference. Even without any information about note durations, key or time signature, or accidentals, one can immediately see some recognizable features: the Bach example has fugue-like texture and movement. The Beethoven example has an alternating octave in the right hand, which is not common in Bach’s music. The Mozart example has scale-like runs in the right hand, with an Alberti bass-like left-hand accompaniment. The Classical and Baroque composers (Bach, Mozart, Haydn) have thinner textures compared to those of the Romantic-era composers. These examples show that even with the minimal bootleg score representation, many aspects of compositional style are preserved.

Example bootleg score images from the labeled 9-class PBSCR data. Staff lines have been overlaid for ease of interpretation.

The labeled 100-class dataset is formatted in a way that resembles the MNIST dataset. Each dataset consists of the following six arrays:

: a binary tensor specifying the training bootleg score fragments
: a 70,000 length array specifying the train composer class indices
: a binary tensor specifying the validation bootleg score fragments
: a 15,000 length array specifying the validation composer class indices
: binary tensor specifying the test bootleg score fragments
: a 15,000 length array specifying the test composer class indices

In addition, we also provide relevant metadata on all train, validation, and test fragments. This metadata includes a unique identifier that specifies the PDF from IMSLP from which the fragment was taken as well as the page and offset in the bootleg score from which the fragment was sampled. This information allows researchers to access the complete, unabridged bootleg scores to study the effect of longer-term structure or to access the original, raw sheet music image data for visualization and deeper understanding.

The 100-class dataset poses a much more challenging classification task than does the 9-class recognition task in Tsai and Ji (2020). One may notice that the shape and the size of the data is similar to MNIST, in keeping with our principle of constructing a dataset that is “as accessible as MNIST and as challenging as ImageNet.” Each composer has only 700 training examples, so the task is difficult both for the large number of composers and for the relative scarcity of labeled data. For these reasons, we believe this dataset will push the boundaries of composer recognition to a higher level.

4.2 9-Class labeled data

The 9-class labeled dataset contains bootleg score fragments for nine classical composers: Bach, Beethoven, Chopin, Haydn, Liszt, Mozart, Schubert, Schumann, and Scriabin. The dataset is constructed in the same way as the 100-class labeled dataset but with one difference: the list of composers was adopted from Tsai and Ji (2020) (rather than selected based on data availability) to enable historical comparisons. In total, the 9-class dataset consists of 28,000 train fragments, 6000 validation fragments, and 6000 test fragments, resulting in a total of 40,000 examples that are balanced across composers. Table 2 shows the number of works, the total number of pages, the number of (predicted) music pages, and the number of bootleg score events for each composer in the 9-class labeled dataset. There are 896 PDFs, 10,305 pages with music content, and 2.2 million bootleg score events. The purpose of providing both 100-class and 9-class datasets is to enable experimentation at varying levels of task difficulty.

Table 2

Overview of the raw sheet music data from which the 9-class PBSCR data was constructed. Cumulative counts for the 100-class data are also shown at the bottom.

Composer	# Works	# Pages (all/music)	# Bootleg Features
Bach	226		424,948
Beethoven	86		272,374
Chopin	89		205,513
Haydn	51	50/50	12,408
Liszt	179	3405/3170	575,367
Mozart	61	702/673	174,355
Schubert	88	836/836	206,103
Schumann	40	981/919	206,379
Scriabin	76	879/825	135,851
9-class	896	10,945/10,305	2,213,298
100-class	4997	70,440/64,129	12,108,749

4.3 Data leakage

In this section we discuss known data leakage issues with the 9-class and the 100-class labeled datasets.

It is impossible to get a perfectly clean train/test split with IMSLP data for many reasons: composers often re-write pieces or re-use themes in later works (and each may be listed with a separate opus number); composers sometimes create alternate versions of pieces; some composers do transcriptions or arrangements of other composers’ works (which makes the ground truth label ambiguous); some works are partially composed by the composer and completed by others after the composer’s death; and some works have uncertain or incorrect authorship (e.g., the “Valse Melancolique” in F-sharp minor was incorrectly attributed to Chopin and listed among his works).

We discovered an additional source of data leakage late in the review process: idiosyncrasies in IMSLP’s organization. Specifically, there are instances in which one piece appears on two different IMSLP webpages: once as an individual composition and once as part of a collection (e.g., individual preludes and fugues in Bach’s The Well-Tempered Clavier). To quantify how often this happens, we manually checked all works in the 9-class dataset and found that 11 of the 896 works were collections exhibiting this issue (9 Bach, 1 Scriabin, 1 Liszt). Of these 11 collections, we noticed that the IMSLP webpages for 8 of them had a warning dialog box at the top of the webpage that indicated the possibility of duplicate entries. The 100-class dataset was too large to check manually, so we used an automated approach to iterate through all of the 4997 works and detect the warning dialog boxmentioned above. We found only two more collections with this issue (1 Mendelssohn, 1 Handel). Thus, we found 13 works/collections across five composers in the 100-class dataset that exhibit this issue. This automated approach most likely failed to catch some additional instances, but it provides a ballpark estimate of how common this phenomenon is. A list of these collections can be found on our github repository.

The above sources of data leakage mean that our reported accuracy numbers are likely inflated. However, as long as other researchers use the same dataset for comparison, the benchmark can serve its purpose of tracking progress in the composer classification task. Also, given how low the accuracies are for the 100-way classification (e.g., the best top-1 accuracy is in Table 4), the accuracy inflation due to train/test leakage still leaves an enormous amount of room for improvement and is likely only a minor factor in overall performance.

5 Research Tasks

In this section, we describe several research tasks that could be studied with the PBSCR dataset. We also provide baseline results using standard techniques for future researchers to compare against.

5.1 Supervised composer recognition

The most obvious task is a supervised composer recognition task. Here, the goal is to classify a bootleg score fragment according to its composer class. In addition to a labeled set of training pairs, unlabeled data from the IMSLP Piano Bootleg Scores v1.1 dataset (Section 3.1) is available for pretraining.

There are several metrics of performance that might be appropriate in this scenario. Following the convention in ImageNet, one useful metric of performance is top- accuracy, which indicates the percentage of queries that have the correct composer in the highest-ranked composers. Another useful metric is mean reciprocal rank (MRR), which is calculated as where indicates the rank of the true composer. MRR ranges between 0 and 1, and a higher MRR is better. MRR offers more nuanced information about the rank of the true composer than does the strict binary threshold of a top- accuracy metric. We recommend reporting results with several of these metrics since the most appropriate metric may depend on the difficulty of the task.

Tables 3 and 4 show the performance of three different baseline systems on the 9-class and 100-class recognition tasks, respectively. The first baseline system is a CNN model with two convolutional layers, followed by global average pooling across time, and then a finaloutput linear classification layer. This model is based on the architecture proposed by Verma and Thickstun (2019) but is adapted to a bootleg score representation (instead of MIDI). The second baseline system is a GPT-2 model (Radford et al., 2019) that is trained in the same manner as in Section 3.2: each bootleg score column is represented as a sequence of 8-bit characters, a subword vocabulary is learned using Byte Pair Encoding, a small 6-layer GPT-2 language model is trained on unlabeled bootleg scores in IMSLP, and the model is fine-tuned on the labeled data. To tease apart the effects of pretraining and fine-tuning, we report results of the GPT-2 model using three different training conditions: (1) training the model from scratch on the labeled data without pretraining (“GPT-2 (no pretrain)”), (2) pretraining the language model and learning a linearprobe (“GPT-2 (LP)”), and (3) pretraining the language model, learning a linear probe, and then unfreezing and fine-tuning the whole model (“GPT2 (LP-FT)”). For (3), we followed the recommended practices in Kumaret al. (2022), which were shown to have good generalization to out-of-distribution data. The third baseline system is a RoBERTa model (Liu et al., 2019) with six Transformer encoder layers. This model is pretrained using a masked language modeling task but is otherwise trained in a similar manner as GPT-2. We report results of the RoBERTa model using the same three training conditions described above. These three model architectures were previously explored in Tsai and Ji (2020) and in Yang and Tsai (2021a) on a 9-class composer recognition task, and here we present results on the (new) 9-class and 100-class PBSCR benchmarks. These results are intended to serve as baselines that future approaches can be compared to.

Table 3

Baseline results for the 9-class PBSCR task. Results are shown for top-1 accuracy (%) and mean reciprocal rank.

System	Top 1	MRR
CNN	40.0	0.593
GPT-2 (LP-FT)	49.6	0.670
GPT-2 (LP)	42.5	0.613
GPT-2 (no pretrain)	25.0	0.466
RoBERTa (LP-FT)	44.4	0.631
RoBERTa (LP)	38.0	0.581
RoBERTa (no pretrain)	19.2	0.407

Table 4

Baseline results for the 100-class PBSCR task. Results are shown for top-1, top-5, and top-10 accuracy (%).

System	Top 1	Top 5	Top 10
CNN	7.4	21.3	32.4
GPT-2 (LP-FT)	13.9	34.8	49.0
GPT-2 (LP)	10.4	28.5	42.8
GPT-2 (no pretrain)	3.2	11.6	20.4
RoBERTa (LP-FT)	10.6	29.0	42.0
RoBERTa (LP)	7.5	22.9	35.0
RoBERTa (no pretrain)	2.1	8.1	15.0

There are two things to notice about the baseline results in Tables 3 and 4. First, the GPT-2 model has the best performance among the three models on both the 9-class and the 100-class recognition tasks. We can see that pretraining on the unlabeled IMSLP data makes a big difference, improving top-1 accuracy on the 9-class recognition task from to and improving top-5 accuracy on the 100-class recognition task from to . This underscores the importance of having a large, diverse set of data for pretraining. We also see that full model fine-tuning makes a big difference, improving top-1 accuracy on the 9-class recognition task from to and improving top-5 accuracy on the 100-class recognition task from to . Second, there is a lot of room for improvement. The best GPT-2 model achieves a top-5 accuracy of on the 100-class recognition task, showing that there is a massive amount of room for improvement.

Our hope is that this dataset can spur progress on this challenging task.

5.2 1-Shot and low-shot composer recognition

An interesting modification to the above problems is to study few-shot composer recognition. The problem setup is the same as before, but the number of training examples per composer is artificially limited to . We recommend the following tasks: (a) a9-class recognition task with and (b) a 100-class recognition task with . This set of tasks encourages the development of approaches that are data-efficient (with labeled data), and it allows one to study the effect of the number of training examples as well as the generalizability of model representations.

We consider three baseline models for the few-shot tasks. The first system is a GPT-2 model that is trained by: (a) pretraining a 6-layer GPT-2 language model on the unlabeled IMSLP bootleg score data, as described in Section 5.1; (b) using the penultimate activations of the language model as a feature representation for the training samples; (c) identifying the nearest neighbors for each composer that are closest in Euclidean distance to a given test query; and (d) rank-ordering the composers by average Euclidean nearest neighbor distance. The second system is a 6-layer RoBERTa model that is trained and used in a similar manner, except with a masked language modeling task during pretraining. These models adopt the pretraining strategies of the classification models described in Section 5.1 but use -nearest neighbors for classification instead of training a classification layer. We also evaluate the performance of a random-guessing baseline for reference.

Tables 5 and 6 show the performance of the three baseline systems on the few-shot 9-class and 100-class recognition tasks, respectively. The upper, middle, and bottom sections of the table show performance for an-shot task with , and , respectively. For we use the nearest neighbor foreach composer (by necessity), and for and we use the nearest neighbors. In each trial, we randomly sample training examples from each composer to simulate a few-shot scenario, and then we calculate the performance on the entire test set. We report the mean and standard deviation of performance across 30 trials.

Table 5

Baseline results for the -shot 9-class recognition task. Results are expressed as mean and standard deviation across 30 trials. Top-1 accuracies are indicated in percentages (%).

System	N	Top-1 mean	Top-1 std	MRR mean	MRR std
GPT-2	1	15.4	2.3	0.36	0.020
RoBERTa	1	14.5	1.8	0.35	0.017
Random	1	11.2	0.3	0.32	0.003
GPT-2	10	19.7	1.8	0.41	0.013
RoBERTa	10	19.8	1.6	0.41	0.013
Random	10	11.0	0.4	0.31	0.003
GPT-2	100	23.8	0.8	0.45	0.006
RoBERTa	100	23.7	0.9	0.45	0.006
Random	100	11.1	0.4	0.31	0.004

Table 6

Baseline results for the N-shot 100-class recognition task. Results are expressed as a mean and a standard deviation across 30 trials.

System	N	Top-1mean	Top-1std	Top-5mean	Top-5std	Top-10mean	Top-10std
GPT-2	1	1.9	0.21	7.7	0.44	14.1	0.56
RoBERTa	1	1.8	0.20	7.7	0.45	14.1	0.57
Random	1	1.0	0.06	5.0	0.13	10.0	0.21
GPT-2	10	3.0	0.25	11.2	0.38	19.1	0.50
RoBERTa	10	3.1	0.19	11.3	0.39	19.3	0.54
Random	10	1.0	0.10	5.0	0.15	10.0	0.23
GPT-2	100	3.9	917	14.2	0.30	23.5	0.41
RoBERTa	100	4.0	0.14	14.3	0.27	23.7	0.34
Random	100	1.0	0.07	5.0	0.16	10.0	0.21

We can see that the pretrained models perform significantly better than random, and that both models perform comparably across all settings. While these results show that the pretrained models are indeed extracting style information, the performance of these models is quite poor overall, indicating how much room there is for improvement. We provide these results as a baseline against which future works can compare.

5.3 Zero-shot composer recognition

Another interesting problem to consider is zero-shot composer recognition. In this task, the goal is to predict the composer of a bootleg score fragment when no previous training examples of that composer have been seen. We are not aware of previous work studying this topic within the composer recognition literature until recently, where researchers explored zero-shot composer classification with music-text data (Wu et al., 2023). Here, we simply define the task and suggest some possible avenues of exploration.

This task is possible due to the rich metadata available on IMSLP. For example, given the Wikipedia articles for some unknown composers, one could infer aspects of compositional style based on their date of birth, country of origin, connections to other composers, or other knowledge about the composer that is embedded in a large language model. Wu et al. (2023) trained a model to embed both symbolic music and text descriptions into a common embedding space using a CLIP-like approach (Radford et al., 2021). This work does not release their music-text training pairs, however, and also evaluates on a small evaluation dataset (411 pieces, 8 classes). The PBSCR data would be sufficiently large to train such models, is open to the research community, and offers a much more challenging classification task. Setting up a benchmark for multimodal tasks such as this is an area for future work.

The zero-shot task opens the door to multimodal approaches to composer recognition. Given the size, richness of metadata, and open nature of IMSLP, we believe that the PBSCR data is well poised to facilitate interesting and novel directions in multimodal research.

6 Research Questions

In this section, we describe several research questions for which we believe the PBSCR data is especially well suited to facilitate research on.

6.1 Encoding schemes

One open research question is “How should we encode music data when feeding it into a model?” We may want to select the encoding scheme to maximize performance on a particular task of interest, to have certain desirable properties such as key or tempo invariance, or some combination of factors. Because of our design decision to use a bootleg score representation, the PBSCR data has discarded a significant amount of musical information. Nonetheless, due to its simple format, it is well poised to facilitate rapid, iterative exploration of many interesting questions, some of which we describe next.

Image vs. Tokens. The fact that the bootleg score is a binary matrix raises the question: “Is it better to treat the data as a 2-dimensional binary image or as a sequence of discrete tokens (e.g., each bootleg score column is interpreted as a discrete “word”)?” These two options lead to different kinds of models: 2-dimensional images lend themselves to CNN or ViT-based architectures, while token sequences lend themselves to Transformer-based language models. Previous work (Yang and Tsai, 2021a) has compared simple CNN architectures with GPT-2 and RoBERTa, but a lot of other work has developed effective strategies for applying Transformers to images (e.g., ViT (Dosovitskiy et al., 2021)) and utilizing pretraining strategies like masked autoencoding (e.g., ViT-MAE (He et al., 2022)). Recent work has explored this topic (Zhang et al., 2023), and it remains an open research question.

Harmonic vs. Temporal. What are effective ways to capture both harmonic and temporal information? The approach described in Section 5.1 encodes bootleg score columns (or parts thereof) as discrete tokens and then models temporal information with a Transformer. But one could alternately encode rectangular blocks of the bootleg score image as discrete tokens, similar to the patches in a ViT model. In this case, each token would capture both harmonic and temporal information, rather than only harmonic information.

Raw vs. Processed. At what level of semantic representation is it best to represent discrete tokens? On one end of the spectrum, we could represent a bootleg score fragment simply as a sequence of zeros and ones and then let a Byte Pair Encoder combined with a Transformer model learn the most suitable representation. On the other end of the spectrum, we could design musically informed encodings, taking into account domain knowledge such as the split between the left- and the right-hand staves, the fact that staff line positions are cyclical and octave-based, etc. For example, one could encode a bootleg score column as the text “C3–G3–E4–C5”, which explicitly decomposes the staff line position into octave and class information. Although recent work has explored this topic (Fradet et al., 2023), this remains an open question.

Absolute vs. Relative. How much should the encodingof discrete tokens capture absolute vs. relative position information? One intuitive shortcoming of the Transformer models in Section 5.1 is that they encode the absolute staff line positions of noteheads, rather than relative position or movement. In the example given above, the bootleg score column encoded as “C3–G3–E4–C5” could instead be encoded as “C3–4–5–5” to capture the relative staff line intervals between notes in the chord. Furthermore, the root of the chord (C3) could itself be expressed relative to a notehead in a previous bootleg score column.

The topics above are issues that could be studied conveniently with the PBSCR data. Many other similar questions could be rapidly explored, given the simplicity and format of the data. Thus, even though the dataset does not have complete symbolic score information, it can facilitate rapid exploration of ideas and progress on research questions of broad interest to the MIR community.

6.2 Data augmentation

Another open research question is “What are effective ways to perform data augmentation of symbolic music data?” The PBSCR data is ideal for exploring data augmentation techniques for two reasons, which we describe below.

First, the PBSCR data has a very simple format. The fact that the bootleg score is a simple 2-dimensional image makes it much easier to explore data-augmentation techniques. For example, many data-augmentation strategies developed for computer vision can be applied out-of-the-box, with no additional effort such as cropping, shifting, or MixUp (Zhang et al., 2018). In contrast, for symbolic music formats like MusicXML, it would be much more cumbersome to rapidly explore the same types of data augmentation. Both existing and novel techniques would be far easier to implement with bootleg score images than with MusicXML data.

Second, the PBSCR data format has several key properties that make such augmentations musically meaningful. One such property of the bootleg score is the nature of key changes: because the staff line positions (A through G) are cyclical, key changes correspond to simple vertical shifts in the bootleg score representation (assuming that the key signature is properly adjusted). Similarly, shifts in time correspond to simple horizontal shifts in the bootleg score representation. The bootleg score also has the property of additivity—if you add a bootleg score event (i.e., a binary vector of length 62) describing a chord in the right hand to a bootleg score event describing a left-hand chord, the resulting event is simply the sum of the two constituent bootleg score events. (Note that discrete token-based representations do not have this property.) It is also worth pointing out that the bootleg score representation is inherently invariant to tempo since it describes only the sequence of noteheads rather than the absolute time between note events. Because of these properties, operations like cropping and shifting and MixUp are very easy to implement and have clear musical interpretations.

The above properties make the PBSCR dataset ideal for exploring data-augmentation strategies. This includes exploring the effectiveness of existing data-augmentation techniques from computer vision, as well as quickly implementing and trying domain-specific data-augmentation techniques.

6.3 Integrating multimodal information

Yet another interesting research question is “How can we train models with multiple modalities of data?” The PBSCR dataset is ideal for exploring this question in MIR for three reasons, which we describe next.

First, the PBSCR data is linked to rich metadata on IMSLP. In particular, each work in IMSLP has a lot of rich multimodal information: audio recordings, MIDI files, sheet music scores, arrangements and transcriptions, relevant metadata (e.g., composer, publisher information, composition date, composer time period, instrumentation), and links to relevant Wikipedia article pages and descriptions (e.g., All Music Guide). Importantly, in keeping with IMSLP’s philosophy, these resources generally have very research-friendly licenses: most audio recordings and sheet music scores have a Creative Commons license or are in the public domain.

Second, the PBSCR dataset is large enough to study multimodal problems at a nontrivial scale. Given the size and scale of modern models, it is necessary to have a large enough quantity of data to train large models. The PBSCR dataset—and certainly IMSLP—fulfills this requirement. One additional benefit of utilizing IMSLP data is that the website is actively maintained, so the quantity of data will presumably only increase into the future.

Third, the bootleg score representation is well suited for cross-modal and multimodal tasks. Previous works have demonstrated the effectiveness of the bootleg score representation for cross-modal tasks. For example, it has been used in cross-modal retrieval to find matches between the Lakh MIDI Dataset and sheet music in IMSLP (Yang and Tsai, 2021b), and it has been used in cross-modal transfer learning to perform composer classification of audio recordings using sheet music as training data (Yang and Tsai, 2021a). As such, it is well suited to connect multiple representations of music, including sheet music images, symbolic files, and audio.

For these reasons, we believe the PBSCR dataset is particularly well situated to facilitate multimodal research in MIR. Setting up the infrastructure for specific multimodal tasks is an area for future work.

7 Conclusion

This article motivates, describes, and presents the PBSCR dataset for studying composer recognition of piano sheet music. Our overarching goal was to create a dataset for studying composer recognition that is “as accessible as MNIST and as challenging as ImageNet.” To achieve this goal, we use a fixed-length bootleg score representation extracted from piano sheet music images on IMSLP. This choice allowed us to access a large, open, diverse set of data while being able to present the data in an extremely simple format that mimics MNIST images. The dataset itself contains labeled fixed-length bootleg score images for 9-class and 100-class recognition tasks as well as a large set of variable-length bootleg scores for pretraining. We include relevant information to connect each bootleg score fragment to the specific work, PDF score, and page from which it was taken, and we scrape, collect, and organize metadata from IMSLP on all works to facilitate multimodal research in the future. We describe several research tasks that could be studied with the dataset and present baseline results for future works to compare against. We also discuss open research questions that the PBSCR data is especially well suited to facilitate research on.

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant No. 2144050.

Competing Interests

The authors have no competing interests to declare.

Notes

[1] https://imslp.org

[2] This article is a journal extension to Tsai and Ji (2020), with an exclusive focus on the dataset (and not the techniques). This article focuses on expanding, improving, and making this previous dataset as easy to use as possible. The novel contributions in this journal include (a) going to significant lengths to clean up the large, unlabeled dataset for pretraining by removing non–music filler pages (Section 3), (b) expanding the labeled dataset from 9 composers to 100 (Section 4), (c) adding metadata information from IMSLP to facilitate multimodal research and allow for convenient linking to other datasets (Section 3.3), (d) presenting a new set of composer classification results using the updated dataset (Section 5), and (e) offering a comprehensive discussion of the context (Section 2), data leakage issues (Section 4.3), and research questions relevant to this dataset (Section 6).