Have a personal or library account? Click to login
Towards an ‘Everything Corpus’: A Framework and Guidelines for the Curation of More Comprehensive Multimodal Music Data Cover

Towards an ‘Everything Corpus’: A Framework and Guidelines for the Curation of More Comprehensive Multimodal Music Data

Open Access
|May 2025

Full Article

1 Introduction and Motivation

Life is inherently multimodal. We experience the world through many, richly intertwined modalities. It is hardly surprising, then, that multimodal data have proven effective across a range of computational tasks that model human experience and expertise (Ephrat et al., 2018; Liem et al., 2011; Simonetta et al., 2021; Soleymani et al., 2011; Yu et al., 2023). As the scale, range, and interconnectedness of multimodal datasets continue to grow, it has become clear that researchers and practitioners in music information retrieval (MIR) are working towards a kind of ‘Everything Corpus’—a collection of not only source material (e.g., audio) but ancillary and supportive data, of which some might be obvious (e.g., album cover images) but others may be less apparent (e.g., court precedents or seismic activity), all spanning a complex range of modalities, formats, and tasks that are of increasing interest to diverse fields.

With the extreme growth in the complexity of musical data comes an equally pressing need to ensure that corresponding multimodal datasets are as comprehensive as possible while remaining precise and well targeted in their intended aims, amenable to open and reproducible research, paradigmatic of a high standard of quality, committed to best ethical standards and practices, and generally able to withstand reasonable future scrutiny. Although an ever‑increasing number of multimodal datasets for MIR research are being developed (Bertin‑Mahieux et al., 2011; Vatolkin and McKay, 2022; Zalkow et al., 2020; Zhang et al., 2023)—each satisfying these needs to varying degrees—there is not yet a clear path forward on how best to navigate these needs.

As a step towards meeting these needs as fully as possible, this article details a conceptual framework for how practitioners might explore the complexity of multimodal music data and proposes criteria and guidelines for the eventual construction of an ‘Everything Corpus’ in MIR research. The intended aims and advantages of such a corpus include having a deeper and more encompassing understanding of music data that is relevant to a wider variety of foundational MIR‑related tasks, opening the door for researchers and practitioners from diverse backgrounds (e.g., engineers, musicologists, and psychologists) to lend their respective expertise in working with music data, and elevating the call for improved practices regarding the construction of multimodal datasets in line with the needs identified above. In contrast to recent work by Christodoulou et al. (2024), we focus less on definitions of multimodality in MIR research and analyses of existing datasets, and more on all relevant music data and effective criteria for curating new multimodal datasets.

In Section (§) 2, we provide clarifying definitions for terminology relevant to multimodal music datasets followed in §3 by a brief overview of the important milestones within the multimodal MIR research community. In §4, we identify 12 different themes of musical data divided into three sequential phases and five narrow focus areas—(i) ‘before’ the music (leading to), (ii) the ‘actual’ music (itself and around it), and (iii) ‘after’ the music (uses of and responses to)—based on historical observations of music and musical practice. Importantly, this framework is not limited to data that have been used in multimodal MIR research but also those which could be included in the future. In §5, we present 17 quantitative, qualitative, and ethical criteria, informed by this conceptual framework and practices observed in existing multimodal datasets, which serve as guidelines for future researchers to consider as part of curating an ‘Everything Corpus’ for multimodal MIR research.

2 Terminology

There is relatively little agreement about the precise definitions of terminology connected to ‘multimodal’ research and data. Broad descriptions of modality have included “the way in which something happens or is experienced” (Baltrušaitis et al., 2019, p. 423), and while music psychology has typically related the concept to sensation, (e.g., the “human body can sense in auditory, visual, tactile, and olfactory modalities” (Leman, 2008, p. 252)), usage in computer science associates modality with the entities used in information processing methods, such as “audio, images, and texts” (Oramas et al., 2018, p. 4). Here, we clarify our use of key terms without prescribing how others ought to use them or attempting to provide an ontology.1

As our primary focus is on multimodal dataset design and evaluation in MIR, we adopt a perspective more grounded in computer science, which takes into account differences in data that give rise to distinct information processing methods. Thus, ‘media’ describes whether information is stored and transmitted as audio, video, or another signal (e.g., electroencephalograms (EEG)),2 symbolic data (e.g., text or score), or image. ‘Modality’ refers to the semantic meaning of a particular media entity, such as a recorded or annotated musical piece, lyrics, album covers, or concert videos. For instance, a single modality in the form of a musical score can be represented by an image produced by optical music recognition (OMR) software or a symbolic score annotated manually. Similarly, distinct modalities, such as studio and live performance recordings of a musical piece, may be represented by the same media (audio signal). We speak of ‘multimodality’ when at least two modalities are combined. It is worth stating that the same modality may lead to distinct human sensory experiences, e.g., lyrics can be heard in audio or seen as text.

‘Source’ relates to the concrete environment where the data are collected and/or to the artist or algorithm responsible for their creation. For instance, musical audio can be recorded in a studio or live, at different venues, and at different time periods. Figure 1 illustrates the complexity of how different sources and media may relate to one another. If multiple actions (arrows) are applied, we may distinguish between original and secondary sources, e.g., a concert may provide the original source, with audio recorded there becoming a secondary source for use in a transcription process. In such cases, for both sources and media, this additional semantic meaning, ‘recorded live music piece’ or ‘transcribed score’, leads to multiple modalities.

tismir-8-1-228-g1.png
Figure 1

Extended representation of an illustration provided in Müller (2021, p. 30) showing different original sources (circles) and media (boxes).

‘Data representation’ extends modality with respect to (1) the nature of the data in mathematical terms, being numerical (a number or tensor), categorical (ordinal or nominal), or symbolic (musical score); (2) a structural entity within the same media type, such as ‘ordered text list’ or ‘graph‑based taxonomy’; and (3) further descriptions, such as ‘code’ (text constructed from pre‑defined elements following specific building rules). ‘Format’ encapsulates or encodes the data, which is stored on computers or other digital devices, and is specified by a recognised standard and/or file extension. Formats can be either ‘generic’ (i.e., a related class of file extensions sharing a similar structure or purpose, such as XML) or ‘specific’ (i.e., designed or intended for a specific modality, such as MP3 for audio). ‘Capturing method’ describes how data are captured or obtained (e.g., microphone or lens) and, in the best case, specifies the actual device used. For further reading on abstraction techniques for data representation, we refer to Wiggins (2009).

3 A Brief History of Multimodal MIR Research

Extensive literature reviews and discussions of multimodal tasks and data in MIR exist elsewhere (Christodoulou et al., 2024; Müller et al., 2011; Simonetta et al., 2019). For this reason, we aim below to provide a high‑level, chronological overview encompassing only some of the more important modern milestones in MIR multimodal research to date rather than a systematic review. A Scopus search was carried out on 2 September 2024 within the title, abstract, and keyword fields of all publication types using the following keywords: (‘multi‑modal’ OR ‘multimodal’) AND ‘music’ AND ‘retrieval’. This revealed 293 documents, beginning in 1997 with three documents and a more or less steadily increasing trend upwards to 2023 with 49 documents (16 documents were indexed at this time in 2024).

3.1 Early years: 1997–2003

While historical precedents exist (see, e.g., Schneider (2018) for a discussion), the earliest research arguably recognisable as multimodal in a modern sense focused on video and sound with efforts towards audiovisual scene segmentation, indexing, and retrieval driven largely by the signal processing and computer vision communities (Faudemay et al., 1997; Qi et al., 1997). Soon after, Haus and Pollastri (2000) stressed the importance of multimodality to computers and music by proposing a framework for music interfaces aimed at facilitating human–computer interactions through singing, playing, and notating music. Schuller et al. (2003) later re‑emphasised this importance for general music retrieval tasks and presented one of the earliest multimodal query systems to use humming, speaking, writing, and typing through belief networks and contextual knowledge.

3.2 Early‑middle years: 2004–2009

As MIR became more established as a field, two significant peaks in 2004 and 2008 brought an expansion of multimodality to other computational tasks and widened the discourse on multimodality to include more diverse and unique perspectives. In 2004, Jang et al. (2004) expanded karaoke interfaces by allowing users to sing, hum, tap, speak, or write along with the system; Wang et al. (2004) addressed audio–lyric alignment on a 20‑song dataset; and one of the earliest examples of large‑scale database retrieval for music emerged (Schuller et al., 2004). Music visualisation also emerged at this time as a means for supporting multimodal navigation and recommendation (Kurth et al., 2005; Lübbers, 2005).

Celma et al. (2006) sought to address the disconnect between semantic descriptions and the information that can be extracted from music, while one of the earliest examples of multimodal audiovisual drum performances created for MIR tasks also emerged (Gillet and Richard, 2006). Now long‑standing tasks of audio segmentation and polyphonic note tracking also began to incorporate additional modalities, such as lyrics (Cheng et al., 2009) and performance video recordings (Quested et al., 2008), while among the relatively new areas of focus at this time were tasks aimed at incorporating mood and emotion. For example, work by Yang et al. (2008) aimed to utilise multimodality to improve categorical music emotion classification, while Laurier et al. (2008) addressed mood classification based on audio and lyrics. Further interesting work to aid in the management and navigation of multimodal music collections also emerged (Kurth et al., 2008; Lübbers and Jarke, 2009; Thomas et al., 2009).

3.3 Later‑middle years: 2010–2020

Significant rises in multimodal MIR research took place around 2010 and 2020 with a wider variety of work in both new and existing areas appearing during this time. Areas of continued focus included mood and emotion, with unsupervised learning approaches to mood recognition based on audio and lyrics (McVicar et al., 2011), continuous analysis of mood using the Now That’s What I Call Music! (NTWICM) dataset of popular UK chart music (Schuller et al., 2011), support vector machine approaches to emotion classification based on Musical Instrument Digital Interface (MIDI), audio, and lyrics (Lu et al., 2010); mood and emotion classification/detection based on audio and lyrics (Delbouys et al., 2018; Hu and Downie, 2010; Quilingking Tomas et al., 2020); and emotion modelling (Schmidt and Kim, 2011). Newer areas included genre classification based on a variety of different modalities, including audio and lyrics (Mayer and Rauber, 2011), audio, symbolic, and cultural sources (McKay, 2010), various acoustic features and tags (Zhen and Xu, 2010), audio and colour information extracted from music videos (Schindler and Rauber, 2015), as well as customer reviews, metadata, and audio descriptors (Oramas et al., 2016).

Other diverse tasks concerned popularity prediction using audio, lyrics, and metadata (using the SpotGenTrack Popularity dataset) (Martín‑Gutiérrez et al., 2020), cover song detection using melody and accompaniment separation (Foucard et al., 2010), playlist generation using audio and textual descriptors (Chiarandini et al., 2011), audio‑to‑image score alignment (Dorfer et al., 2017), score following using image score and spectograms (Dorfer et al., 2016), as well as reinforcement learning (Dorfer et al., 2018), audiovisual source association of string ensemble playing (Li et al., 2019), song retrieval from image using lyrics (Li et al., 2017), fusion of music with videos (Xu et al., 2014), music instrument recognition using audio and video (Slizovskaia et al., 2017), singing voice separation using audio and lyrics (Meseguer‑Brocal and Peeters, 2020), and an audiovisual tutor for guitar (Barthet et al., 2011). Graph‑based approaches also emerged, such as auto‑tagging (Hsu and Huang, 2015) and mood classification using lyrics and audio (Su and Xue, 2017). Work by Barthet and Dixon (2011) further broadened the discourse of multimodality to include ethnography, musicologists, and their implications for MIR research. During this time there was also an increase in efforts towards improved personalisation and user modelling, with modalities including audio features, listening activities, and geospatial location (Schedl and Schnitzer, 2014); recommendations based on physiological measures and personality traits (Liu and Hu, 2020); personality traits and bodily movements (Agrawal et al., 2020); and embodied cooperation (social queries) (Varni et al., 2011).

More multimodal datasets also began to emerge, including, notably, the Million Song Dataset (MSD) (Bertin‑Mahieux et al., 2011), the Musical Theme Dataset (MTD) of musical themes (Zalkow et al., 2020), the Acoustic and Lyrics Features (ALF)‑200k dataset of audio and lyrics features for playlist‑related tasks (Zangerle et al., 2018), the Dataset for Emotion Analysis using Physiological Signals (DEAP) database of physiological signals and video for emotion analysis (Koelstra et al., 2011), the PROBADO multimodal online library of audio, score, and lyrics (Thomas et al., 2012), and datasets of emotional and colour responses to music (Pesek et al., 2017), among others. Related work sought to further facilitate the creation of such datasets through audio and gesture capture software (Hochenbaum and Kapur, 2012).

3.4 Recent years: 2021–present

In recent years, much MIR research has concentrated on generative artificial intelligence (AI) and other advanced deep learning approaches (e.g., large‑language models (LLMs), diffusion‑based models, and, specifically, multimodal LLMs (Fu et al., 2023)), applied to an ever‑widening range of tasks. Graph‑based neural networks have been applied to music recommendation (Cui et al., 2023), piano transcription (Li et al., 2023), and predicting artist similarity (da Silva et al., 2024), while LLMs have augmented song metadata for improving emotion recognition, genre classification, and music tagging in audio (Rossetto et al., 2023).

Various similar tasks of image–audio transcription (Alfaro‑Contreras et al., 2023), audio‑to‑score alignment (Simonetta et al., 2021), audio‑score retrieval (Carvalho and Widmer, 2023; Carvalho et al., 2023), caption generation (Manco et al., 2021), and others that incorporate emotion (Doh et al., 2023; Stewart et al., 2024) have relied on a variety of deep learning approaches from encoder–decoder architectures to attention‑based ones. Representation and metric learning approaches have also focused on tag‑based music retrieval (Won et al., 2021; da Silva et al., 2022) and cover song detection (Gu et al., 2023) while also attempting to integrate several different modalities, such as text, images, graphs, and audio (Tabaza et al., 2024). Other interesting multimodal tasks, such as singing language classification using the Music4All dataset (Choi and Wang, 2021) and singing melody extraction using a neural harmonic‑aware network with gated attentive fusion (Yu et al., 2023), have become more prominent. Still, long‑standing tasks, such as emotion recognition (Sung and Wei, 2021) and instrument identification based on audio, image scores, and MIDI (Yang et al., 2022), continue to be areas in which deep learning proves effective.

Perhaps unsurprisingly, with the ever‑growing emphasis on deep learning and generative AI across different modalities, many multimodal datasets have recently been created. Recent examples of such datasets include the Schubert Winterreise Dataset (SWD) (Weiß et al., 2021) and Wagner Ring Dataset (WRD) (Weiß et al., 2023) for computational music analysis tasks; the EMOPIA dataset of pop piano music for emotion recognition and emotion‑based generation (Hung et al., 2021); the University of Hong Kong 956 (HKU956) dataset for emotion detection consisting of audio, physiological signals, and self‑reported emotion (Hu et al., 2022); the Musical Ternary Modalities (MusicTM) dataset for joint representation learning of audio, sheet music, and lyrics (Zeng et al., 2021); the Music4All‑Onion content‑centric dataset (Moscati et al., 2022); the Next‑MV dataset of stylistic and cultural correlations between music and video (Chen et al., 2022); the N20EM dataset for lyric transcription consisting of audio, head movements of a singer, and videos of their lip movements (Gu et al., 2022); the Vocal92 dataset of a cappella solo singing and speech (Deng and Zhou, 2023); and the WikiMuTe dataset of web‑sourced semantic descriptions of audio (Weck et al., 2024), among others.

3.5 Summary

While this overview of multimodal MIR research is not intended to be exhaustive, it does highlight several important points. First, the significance of music's multimodal nature was evident early; however, the emergence of various multimodal datasets aimed explicitly at MIR tasks has largely tracked with the on‑going development of these tasks within the field and the concurrent rise in various deep learning advancements which have allowed for multiple modalities in recent years. Second, audio, symbolic music data, image/video, and text in various forms remain among the most common modalities represented in research. Finally, the steadily increasing trend of multimodal research in MIR over the years suggests that multimodality will continue to be important for the foreseeable future.

4 A Framework for Understanding All Data that Has or Could be Used in MIR

Many types of musical data feature somewhere in scholarship on music, and there have been considerable efforts to make multimodal connections among these types, both within and across various fields (§3). At the same time, many of these efforts remain un‑ or under‑explored owing largely to divisions between different research ecosystems having their own publication venues and cross‑referencing practices, despite their shared focus on music and closely aligned research goals.

Table 1 presents our conceptual framework of 12 different themes of music and music‑adjacent data that either have featured or plausibly could be featured in multimodal MIR research. Each theme belongs to one of three sequential phases (and five focus areas), concerning either the preparation of music, its creation, or beyond, intended to help researchers and practitioners in thinking about music data and directing their efforts in the construction of multimodal music datasets.

Table 1

Proposed conceptual framework of 12 themes of music data, broken down into three phases—phase −1, ‘before’ the music (leading to); phase 0, the ‘actual’ music (itself and around it); and phase +1, ‘after’ the music (uses of and responses to)—and five focus areas. Example sources and media/representations are provided for each.

ThemeSource examplesMedia/representation examples
Phase −1: ‘Before’ the musicFocus area: ‘Leading to’ the music
tismir-8-1-228-i1.png Context (§4.1)Personal biographySymbolic (text as linked open data)
Socio‑cultural history, networkSymbolic (text as linked open data)
Economic: e.g., commissionSymbolic (text as linked open data)
Technical: e.g., textbooksSymbolic (text)
tismir-8-1-228-i2.png Preparation (§4.2)Composition sketchesImage, symbolic (text as version control)
Performer practiceSymbolic (text), video, other signal (motion capture)
Phase 0: The ‘actual’ musicFocus area: The music ‘itself’
tismir-8-1-228-i3.png Composition (§4.3)EditionImage
ScoreSymbolic (score)
Live codingSymbolic (text as code)
tismir-8-1-228-i4.png Performance (§4.4)Live performanceAudio, image, video, other signal (thermal, seismic time series)
Studio recordingAudio, image, video
InstrumentsSymbolic (text as taxonomy)
Playing action: explicitSymbolic (text)
Playing action: observedSymbolic (text), video, other signal (motion capture)
Focus area: ‘Around’ the music
tismir-8-1-228-i5.png Associated media (§4.5)Album artworkImage
LyricsSymbolic (text)
Promotional photo shootImage
‘Music video’Video
Phase +1: ‘After’ the musicFocus area: ‘Uses’ of music
tismir-8-1-228-i6.png Other media (§4.6)In video gamesSymbolic (text as mapping of music cues to game event triggers)
In film, television, advertisementsVideo
tismir-8-1-228-i7.png Meta‑composition (§4.7)In albumSymbolic (text as ordered list)
In playlistSymbolic (text as ordered list)
In recommender sequenceSymbolic (text as code)
In setlistSymbolic (text as ordered list)
As sampleSymbolic (text as list of time events)
tismir-8-1-228-i8.png Popularity (§4.8)ChartsOther signal (time series of rank, dates, sales)
Streams, likes, shares, skipsOther signal
tismir-8-1-228-i9.png Culture/occasion (§4.9)Specific: e.g., a coronationSymbolic (text, also as date or GPS location)
Generic: e.g., weddingsSymbolic (text)
Focus area: ‘Responses’ to music
tismir-8-1-228-i10.png Physiology (§4.10)Heart rateOther signal (time series)
Skin conductivityOther signal (time series)
Brain signalsOther signal (EEG)
SeismicOther signal (time series)
tismir-8-1-228-i11.png Analysis (§4.11)AnalysisSymbolic (text as fixed syntax or free prose)
Genre labelsSymbolic (text based on taxonomy)
Journalistic writingSymbolic (text)
Fan/open writingSymbolic (text)
tismir-8-1-228-i12.png Legal (§4.12)Court precedentsSymbolic (text)

As expansive as this framework aims to be, we necessarily limit its scope. First, we are looking for music‑centred multimodality for prospective use in MIR, so all data types must relate to music in some way. Second, we exclude data that can be derived directly from other data, for example, through processes of alignment or feature extraction. Audio is thus included data, but audio features are not. Third, we omit metadata as an independent category because it is relevant to all data types, to varying degrees. Notable examples of elements shared between many parts include metadata for the people involved in these processes and their stated roles. Finally, we emphasise that music data is complex and that the proposed framework is neither truly comprehensive nor the only way of organising this data. To this end, we find our efforts complementary to other related categorisations, such as the ‘musicking quadrant’ which distinguishes ‘in time’ from ‘out of time’ musical acts, and musical ‘experience’ against ‘creation’ (Jensenius, 2022). In the following subsections, we expand on each of the themes presented in Table 1, noting examples of the different kinds of data and complexities that emerge.

4.1 ‘Leading to’ the music: Context—the mysterious zeitgeist

Music and musical creation emerge from a complex societal context (Zeitgeist) that is challenging to capture meaningfully in encoded datasets. For that reason, the topic has often been limited to study within musicology rather than MIR. One of the defining challenges for an ‘Everything Corpus’ is to find ways to unite these two fields. There are many contextual matters that clearly make a material difference to the resulting music, although these can be hard to make clear observations of and to encode and leverage computationally. These include the economic context of commission (e.g., Taylor Swift re‑recording her back‑catalogue in ‘Taylor's Version’)3 and stylistic fashions of the time, musical and otherwise, having bi‑directional influence (Strähle and Rödel, 2018). Additionally, there is the socio‑cultural network involved, including details of who taught or influenced whom, either verbally or in written form through the technical guides that artists engaged with while learning their craft.

4.2 ‘Leading to’ the music: Preparation—putting it all together

The timescale for musical creation depends largely on the practices involved and varies widely, from painstakingly slow in some cases to instantaneous in the case of improvisation. It can likewise be the work of an individual (typical of a classical composition) or a collective effort (as in most popular music). In non‑instantaneous cases, there may be documentary evidence recording the compositional process, for example, in the form of sketches. While the study of sketches/manuscripts is an established practice in musicology, it requires multimodal expertise hardly touched by MIR.4 The age, status, and reliability of a source can be gauged from the physical paper (e.g., watermark), the probable scribes (e.g., their proficiency and relationship to the composer), the apparent stage of source and correction (e.g., a working draft or clean copy), and the details of publication and transmission (e.g., named publisher and date).

On the digital side, there is the infrastructure for version control, but this is not (yet) widely used for tracking and comparing musical sources. Musicology's sketch and manuscript studies currently intersect with MIR in limited cases, such as Beethovens Werkstatt.5 There is little MIR research on musical source version control, even though composition increasingly takes place in digital environments, such as notation software packages and digital audio workstations. Promising examples include diff tools for MusicXML and MEI (Foscarin et al., 2019; Herold et al., 2020) as well as projects for recording the keystroke data that would be applicable to relevant cases, such as live‑coding (Lee and Essl, 2014). In this effort, we may stand to learn from the visual and textual arts. For example, the visual artist David Hockey's recent ‘iPad paintings’ include storing the paintings' every pen stroke, with some galleries even displaying this process as a time‑lapse video. In text, we note the long‑standing field of keystroke logging and ‘genetic narratology’ (Wengelin and Johansson, 2023).

Composition also often involves pre‑existing material, e.g., quoting older musical items or setting a text to music. The ‘contextual’ use of existing material clearly has a counterpart: The pre‑existing piece is used in the new one. This is one of many cases highlighting the rich inter‑connectedness of musical data. We return to the latter perspective of used in with §4.6.

It is not only the composer who works in advance of a performance: The performer, too, prepares. Understanding this process was one motivation for the recent Rach3 dataset, which records the preparation of Sergei Rachmaninoff's notoriously difficult Piano Concerto No. 3 (Cancino‑Chacón and Pilkov, 2024) and includes synchronised MIDI, audio, and video taken from above to primarily capture the hands and fingers and, more incidentally, the arms, head, and torso. This is a commendable project, creating rich data. Yet, even here, it is easy to imagine more. For instance, music psychology might raise the case for including video of eye movement, given the long history of studying such data, particularly in the context of sight‑reading (Perra et al., 2021).

4.3 The music ‘itself’: Composition—aide‑mémoire

In most contexts, music is transmitted aurally; however, in some cultures there is written notation that serves not to capture every detail (notation underdetermines performance) but to aid memory and transmission (Goehr, 1994). This gives rise to the ‘symbolic’ side of MIR data and research, and is by no means limited to Western culture. Resources such as Jon Silpayamanant's Timeline of Music Notation give a sense of how misleading a Western‑centric notion can be, and recent years have seen growing attention from MIR to the world's music, including notated music from beyond the West (e.g., Han et al. (2024)). As with the compositional process, even ‘final’ compositions often exist in more than one single, definitive version. For instance, there may be a complex cluster of conflicting editions. For an MIR‑friendly example, see the Online Chopin Variorum Edition (OCVE) project on navigating the complexities of Chopin sources.6

4.4 The music ‘itself’: Performance—what humans do when playing music

Performance (live and otherwise) implies a great deal of detail, including information about the individuals involved, their roles and relationships (both musical and social‑societal), the venue (e.g., location and capacity), and any connections to other events, such as a festival or tour.

Performance may also fuse with ‘composition’ in the case of improvisation and other spontaneous acts, and separating the two is complicated not only in the case of improvisation but even when we consider notation. Notational practices vary with respect to playing techniques and other instrumental/vocal actions. More prescriptive systems, such as tablature, emphasise an instrumentalist’s actions (see, e.g., Koozin's (2011) fret‑interval types for guitar playing). By contrast, more descriptive systems (Western classical) relate primarily to the sounding result and rely on peripheral symbols and/or text for anything gestural.

Apart from notational attempts to describe the physical act of playing music, there are technical methods for retrieving that data. For example, Chander et al. (2022) used motion capture (with ‘reflective markers’ on the players’ bodies) to analyse the movements of violinists. And these technologies (such as motion capture) can also have wider purposes: The motion involved in making music is only a small part of what happens, physically, at a musical performance. In the majority of the world's musical practices, there is at least some meaningful physical action taking place that is beyond what is strictly required for making sound. Such examples include dance, procession, and ritual—much of which is highly structured and regular from one performance to another, whether it is documented or not. The combination of music and dance is well studied, and while Western languages typically distinguish between ‘music’ and ‘dance’, for instance, this is not true of all musical lexicons.7

Physical gestures can also carry semiotic meaning, as Nadkarni et al. (2024), for example, have discussed in relation to raga in Indian classical music. More generally, ‘dancing’ to music can carry musical meaning relating directly to MIR features. For example, dancing can express a view about the ‘main’ tempo by changing rates of motion from synchrony with the fast pulse to a slow one at ‘half tempo’ moments. Finally, some born‑digital music uses movement itself as a kind of instrument to trigger sound events. On the topic of movement to generate music using digital instruments, see the proceedings of the New Interfaces for Musical Expression (NIME) conferences.

4.5 ‘Around’ the music: Associated media—does my music look good in this?

While the music‑adjacent motion discussed above exists, as music does, in time, it lends itself to data encoded as a timeless moment. Both video and still images can include multiple cameras/views and all the geophysical details that go with it (e.g., global positioning system (GPS) position, direction, and zoom). The same is true of microphone positioning for audio in studio and live recordings with many orchestras, for example, recording concerts with dozens of microphones.8 While much of the visual data tends to be of recorded live performances (by fans and journalists alike), the same could be had for studio work, a currently limited amount of which is available in documentary film.

Visual imagery naturally extends beyond stills of concerts or any active making of music. Notably, album covers and other official artwork associated with music can be highly indicative of the genre. Compare, for instance, the visual aesthetic of folk versus metal music (Vatolkin and McKay, 2022). Musical artists also typically cultivate a visual aesthetic beyond the music, which can include set designs, as well as an individual's ‘look’ (e.g., clothing). Staged photo shoots provide clearly delimited data, and further sources can include press and fans' photos of artists' appearances in public.

Beyond visual aspects, we also situate lyrics here, ‘around’ the music, while noting that this is another complex case. Consider that lyrics are often pre‑existing (‘set to music’) and track the musical content closely—at least as closely as the in‑performance motion discussed above. Moreover, lyrics are also often connected to occasion‑specific uses of music, as will be discussed in the following subsections.

4.6 ‘Uses’ of music: Other media—hearing with our eyes

In addition to video intended to accompany music (e.g., music videos), multimodal MIR must also consider uses of music in video and other media (e.g., feature films, TV, and adverts). Purposes can include coordinating important musical events with visual action and movement (so‑called mickey mousing). Moreover, usages of this kind can be centred on other musical objects, such as instruments (e.g., the distinctive, oscillating sound of the ‘Swarmatron’ in the film The Social Network (2010)). Other purposes can include matching the mood with such data about the context of the quotation, potentially providing insight from the emotional tone of the dialogue and imagery at the point of sampling that can intersect with computational methods for language applied to the script. Usage in film also often creates a spike of wider engagement, e.g., in the number of streams (cf. §4.8).

Video games share all of the above considerations and add to that a more complex scenario: For instance, while most films have fixed order and length, games offer multiple ‘routes through’. The field of research into ludomusicology is quickly growing (Kamp et al., 2016), although not as fast as the staggering rates of growth in the industry and the technologies around it, which include augmented‑reality (AR), virtual‑reality (VR), and extended‑reality (AR/VR/XR) headsets that produce spatial and acoustical data that interacts with game music in even more complex ways.

4.7 ‘Uses’ of music: Meta‑composition—you know it by the company it keeps

‘Composition’ takes its etymology from the Latin componere (‘to put together’), and musical compositions have included pre‑existing material for as long as records attest, e.g., through the quotation of a well‑known melody. Similarly, whole pieces are combined in various ways. Movements can be combined into a multi‑movement work, such as in a symphony, and popular tracks into an album, whether they were intended for that purpose or not (e.g., compilation). In all of these cases, the ordering can be highly significant.

Works are also combined in a specific order for live performance. Both band setlists and orchestral programmes demonstrate the thought that artists and management give to the choice and sequence of items. These data are theoretically easy to record and indeed becoming more readily available. Recent years have seen orchestras, for example, release MIR‑ready datasets of their entire performance histories, while sites such as setlist.fm provide crowdsourced data for the (broadly) equivalent history of popular music performance.9

Beyond what was played and where, richer data can include fees paid (as retrievable from charities and other publicly funded groups disclosing their financial records), audience numbers, and more. Such data are currently understudied in MIR, relative to playlists and recommender systems, which have attracted considerably more attention (as summarised in Gabbolini and Bridge (2024)), partly owing to the commercial interests of major digital (e.g., streaming) companies. Playlists can also have a specific purpose, such as relaxation or exercise, with some companies, such as the exercise bike company Peloton, even having rights arrangements with artists. Various musical features, such as tempo, are important to such playlists, and their uses are further relevant to music therapeutic contexts (Bemman et al., 2023).

Uses can be more granular than whole works. Today, the most prevalent form of musical quotation is likely sampling. The WhoSampled website and application programming interface (API) provides insight into these secondary uses of pre‑recorded music.10 Memes are an increasingly important (and understudied) use case that, along with sampling, may indicate a growing desire among non‑professional musicians to not simply play music they appreciate, relatively passively, but to play with it more actively.

4.8 ‘Uses’ of music: Popularity—music as contest

Usage also relates to popularity, another source of relevant data which can include the nature of that usage, e.g., when certain tracks are played or skipped (cf. ‘attention studies’). While modern MIR data have addressed track skipping (Montecchio et al., 2020), this phenomenon is older than it may appear. When Spotify introduced the 30‑second rule for counting royalties, the inevitable gaming of this algorithm saw many tracks of 31 seconds' length. Vulfpeck's 2014 ‘Sleepify’ album is surely the most notable example, featuring 10 tracks with approximately 30 seconds of silence, matched by a campaign inviting fans to ‘listen’ to it on repeat during sleep.

Moreover, the precedents predate even streaming. Before she was the German chancellor, Angela Merkel was a young person in the former German Democratic Republic. In an interview, Merkel recounted the ‘60:40 rule’, which required discos to play more tracks from the East than the West.11 The rule did not constrain track skipping, and apparently it was common to stop disfavoured tracks early.

Even the number of streams has historical precedent. Before streaming, physical disc sales were the primary measure; earlier still, there was the number of jukebox plays; and sheet music sales antedate even that (while continuing to the modern day). How popularity changes over time and in different places can say a great deal about the music, particularly when combined with other data. Some datasets, such as the Billboard corpus, are explicitly organised in relation to popularity (Burgoyne et al., 2011).

4.9 ‘Uses’ of music: Culture/occasion—gebrauchsmusik

Music may be used to celebrate an occasion. And where the music is commissioned for that occasion, these considerations relate closely back to the context discussed in section §4.1. For example, many historical records attest to the commission and first performance of works, especially where the occasion is large and public, such as a coronation.

Apart from a particular occasion, there are also generic (kinds of) occasions to consider. For example, classical music has a much higher uptake at weddings than most other occasions. Moreover, a very small number of pieces have dominated that market: notably, Pachelbel's Canon, Widor's Toccata, and Wagner's ‘Bridal Chorus’ from Lohengrin (Range, 2024).

‘Generic’ occasions may be the best explanation for the many uses of music in, for example, sports and political rallies. Sometimes the motivation for a choice of song is clear. For instance, Queen's ‘We are the Champions’ is ubiquitous at celebratory occasions. In other cases, the choice is less obvious. The White Stripes' ‘Seven Nation Army’ has become a calling card for supporters of multiple sports teams and even politicians (“oh Je‑re‑my Cor‑byn”). The explanation for this is not as straightforward as in the Queen case. Is it due to a kind of ‘tribal’ aesthetic? Could such a feature be quantified and used reliably for prediction?

Works of music can, of course, straddle both the specific and the generic. For a work such as Bach's St John Passion, we know both the specific occasion of a first performance (7 April 1724) and the more generic time of year in the liturgical calendar (Good Friday). Subsequent performances tend to take place at the same time of year, and especially at notable anniversaries (April 2024 saw many such performances).

Apart from occasions, music also may be intended for a particular audience. For example, broadside ballads were single‑sheet songs that sold for a penny a piece. The recent 100ballads research project documents surviving examples (e.g., score and image), alongside new audio recordings, text transcriptions, and thematic tagging of genre and topic labels.12

4.10 ‘Responses’ to music: Physiology—how human bodies react to music

Some uses of music lead naturally to responses. For example, film‑makers occasionally claim to make certain decisions in response to music, beyond merely using it. More direct examples of responses include neuro‑physiological data (e.g., heart rate, skin conductance, EEGs, and magnetic resonance imaging (MRI) records of brain activity) gathered from people when listening to music. We may or may not distinguish here between performers and audience, and such physiological responses are further relevant to other physical actions, as discussed in §4.4.

This topic can also intersect with voluntary, conscious responses (§4.11). For example, musical emotion research may involve both physiological cues and conscious responses, where participants state their emotional state ‘declaratively’, thus intersecting with analysis, comments, and labels. MIR datasets of that kind have included work on Bach chorales (Sears et al., 2023) and popular music (Arthur and Condit‑Schultz, 2023). Among many wider examples, see Putkinen et al. (2024) for insight into cross‑cultural associations between music and the body and Turchet et al. (2024) for emotion recognition on the part of playing musicians. Music psychology research can also extend to wider modalities, as in Bruno Mesz's attempts to explore interactions of music with visual/tactile features and even taste (Mesz et al., 2011). Smell and taste pose acute challenges for digital encoding.

Finally, physiological responses can also come from non‑human animals, and other physical responses can be measured from the earth itself. Caplan‑Auerbach et al. (2023) provide a geological report on the seismic impact of a Taylor Swift concert. Naturally, peak seismic activity corresponds to the moments of highest audience intensity, showing that even this unusual example brings new data that are clearly relevant to the musical experience.

4.11 ‘Responses’ to music: Analysis—what humans say about music

We use the term ‘analysis’ here broadly to refer to a range of textual commentaries, from more structured statements within a specified syntax to free‑text comments on sites such as YouTube and SoundCloud. Examples of structured analyses include syntaxes and datasets for harmonic analysis (Harte et al., 2005; Neuwirth et al., 2018; Tymoczko et al., 2019). There is considerable opportunity for expanding this to other musical parameters (which have seen less attention) and modalities. For example, identifying the main melody on a score (Hauptstimme) is closely related to the task of tracking where the camera's focus lies in video of an orchestral performance. Figure 2 presents one display tool called TiLiA that serves to bring these annotations together.13

tismir-8-1-228-g2.png
Figure 2

Excerpt of a timeline analysis (TiLiA) consisting of two hierarchy timelines (‘video cuts’ and ‘Hauptstimme + form’), a beat timeline (‘Beats’), and multiple PDF timelines.

Analysis and annotation of this sort also includes genre statements. More thoroughly explored by the MIR community, such statements can range from the generic (‘jazz’), to something more specific. For example, Bach's 48 ‘well‑tempered’ preludes and fugues can be labelled generically as ‘classical’ (as opposed to ‘jazz’ (large scale category), or conversely, as ‘Baroque’ as opposed to ‘Classical’ (narrower time periods)). They may also be seen as ‘solo’ or ‘keyboard solo’ works (instrumentation), and the preludes and fugues can be further separated in terms of form and process. Alternatively, they might even be reorganised in terms of sub‑styles, such as ‘toccata’, which could be equally applied to either a prelude or a fugue.

With free‑text comments, divisions include both the presence or absence of timestamps, and (separately) whether that timestamp is meaningful. While YouTube comments are timestamped, the content rarely refers specifically to the moment in question. For an example of an MIR‑ready dataset in this regard, see Melechovsky et al.'s (2024) MidiCap project.

Writing about music in the popular press is rather under‑explored by MIR. One notable exception includes Hentschel and Kreutz (2021). Writing—particularly free prose—is one of many cases where data must record the source (venue) and author (name and role). There is, for example, a clear difference between a named journalist and an anonymous online commentator. Currently, MIR tends to offer only a brief account of contributors to datasets (e.g., ‘expert’ annotator). Music psychology typically records more (anonymised) information about participants, to include demographic and musical experience level, for instance. Recently, newsworthy digital archiving of music journalism includes the Internet Archive's new searchable index of 0.5 million web pages from MTV news that were otherwise thought to be lost.14 This is a reminder of the fragility of data, even in born‑digital form, and of sources of data with potential value to MIR.

Naturally, this kind of free‑text commentary can refer in complex ways to other musical data, very much including multimodal connections. Among intriguing and promising results for musical multimodality, Antović et al. (2023) have shown that listeners' “verbal reports of musical connotation” are full of creative references to moving bodies and that these share underlying embodied, image‑schematic structures.

4.12 ‘Responses’ to music: Legal—whose song is it anyway?

Much analysis (e.g., of genre and motives) focuses on questions of similarity. These same topics form the basis of many famous music‑copyright trials. The argument that one song is or is not ‘too similar’ to an existing one is familiar but has been refashioned to convince a judge or jury, rather than an academic readership. Legal data could take note of the songs and artists to feature in those trials, the outcomes (‘legal precedent’), and the substance of the argument, perhaps including not only the court transcript (text) but even the ‘performance’ of those arguments. See, for example, Bosher (2021) for a legal text on copyright in the music industry,15 and see the introduction in White (2022) for the theorist's take on some of the musical complexities. The fair use of music in AI training is a highly topical issue at the time of writing, and creative solutions have included the Night Shade tool for ‘poisoning samples’ with hidden content that acts as a kind of digital watermark.16

5 Guidelines for Best Practice in Multimodal Datasets

There is relatively little academic work addressing the topic of how to create well‑designed multimodal MIR research datasets. There exist generic guidelines (not specific to music or multimodality), including the four principles of data management stating that data should be findable, accessible, interoperable, and reusable (FAIR) (Wilkinson et al., 2016).17 Serra (2014) introduced the first set of criteria to consider when creating general MIR corpora: purpose, coverage, completeness, quality, and reusability. More recently, Christodoulou et al. (2024) expanded on these efforts to include criteria specifically aimed at the construction and evaluation of multimodal datasets: diversity and representation, data quality and consistency, annotation and ground truth, modality interplay, generalisation and robustness, usability and accessibility, real‑world impact, and legal constraints and limitations.

In this section, we propose a more structured set of criteria to evaluate multimodal music datasets. Figure 3 provides an overview. For each criterion, we provide (1) a short description; (2) an overview of the current state, highlighting particularly promising or impressive work; and (3) a view on the potential for future research. Currently, there exist no datasets which satisfy all of these proposed criteria fully. An existing dataset might contain a high number of music pieces, for example, but only a few modalities or less precise annotations, while an otherwise high‑quality dataset may be limited to license‑free music, or have a poor balance of musical genres and composers for the relevant task. Thus, the proposed criteria should be understood as guidelines for the design of future datasets on our way towards an ‘Everything Corpus’.

tismir-8-1-228-g3.png
Figure 3

Overview of the 17 proposed criteria, organised into four broad categories, for the design and evaluation of multimodal music datasets.

5.1 Quantitative evaluation

One advantage of quantitative criteria is that they can, in general, be precisely calculated, making objective comparisons between datasets possible. This is notwithstanding the many caveats concerning possible imbalances. For example, datasets are sometimes described in terms of the number of tracks, although ‘a track’ is clearly not of fixed length. Even recognised standards which are fixed, such as seconds, can be highly variable in terms of information density.

5.1.1 Number of entities

A typical item in many datasets is a musical piece, or its segments of a certain length. In general, the larger the number of these entities, the better. More data allow for the training of high‑quality deep learning models and help to produce more robust statistical evaluation.

Although the MSD was presented more than a decade ago (Bertin‑Mahieux et al., 2011), it still contains the largest number of entities across multimodal datasets (1,000,000 pieces). For comparison, CocoChorales (Wu et al., 2022) contains 240,000 generated chorales in this single style.

Even if MSD has an impressively large number of tracks, we remain far removed from an ‘Everything Corpus’ that might cover all musical pieces ever composed. Consider, for instance, that Spotify lists on its web page an impressive “more than 100 million tracks” in 2024 and still falls short of ‘all music’. Yet, an even smaller amount of music data are available in other formats. The International Music Score Library Project (IMSLP), for instance, currently contains ‘only’ 792,552 scores of 241,017 pieces.18 Moreover, there are indications that even the large numbers of tracks available on various streaming services increasingly comprise new, AI‑generated songs.

5.1.2 Numbers of modalities and media types

In general, MIR functionality is enhanced by access to a greater number of independent modalities and media, rather than ones which are automatically generated from another, such as with synthetic audio produced from digital scores. Having more modalities can generally allow for deeper investigations into how musical elements interconnect and provide richer insights into the impact of each on various applications and tasks.

Many existing multimodal datasets contain only two or three modalities. Lakh MIDI dataset (LMD)‑aligned (Vatolkin and McKay, 2022) is notable for using six different modalities: audio tracks, MIDI scores, lyrics, playlist statistics, semantic tags, and album covers. MTD (Zalkow et al., 2020) contains audio tracks, sheet images, MIDI scores, and metadata. Table 1 and §4 clearly indicate that there are many other distinct modalities that could be included in future datasets.

5.1.3 Richness of annotations

For categorisation tasks, such as genre recognition, a high number of annotated labels is preferable. Additionally, diversifying the types of annotations and annotators is helpful (as discussed in §4.11). Hierarchical relations, such as sub‑genres, moods, and tags, often help in this regard and can further express these relations to varying degrees of richness, from a simple tree structure to a more nuanced ontology of relations.

Various genre annotations exist for pieces in the MSD: 13 ‘All Music Guide’ genres feature in the MSD Allmusic Genre Dataset (Schindler et al., 2012), there are 250 genres selected from the Amazon genre hierarchy in MuMu (Oramas et al., 2018), and there are 63 genres in Dataset of Aligned Lyric Information (DALI) (Meseguer‑Brocal et al., 2020). As an example of another more specific application, multimodal recognition of instruments and playing techniques, the Central Conservatory of Music (CCOM)‑HuQin (Zhang et al., 2023) contains eight different instruments and 12 playing techniques, with the largest number of annotations across ten other studied datasets having a similar focus on bowed string instruments.

Going forward, the variety of taxonomies, the number of annotations per taxonomic category, and the detail of annotation and confidence values could all be significantly increased in comparison to what is offered in existing datasets.

5.2 Qualitative evaluation

The criteria discussed below can be (partly) described by numerical values. However, their definitions are less precise, and robust assessment is generally less straightforward. Despite this difficulty, we should aim to develop more precise definitions for qualitative criteria.

5.2.1 Quality of representations

In general, it is best to minimise loss when representing audio and images. Ideally, this would entail storing lossless or otherwise high‑resolution representations of the complete entity as well as its constituent elements (such as individual instruments and/or stems), which should also be accessible directly, obviating the need for processes such as source separation.

Unfortunately, the quality of representations remains beyond the scope of nearly all available multimodal datasets. For instance, audio is not directly available in MSD; only song previews can be downloaded. Slakh (Manilow et al., 2019) belongs to only a few multimodal datasets which contain audio multitracks, but they were synthesised from MIDI files and can be considered another representation of the same modality. The MTD dataset includes WAV files of high quality but is limited to musical themes and does not contain complete pieces. MSD‑I (Oramas et al., 2018) extends MSD to include album covers; however, the images are generally of low resolution (200 x 200 pixels).

Considerable work remains to create a dataset with many different modalities represented in the highest possible qualities. Arguably, the most problematic issue here is that such data often cannot be shared owing to legal issues (§4.12).

5.2.2 Completeness of entities

Generally speaking, the primary entities of a dataset (often music tracks) should be as complete as possible, e.g., musical pieces in full length instead of representative segments of the chorus of songs.

As mentioned above for MSD, often only audio previews are available. This restriction also holds for MSD‑related datasets, such as LMD‑aligned or MSD‑I. A better solution is provided in DALI, where not only are full audio tracks made available but the lyrics feature “four levels of granularity: notes, words, lines and paragraphs” (Meseguer‑Brocal et al., 2020, p. 57). However, only vocal melodies are available as the third modality, without the complete scores. There is just as much potential for future work on these considerations as for the other recommendations represented here.

5.2.3 Precision of annotations

Some annotations for musical data are inherently subjective and so cannot be ‘perfectly precise’ (e.g., differing opinions regarding genre). Both despite and because of this observation, annotations by multiple experts are generally desirable. Having multiple annotators is almost always relevant, and experts are typically preferred, except in cases such as psychological studies which focus on non‑specialist responses.

In most cases, annotated descriptions should be as precise as possible. For instance, information about ‘predominant’ instruments or the ‘occurrence’ of an instrument within a given time window is less precise than the exact beginning and end times of a particular instrument actively playing. Automatic annotations can sometimes be an option (e.g., if individual instrumental tracks are available, the activation of instruments in each single track can be measured automatically); however, they are not possible in all scenarios.

The Artificial Audio Multitrack (AAM) dataset (Ostermann et al., 2023) provides an example of near‑exact and fully automatic annotation of onsets, pitches, instruments, keys, tempos, chords, beats, and segment boundaries. However, as in Slakh, the audio pieces were generated directly from MIDI and so are not entirely independent. MuMu (Oramas et al., 2018) likewise features a high number of annotations per entity (447,583 Amazon customer reviews across 31,471 albums), leading to more precise ‘average’ statistics over reviews.

As elsewhere, the current state stands to be significantly improved. However, this may require an especially high degree of manual effort from what is a relatively limited pool of musical experts.

5.2.4 Interpretability of material

Making primary data and related material (e.g., extracted features) interpretable brings significant benefits, facilitating both future technical use of the data and wider understanding by non‑specialists. We include here semantic features (e.g., shares of musical instruments, clarity of vocals, level of activation, and supplementary material (Vatolkin and McKay, 2022)), which can be used for a more comprehensive analysis and categorisation. This contrasts with the so‑called black‑box models of many deep neural networks.

Including symbolic representations of musical pieces as part of a dataset greatly facilitates the interpretability, with characteristics, such as ‘which instruments are playing’ being directly extractable. For audio, extracted ‘semantic’ features can help. For instance, the MSD includes some Echo Nest descriptors (e.g., ‘hotness’ and ‘danceability’). Including a clear account of how these features are extracted is obviously crucial to interpretability.

The requirements for designing interpretable multimodal datasets strongly depend on the modalities and media under investigation. While modalities based on symbolic data, such as digital scores and text can be efficiently used to extract many interpretable characteristics, this is generally more difficult for audio and video sources. For video, promising initiatives include the Video Gestures Toolbox (Laczkó and Jensenius, 2021).

5.3 Ethical I: Diversity

With an eye towards an MIR field that aims to represent all music, datasets should therefore be as diverse as possible. Ethical issues are important here, as are practical concerns, as more and more data by different composers, interpreters, listeners, and experts contribute to the growing size and scope of available datasets.

5.3.1 Cultural and temporal balance

Musical data should be collected from diverse cultures, continents, genres, and time periods. Moreover, the distributions of these should be as balanced as possible.

The diversity and balance of existing MIR datasets fall far short of this ‘golden standard’. MSD, the largest dataset, acknowledges its limitation in this regard: “Diversity is another issue: there is little or no world, ethnic, and classical music” (Bertin‑Mahieux et al., 2011, p. 594). DALI likewise notes that: “Most of the songs are from popular genres like Pop or Rock, from the 2000s and English is the most predominant language” (Meseguer‑Brocal et al., 2020, p. 58). By contrast, several other datasets are focused on classical music, sometimes because the recordings or scores can be freely distributed, but other limitations concerning balance exist. For instance, 559 of the 2,067 themes in MTD are composed by Beethoven, and 254 of the 880 scores in Multimodal Music Collection for Automatic Transcription (MUSCAT) (Galán‑Cuenca et al., 2024) are by Bach. OpenScore corpora have sought to rebalance this but remain limited in scope relative to datasets such as MSD and are not multimodal (Gotham and Jonas, 2022; Gotham et al., 2023).

In the best case, an ‘Everything Corpus’ should be not only large but ‘balanced’ and ‘representative’ as well. The creation of such a dataset is extremely challenging.19 For example, prioritising the balance of ‘popular genres’ often leads to an imbalance of instruments (e.g., popular genres often use drum kits). Likewise, balancing across the representation of instruments will likely lead to genre imbalance.

5.3.2 Popularity

It is understandable that datasets tend to focus on popular music. Popular music is, by definition, the best known, and it is often the only music for which data of the highest quality exist. Unfortunately, these qualities naturally narrow the possible scope of research.

As discussed above, the largest datasets are generally focused on popular music, complemented to some extent by a series of smaller, specific datasets focused on classical music. There have been significant efforts to diversify the provision of MIR datasets, but this has been a relatively marginal part of MIR to date.20 Another problematic issue arises in how popularity should be defined, as it usually has a focus on some epoch, or culture, such as in Western contemporary music.

While an explicit focus on ‘less popular’ music may not be the best way forward, some community‑wide effort to balance the overall provision of datasets with respect to different genres, time periods, cultures, and more is clearly needed (cf. §5.3.1).

5.3.3 Diversity of versions

An individual entity of the same modality can be included through several representations, e.g., live recordings of the same song from different concerts, different interpretations of the same classical piece, or concert photos taken from different angles. This practice is generally to be recommended, where possible, in the interests of creating more stable machine learning models.

CCOM‑HuQin provides an interesting example here, as it uses three different video cameras to record distinct representations of the same modality. DALI provides different versions of songs (studio, radio, edited, live, or remixed), while some datasets such as MTD issue not only different versions of symbolic representations (i.e., original files from the Electronic Dictionary of Musical Themes, manually corrected versions, and score) but different formats (e.g., MusicXML and PDF) as well. Other examples include the 9 different recorded performances featured in SWD (Weiß et al., 2021), the 16 in WRD (Weiß et al., 2023), and 11 in Beethoven Piano Sonata Dataset (BPSD) (Zeitler et al., 2024).

Datasets with multiple sources for the same musical entity (cf. §2) are certainly promising. Prospective benefits include significantly increasing the robustness of architectures that work with many parameters, such as deep neural networks. The creation of new recordings, in particular, to address the gap beyond Western classical music, however, requires extremely high manual effort and resources.

5.3.4 Focus on specificity

Some studies have a specific focus (e.g., music of a particular age, country, or composer). In such studies, there is a case for limiting the diversity of data used from outside of this immediate focus.

We return here once again to highlight the case of CCOM‑HuQin. This dataset provides a narrow and specific multimodal view of Chinese bowed string instruments and their playing techniques.

Even if we have the goal of acquiring ‘more data’ and ‘more diversity’ as key considerations for the field overall, it remains important to conduct studies which are limited in scope, partly as steps towards this high‑level goal.

5.4 Ethical II: Reproducibility and legal and environmental statements

To enhance reproducibility in future research, all data, code, and clearly documented details of conducted experiments should be shared publicly and transparently.

5.4.1 Availability of source data

In the best case, all original source data should be publicly shared. Complementing this, it should also be possible to download smaller parts of the dataset, especially in the case of extremely large datasets. Moreover, scripts to retrieve a small sample for preliminary experiments are helpful.

Positive examples include MTD (Zalkow et al., 2020), where all raw data are available on the corresponding website. More problematic, however, are datasets based on MSD or DALI, where audio recordings must be downloaded as previews or from YouTube. There is no guarantee that these platforms will continue to make this data available at all, or that the URLs will remain unchanged in the future.

For commercial music data, availability will remain a problem. One possible, but expensive, solution is to offer services which retain the original modalities on the server and return derived data on demand (e.g., spectrograms for audio signals or word statistics for lyrics). Another partial solution is the sharing of extracted features as discussed below.

5.4.2 Availability of processed data

Extracted and (pre)processed features can be highly useful, saving computational energy and extraction times while allowing for computation on hardware with limited capabilities. Further results of experiments (e.g., trained classification models and runtime logs) can be shared as well.

Features are often publicly shared; however, their number is usually limited. Furthermore, in almost all cases either original modalities without extracted features are provided (e.g., DALI and MTD), or only a set of selected features is made available (e.g., 55 Echo Nest tags for MSD, 2,803 features for LMD‑aligned, and audio spectrograms for Multimodal Sheet Music Dataset (MSMD)). Augmented entities, however, can be shared for a more robust training of machine learning models. For instance, data augmentation for MSMD is described in Dorfer et al. (2018), and the corresponding scripts are available on the project's GitHub page.

We recommend sharing not only the original modality sources but derived data as well. These data should be shared either directly or by providing scripts which are well documented and easy to use and reliably return the same data reported in published studies.

5.4.3 Citability and availability

Reproducibility is aided by sharing datasets available on long‑standing and professionally organised public repositories (such as Zenodo), complete with static digital object identifiers (DOIs) assigned to each version. Additionally, more published studies in peer‑reviewed and open‑access venues, which use a given dataset, can serve as helpful baselines for comparing and improving upon different algorithms for various tasks.

Fortunately, almost all relevant multimodal datasets discussed here are shared on Zenodo, often with supplementary code on GitHub. Not surprisingly, based on its size and longevity, MSD has been much cited.21

The creators of datasets should be responsible for their sharing on websites or services that have a high chance of remaining in long‑term service and provide a strategy for unique references.

5.4.4 Statistical evaluation

Although the general availability of published studies is a prerequisite for reproducibility and future possible improvements, the quality of these studies is also important. The demands on such studies include the careful design of experiments using statistical hypotheses (e.g., will the combination of several modalities increase the quality of classification models?), reporting of several evaluation criteria (e.g., classification quality, robustness, interpretability of models, and runtimes), and the sharing of exact lists with training, validation, and test instances for machine learning applications.

Many multimodal datasets (including MSD, DALI, and MSMD) propose exact splits of training, validation, and test instances. An example of how rigorous statistical analyses for multimodal music genre classification can be done is provided in Wilkes et al. (2021); however, this study is based on an in‑house dataset, so the direct reproduction of results is not possible. The application of statistical testing is still rather seldom used for multimodal music datasets. It was, for instance, conducted for Symbolic Lyrical Audio Cultural (SLAC) (McKay et al., 2010) and LMD‑aligned (Vatolkin and McKay, 2022), as well as previously for SLAC alone (McKay, 2010). However, both datasets publicly share only the extracted features and not all original data.

The combination of freely available source modalities with exact reports of all study and parameter settings, together with statistical evaluation, still remains a large research gap in multimodal MIR research.

5.4.5 Potential for future extensions

It should be made possible for researchers to expand, correct, and otherwise improve existing datasets with minimal effort. This can be managed using artificially generated music and/or automatic creation of annotations. However, these lead to other drawbacks, such as an interdependency of modalities or music, which represent ‘real‑world’ scenarios to a lesser degree. Clearly identified maintainers and publicly available guidelines for recommended usage and contribution are highly recommended.

While extensions are possible for all datasets discussed so far, only a few appear to be presently under regular development. For instance, the emergence of the second DALI version (Meseguer‑Brocal et al., 2020) was accompanied by documentation explicitly describing how it was extended in the section “Improving DALI version”. Multimodal extensions to MSD, such as MSD‑I and MuMu (Oramas et al., 2018), provide other notable examples.

The design of future multimodal datasets should integrate the possibilities for future extensions at very early stages. The accepted method for extension needs to be clearly documented and, if possible, considered in programming interfaces that will specify data types or tasks that may be supported at a later date.

5.4.6 Legal and environmental statements

Legal and environmental statements may include notes related to copyright issues, reports of energy and hardware resources, and human efforts required for the creation and processing of the dataset. Moreover, a wide range of diverse listeners, from experts to laypersons, or people with different cultural backgrounds and ages, can and should be involved in the annotation and data revision process.

Participant studies in music psychology (and elsewhere) routinely have participants complete consent forms and report on anonymised data about these individuals (e.g., in terms of age and demographic). With respect to multimodal music datasets in particular, information of this kind is seldom made available. One recent exception is the Gunnlod dataset (Johansson and Lindgren, 2023) based on Lakh, as developed in a student project, which further identified seven key requirements for trustworthy AI in multimodal music datasets: human agency and oversight; technical robustness and safety; privacy and data governance; transparency; diversity, non‑discrimination, and fairness; societal and environmental well‑being; and accountability. However, the modalities in this dataset are not independent, as audio pieces were generated from MIDI files. In another study with listeners on genre recognition using the MSD‑I dataset, it was reported that “all procedures . . . were conducted in accordance with the 1964 Helsinki declaration and its later amendments or comparable ethical standards” (Oramas et al., 2018, p. 18). In particular, for datasets where data need to be downloaded from external sources, such as YouTube for DALI, it may be unclear whether the possibly copyrighted music can be legally used.

Copyright restrictions may remain challenging in the future and lead to legally unclear situations regarding the use of data. Some possible solutions are to focus on copyright‑free data, produce new data with manual efforts, or apply computer‑based generative methods, either rule‑based (Cope, 2001) or otherwise (Miranda, 2021). The ethical and environmental impact of the dataset's creation and its processing, as well as of related studies based on those data, should be described.

6 Conclusions and Vision

This article laid out a conceptual framework for how researchers interested in MIR can better understand the multitude of modalities that are or might be relevant to musical data and provided concrete criteria and guidelines, based on observed practices in previous research, for how we believe multimodal datasets in MIR should be curated in the future. In doing so, we have aimed to provide a solid basis for the eventual curation of an ‘Everything Corpus’. While our efforts here are far from the last word on this subject, as we do not purport to have fully described such a corpus, we offer this framework and these guidelines instead as motivation and inspiration for future corpus‑building efforts that might contribute to this goal.

Acknowledgements and Author Contribution

We would like to thank the editors and anonymous reviewers of TISMIR for the thorough and insightful feedback that has greatly improved this work. We would also like to thank a number of anonymous colleagues who have likewise helped in the development of this work by reading drafts, providing comments at presentations, and discussing our thoughts along the way. The contribution of Igor Vatolkin was supported by an Alexander von Humboldt professorship in AI held by Holger Hoos. Mark Gotham developed the central idea of an ‘Everything Corpus’ and the framework presented in §4, Brian Bemman carried out the literature review in §3, and Igor Vatolkin provided the terminology in §2 and developed the guidelines in §5. All authors contributed equally to the writing of the article.

Competing Interests

Vatolkin and Gotham are guest editors for this special issue. Neither author had any editorial oversight or decision‑making role in connection with this article.

Notes

[1] Among the many proposed ontologies, perhaps the most relevant to this effort is Christodoulou et al. (2024).

[2] A signal is “a function that conveys information about the state or behavior of a physical system” (Müller, 2021, p. 57). While audio and EEG signals map one variable (timepoint) to the function value, a red, green, and blue (RGB) video signal is composed of three colour functions, which take two spatial coordinates and timepoint as input.

[3] For more on the corporate background of modern music tech, see Water & Music's community‑curated data and analysis: https:/bit.ly/3WkGNN7; Accessed: 30‑01‑2025.

[4] For a counter example, see the nascent ‘AI Kobayashi’ on Bach sources.

[5] Beethovens Werkstatt; Accessed: 30‑01‑2025. Beethoven‑Haus Bonn, Detmold/Paderborn.

[6] https://chopinonline.ac.uk/ocve/; Accessed: 30‑01‑2025.

[7] Recent MIR literature on music and gesture includes Essid et al. (2012) and Kritsis et al. (2021).

[8] Around 20–30 is typical, depending on the repertoire. Thanks to various producers and engineers for informal correspondence on this topic, including Myles Eastwood.

[9] https://setlist.fm/; Accessed: 30‑01‑2025. See https://github.com/programming-concert-programming/setlists for computational methods of engaging with that resource (publication forthcoming).

[10] https://www.whosampled.com/; Accessed: 30‑01‑2025.

[11] Merkel, 2022, dir. Eva Weber.

[12] https://www.100ballads.org/; Accessed: 30‑01‑2025.

[13] An initial presentation of TiLiA is provided in Martins and Gotham (2023), while the specific example shown in Figure 2 can be found at https://tilia-app.com/viewer/65, and the Haupstimme dataset is publicly visible at https://github.com/MarkGotham/Hauptstimme; Accessed: 30‑01‑2025.

[14] https://web.archive.org/mtv.com/search/mtv; Accessed: 30‑01‑2025.

[15] ‘Whose Song Is It Anyway?' is the name of (and a nod to) Bosher's podcast.

[16] https:/nightshade.cs.uchicago.edu/whatis.html; Accessed: 30‑01‑2025.

[17] Gotham (2021) discusses how these FAIR principles can apply in practice to different musical user groups. How ‘findable’ are our datasets to musicians for instance? How interoperable are our converters?

[18] https://imslp.org; Accessed: 30‑01‑2025.

[19] A first possible step could be to build upon already available datasets; see, e.g., https://ismir.net/resources/datasets/; Accessed 30‑01‑2025.

[20] Notable here is the CompMusic project (Xavier Serra P.I., ERC funded, 2011–2017).

[21] In all, 1,892 times according to https://scholar.google.com; Accessed: 30‑01‑2025.

DOI: https://doi.org/10.5334/tismir.228 | Journal eISSN: 2514-3298
Language: English
Submitted on: Sep 30, 2024
Accepted on: Feb 25, 2025
Published on: May 5, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Mark Gotham, Brian Bemman, Igor Vatolkin, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.