1 Introduction
Music performance, particularly Western classical music performance, often involves ‘interpreting’ a written musical score. This score can be seen as a set of instructions for the performer specifying, for example, what notes should be played when. Were these instructions rendered literally by a computer, the result would typically sound unmusical; rather, the performer is expected to modify certain musical parameters as available on their instrument in order to highlight aspects of the musical structure (Doğantan‑Dack, 2021; Rink, 2015; Spiro et al., 2010), convey emotions or character (Juslin, 2013; Meissner et al., 2021), or to demonstrate particular playing styles (Fabian and Schubert, 2009; Liebman et al., 2012).
Key musical parameters that a player may manipulate in expressive performance include timing, timbre, pitch, and dynamics. Each parameter can be expressively adjusted in complex ways both individually and in combination; a performer might enhance emotional intensity by subtly manipulating note duration to speed up or slow the tempo (‘rubato’), adjusting how loud or soft notes are (‘dynamics’), increasing the width of periodic pitch fluctuation (‘vibrato’), or some combination of these (Dromey et al., 2015; Palmer, 1997).
The expressive parameters an expert player may manipulate are contingent upon the instrument’s capacities. For example, almost all instruments can introduce timing expressivity through rubato. One common expressive rubato pattern includes slowing towards the ends of phrases in what is called ‘phrase‑final lengthening’ (Benadon and Zanette, 2015; Demos et al., 2016; Drake, 1993; Goodchild et al., 2016; Rector, 2021). This is sometimes coupled with a progressive acceleration at the beginning of the phrase to produce an overall ‘phrase‑arching’ effect (Lerdahl and Jackendoff, 1977; 1983; Todd, 1985; Widmer, 1996). Other techniques are limited to particular instrument families. For example, unfretted string and wind instrument players introduce pitch expressivity through vibrato, whereas this is unavailable within standard techniques for most fretted string and keyboard instruments (McDermott et al., 2013).
Expressive parameters are often interrelated and manipulated simultaneously. For one example, increases or decreases in tempo tend to positively correlate with respective increases and decreases of dynamics within musical phrases (Cook, 2014; Repp, 1996) in a manner that has been interpreted as reflecting intuitive physics, or ‘energetics’ (see Larson, 2012; Rothfarb, 2002; Thoresen, 2022). For another, while timbre can inherently vary as a function of pitch or dynamics on certain instruments (McAdams, 2013), players can also control it, jointly adjusting this parameter alongside timing and dynamics to create rich musical metaphors and evoke emotions such as happiness or sadness (see Davies, 2011).
The datasets used to analyse players’ adjustments of these parameters in their musical interpretations have most often been unimodal, consisting solely of performance audio or audio‑derived symbolic transcriptions (e.g. Musical Instrument Digital Interface [MIDI]). This is in part because of the conventional assumption that music is an exclusively aural communicative means. Most research using these unimodal datasets has focused on the score and the structure it implies as a primary factor in determining an interpretation (Cook, 2014; Crispin and Östersjö, 2017; Doğantan‑Dack, 2021; Rink, 2015; Spiro et al., 2010). MIDI, the other data type often featured in unimodal datasets, has also been used in research that includes identifying heuristics for automatic detection of expressive and non‑expressive MIDI transcriptions of performances (Lee et al., 2025). These data have additionally proven effective for tasks including automatic performer identification and classification of individual expressive approaches (e.g. Rafee et al., 2021; Spiro et al., 2016; Stamatatos and Widmer, 2005; Zhang et al., 2022).
While audio recordings comprehensively capture the aural expressive parameters available to performers, and audio‑derived symbolic transcriptions capture many of them, these data types ultimately provide limited perspectives on expressive performances (Maestre et al., 2017). In particular, studying audio or audio‑derived data in isolation makes it difficult both to unpick the myriad factors underlying performers’ interpretative choices and to understand how the performer’s interpretation and the performance context combine to shape listener responses.
In contrast, multimodal datasets, which contain ‘diverse data types that offer complementary insights for a specific music‑processing task’ (Christodoulou et al., 2024, p. 37), allow more nuanced understanding of musicians’ manipulation of these parameters. In addition to furthering empirical performance studies, such datasets can also support machine‑learning applications – for example, integrating audio and video data to enhance instrument identification in ensemble contexts (Duan et al., 2019). For a history of multimodal MIR research, see Gotham and colleagues (2025).
These research directions illustrate the growing interest in and need for multimodal datasets of expressive musical interpretations that support more holistic study of performance and performer. Various such datasets have been produced in recent years, but there is still substantial scope for improvement both in terms of the accessibility of these datasets and the types of material they provide.
The present article reviews this domain of multimodal datasets of musical interpretations, focusing on those that allow comparative analysis of experts’ interpretations of a score and that are accessible to the research community. We first propose a taxonomy of key modalities relevant to expressive musical performances that can be studied using multimodal datasets (Section 2). We next review existing datasets and the different modalities they support (Section 3). We then highlight underrepresented modalities that could be covered by future datasets and make practical suggestions for their inclusion (Section 4). We conclude by discussing key challenges facing those compiling and releasing multimodal datasets for interdisciplinary study and with broad recommendations for applications of future multimodal expressive music performance datasets (Sections 5–6).
2 Taxonomy of Modalities Relevant to Expressive Musical Interpretations
Research into musicians’ expressive interpretations of scores involves consideration of complex interacting factors. This section proposes a taxonomy of modalities inspired by Gotham and colleagues (2025), albeit refocused on the specific use case of expressive interpretations of musical scores in performance. The base unit for our taxonomy is the ‘modality’, defined as a facet of the performance or its context through which the performance can be studied (e.g. ‘venue acoustics’, ‘venue visuals’, ‘performer movements’). We organise these modalities into general topics (e.g. ‘venue’, ‘performer’), which we then connect to three temporal phases during which data supporting them may be collected: ‘before’, ‘during’, and ‘after’ the performance (Table 1). Note that our taxonomy is not a strict hierarchy; in particular, modalities and topics can belong to multiple phases (e.g. one can measure a baseline of venue acoustics before the performance, but one can also measure how venue acoustics change during the performance).
Table 1
Proposed taxonomy of modalities relevant to expressive interpretations of musical scores.
| Data‑collection phase | Topic | Modality | Example(s) |
|---|---|---|---|
| Before the performance | Score | Engraved score | Western staff notation, printed |
| Symbolic score | MusicXML, MIDI | ||
| Score‑derived data | Music‑theoretic annotations/analyses | ||
| Instrument | Instrument characteristics | Type, model, mechanical properties, temperament | |
| Venue | Venue visuals | Room lighting, capacity, aesthetics | |
| Venue configuration | Layout, performer location | ||
| Venue acoustics | Size, shape, reverberance, acoustic anomalies | ||
| Performer | Performer biographical data | Age, gender, studies, pedagogical background, expertise | |
| Performer physical attributes | Anthropometric measurements, state of physical fitness | ||
| Performer psychology | Personality | ||
| Interpretative set | Expressive intention or ideal | ||
| Listener | Listener biographical data | Personal, educational, professional/expertise | |
| During the performance | Instrument | Instrument state | Tuning, responsiveness, mechanical wear |
| Venue | Venue visuals | Room lighting, stage effects or decorations | |
| Venue configuration | Configuration of player(s) and instrument(s), presence and placement of intrusive recording equipment | ||
| Venue acoustics | Humidity or temperature shifts | ||
| Listener | Listener configuration | Presence, location | |
| Listener physiological state | Heart rate, skin conductance, respiration, pupil dilation, brain activity, eye tracking | ||
| Performer | Performer movements | Performer gestures, performer–performer interaction, performer–audience interaction (e.g. video, motion capture [MoCap]) | |
| Performer physiological state | Heart rate, skin conductance, respiration, pupil dilation, brain activity, eye tracking | ||
| Recording process | Recording setup | Type, settings, and positioning of recording equipment | |
| Recording post‑production | Cuts, splices, equalisation, reverberation adjustment, file compression | ||
| Performance sound | Audio | Mixed recording or master tracks | |
| Ambient noise | External noise bleed, HVAC system sounds, audience noise | ||
| Audio‑derived data | Audio‑derived representations (e.g. spectrograms, loudness curves, audio–score alignment) | ||
| After the performance | Performer | Performer assessment of the interpretation | Reflection on the extent to which interpretational intent was carried out, post‑performance/ad hoc justification |
| Performer evaluation of the experience | (Dis)comfort during performance, attention/distraction | ||
| Listener | Listener evaluation of the performance | Physiological reactions, aesthetic preference, emotional response, attention, physical (dis)comfort during performance | |
| Listener evaluation of the performer | Stage presence, movements, facial expressions |
This taxonomy forms the basis for the subsequent systematic review of existing multimodal expressive performance datasets and the modalities that they support (see Section 3), along with discussions of practical uses of this taxonomy for future datasets and analyses (see Sections 4–6).
2.1 Before the performance
We identify five primary topics for which data can be collected ‘before the performance’: the written piece of music (score), instrument, venue, and the backgrounds of the performer and any listener. In many cases, data supporting study of corresponding modalities to these topics could alternatively be captured following the performance (for the primary exception, see the ‘interpretative set’ in Section 2.1.4).
2.1.1 Score
The score is a representation of the musical work being interpreted. It typically includes such indications as pitch, rhythm, meter, articulation, dynamics, phrasing, and other performance directions. Scores may exist in various editions with different historical sources or editorial markings, which reflect decisions made by editors or composers. These can shape players’ expressive choices (Broude, 2012). In studies of musical interpretations, scores serve as a reference framework or ground truth for analysing deviations in expressive musical parameters and exploring how performers may interpret notation under different stylistic, historical, or situational contexts. Analyses derived from the score may also support examining the relationship between performers’ interpretations and aspects of the written representation.
It is useful to differentiate three main score modalities, i.e. ‘engraved scores’, ‘symbolic scores’, and ‘score‑derived data’, where:
Engraved scores are traditionally written in staff notation and can be read by analysts and performers or realised in performance. While not usually machine‑readable, they are useful for traditional manual analyses.
Symbolic scores are typically derived from engraved scores. As they are machine‑readable, they support computational analysis, machine learning, and alignment with other data types. Formats such as MusicXML, Music Encoding Initiative (MEI), and Humdrum preserve a range of symbolic information, including pitch, rhythm, meter, articulations, and score structure (e.g. Riley et al., 2024; Weiß et al., 2023). These formats can also retain detailed notational semantics, making them suitable for tasks such as audio–score alignment, automated analysis, feature extraction, and score‑informed audio processing. MIDI is another common machine‑readable encoding that can be derived from either engraved or symbolic scores and provides a lower‑level symbolic representation focused on note events (e.g. pitch, onset, duration, velocity) (e.g. Cancino‑Chacón et al., 2020; Foscarin et al., 2020). While useful for timing and expressive control data, MIDI neither enables capturing many of the notational features present in scores nor preserves visual formatting or editorial information.
Score‑derived data can be produced by either expert human annotators or automated analyses. These typically capture music‑theoretic concepts – for example, harmonic analysis, phrase boundaries and hierarchies or formal structure. Such annotations can be and are often used to support comparative analyses that link performers’ performance tendencies or interpretative decisions with aspects of the score (e.g. Benadon and Zanette, 2015; Cook, 2014; Dodson, 2009; Rector, 2021).
2.1.2 Instrument
Instruments play an essential mediating role between the performer and the realised sound, as their characteristics can influence players’ technique and performance.
Instrument characteristics include physical and mechanical properties of any instrument used in the performance. The instrument’s type, model, size, layout, mechanical properties, and its general affordances and constraints give visual, aural, and tactile feedback to the musician. This provides a baseline for the player’s habitual interaction with the instrument and can support their development of expressive ideas (Campbell, 2014; Doğantan‑Dack, 2015; Nijs et al., 2013).
2.1.3 Venue
The venue provides the physical context in which a player realises their interpretation. Performance venues can differ substantially from the musician’s usual rehearsal spaces, prompting expressive adjustments during pre‑performance rehearsals.
Venue visuals refer to the look and atmospheric qualities of a performance space, be it architecture, lighting, size, or decor. These elements can influence players’ manipulation of expressive parameters and shape listeners’ emotional engagement. For example, room capacity, visual appeal, or even socio‑cultural significance may affect how restricted performers feel when expressing themselves or may prompt interpretation adjustments to reflect perceived ambiance or expectations (Armstrong, 2020; Picaud, 2022).
Venue configuration refers to the spatial arrangement of performers and instruments. The normal stage layout and seating orientation of the performance venue can affect sound projection, performer interaction, and comfort, thereby influencing expressive choices. For instance, instrument positioning relative to walls or other performers may alter the level of engagement players feel with the interpretation they create (Armstrong, 2020).
Venue acoustics refer to the typical sound‑related properties of the performance space, such as reverberation, absorption, and diffusion. These are determined by factors such as room size, materials, and architectural design, and they affect how sound is perceived and produced (Eley et al., 2024). For example, a highly reverberant hall encourages slower tempi, greater space between notes, and longer rests (Amengual Garí et al., 2015; Bolzinger and Risset, 1994).
2.1.4 Performer
Although the performance itself may be relatively brief, a musician’s performed interpretation is rooted in the much longer‑term background of that performer, such as their biographical data, their physical attributes, their psychological background, and their conception of the piece to be performed, which together may influence how the player thinks and moves during the performance.
Performer biographical data include the performer’s demographic, educational, and professional information. These data can support the study of how a player’s individual history may influence the ways they perform and perceive the music they make. For example, some differences between musicians’ performances may relate to their age (Liebman et al., 2012; Luck and Ansani, 2024), the time periods during which they live (Hansen et al., 2016; Liebman et al., 2012), and the language(s) and accents with which they speak (Goodchild et al., 2016; McGowan and Levitt, 2011; Nakai et al., 2009; Rector, 2021; Siefart et al., 2021). Players’ patterns of expressiveness, such as asynchronous playing or rubato, may also be connected to their educational background or expertise (Cook, 2009; Liebman et al., 2012; Palmer, 1989).
Performer physical attributes, such as anthropometric measurements (e.g. height) or physical fitness (e.g. muscle strength, cardiorespiratory endurance), may also shape and constrain performers’ interpretations. Factors such as height, hand size, body proportions, flexibility, strength, and endurance affect how players approach and interact with their instrument (İzci et al., 2023). In reinforcing muscle patterns through practice, players quite literally shape their bodies, and their developed kinaesthetic awareness may in turn influence their interpretative decision‑making (Auslander, 2009; Barczyk‑Pawelec et al., 2012; Blanco‑Piñeiro et al., 2017; Burrows, 1987; Doğantan‑Dack, 2015; 2021; de Souza, 2017; Schick, 1994).
Performer psychology refers to the mental characteristics and attitudes of the player (Kokotsaki and Davidson, 2010; Moreno‑Gutiérrez et al., 2023; Osborne and Kenny, 2005; Yoshie et al., 2009). Particularly relevant here are personality traits. For example, conscientiousness and extraversion are linked to stage presence, gesture, and verbal communication, while openness to experience is potentially connected to creativity and spontaneity (Gjermunds et al., 2020; Kemp, 1996). These traits may guide players’ interpretative preferences, emotional expressiveness, and tendency towards impromptu decision‑making in performance.
Interpretative set describes the collection of strategies that a performer uses to interpret the score. These strategies may be conscious or unconscious and, in some cases, may have developed over decades. The interpretative set will typically cover concerns such as phrasing, emotional expression, or underlying narrative – in short, everything that constitutes the performer’s individual expressive vision for the piece (Héroux, 2017). The interpretative set can be probed through qualitative data (e.g. performer interviews), though such data may be influenced by biases of which musicians may or may not be aware, and they may not always be straightforward for the researcher to interpret. These data can then be linked to actual musical performances to yield interesting insights into the relationship between intents and outcomes (Lisboa et al., 2011). This modality is unique among modalities in this Section 2.1 because a player’s pre‑performance descriptions of interpretative intentions may differ from their post‑performance evaluations of the same (Hamilton and Duke, 2020; Holmes and Holmes, 2013; Rey et al., 2025). Data regarding the interpretative set modality captured before performance may therefore fundamentally differ from that collected afterwards, unlike those corresponding to other topics described in this section.
2.1.5 Listener
If a listener is present, information about their background can help in studying their reactions to the musical performance.
Listener biographical data comprise basic individual information such as demographics, education, and expertise. Such data can support nuanced study of musical interpretations. For example, how listeners speak about performances may relate to gender and musical experience (Istók et al., 2009), and expertise information may help with predicting the musical aspects listeners notice when identifying a performer’s emotional intention (Yang et al., 2021).
2.2 During the performance
We consider six topics that together cover the performance itself and its immediate context, including the recording methods used to capture the performance data: instrument, venue, listener, performer, performance sound, and recording process.
2.2.1 Instrument
Instrument state refers to instrument characteristics that may fluctuate in the vicinity of or during a performance, such as mechanical wear, responsiveness, or tuning (in contrast with its fixed characteristics described in Section 2.1.2). These factors may affect a player’s use of the instrument, prompting adjustments in technical or interpretative strategies mid‑performance (Doğantan‑Dack, 2013).
2.2.2 Venue
Venue conditions can vary even during a performance, potentially inspiring players to make real‑time adjustments to technical and interpretative decisions in response to different venue‑related feedback.
Venue visuals that can vary during a performance or between rehearsal and performance include lighting, decorations, camera flashes, stage effects like smoke or pyrotechnics, the presence of a screen and projector, or even listeners’ movements. For example, audience members’ physical reactions, when visible to a player, may interfere with their flow mid‑performance (Okan and Usta, 2021), or novel visual stimuli can impose extra attentional demands on performers, with this additional mental load potentially impacting their interpretations in meaningful ways (Çorlu et al., 2014).
Venue configuration can change between rehearsal and performance and/or vary during the performance. Instrument placement, stage layout, audience proximity, or the presence or placement of recording equipment can affect how performers perceive and project sound, eliciting adjustments in technique or interpretation. For example, a performer’s position in the room can subtly shape their expressive choices by altering their sense of enjoyment as they move through the space (Armstrong, 2020).
Venue acoustics can fluctuate during the performance due to environmental factors like humidity or temperature shifts or as a result of sound absorption due to audience members’ presence (Hidaka et al., 2000; Nowoświat, 2022). These differences may affect players’ manipulation of such parameters as tempo or articulation (Kalkandjiev and Weinzierl, 2015; Ueno et al., 2010).
2.2.3 Listener
The listener is integral to the performance context. Their presence and characteristics can impact a performer’s expressive decisions, while their perception responds to the performance and setting.
Listener configuration refers to the spatial and social positioning of audience members during a performance, including their presence, location, visibility, and perceived identities. These factors can influence a performer’s sense of pressure or intention. For example, knowledge of listeners’ identities may impact a player’s interpretative strategies (Héroux, 2017; Persson et al., 1992).
Listener physiological state during performance can offer real‑time insights into how listeners respond to a performance and performance context. Measures such as heart rate, skin conductance, respiration, pupil dilation, brain activity, and eye tracking can reveal attentiveness or emotional engagement (Bishop and Jensenius, 2020; Latulipe et al., 2011).
2.2.4 Performer
Performer movements can offer insights into players’ interpretative intentions. For example, performers’ facial expressions and body movements can help us understand the emotions they may be trying to convey in a given passage (Dahl and Friberg, 2007; Davidson, 2012), and their physical movements can help communicate the phrasing structure they seek to articulate (Buck et al., 2013; MacRitchie et al., 2013; Thompson and Luck, 2011), even if these aspects are harder to discern in isolated sound. In collaborative performances, gestures and facial expressions contribute to dynamic, real‑time interactions, potentially affecting timing, dynamics, and other interpretative choices (Chang et al., 2017; Davidson, 2012).
Performer physiological state can be studied by collecting data on heart rate, skin conductance, muscle tension, brain activity, breathing patterns, and other physiological feedback during performance. Such data can offer insights into how the body reacts to the production or processing of musical stimuli, as well as how interpretations may in turn be shaped as a result of these responses (Higuchi et al., 2011; Turchet et al., 2024; Yoshie et al., 2009).
2.2.5 Recording process
To enable analysis, the ephemeral performance moment must be captured for subsequent study. The recording setup can impact players’ expressive interpretations, and any post‑production adjustments to the recording may affect the data that researchers later use to study the performance.
Recording setup may affect both the data captured and the freedom a performer feels while playing. Equipment type, settings, and positioning can emphasise or omit certain sounds. Intrusive setups, whether visibly reminding the player that they are being recorded or physically placed on and restricting the player, may impact musicians’ physical comfort, freedom of movement, or focus, which could in turn affect their expressive choices (D’Amato et al., 2021; Perez‑Carrillo et al., 2016). In datasets assembled from extant recordings, providing the album titles allows researchers to begin to source information about how the performance was captured, if included in album metadata.
Recording post‑production refers to any ways in which the recording is adjusted after its creation. Particularly relevant are cuts or splices in the recordings, which can significantly alter the type and shape of musicians’ nuanced manipulation of expressive parameters, resulting in a different, often idealised, performance. Recording cuts or splices are sometimes not acknowledged, particularly in commercially released recordings. Therefore, spliced recordings could later be studied by researchers with no knowledge of these adjustments, depending on recording and transmission mediums (Leech‑Wilkinson, 2008; Rumsey, 2013). This can potentially affect analysis validity and applicability. More general changes, such as adjustment of reverberation, equalisation, or file compression, can impact the data that remain in the recordings.
2.2.6 Performance sound
Audio recordings of the musical performance should capture essentially all musical parameters a player can adjust, such as variation in dynamics, timbre, articulation, timing, and vibrato. They have supported many studies into musical interpretations (e.g. Dodson, 2009; Fabian and Schubert, 2009; Jerkert, 2004; Repp, 1992).
Ambient noise refers to non‑musical sounds present during a performance, such as audience noise, mechanical sounds from instruments, or climate control systems. These sounds may interrupt players’ concentration, alter their experience in performance, or influence their interpretation execution (Armstrong, 2020). Such sounds will almost inevitably be captured in acoustic recordings, potentially complicating the extraction of performance parameters from them.
Audio‑derived data include representations such as MIDI, spectrograms, waveforms, note onsets and offsets, or tempo curves. Such representations can facilitate certain kinds of useful analyses into expressive musical parameters. They can also be used to produce audio‑aligned scores, which are helpful for making systematic comparisons across performers. Audio‑score alignment can be automatic, manual, or a combination of both (e.g. Cancino‑Chacón et al., 2020; Dong et al., 2022; Riley et al., 2024).
2.3 After the performance
Retrospective qualitative reflections from performers and listeners collected after the performance can enrich analyses of data collected during other phases. While this type of collected data may be subject to bias, it can also offer profound insights into experiential aspects of creating the ephemeral aural result. Primary topics that support studying a performance retrospectively are performers’ and listeners’ varied evaluations of the performance.
2.3.1 Performer
Performer assessment of the interpretation might include the player’s reflections on their success in achieving their interpretative goals, as well as reflections on whether, how, and why their interpretation strategies were spontaneously adjusted in the performance (Clarke, 1995; Ginsborg et al., 2012).
Performer evaluation of the experience incorporates broader information on the experiential aspects of performance. Descriptions of the nature and qualities of challenges encountered, perceived energy levels, emotional reflections, or explanations regarding their own reactions to the unexpected while playing can help shed further light on the mechanics and experiential aspects that impact the performance (Clark et al., 2014; van Zijl and Sloboda, 2010; van Zijl et al., 2014).
2.3.2 Listener
Listener evaluation of the performance may capture listeners’ emotional and aesthetic responses, attention shifts, and physical discomfort while listening. It may also include their annotations of perceived expressive parameter deviations (Park et al., 2024), which can be contextualised within listeners’ biographical and physiological data before being compared with performers’ evaluations and data from the performance (e.g. Gabrielsson and Juslin, 1996; Park et al., 2024; Yang et al., 2021).
Listener evaluation of the performer focuses on audience perceptions of the performer, including interaction, physical gestures, and proficiency. For example, listeners seem strongly influenced by visual cues, such as players’ facial expressions and physical movements (Broughton and Stevens, 2009; Kawase, 2014; Waddell and Williamon, 2017), as well as whether or not they use sheet music (Kopiez et al., 2017; Williamon, 1999). Contextual knowledge is also important – for example, the understanding a listener has of the player’s expertise (Springer and Sorenson, 2024). Social context may additionally impact listeners’ responses to music, such as the extent of the emotional arousal they experience (Thompson and Larson, 1995).
3 Existing Multimodal Datasets
The taxonomy in Section 2 summarises the complex factors that can influence experts’ expressive interpretations of musical scores in a given performance. To establish the current state of multimodal datasets of expressive musical performances and the extent to which they support study of these modalities and topics, the current section uses this proposed taxonomy to review existing multimodal datasets, focusing on datasets that are currently available to the research community.
As there is not yet a dedicated database that lists open‑access datasets for expressive music performance, we conducted a systematic search in the Web of Science to find relevant datasets and journal articles using the following groups of keywords: ‘expressive music performance dataset’ (136 results), ‘multimodal music performance dataset’ (188 results) and ‘music score interpretation dataset’ (85 results) (search date: 9 October 2025). These keywords were developed manually by reviewing a known set of datasets and identifying recurring keywords. We additionally considered ‘music performance datasets’, but the vast majority of the 3,121 results did not meet our inclusion criteria (see below).
Results were supplemented through an ad hoc Google Scholar search and through reviewing datasets listed on the website of the International Society for Music Information Retrieval [ISMIR]1. Further datasets were also found through cross‑referencing article literature reviews.
Experts’ expressive musical interpretations of scores interest researchers in a wide range of disciplines, which often use different reporting standards. As a result, each dataset needed to be manually reviewed for relevance according to the following inclusion criteria:
Expert human performers. Performances (on any instrument or voice) by one or more expert human musicians (i.e. at least university‑ or conservatoire‑level)
Score‑based. Performances of written compositions (full or excerpted)
Dataset accessibility. Open access or available via researcher contact (see Section 3.1)
Performance audio or audio‑derived data. Containing at least one of the following: performance audio, references that enable sourcing the performance audio (i.e. album information), or a symbolic encoding of the performance audio (i.e. MIDI)
Multimodal performance. Additionally including at least one other performance‑oriented non‑trivial data type. A data type is classified as ‘performance‑oriented’ if it supports the ‘Performer’, ‘Venue’ or ‘Instrument’ topics in Table 1 (i.e. it corresponds to the performer or the immediate performance context); it is classified as ‘trivial’ if it corresponds solely to the performer’s name, composition title, performance venue name (e.g. ‘Sydney Opera House’) or instrument type (e.g. ‘violin’)
Where information about the dataset differed between the release article and the dataset source website, we prioritised the latter.
This process resulted in 41 datasets. We are confident that this collection includes the vast majority of relevant datasets; however, the topic’s interdisciplinary nature and terminological heterogeneity make it hard to guarantee complete comprehensiveness.
3.1 Unreleased datasets
Many studies have analysed multimodal datasets but not made them publicly available. This was particularly true historically, when technology made sharing large datasets complicated and releasing them was unusual.
Sometimes, a dataset appears publicly available but the hyperlinks, at the time of writing, no longer appear to be maintained (e.g. Bazzica et al., 2017; D’Amato et al., 2021; Winters et al., 2016). Some publications offer to share the dataset, or the data used for the article’s conclusions, upon request (e.g. Ghodousi et al., 2022; Nusseck et al., 2022; Turchet and Pauwels, 2021). Other publications state either that data will be available at an unspecified time in the future or present the dataset and findings without sharing details of how to access it, making it necessary for interested researchers to follow up to confirm availability and access (e.g. Huang et al., 2020; Rafee et al., 2021). Our attempts to contact original researchers have had limited success, possibly due to changing contact details.
Lack of access to datasets prevents other researchers from verifying, replicating, or extending original studies’ analyses (Maestre et al., 2017). This can be an inefficient use of resources, especially in the context of performance datasets, which often require substantial time, finances, and expertise to compile. Fortunately, it is increasingly common for researchers to make datasets open access, and the open science movement has identified some helpful best practices for Open Music Research that can allow the same resources to be reused by multiple different studies (Huang et al., 2024; see Moss and Neuwirth, 2021).
3.2 Released datasets extracted from existing recordings
One efficient way to create a large multimodal dataset is to take pre‑existing audio recordings and add new data (Table 2). These datasets primarily connect audio recordings with audio‑derived data, typically MIDI, sometimes adding score alignment (e.g. The Con Espressione Game Dataset [Cancino‑Chacón et al., 2020], Guitar‑Aligned Performance Scores [GAPS; Riley et al., 2024], Aligned Scores and Performances [ASAP; Foscarin et al., 2020], and Automatically Transcribed Expressive Piano Performance [ATEPP; Zhang et al., 2022]). Other datasets include audio and MIDI with the additions of scores or score‑ based data (e.g. MusicNet [Thickstun et al., 2017], the Wagner Ring Dataset [Weiß et al., 2023], and the Beethoven Piano Sonata Dataset [BPSD; Zeitler et al., 2024]).
Table 2
Accessible multimodal music performance datasets assembled from extant recordings.
| Dataset | Citation | Modalities | # players | # pieces* | # recordings | Instrument(s) or ensemble | Performance MIDI transcription |
|---|---|---|---|---|---|---|---|
| The Con Espressione Game Dataset2 † | Cancino‑Chacón et al. (2020) | Engraved score (PDF); symbolic score (MusicXML); performer biographical data (name); listener biographical data (expertise); recording setup (inferable from album metadata); audio‑derived data (loudness curves, spectrograms, MIDI, audio–score alignment annotations); listener evaluation of performance | 26 | 9 | 45 | Piano | ‘Approximate’ from alignment/ loudness curves |
| ASAP3 | Foscarin et al. (2020) | Symbolic score (MusicXML, MIDI); score‑derived data (music‑theoretic annotations); audio (some); ambient noise, audio‑derived data (MIDI, audio–score alignment annotations) | Not listed | 236 | 1,067 | Piano | Automatic + manual |
| PianoMotion10M4 | Gan et al. (2024) | Performer movements (video, video annotations); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI) | Not listed | Not listed | 1,966 | Piano | Automatic + manual |
| CrestMusePEDB5†‡ | Hashida et al. (2008) | Symbolic score (MusicXML, MIDI); performer biographical data (name); recording setup (inferable from album metadata); audio‑derived data (rough audio–score deviation data) | Not listed | ~100 | Not listed | Piano | Manual |
| Guqin dataset6† | Huang et al. (2020) | Performer biographical data (name); recording setup (inferable from album metadata); audio; ambient noise (inferable from audio recording); audio‑derived data (note, tempo, and dynamic annotations; technique annotations) | 5 | 10 | 39 | Guqin | None |
| GiantMIDI‑Piano7§ | Kong et al. (2022) | Performer biographical data (name, dates, nationality); venue visuals and configuration (inferable from performance video); performer movements (video); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI) | Not listed | 10,855 | 10,855 | Piano | Automatic |
| MazurkaBL8†¶ | Kosta et al. (2018) | Symbolic score (MusicXML); performer biographical data (name); recording setup (inferable from album metadata); audio‑derived data (loudness, expression, score‑aligned beat annotations) | Not listed | 44 | 2,000 | Piano | None |
| GAPS9# | Riley et al. (2024) | Symbolic score (MusicXML); instrument characteristics (instrument‑specific features, tunings); venue visuals and configuration (inferable from performance video); performer biographical data (name, dates, nationality, gender); performer movements (video); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI, audio–score alignment annotations) | 205 | Not listed | 300 | Guitar | None |
| CHARM Mazurka Project10† | Sapp (2007) | Recording setup (inferable from album metadata); audio; ambient noise (inferable from audio recording); audio‑derived data (some MIDI, some tempo and dynamics data) | 135 | 49 | 2,926 | Piano | Manual |
| MusicNet11 | Thickstun et al. (2017) | Performer biographical data (some names); recording setup (inferable from album metadata); audio; ambient noise (inferable from audio recording); audio‑derived data (note labels, MIDI) | Not listed | Not listed | 330 | Varied | None |
| SUPRA12 | Shi et al. (2019) | Performer biographical data (name); audio; ambient noise (inferable from audio recording); audio‑derived data (‘raw’ MIDI hole file, ‘expressive’ MIDI dynamic hole file, piano roll image) | 151 | ~430 | 457 | Piano | Automatic + manual |
| Wagner Ring Dataset13† | Weiß et al. (2023) | Engraved score (PDF); symbolic score (MusicXML, MIDI); score‑derived data (structural annotations, music‑theoretic annotations); performer biographical data (names); recording setup (inferable from album metadata); audio; ambient noise (inferable from audio recording); audio‑derived data (audio–score alignment) | Not listed | Not listed | 16 | Orchestra and vocalists | None |
| Schubert Winterreise Dataset14† †† | Weiß et al. (2021) | Engraved score (PDF); symbolic score (MusicXML, MIDI); score‑derived data (structural annotations, music‑theoretic annotations); performer biographical data (names); instrument state (tuning); recording setup (inferable from album metadata); audio; ambient noise (inferable from audio recording); audio‑derived data (audio–score alignment annotations) | 16 | 24 | 216 | Piano, voice | None |
| BPSD15 | Zeitler et al. (2024) | Engraved score (PDF); symbolic score (Sibelius, MusicXML, MIDI); score‑derived data (music‑theoretic annotations); instrument characteristics (type); performer biographical data (names); recording setup (inferable from album metadata); audio (some); ambient noise (inferable from audio recording); audio‑derived data (MIDI, audio–score alignment) | 11 | 32 | 352 | Piano | Automatic + manual |
| ATEPP16† | Zhang et al. (2022) | Symbolic score (43% MusicXML); performer biographical data (names); recording setup (inferable from album metadata); audio‑derived data (MIDI) | 49 | 1,595 | 11,674 | Piano | Automatic |
[i] * Individual movements of a larger work are counted as separate pieces.
[ii] † Dataset contains album names to allow any audio recordings not present in the dataset to be purchased commercially.
[iii] ‡ This first edition of CrestMusePEDB contains this number of commercially released recordings. The second edition adds performances recorded by the researchers and appears in Table 3.
[iv] § A subset of 7,236 recordings includes composers’ names in recording titles.
[v] ¶ MazurkaBL is an extension of the CHARM Mazurka Project.
[vi] # GAPS contains videos of 205 performers, but it is unclear if any audio‑only data may include additional players.
[vii] †† Number of performers includes different piano accompanists.
Some datasets drawn from extant recordings offer particularly substantial expansions upon their source data. The Stanford University Piano Roll Archive (SUPRA; Shi et al., 2019) includes two MIDI files alongside each audio recording: one ‘raw’, with pitch, timing, and articulation derived from the roll’s holes, and the other ‘expressive’ (with dynamics), all compiled from digitised piano recordings in Stanford’s Piano Roll Archive. It also includes scans of the piano rolls. The Con Espressione Game Dataset combines listeners’ descriptions of expressive character and favourite performances with scores, audio–score alignment, and performance annotations. PianoMotion10M (Gan et al., 2024) expands upon video and audio recordings compiled from the Internet with performance MIDI and annotations of players’ hand positions. GAPS aligns internet audio and video recordings with scores and instrument‑specific information, expanding this with performer, composition, and composer metadata. Finally, GiantMIDI‑Piano (Kong et al., 2022) links player demographics (names, nationalities, and birth and death years) to audio, MIDI, and YouTube videos of the performances.
The potentially large size of datasets drawn from extant recordings makes them particularly well‑suited to machine‑learning applications. Research using them has included early development of performer identification methodologies based on interpretative expressiveness (Zhang et al., 2022), automating beat and note annotation (Foscarin et al., 2020), performance transcription (Kong et al., 2022; Zhang et al., 2022), and predictions of performance evaluations (Cancino‑Chacón et al., 2020; Gan et al., 2024).
A drawback of compiling datasets of performances from extant recordings is the paucity of detail about recording conditions that could affect players’ interpretations. It is usually impossible to ask performers about their recording experience, even retrospectively. Similarly, the presence or extent of editing in a recording can be undetectable. Live concert recordings appear alongside those made in studio conditions without indication of which recordings came from which source. Analyses must consequently be performed on the audio recordings and associated transcriptions without consideration of whether they were achieved through an unedited live recording, in a single perfect take in a recording studio, or through splicing multiple recordings together, and without discussion of the effect that these conditions may have on the data and subsequent analyses. As a result, while these datasets can be and are used for studying broad interpretative trends, they cannot help with examining the different expressive interpretations that can be produced in particular contexts.
A persistent challenge with releasing the audio in these types of datasets is copyright, which continues to make it hard to include audio from commercial recordings in an open‑access dataset. Listing album titles and player names within the dataset allows researchers to both compile performer biographical data and purchase the recordings, which may provide some details of the performance venue and recording (e.g. The Con Espressione Game Dataset, CrestMusePEDB [Hashida et al., 2008], MazurkaBL [Kosta et al., 2018], CHARM Mazurka Project [Sapp, 2007], Schubert Winterreise Dataset [Weiß et al., 2021]).
One solution is for researchers to newly record performances for their datasets (see Section 3.3). This allows them to potentially secure musicians’ permissions for sharing the recordings and data associated with them, avoiding the copyright restrictions of commercial recordings.
3.3 Released datasets from purpose‑recorded musical interpretations
The most controlled way to gather diverse multimodal data for research is through recording sessions that involve the researcher. Open‑access datasets of purpose‑recorded musical interpretations include both solo (Table 3) and ensemble recordings (Table 4).
Table 3
Purpose‑recorded, publicly available multimodal datasets: Solo.
| Dataset | Citation | Modalities | # players | # pieces* | # (N) or hours (H) of recording(s) | Instrument(s) | Performance MIDI transcription |
|---|---|---|---|---|---|---|---|
| ChoraleBricks17 | Balke et al. (2025) | Symbolic score (MEI, MusicXML, MIDI, CSV); instrument characteristics (type); performer biographical data (birth year); recording setup (equipment type); audio; ambient noise (inferable from audio recording); audio‑derived data (audio–score alignment annotations) | 11 | 10 | 2.7 H | Flute, oboe, clarinet, trumpet, saxophone, baritone, trombone, tuba | None |
| Rach3 Dataset18 | Cancino‑Chacón and Pilov (2024) | Symbolic score (MusicXML, MEI); instrument characteristics (type); venue visuals and configuration (inferable from performance video); performer movements (video); recording setup (equipment type, settings); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI); performer experience (log, description, Mood State Questionnaire) | 1 | 1 | ~350 H | Piano | Automatic |
| Bach Violin Dataset19 | Dong et al. (2022) | Symbolic score (MusicXML); performer biographical data (name); recording post‑production (mixing, file conversion); audio; ambient noise (inferable from audio recording); audio‑derived data (estimated audio–score alignment annotations) | 17 | 32 | 6.5 H | Violin | None |
| Bach10 Dataset20 | Duan and Pardo (2012) | Symbolic score (MIDI); audio; ambient noise (inferable from audio recording); audio‑derived data (audio–score alignment annotations) | 4 | 10 | 40 N | Violin, clarinet, saxophone, bassoon | None |
| The Vienna 4×22 Piano Corpus21 | Goebl (1999) | Symbolic score; instrument characteristics (type); venue visuals and acoustics (described size, building name); performer biographical data (education, expertise); recording setup (equipment type, positioning); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI, audio–score alignment annotations) | 22 | 4 | 88 N | Piano | Automatic |
| RWC Music Database (Classical Music)22† | Goto et al. (2002) | Performer biographical data (names); recording setup (inferable from album metadata); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI) | 19 | 42 | 42 N | Piano, violin, cello, flute, and others | Manual |
| CrestMusePEDB (2nd edition)5 | Hashida et al. (2018) | Symbolic score (MusicXML/MIDI); instrument characteristics (type); venue visuals (described locations); performer biographical data (names); interpretative set; recording setup (inferable from album metadata); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI, audio–score alignment annotations) | 12 | 24 | 443 N | Piano | Automatic |
| MAESTRO23‡ | Hawthorne et al. (2019) | Instrument characteristics (type); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI, audio–MIDI alignment) | 205 | ~864 | 1,276 N | Piano | Automatic |
| PercePiano24§ | Park et al. (2024) | Symbolic score (MusicXML, MIDI); instrument characteristics (type); listener biographical data (expertise); audio‑derived data (MIDI); listener evaluation of performance (annotations [Likert scale]) | 25 | 1,202 excerpts | 1,202 N | Piano | Automatic |
| The Batik‑plays‑Mozart corpus25¶ | Hu and Widmer (2023) | Engraved score (PDF); score‑derived data (music‑theoretic annotations by Hentschel et al., 2021); symbolic score (MusicXML); instrument characteristics (type); performer biographical data (name); recording setup (inferable from album metadata); audio‑derived data (MIDI, MIDI–score alignment) | 1 | 36 | 36 N (3.75 H) | Piano | Automatic |
| SPD26 | Jin et al. (2024) | Instrument characteristics (video); venue visuals and configuration (inferable from performance video); performer movements (video, 3D motion annotations); recording setup (equipment type and placement); audio; ambient noise (inferable from audio recording) | 9 | 120 | > 3 H | Cello, violin | None |
| SMD MIDI‑Audio Piano Music Collection27 | Müller et al. (2011) | Instrument characteristics (type); venue configuration (described location); performer biographical data (expertise); recording setup (equipment type and placement); recording post‑production; audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI) | Not listed | 38 | 50 N | Piano | Automatic |
| Piano Syllabus Dataset28 | Ramoneda et al. (2025) | Instrument characteristics (type); venue visuals and configuration (inferable from performance video); performer movements (video); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI, CQT, piano rolls) | Not listed | 7,901 | 7,901 N | Piano | Unclear |
| Piano gestures dataset29 | Sarasúa et al. (2017) | Symbolic score (MIDI); instrument characteristics (type, video); venue visuals and configuration (inferable from performance video); interpretative set (researchers asked them play different ways); performer movements (video, MoCap); recording setup (equipment type); audio; ambient noise (inferable from audio recording) | 2 | 1 | 105 N | Piano | Automatic |
| Violin gestures dataset29 | Sarasúa et al. (2017) | Performer biographical data (expertise); interpretative set (researchers asked them to play different ways); performer movements (EMG, accelerometer, gyroscope); recording setup (equipment type); audio; ambient noise (inferable from audio recording) | 8 | 1 | 880 N | Violin | None |
| Telemann’s 12 Fantasias for Solo Flute30 | Thibaud et al. (2025) | Engraved score (PDF); symbolic score (MEI, MSCZ); instrument characteristics (type); performer biographical data (name); audio; ambient noise (inferable from audio recording); audio‑derived data (audio–score alignment annotations) | 6 | 12 | 72 N | Flute | None |
| ARME Virtuoso Strings Dataset31† | Tomczak et al. (2023) | Engraved score (PNG); interpretative set (researchers asked them to play in different ways); recording setup (equipment type and placement); audio; ambient noise (inferable from audio recording); audio‑derived data (note‑onset annotations) | 4 | 5 | 746 N | Viola, violin, cello | None |
| CBFdataset32 | Wang et al. (2022) | Venue visuals and acoustics (described room type); performer biographical data (expertise); performer movements (playing technique annotations); recording setup (equipment type); audio; ambient noise (inferable from audio recording) | 10 | 4 | 2.6 H | Chinese bamboo flute (2 types) | None |
| CCOM‑HuQin33 | Zhang et al. (2023) | Engraved score (PDF); symbolic score (MusicXML); instrument characteristics (type); venue visuals (described type); performer biographical data (expertise, names); venue visuals (inferable from performance video); (venue configuration (inferable from performance video); performer movements (video); recording setup (equipment type and placement); recording post‑production; audio; ambient noise (inferable from audio recording); audio‑derived data (audio–score alignment annotations); performer assessment of interpretation (notes of applied techniques) | 8 | 57 | 1.28 H | Erhu, Banhu, Gaohu, Zhuihu, Zhonghu | None |
[i] *Individual movements of a larger work are counted as separate pieces. Regarding the Telemann Fantasias, each complete Fantasia is counted as one recording here since the dataset presents the whole Fantasia, without separating it into separate sections.
[ii] †RWC Music Database and ARME Virtuoso Strings Dataset also contain ensemble performances and so are shown in Tables 3 and 4.
[iii] ‡Zhang and colleagues (2022) determined this number of performers by connecting names with the dataset’s recordings.
[iv] §PercePiano expanded upon data organised by MAESTRO with symbolic data (score and annotations).
[v] ¶Audio recordings may be purchased commercially.
Table 4
Purpose‑recorded, publicly available multimodal datasets: Ensemble.
| Dataset | Citation | Modalities | # players | # pieces* | # (N) or hours (H) of recording(s) | Instrument(s) and/or ensemble type(s) | Performance MIDI transcription |
|---|---|---|---|---|---|---|---|
| Choral Singing Dataset34 | Cuesta et al. (2018) | Performer biographical data (ensemble name); recording setup (equipment type and placement); audio; ambient noise; audio‑derived data (MIDI) | 16 | 3 | 48 N | Voice | Automatic + manual |
| Quartet Body Motion and Pupillometry Dataset35 | Bishop and Jensenius (2020) | Instrument characteristics (video); venue visuals (described location); performer biographical data (expertise); listener biographical data (expertise); venue visuals and configuration (inferable from video recording); performer movements (MoCap, video); performer physiological state (eye tracking); recording setup (equipment type and placement); audio; ambient noise (inferable from audio recording); performer experience (difficulty ratings) | 5 | 4 | 9 N | String quartet | None |
| RWC Music Database (Classical Music)22† | Goto et al. (2002) | Performer biographical data (ensemble name); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI) | ~96 | 20 | 20 N | Orchestra, chamber ensembles | Manual |
| URMP36 | Li et al. (2018) | Engraved score (PDF); symbolic score (MIDI); instrument characteristics (video); venue visuals (described); performer biographical data (expertise); venue visuals and configuration (inferable from performance video); performer movements (video); recording setup (equipment type and placement); recording post‑production; audio; ambient noise (inferable from audio recording); audio‑derived data (audio annotations) | 23 | 44 | 44 N | String duo, trio, quartet, quintet | None |
| EEP37 | Marchini et al. (2014) | Engraved score (PDF); performer biographical data (expertise); performer movements (MoCap, bowing annotations); recording setup (equipment type); recording post‑production; audio; ambient noise (inferable from audio recording); audio‑derived data (audio–score alignment annotations) | 4 | 5 | 23 N | String quartet | None |
| QUARTET38 | Papiotis (2015) | Instrument characteristics (video); venue visuals (described); venue visuals and configuration (inferable from performance video); performer movements (MoCap, video); recording setup (equipment type and placement); audio; ambient noise (inferable from audio recording); audio‑derived data (audio–score alignment annotations) | 4 | Not listed | 96 N | String quartet | None |
| Erkomaishvili Dataset39 | Rosenzweig et al. (2020) | Symbolic score (MusicXML); performer biographical data (name); recording setup (equipment type); audio; ambient noise (inferable from audio recording); audio‑derived data (performed note‑onset annotations, fundamental frequency) | 1 | 101 | 101 N (7 H) | Voice | None |
| PHENICX‑conduct dataset40 | Sarasúa (2017) | Symbolic score (MusicXML/MIDI); instrument characteristics (types); venue visuals (described location); performer biographical data (ensemble name); venue visuals and configuration (inferable from performance video); performer movements (video); recording setup; audio; ambient noise (inferable from audio recording) | Not listed | 3 excerpts | 75 N | Orchestra | None |
| ARME Virtuoso Strings Dataset31† | Tomczak et al. (2023) | Engraved score (PNG); interpretative set (researchers asked them to play in different ways); recording setup (equipment type and placement); audio; ambient noise (inferable from audio recording); audio‑derived data (note‑onset annotations) | 4 | 5 | 746 N | String duo, trio, quartet | None |
Nearly all these datasets include audio recordings and symbolic data, such as MIDI, usually captured through instruments’ own MIDI devices or MIDI tools retrofitted to acoustic instruments (e.g. Rach3 Dataset [Cancino‑Chacón and Pilov, 2024], Real World Computing Music Database [RWC; Goto, 2004; Goto et al., 2002], MIDI and Audio Edited for Synchronous TRacks and Organization [MAESTRO; Hawthorne et al., 2019], Saarland Music Data MIDI‑Audio Piano Music Collection [SMD; Müller et al., 2011], CrestMusePEDB 2nd edition [Hashida et al., 2018]). Datasets may also include engraved or symbolic scores (Rach3 Dataset, The Vienna 4x22 Piano Corpus [Goebl, 1999], Bach Violin Dataset [Dong et al., 2022], CCOM‑HuQin [Zhang et al., 2023], The Batik‑plays‑Mozart corpus [Hu and Widmer, 2023], University of Rochester Multi‑Modal Music Performance [URMP; Li et al., 2018], CrestMusePEDB 2nd edition, Telemann’s 12 Fantasias for Solo Flute [Thibaud et al., 2025]). Researchers may also add extensive annotations, whether themselves or those by invited expert annotators, such as 3D motion annotations (String Performance Dataset [SPD; Jin et al., 2024]), technique annotations (CBFdataset [Wang et al., 2022]), or dynamic annotations (MAESTRO). The most extensive datasets also include video recordings (Rach3 Dataset, CCOM‑HuQin, SPD, Piano and violin gestures datasets [Sarasúa et al., 2017], Quartet Body Motion and Pupillometry Dataset [Bishop and Jensenius, 2020], URMP, QUARTET [Papiotis, 2015]), MoCap (Ensemble Expressive Performance Dataset [EEP; Marchini et al., 2014], QUARTET, Piano and violin gestures datasets, Quartet Body Motion and Pupillometry Dataset, Piano Syllabus Dataset [Ramoneda et al., 2025], PHENICX‑conduct dataset [Sarasúa, 2017]), eye tracking or other physiological measurements (Violin gestures dataset, Quartet Body Motion and Pupillometry Dataset), or listener data (PercePiano [Park et al., 2024], Quartet Body Motion and Pupillometry Dataset).
Datasets of musical interpretations recorded for the purpose of dataset compilation have supported research in MIR, performance studies, and other areas in empirical musicology. The Vienna 4x22 Piano Corpus was used to test methods for automatic alignment of different interpretations of the same piece (Wang et al., 2016). Using the Quartet Body Motion and Pupillometry Dataset, Bishop and colleagues (2021) found a relationship between technical difficulty and physical effort in performance on pupil diameters in both players and listeners. These datasets have also supported study of musicians’ development of interpretations through the learning of a new piece (Cancino‑Chacón and Pilov, 2024 [Rach3 Dataset]), as well as how interpretative intent may influence the amount performers move (Sarasúa et al., 2017 [Piano Gestures dataset]). Jin and colleagues (2024) used SPD to reinforce the importance of combining multiple modalities in the study of performances, in their case refining the analyses of annotated 3D motion data with nuances drawn from the concomitant audio.
Alongside multimodal quantitative data, researchers sometimes also provide small amounts of information about the musicians whose performances make up these datasets. Most commonly reported are general demographic information (e.g. national background), professional status (e.g. students versus professionals), or institutional affiliation. For example, Goebl (1999) describes the Vienna 4x22 Piano Corpus as containing performances by Viennese pianists, while the SMD MIDI‑Audio Piano Music Collection features performances by students at Hochschule für Musik Saar (Müller et al., 2011) and URMP includes some performances by students at the Eastman School of Music (Li et al., 2018), although none of these examples connect piece, excerpt, or performance with performer. The Piano and Violin gestures datasets mention performers’ genders (Sarasúa et al., 2017), the EEP dataset states average years of experience and ages of participants (Marchini et al., 2014), and ChoraleBricks includes performers’ years of birth (Balke et al., 2025). Although they did not collect demographic data, Cancino‑Chacón and Pilov (2024) included rehearsal logs and answers to a Mood State Questionnaire in their Rach3 Dataset. The Batik‑plays‑Mozart corpus, ARME Virtuoso Strings Dataset (Tomczak et al., 2023), and Erkomaishvili Dataset (Rosenzweig et al., 2020) are unique among these purpose‑recorded datasets as the performances they contain are by the same, named players, allowing researchers to compile performer data if desired. Players’ names can sometimes be found within other datasets, as in CrestmusePEDB 2nd edition, or in the file names (e.g. Bach Violin Dataset, CCOM‑HuQin, Telemann’s 12 Fantasias for Solo Flute).
Researchers compiling other datasets found through this review may not have requested or received consent to release demographic or other background data, as most analyses seek statistical summaries or broad trends in interpretative strategies or to train machine‑learning models to align or predict music. Attributing trends in the data to players’ backgrounds or individual perspectives was likely not the intended use of these datasets.
A related ongoing endeavour involving purpose‑recorded, publicly available multimodal datasets, though strictly outside the scope of the present review, is to produce datasets of improvised musical performances. This is an interesting domain, where the performer’s contribution is not limited to solely the musical parameters they manipulate in a performance based largely on an inscribed score but additionally includes the choice of notes, chords, and rhythms. Example datasets that incorporate both score‑based and extemporaneous playing include Filosax41 (Foster and Dixon, 2021), FiloBass42 (Riley and Dixon, 2023), Guitarset43 (Xi et al., 2018), and GrooveMIDI44 (Gillick et al., 2019).
4 Underrepresented Modalities
A comparison of our taxonomy of modalities (Table 1) with the modalities supported by published datasets (Tables 2–4) reveals that, while some modalities are represented well (e.g. engraved and symbolic scores, recording setup, audio, audio‑derived data), others are represented poorly, if at all (e.g. venue acoustics, performer physical attributes, listener configuration, performer assessment of interpretation and experience). This is partly due to practical factors, such as the ease of collecting audio or audio‑derived data as compared to (for example) venue acoustics or performer experience data. However, none of these practical factors are fundamentally insuperable, whether with recent technologies (e.g. software for pose estimation from video) or through a short survey (e.g. questions about performance experience). We believe that future research stands to gain significantly from using such modalities to study these performances, thereby engaging new research questions about expressive music interpretation.
We now review key underrepresented modalities and provide practical suggestions about how they might be incorporated into future datasets.
4.1 Before the performance
Concerning modalities for which data can be collected before the performance, as outlined previously (Table 1; Section 2.1), most datasets only provide the score. Some datasets do provide information about performance instruments, though this information is typically basic (e.g. it is a piano). Almost no datasets provide significant information about venue acoustics, performer background, or listener background. We will focus on these latter lacunae in the following.
4.1.1 Venue acoustics
Only two datasets include acoustic information, stating that the performances were recorded in recording studios (The Vienna 4x22 Piano corpus, CBFdataset). Acoustic questions become much more complicated in other real‑world performance venues, however. Recording an impulse response function would be an efficient way to capture key acoustic properties of the space, such as reverberation time and clarity. Researchers could also offer qualitative descriptions of the acoustics.
4.1.2 Performer
Performer biographical data such as age, nationality, language, education, and other relevant details are rarely included in these datasets, while several mention performer expertise (e.g. Vienna 4x22 Piano Corpus, SMD MIDI‑Audio Piano Music Collection, Violin gestures dataset). Basic biographical data may be attached to a large dataset, as in the Piano Jazz with Automatic MIDI Annotations dataset45 (PiJAMA), which attaches players’ genders, names, and years of birth to 2,777 performances taken from existing recordings (Edwards et al., 2023). However, this dataset is of improvisations, not score‑based interpretations, and no comparably substantial datasets of interpretations yet include this information. Of course, researchers using extant datasets can compile this biographical information for themselves if they have access to performers’ names; however, names are only available in about half of the purpose‑recorded, publicly available datasets.
Performer physical attributes do not appear in any datasets found in this review. This information could support study of performed gestures or fingerings, as well as the capabilities and restrictions of players of different statures, ages, or those working through various performance‑related injuries or technique adjustments (Carmeli et al., 2003; Sakai et al., 2006). Such data could provide an interesting new perspective for predicting performers’ interpretations and yield useful applications in pedagogical research and the study of injury prevention and recovery.
Performer psychology also is not covered by any of these datasets. Including psychological profiles, and personality traits in particular, could enrich analyses of expressive variation, performance anxiety, and interpretative decision‑making. Widely used instruments such as the Big Five Inventory (John et al., 1991), its shorter version, the 10‑item Personality Inventory (Gosling et al., 2003), or the HEXACO Personality Inventory – Revised offer scalable options for inclusion. These tools have been successfully applied in some music contexts, such as studies linking Big Five traits to music performance anxiety and listeners’ musical preferences (Chattin, 2019; Flannery and Woolhouse, 2021) or those differentiating musicians from non‑musicians based on traits like openness or honesty and humility (Bandi et al., 2022). Including such data could support research in music psychology, pedagogy, and performance science, enabling nuanced investigations into how personality influences musical expression and experience.
Interpretative set is similarly largely absent from the datasets reviewed. A small number of datasets provide performers with pre‑performance instructions (e.g. ‘play more dramatically’, ‘play more sadly’; see Piano gestures dataset, Violin gestures dataset, ARME Virtuoso Strings Dataset), and one other dataset asks performers to describe their interpretative intention (CrestMusePEDB 2nd edition). Further work in this vein could substantially benefit studies in music psychology, pedagogy, and expressive performance modelling, helping researchers to study, for example, how intention shapes expressive output, how players creatively adjust their interpretative intention during performance, and how different listeners perceive these intentions (Héroux 2017; Héroux and Fortier, 2014; Turchet and Pauwels, 2021).
4.1.3 Listener
Listener biographical data appear in only three datasets: The Con Espressione Game Dataset, PercePiano, and the Quartet Body Motion and Pupillometry Dataset, which solely furnish listeners’ expertise or musical exposure. Further biographical data might enable comparison of differences between listeners’ perception of the performances, as well as between their perception and that of the performer (Turchet and Pauwels, 2021). Of course, such data are only relevant to datasets where the performance takes place with a live audience.
4.2 During the performance
Concerning modalities for which data may be captured ‘during the performance’ (Table 1; Section 2.2), most datasets only consistently provide audio (and, when recordings are acoustic, some kind of ambient noise) and audio‑derived data. Several include performance movements through video and, therefore, also inadvertently capture some venue visuals and configuration (e.g. GiantMIDI‑Piano, GAPS; SPD, Piano gestures dataset, Quartet Body Motion and Pupillometry Dataset). However, as discussed below, listener data and performer physiological data are particularly underrepresented.
4.2.1 Listener
Neither the configuration nor the physiological state of any listener is included in any of these datasets, with the exception of one dataset that includes listeners’ eye tracking (Quartet Body Motion and Pupillometry Dataset). This is primarily because many data collections took place without an audience, or the studies collected qualitative data on listeners’ reflections rather than quantitative data. Future studies with audiences could record their precise positions in the venue, offering opportunities to study how listeners might perceive performances differently based on their physical distance to the sound, as well as how performers might respond to the proximity and actions of listeners during performance (O’Neill and Sloboda, 2017). Wearable sensors might offer basic heart rate and respiration data to enable study of attention, engagement, and inter‑listener synchrony during performances (Tschacher et al., 2023).
4.2.2 Performer
Aspects of performer physiological state during performances are included in only one of the datasets reviewed. While physiological data‑collection tools can be intrusive (see Section 5 for further discussion), the data they capture can offer insights into expressive behaviour, performance stress, and performer–audience interaction. As with listeners, the use of wearable heart rate monitors, respiration belts, and galvanic skin response devices can capture physiological signals during live performance, enabling analysis of arousal and emotional engagement (Knapp et al., 2009). These data can therefore be used to study the relationships between performers’ internal states and expressive parameters, as well as to study potential synchronisation between the physiological states of listeners and performers (Tschacher et al., 2023).
4.3 After the performance
Data capturing listeners’ and performers’ evaluations and reflections of the performance are almost entirely missing from these datasets, as discussed below.
4.3.1 Performer
Performer assessments of their interpretations and evaluations of their experience during the performance can support more profound study of how we play and listen to music (Armstrong, 2020; Bresin and Friberg, 1999; Vieillard et al., 2008). In our survey, only two datasets include details about performers’ interpretations during the data collection (CrestMusePEDB 2nd edition, CCOM‑HuQin), and only two others offer brief information about performers’ experience in the data collection (Rach3 Dataset, Quartet Body Motion and Pupillometry Dataset). Such data could be helpful for clarifying intentionality of performance decisions and for enhancing understanding of the performer’s internal state during the performance (especially in the absence of physiological data). Musicians’ post hoc explanations of performance outcomes as well as performance context (e.g. recalling being distracted by a passing thought, remembering thinking about the acoustic rather than the performance) can also be useful for understanding their individual performances.
Taking this a step further, performers could be invited to collaborate in study design, data annotation, and analysis. Including musicians as co‑researchers can help to create richer datasets that include their perspectives on the performances and data‑collection processes and that are shaped by their input (Persson and Robson, 1995; see Burke and Onsman, 2017). Research strategies incorporating participant involvement also give them active continued interest in the data and its use, which has been seen in other fields to make people more likely to take part in research (Castonguay et al., 2025; Klompstra et al., 2025).
4.3.2 Listener
Listener evaluations of performances appear in only two datasets (The Con Espressione Game Dataset, PercePiano), and no datasets include participants’ evaluations of the performer. More multimodal datasets with listener evaluations would be essential for furthering our understanding of factors beyond the aural that determine listeners’ judgments of musical performances (Bugai et al., 2019; Tsay, 2013; Urbaniak and Mitchell, 2021; Vuoskoski et al., 2016). For datasets drawn from public music performances or publicly available recordings, an interesting possibility would be to include data on critical reception (e.g. competition rankings, concert reviews, comments on online platforms), and/or long‑term success metrics (e.g. radio plays, streaming statistics, album purchases), as these can offer insights into how interpretations are received and sustained over time.
5 Challenges
Producing multimodal datasets involves several key challenges. Here we focus on three: standardisation of data reporting, the organisation and storage of multimodal datasets, and the potential influences that invasive data‑collection tools may have on performers’ interpretations.
Standardisation of data reporting is a key challenge for the burgeoning field of multimodal performance studies. Even basic dataset characteristics such as the number of distinct performers, recordings, or compositions are not consistently reported in previous research (e.g. ASAP, GiantMIDI‑Piano, SMD MIDI‑Audio Piano Music Collection, QUARTET, MAESTRO). Moreover, any study that wishes to apply an analysis or machine‑learning pipeline over multiple datasets must grapple with different conventions for encoding multimodal data streams, such as MoCap and music‑theoretic annotations, or locate solutions for alignment and synchronisation of different data types (Müller et al., 2010; Weiß and Müller, 2024). This diversity in conventions comes in part from the interdisciplinary nature of the field, which includes researchers from music cognition, music psychology, MIR, performance studies, and other branches of empirical musicology.
Some useful standards have already been presented through projects that include the Music Encoding Initiative46 and the Initiative Fusing Audio and Semantic Technology for Intelligent Music Production and Consumption (FAST) project47. These and other such programs have often taken advantage of technologies such as the Semantic Web to organise rich metadata about recordings, such as performer, recording context, and composition (Sandler et al., 2019), or to encode aspects of musical structure (Harley, 2019; Harley et al., 2015). However, these offer only partial solutions; for example, listener data, MoCap, and performer retrospections are not yet covered.
The organisation and storage of multimodal datasets is a second, related challenge (Jensenius, 2021). The repository must be long‑lived so that researchers can use datasets collected more than a few years ago. Generic repositories do exist for this (e.g. Zenodo), but they do not support easy exploration of the data. The online platform RepoVizz48 (Maestre et al., 2017) was an exciting solution tailored to multimodal performance studies; researchers could contribute diverse streams of synchronised data (e.g. audio, video, MoCap), and visitors could then explore these synchronised streams through a graphical user interface. However, this database was shut down in September 2025, and no comparable alternative has yet been made available.
The potential influences of data‑collection tools on players’ interpretations is a third challenge. Desire for additional data streams must be tempered by acknowledgement of the potentially adverse influence that such measurements can have on the ways musicians perform. For example, marker‑based MoCap involves performers wearing equipment that can interfere with their physical movements. Fortunately, less‑invasive data‑collection methods continue to be developed that can provide equally useful data for certain research projects (e.g. D’Amato et al., 2021; Jin et al., 2024). When intrusive data‑collection methods are necessary in recordings, it would be good practice to use methods to both validate the data for particular research applications and better understand how interpretations might have been affected by the data‑collection tools. One way to accomplish this would be through interviewing the musicians to gain their perspectives. Another would be to compare their performances recorded using more intrusive data‑collection methods with those recorded using less‑ or non‑intrusive data‑collection methods, such as those capturing audio, video, or MIDI.
6 Practical Applications
In addition to contributing theoretical advances to empirical musicology and music psychology (as discussed above), expanded multimodal datasets have important potential practical applications.
In artificial intelligence, multimodal datasets can drive advances in several types of algorithms. Performer identification models have traditionally been restricted to audio and audio‑derived information (Edwards et al., 2023; Rafee et al., 2021; Zhang et al., 2022), but extending these models to multimodal inputs can help us to characterise performer differences in gestures, microexpressions, and performer intention. Multimodal datasets can also be used to extend the scope of music‑generation models, for example generating not just audio outputs but also performer movements and gestures, and modelling the relationship between stated intention and expressive outcomes. This could enable users to state an interpretative goal for a musical piece and to have the algorithm realise this goal musically. Lastly, multimodal datasets with listener evaluations could be a useful resource for music‑recommendation platforms, allowing them to take into account both the perceived quality of musical performances and their emotional impact.
Music pedagogy can also be linked to MIR for further insights. More profound understanding of expert expressive options, players’ rationales behind enacting them, the gestures they use to produce them, and how listeners and players alike perceive and respond to them could assist with demonstration and teaching. This could lead to the development of games and interactive learning environments to teach expressive strategies and interpretation ideas, extending research into technique and technical exercises underway in such projects as the Technology Enhanced Learning of Music Instruments49 (TELMI; Volpe et al., 2017) to the study of interpretation and interpretation formulation. Building on research into the influence of musical structure on listeners’ movements by Toiviainen and colleagues (2009), studies into how audience members may move differently in response to varied players’ interpretations could be used to teach different expressive techniques or to strategise performance movements. In turn, more profound understanding of the development, execution, and perception of different interpretative strategies could assist with further deep‑learning models.
There are many different modalities that could be included in a multimodal musical dataset (Table 1), and it will rarely be realistic to cover all of these in individual studies. However, we hope that our taxonomy might help future researchers make an informed decision about the best combination to include in their own work, taking into account both what has been done before (Tables 2–4) and what is underrepresented (Section 4). Careful consideration of challenges such as data reporting, standardisation, and storage (Section 5) will help researchers to maximise the long‑term impact of their research, as encouraged with such prizes as ISMIR’s ‘Test of Time Award’.
Funding Information
This work was supported by the UK Economic and Social Science Research Council [training grant reference number ES/P000738/1].
Competing Interests
The authors have no competing interests to declare.
Authors’ Contributions
KE: conceptualisation, funding acquisition, investigation, methodology, writing – original draft, writing – reviewing & editing.
PMCH: conceptualisation, funding acquisition, methodology, writing – reviewing & editing, supervision.
Notes
[15] All hyperlinks were last accessed on 14/11/25.
