Table 1
Proposed taxonomy of modalities relevant to expressive interpretations of musical scores.
| Data‑collection phase | Topic | Modality | Example(s) |
|---|---|---|---|
| Before the performance | Score | Engraved score | Western staff notation, printed |
| Symbolic score | MusicXML, MIDI | ||
| Score‑derived data | Music‑theoretic annotations/analyses | ||
| Instrument | Instrument characteristics | Type, model, mechanical properties, temperament | |
| Venue | Venue visuals | Room lighting, capacity, aesthetics | |
| Venue configuration | Layout, performer location | ||
| Venue acoustics | Size, shape, reverberance, acoustic anomalies | ||
| Performer | Performer biographical data | Age, gender, studies, pedagogical background, expertise | |
| Performer physical attributes | Anthropometric measurements, state of physical fitness | ||
| Performer psychology | Personality | ||
| Interpretative set | Expressive intention or ideal | ||
| Listener | Listener biographical data | Personal, educational, professional/expertise | |
| During the performance | Instrument | Instrument state | Tuning, responsiveness, mechanical wear |
| Venue | Venue visuals | Room lighting, stage effects or decorations | |
| Venue configuration | Configuration of player(s) and instrument(s), presence and placement of intrusive recording equipment | ||
| Venue acoustics | Humidity or temperature shifts | ||
| Listener | Listener configuration | Presence, location | |
| Listener physiological state | Heart rate, skin conductance, respiration, pupil dilation, brain activity, eye tracking | ||
| Performer | Performer movements | Performer gestures, performer–performer interaction, performer–audience interaction (e.g. video, motion capture [MoCap]) | |
| Performer physiological state | Heart rate, skin conductance, respiration, pupil dilation, brain activity, eye tracking | ||
| Recording process | Recording setup | Type, settings, and positioning of recording equipment | |
| Recording post‑production | Cuts, splices, equalisation, reverberation adjustment, file compression | ||
| Performance sound | Audio | Mixed recording or master tracks | |
| Ambient noise | External noise bleed, HVAC system sounds, audience noise | ||
| Audio‑derived data | Audio‑derived representations (e.g. spectrograms, loudness curves, audio–score alignment) | ||
| After the performance | Performer | Performer assessment of the interpretation | Reflection on the extent to which interpretational intent was carried out, post‑performance/ad hoc justification |
| Performer evaluation of the experience | (Dis)comfort during performance, attention/distraction | ||
| Listener | Listener evaluation of the performance | Physiological reactions, aesthetic preference, emotional response, attention, physical (dis)comfort during performance | |
| Listener evaluation of the performer | Stage presence, movements, facial expressions |
Table 2
Accessible multimodal music performance datasets assembled from extant recordings.
| Dataset | Citation | Modalities | # players | # pieces* | # recordings | Instrument(s) or ensemble | Performance MIDI transcription |
|---|---|---|---|---|---|---|---|
| The Con Espressione Game Dataset2 † | Cancino‑Chacón et al. (2020) | Engraved score (PDF); symbolic score (MusicXML); performer biographical data (name); listener biographical data (expertise); recording setup (inferable from album metadata); audio‑derived data (loudness curves, spectrograms, MIDI, audio–score alignment annotations); listener evaluation of performance | 26 | 9 | 45 | Piano | ‘Approximate’ from alignment/ loudness curves |
| ASAP3 | Foscarin et al. (2020) | Symbolic score (MusicXML, MIDI); score‑derived data (music‑theoretic annotations); audio (some); ambient noise, audio‑derived data (MIDI, audio–score alignment annotations) | Not listed | 236 | 1,067 | Piano | Automatic + manual |
| PianoMotion10M4 | Gan et al. (2024) | Performer movements (video, video annotations); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI) | Not listed | Not listed | 1,966 | Piano | Automatic + manual |
| CrestMusePEDB5†‡ | Hashida et al. (2008) | Symbolic score (MusicXML, MIDI); performer biographical data (name); recording setup (inferable from album metadata); audio‑derived data (rough audio–score deviation data) | Not listed | ~100 | Not listed | Piano | Manual |
| Guqin dataset6† | Huang et al. (2020) | Performer biographical data (name); recording setup (inferable from album metadata); audio; ambient noise (inferable from audio recording); audio‑derived data (note, tempo, and dynamic annotations; technique annotations) | 5 | 10 | 39 | Guqin | None |
| GiantMIDI‑Piano7§ | Kong et al. (2022) | Performer biographical data (name, dates, nationality); venue visuals and configuration (inferable from performance video); performer movements (video); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI) | Not listed | 10,855 | 10,855 | Piano | Automatic |
| MazurkaBL8†¶ | Kosta et al. (2018) | Symbolic score (MusicXML); performer biographical data (name); recording setup (inferable from album metadata); audio‑derived data (loudness, expression, score‑aligned beat annotations) | Not listed | 44 | 2,000 | Piano | None |
| GAPS9# | Riley et al. (2024) | Symbolic score (MusicXML); instrument characteristics (instrument‑specific features, tunings); venue visuals and configuration (inferable from performance video); performer biographical data (name, dates, nationality, gender); performer movements (video); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI, audio–score alignment annotations) | 205 | Not listed | 300 | Guitar | None |
| CHARM Mazurka Project10† | Sapp (2007) | Recording setup (inferable from album metadata); audio; ambient noise (inferable from audio recording); audio‑derived data (some MIDI, some tempo and dynamics data) | 135 | 49 | 2,926 | Piano | Manual |
| MusicNet11 | Thickstun et al. (2017) | Performer biographical data (some names); recording setup (inferable from album metadata); audio; ambient noise (inferable from audio recording); audio‑derived data (note labels, MIDI) | Not listed | Not listed | 330 | Varied | None |
| SUPRA12 | Shi et al. (2019) | Performer biographical data (name); audio; ambient noise (inferable from audio recording); audio‑derived data (‘raw’ MIDI hole file, ‘expressive’ MIDI dynamic hole file, piano roll image) | 151 | ~430 | 457 | Piano | Automatic + manual |
| Wagner Ring Dataset13† | Weiß et al. (2023) | Engraved score (PDF); symbolic score (MusicXML, MIDI); score‑derived data (structural annotations, music‑theoretic annotations); performer biographical data (names); recording setup (inferable from album metadata); audio; ambient noise (inferable from audio recording); audio‑derived data (audio–score alignment) | Not listed | Not listed | 16 | Orchestra and vocalists | None |
| Schubert Winterreise Dataset14† †† | Weiß et al. (2021) | Engraved score (PDF); symbolic score (MusicXML, MIDI); score‑derived data (structural annotations, music‑theoretic annotations); performer biographical data (names); instrument state (tuning); recording setup (inferable from album metadata); audio; ambient noise (inferable from audio recording); audio‑derived data (audio–score alignment annotations) | 16 | 24 | 216 | Piano, voice | None |
| BPSD15 | Zeitler et al. (2024) | Engraved score (PDF); symbolic score (Sibelius, MusicXML, MIDI); score‑derived data (music‑theoretic annotations); instrument characteristics (type); performer biographical data (names); recording setup (inferable from album metadata); audio (some); ambient noise (inferable from audio recording); audio‑derived data (MIDI, audio–score alignment) | 11 | 32 | 352 | Piano | Automatic + manual |
| ATEPP16† | Zhang et al. (2022) | Symbolic score (43% MusicXML); performer biographical data (names); recording setup (inferable from album metadata); audio‑derived data (MIDI) | 49 | 1,595 | 11,674 | Piano | Automatic |
[i] * Individual movements of a larger work are counted as separate pieces.
[ii] † Dataset contains album names to allow any audio recordings not present in the dataset to be purchased commercially.
[iii] ‡ This first edition of CrestMusePEDB contains this number of commercially released recordings. The second edition adds performances recorded by the researchers and appears in Table 3.
[iv] § A subset of 7,236 recordings includes composers’ names in recording titles.
[v] ¶ MazurkaBL is an extension of the CHARM Mazurka Project.
[vi] # GAPS contains videos of 205 performers, but it is unclear if any audio‑only data may include additional players.
[vii] †† Number of performers includes different piano accompanists.
Table 3
Purpose‑recorded, publicly available multimodal datasets: Solo.
| Dataset | Citation | Modalities | # players | # pieces* | # (N) or hours (H) of recording(s) | Instrument(s) | Performance MIDI transcription |
|---|---|---|---|---|---|---|---|
| ChoraleBricks17 | Balke et al. (2025) | Symbolic score (MEI, MusicXML, MIDI, CSV); instrument characteristics (type); performer biographical data (birth year); recording setup (equipment type); audio; ambient noise (inferable from audio recording); audio‑derived data (audio–score alignment annotations) | 11 | 10 | 2.7 H | Flute, oboe, clarinet, trumpet, saxophone, baritone, trombone, tuba | None |
| Rach3 Dataset18 | Cancino‑Chacón and Pilov (2024) | Symbolic score (MusicXML, MEI); instrument characteristics (type); venue visuals and configuration (inferable from performance video); performer movements (video); recording setup (equipment type, settings); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI); performer experience (log, description, Mood State Questionnaire) | 1 | 1 | ~350 H | Piano | Automatic |
| Bach Violin Dataset19 | Dong et al. (2022) | Symbolic score (MusicXML); performer biographical data (name); recording post‑production (mixing, file conversion); audio; ambient noise (inferable from audio recording); audio‑derived data (estimated audio–score alignment annotations) | 17 | 32 | 6.5 H | Violin | None |
| Bach10 Dataset20 | Duan and Pardo (2012) | Symbolic score (MIDI); audio; ambient noise (inferable from audio recording); audio‑derived data (audio–score alignment annotations) | 4 | 10 | 40 N | Violin, clarinet, saxophone, bassoon | None |
| The Vienna 4×22 Piano Corpus21 | Goebl (1999) | Symbolic score; instrument characteristics (type); venue visuals and acoustics (described size, building name); performer biographical data (education, expertise); recording setup (equipment type, positioning); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI, audio–score alignment annotations) | 22 | 4 | 88 N | Piano | Automatic |
| RWC Music Database (Classical Music)22† | Goto et al. (2002) | Performer biographical data (names); recording setup (inferable from album metadata); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI) | 19 | 42 | 42 N | Piano, violin, cello, flute, and others | Manual |
| CrestMusePEDB (2nd edition)5 | Hashida et al. (2018) | Symbolic score (MusicXML/MIDI); instrument characteristics (type); venue visuals (described locations); performer biographical data (names); interpretative set; recording setup (inferable from album metadata); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI, audio–score alignment annotations) | 12 | 24 | 443 N | Piano | Automatic |
| MAESTRO23‡ | Hawthorne et al. (2019) | Instrument characteristics (type); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI, audio–MIDI alignment) | 205 | ~864 | 1,276 N | Piano | Automatic |
| PercePiano24§ | Park et al. (2024) | Symbolic score (MusicXML, MIDI); instrument characteristics (type); listener biographical data (expertise); audio‑derived data (MIDI); listener evaluation of performance (annotations [Likert scale]) | 25 | 1,202 excerpts | 1,202 N | Piano | Automatic |
| The Batik‑plays‑Mozart corpus25¶ | Hu and Widmer (2023) | Engraved score (PDF); score‑derived data (music‑theoretic annotations by Hentschel et al., 2021); symbolic score (MusicXML); instrument characteristics (type); performer biographical data (name); recording setup (inferable from album metadata); audio‑derived data (MIDI, MIDI–score alignment) | 1 | 36 | 36 N (3.75 H) | Piano | Automatic |
| SPD26 | Jin et al. (2024) | Instrument characteristics (video); venue visuals and configuration (inferable from performance video); performer movements (video, 3D motion annotations); recording setup (equipment type and placement); audio; ambient noise (inferable from audio recording) | 9 | 120 | > 3 H | Cello, violin | None |
| SMD MIDI‑Audio Piano Music Collection27 | Müller et al. (2011) | Instrument characteristics (type); venue configuration (described location); performer biographical data (expertise); recording setup (equipment type and placement); recording post‑production; audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI) | Not listed | 38 | 50 N | Piano | Automatic |
| Piano Syllabus Dataset28 | Ramoneda et al. (2025) | Instrument characteristics (type); venue visuals and configuration (inferable from performance video); performer movements (video); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI, CQT, piano rolls) | Not listed | 7,901 | 7,901 N | Piano | Unclear |
| Piano gestures dataset29 | Sarasúa et al. (2017) | Symbolic score (MIDI); instrument characteristics (type, video); venue visuals and configuration (inferable from performance video); interpretative set (researchers asked them play different ways); performer movements (video, MoCap); recording setup (equipment type); audio; ambient noise (inferable from audio recording) | 2 | 1 | 105 N | Piano | Automatic |
| Violin gestures dataset29 | Sarasúa et al. (2017) | Performer biographical data (expertise); interpretative set (researchers asked them to play different ways); performer movements (EMG, accelerometer, gyroscope); recording setup (equipment type); audio; ambient noise (inferable from audio recording) | 8 | 1 | 880 N | Violin | None |
| Telemann’s 12 Fantasias for Solo Flute30 | Thibaud et al. (2025) | Engraved score (PDF); symbolic score (MEI, MSCZ); instrument characteristics (type); performer biographical data (name); audio; ambient noise (inferable from audio recording); audio‑derived data (audio–score alignment annotations) | 6 | 12 | 72 N | Flute | None |
| ARME Virtuoso Strings Dataset31† | Tomczak et al. (2023) | Engraved score (PNG); interpretative set (researchers asked them to play in different ways); recording setup (equipment type and placement); audio; ambient noise (inferable from audio recording); audio‑derived data (note‑onset annotations) | 4 | 5 | 746 N | Viola, violin, cello | None |
| CBFdataset32 | Wang et al. (2022) | Venue visuals and acoustics (described room type); performer biographical data (expertise); performer movements (playing technique annotations); recording setup (equipment type); audio; ambient noise (inferable from audio recording) | 10 | 4 | 2.6 H | Chinese bamboo flute (2 types) | None |
| CCOM‑HuQin33 | Zhang et al. (2023) | Engraved score (PDF); symbolic score (MusicXML); instrument characteristics (type); venue visuals (described type); performer biographical data (expertise, names); venue visuals (inferable from performance video); (venue configuration (inferable from performance video); performer movements (video); recording setup (equipment type and placement); recording post‑production; audio; ambient noise (inferable from audio recording); audio‑derived data (audio–score alignment annotations); performer assessment of interpretation (notes of applied techniques) | 8 | 57 | 1.28 H | Erhu, Banhu, Gaohu, Zhuihu, Zhonghu | None |
[i] *Individual movements of a larger work are counted as separate pieces. Regarding the Telemann Fantasias, each complete Fantasia is counted as one recording here since the dataset presents the whole Fantasia, without separating it into separate sections.
[ii] †RWC Music Database and ARME Virtuoso Strings Dataset also contain ensemble performances and so are shown in Tables 3 and 4.
[iii] ‡Zhang and colleagues (2022) determined this number of performers by connecting names with the dataset’s recordings.
[iv] §PercePiano expanded upon data organised by MAESTRO with symbolic data (score and annotations).
[v] ¶Audio recordings may be purchased commercially.
Table 4
Purpose‑recorded, publicly available multimodal datasets: Ensemble.
| Dataset | Citation | Modalities | # players | # pieces* | # (N) or hours (H) of recording(s) | Instrument(s) and/or ensemble type(s) | Performance MIDI transcription |
|---|---|---|---|---|---|---|---|
| Choral Singing Dataset34 | Cuesta et al. (2018) | Performer biographical data (ensemble name); recording setup (equipment type and placement); audio; ambient noise; audio‑derived data (MIDI) | 16 | 3 | 48 N | Voice | Automatic + manual |
| Quartet Body Motion and Pupillometry Dataset35 | Bishop and Jensenius (2020) | Instrument characteristics (video); venue visuals (described location); performer biographical data (expertise); listener biographical data (expertise); venue visuals and configuration (inferable from video recording); performer movements (MoCap, video); performer physiological state (eye tracking); recording setup (equipment type and placement); audio; ambient noise (inferable from audio recording); performer experience (difficulty ratings) | 5 | 4 | 9 N | String quartet | None |
| RWC Music Database (Classical Music)22† | Goto et al. (2002) | Performer biographical data (ensemble name); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI) | ~96 | 20 | 20 N | Orchestra, chamber ensembles | Manual |
| URMP36 | Li et al. (2018) | Engraved score (PDF); symbolic score (MIDI); instrument characteristics (video); venue visuals (described); performer biographical data (expertise); venue visuals and configuration (inferable from performance video); performer movements (video); recording setup (equipment type and placement); recording post‑production; audio; ambient noise (inferable from audio recording); audio‑derived data (audio annotations) | 23 | 44 | 44 N | String duo, trio, quartet, quintet | None |
| EEP37 | Marchini et al. (2014) | Engraved score (PDF); performer biographical data (expertise); performer movements (MoCap, bowing annotations); recording setup (equipment type); recording post‑production; audio; ambient noise (inferable from audio recording); audio‑derived data (audio–score alignment annotations) | 4 | 5 | 23 N | String quartet | None |
| QUARTET38 | Papiotis (2015) | Instrument characteristics (video); venue visuals (described); venue visuals and configuration (inferable from performance video); performer movements (MoCap, video); recording setup (equipment type and placement); audio; ambient noise (inferable from audio recording); audio‑derived data (audio–score alignment annotations) | 4 | Not listed | 96 N | String quartet | None |
| Erkomaishvili Dataset39 | Rosenzweig et al. (2020) | Symbolic score (MusicXML); performer biographical data (name); recording setup (equipment type); audio; ambient noise (inferable from audio recording); audio‑derived data (performed note‑onset annotations, fundamental frequency) | 1 | 101 | 101 N (7 H) | Voice | None |
| PHENICX‑conduct dataset40 | Sarasúa (2017) | Symbolic score (MusicXML/MIDI); instrument characteristics (types); venue visuals (described location); performer biographical data (ensemble name); venue visuals and configuration (inferable from performance video); performer movements (video); recording setup; audio; ambient noise (inferable from audio recording) | Not listed | 3 excerpts | 75 N | Orchestra | None |
| ARME Virtuoso Strings Dataset31† | Tomczak et al. (2023) | Engraved score (PNG); interpretative set (researchers asked them to play in different ways); recording setup (equipment type and placement); audio; ambient noise (inferable from audio recording); audio‑derived data (note‑onset annotations) | 4 | 5 | 746 N | String duo, trio, quartet | None |
