Multimodal Datasets for Studying Expert Performances of Musical Scores

Katelyn Emerson; Peter M. C. Harrison

doi:10.5334/tismir.230

Figures & Tables

Table 1

Proposed taxonomy of modalities relevant to expressive interpretations of musical scores.

Data‑collection phase	Topic	Modality	Example(s)
Before the performance	Score	Engraved score	Western staff notation, printed
		Symbolic score	MusicXML, MIDI
		Score‑derived data	Music‑theoretic annotations/analyses
	Instrument	Instrument characteristics	Type, model, mechanical properties, temperament
	Venue	Venue visuals	Room lighting, capacity, aesthetics
		Venue configuration	Layout, performer location
		Venue acoustics	Size, shape, reverberance, acoustic anomalies
	Performer	Performer biographical data	Age, gender, studies, pedagogical background, expertise
		Performer physical attributes	Anthropometric measurements, state of physical fitness
		Performer psychology	Personality
		Interpretative set	Expressive intention or ideal
	Listener	Listener biographical data	Personal, educational, professional/expertise
During the performance	Instrument	Instrument state	Tuning, responsiveness, mechanical wear
	Venue	Venue visuals	Room lighting, stage effects or decorations
		Venue configuration	Configuration of player(s) and instrument(s), presence and placement of intrusive recording equipment
		Venue acoustics	Humidity or temperature shifts
	Listener	Listener configuration	Presence, location
		Listener physiological state	Heart rate, skin conductance, respiration, pupil dilation, brain activity, eye tracking
	Performer	Performer movements	Performer gestures, performer–performer interaction, performer–audience interaction (e.g. video, motion capture [MoCap])
		Performer physiological state	Heart rate, skin conductance, respiration, pupil dilation, brain activity, eye tracking
	Recording process	Recording setup	Type, settings, and positioning of recording equipment
		Recording post‑production	Cuts, splices, equalisation, reverberation adjustment, file compression
	Performance sound	Audio	Mixed recording or master tracks
		Ambient noise	External noise bleed, HVAC system sounds, audience noise
		Audio‑derived data	Audio‑derived representations (e.g. spectrograms, loudness curves, audio–score alignment)
After the performance	Performer	Performer assessment of the interpretation	Reflection on the extent to which interpretational intent was carried out, post‑performance/ad hoc justification
		Performer evaluation of the experience	(Dis)comfort during performance, attention/distraction
	Listener	Listener evaluation of the performance	Physiological reactions, aesthetic preference, emotional response, attention, physical (dis)comfort during performance
		Listener evaluation of the performer	Stage presence, movements, facial expressions

Table 2

Accessible multimodal music performance datasets assembled from extant recordings.

Dataset	Citation	Modalities	# players	# pieces*	# recordings	Instrument(s) or ensemble	Performance MIDI transcription
The Con Espressione Game Dataset^² †	Cancino‑Chacón et al. (2020)	Engraved score (PDF); symbolic score (MusicXML); performer biographical data (name); listener biographical data (expertise); recording setup (inferable from album metadata); audio‑derived data (loudness curves, spectrograms, MIDI, audio–score alignment annotations); listener evaluation of performance	26	9	45	Piano	‘Approximate’ from alignment/ loudness curves
ASAP^³	Foscarin et al. (2020)	Symbolic score (MusicXML, MIDI); score‑derived data (music‑theoretic annotations); audio (some); ambient noise, audio‑derived data (MIDI, audio–score alignment annotations)	Not listed	236	1,067	Piano	Automatic + manual
PianoMotion10M^⁴	Gan et al. (2024)	Performer movements (video, video annotations); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI)	Not listed	Not listed	1,966	Piano	Automatic + manual
CrestMusePEDB^⁵†‡	Hashida et al. (2008)	Symbolic score (MusicXML, MIDI); performer biographical data (name); recording setup (inferable from album metadata); audio‑derived data (rough audio–score deviation data)	Not listed	~100	Not listed	Piano	Manual
Guqin dataset^⁶†	Huang et al. (2020)	Performer biographical data (name); recording setup (inferable from album metadata); audio; ambient noise (inferable from audio recording); audio‑derived data (note, tempo, and dynamic annotations; technique annotations)	5	10	39	Guqin	None
GiantMIDI‑Piano^⁷§	Kong et al. (2022)	Performer biographical data (name, dates, nationality); venue visuals and configuration (inferable from performance video); performer movements (video); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI)	Not listed	10,855	10,855	Piano	Automatic
MazurkaBL^⁸†¶	Kosta et al. (2018)	Symbolic score (MusicXML); performer biographical data (name); recording setup (inferable from album metadata); audio‑derived data (loudness, expression, score‑aligned beat annotations)	Not listed	44	2,000	Piano	None
GAPS^⁹#	Riley et al. (2024)	Symbolic score (MusicXML); instrument characteristics (instrument‑specific features, tunings); venue visuals and configuration (inferable from performance video); performer biographical data (name, dates, nationality, gender); performer movements (video); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI, audio–score alignment annotations)	205	Not listed	300	Guitar	None
CHARM Mazurka Project^¹⁰†	Sapp (2007)	Recording setup (inferable from album metadata); audio; ambient noise (inferable from audio recording); audio‑derived data (some MIDI, some tempo and dynamics data)	135	49	2,926	Piano	Manual
MusicNet^¹¹	Thickstun et al. (2017)	Performer biographical data (some names); recording setup (inferable from album metadata); audio; ambient noise (inferable from audio recording); audio‑derived data (note labels, MIDI)	Not listed	Not listed	330	Varied	None
SUPRA^¹²	Shi et al. (2019)	Performer biographical data (name); audio; ambient noise (inferable from audio recording); audio‑derived data (‘raw’ MIDI hole file, ‘expressive’ MIDI dynamic hole file, piano roll image)	151	~430	457	Piano	Automatic + manual
Wagner Ring Dataset^¹³†	Weiß et al. (2023)	Engraved score (PDF); symbolic score (MusicXML, MIDI); score‑derived data (structural annotations, music‑theoretic annotations); performer biographical data (names); recording setup (inferable from album metadata); audio; ambient noise (inferable from audio recording); audio‑derived data (audio–score alignment)	Not listed	Not listed	16	Orchestra and vocalists	None
Schubert Winterreise Dataset^¹⁴† ††	Weiß et al. (2021)	Engraved score (PDF); symbolic score (MusicXML, MIDI); score‑derived data (structural annotations, music‑theoretic annotations); performer biographical data (names); instrument state (tuning); recording setup (inferable from album metadata); audio; ambient noise (inferable from audio recording); audio‑derived data (audio–score alignment annotations)	16	24	216	Piano, voice	None
BPSD^¹⁵	Zeitler et al. (2024)	Engraved score (PDF); symbolic score (Sibelius, MusicXML, MIDI); score‑derived data (music‑theoretic annotations); instrument characteristics (type); performer biographical data (names); recording setup (inferable from album metadata); audio (some); ambient noise (inferable from audio recording); audio‑derived data (MIDI, audio–score alignment)	11	32	352	Piano	Automatic + manual
ATEPP^¹⁶†	Zhang et al. (2022)	Symbolic score (43% MusicXML); performer biographical data (names); recording setup (inferable from album metadata); audio‑derived data (MIDI)	49	1,595	11,674	Piano	Automatic

[i] * Individual movements of a larger work are counted as separate pieces.

[ii] † Dataset contains album names to allow any audio recordings not present in the dataset to be purchased commercially.

[iii] ‡ This first edition of CrestMusePEDB contains this number of commercially released recordings. The second edition adds performances recorded by the researchers and appears in Table 3.

[iv] § A subset of 7,236 recordings includes composers’ names in recording titles.

[v] ¶ MazurkaBL is an extension of the CHARM Mazurka Project.

[vi] # GAPS contains videos of 205 performers, but it is unclear if any audio‑only data may include additional players.

[vii] †† Number of performers includes different piano accompanists.

Table 3

Purpose‑recorded, publicly available multimodal datasets: Solo.

Dataset	Citation	Modalities	# players	# pieces*	# (N) or hours (H) of recording(s)	Instrument(s)	Performance MIDI transcription
ChoraleBricks^¹⁷	Balke et al. (2025)	Symbolic score (MEI, MusicXML, MIDI, CSV); instrument characteristics (type); performer biographical data (birth year); recording setup (equipment type); audio; ambient noise (inferable from audio recording); audio‑derived data (audio–score alignment annotations)	11	10	2.7 H	Flute, oboe, clarinet, trumpet, saxophone, baritone, trombone, tuba	None
Rach3 Dataset^¹⁸	Cancino‑Chacón and Pilov (2024)	Symbolic score (MusicXML, MEI); instrument characteristics (type); venue visuals and configuration (inferable from performance video); performer movements (video); recording setup (equipment type, settings); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI); performer experience (log, description, Mood State Questionnaire)	1	1	~350 H	Piano	Automatic
Bach Violin Dataset^¹⁹	Dong et al. (2022)	Symbolic score (MusicXML); performer biographical data (name); recording post‑production (mixing, file conversion); audio; ambient noise (inferable from audio recording); audio‑derived data (estimated audio–score alignment annotations)	17	32	6.5 H	Violin	None
Bach10 Dataset^²⁰	Duan and Pardo (2012)	Symbolic score (MIDI); audio; ambient noise (inferable from audio recording); audio‑derived data (audio–score alignment annotations)	4	10	40 N	Violin, clarinet, saxophone, bassoon	None
The Vienna 4×22 Piano Corpus^²¹	Goebl (1999)	Symbolic score; instrument characteristics (type); venue visuals and acoustics (described size, building name); performer biographical data (education, expertise); recording setup (equipment type, positioning); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI, audio–score alignment annotations)	22	4	88 N	Piano	Automatic
RWC Music Database (Classical Music)^²²†	Goto et al. (2002)	Performer biographical data (names); recording setup (inferable from album metadata); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI)	19	42	42 N	Piano, violin, cello, flute, and others	Manual
CrestMusePEDB (2nd edition)^⁵	Hashida et al. (2018)	Symbolic score (MusicXML/MIDI); instrument characteristics (type); venue visuals (described locations); performer biographical data (names); interpretative set; recording setup (inferable from album metadata); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI, audio–score alignment annotations)	12	24	443 N	Piano	Automatic
MAESTRO^²³‡	Hawthorne et al. (2019)	Instrument characteristics (type); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI, audio–MIDI alignment)	205	~864	1,276 N	Piano	Automatic
PercePiano^²⁴§	Park et al. (2024)	Symbolic score (MusicXML, MIDI); instrument characteristics (type); listener biographical data (expertise); audio‑derived data (MIDI); listener evaluation of performance (annotations [Likert scale])	25	1,202 excerpts	1,202 N	Piano	Automatic
The Batik‑plays‑Mozart corpus^²⁵¶	Hu and Widmer (2023)	Engraved score (PDF); score‑derived data (music‑theoretic annotations by Hentschel et al., 2021); symbolic score (MusicXML); instrument characteristics (type); performer biographical data (name); recording setup (inferable from album metadata); audio‑derived data (MIDI, MIDI–score alignment)	1	36	36 N (3.75 H)	Piano	Automatic
SPD^²⁶	Jin et al. (2024)	Instrument characteristics (video); venue visuals and configuration (inferable from performance video); performer movements (video, 3D motion annotations); recording setup (equipment type and placement); audio; ambient noise (inferable from audio recording)	9	120	> 3 H	Cello, violin	None
SMD MIDI‑Audio Piano Music Collection^²⁷	Müller et al. (2011)	Instrument characteristics (type); venue configuration (described location); performer biographical data (expertise); recording setup (equipment type and placement); recording post‑production; audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI)	Not listed	38	50 N	Piano	Automatic
Piano Syllabus Dataset^²⁸	Ramoneda et al. (2025)	Instrument characteristics (type); venue visuals and configuration (inferable from performance video); performer movements (video); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI, CQT, piano rolls)	Not listed	7,901	7,901 N	Piano	Unclear
Piano gestures dataset^²⁹	Sarasúa et al. (2017)	Symbolic score (MIDI); instrument characteristics (type, video); venue visuals and configuration (inferable from performance video); interpretative set (researchers asked them play different ways); performer movements (video, MoCap); recording setup (equipment type); audio; ambient noise (inferable from audio recording)	2	1	105 N	Piano	Automatic
Violin gestures dataset^²⁹	Sarasúa et al. (2017)	Performer biographical data (expertise); interpretative set (researchers asked them to play different ways); performer movements (EMG, accelerometer, gyroscope); recording setup (equipment type); audio; ambient noise (inferable from audio recording)	8	1	880 N	Violin	None
Telemann’s 12 Fantasias for Solo Flute^³⁰	Thibaud et al. (2025)	Engraved score (PDF); symbolic score (MEI, MSCZ); instrument characteristics (type); performer biographical data (name); audio; ambient noise (inferable from audio recording); audio‑derived data (audio–score alignment annotations)	6	12	72 N	Flute	None
ARME Virtuoso Strings Dataset^³¹†	Tomczak et al. (2023)	Engraved score (PNG); interpretative set (researchers asked them to play in different ways); recording setup (equipment type and placement); audio; ambient noise (inferable from audio recording); audio‑derived data (note‑onset annotations)	4	5	746 N	Viola, violin, cello	None
CBFdataset^³²	Wang et al. (2022)	Venue visuals and acoustics (described room type); performer biographical data (expertise); performer movements (playing technique annotations); recording setup (equipment type); audio; ambient noise (inferable from audio recording)	10	4	2.6 H	Chinese bamboo flute (2 types)	None
CCOM‑HuQin^³³	Zhang et al. (2023)	Engraved score (PDF); symbolic score (MusicXML); instrument characteristics (type); venue visuals (described type); performer biographical data (expertise, names); venue visuals (inferable from performance video); (venue configuration (inferable from performance video); performer movements (video); recording setup (equipment type and placement); recording post‑production; audio; ambient noise (inferable from audio recording); audio‑derived data (audio–score alignment annotations); performer assessment of interpretation (notes of applied techniques)	8	57	1.28 H	Erhu, Banhu, Gaohu, Zhuihu, Zhonghu	None

[i] *Individual movements of a larger work are counted as separate pieces. Regarding the Telemann Fantasias, each complete Fantasia is counted as one recording here since the dataset presents the whole Fantasia, without separating it into separate sections.

[ii] †RWC Music Database and ARME Virtuoso Strings Dataset also contain ensemble performances and so are shown in Tables 3 and 4.

[iii] ‡Zhang and colleagues (2022) determined this number of performers by connecting names with the dataset’s recordings.

[iv] §PercePiano expanded upon data organised by MAESTRO with symbolic data (score and annotations).

[v] ¶Audio recordings may be purchased commercially.

Table 4

Purpose‑recorded, publicly available multimodal datasets: Ensemble.

Dataset	Citation	Modalities	# players	# pieces*	# (N) or hours (H) of recording(s)	Instrument(s) and/or ensemble type(s)	Performance MIDI transcription
Choral Singing Dataset^³⁴	Cuesta et al. (2018)	Performer biographical data (ensemble name); recording setup (equipment type and placement); audio; ambient noise; audio‑derived data (MIDI)	16	3	48 N	Voice	Automatic + manual
Quartet Body Motion and Pupillometry Dataset^³⁵	Bishop and Jensenius (2020)	Instrument characteristics (video); venue visuals (described location); performer biographical data (expertise); listener biographical data (expertise); venue visuals and configuration (inferable from video recording); performer movements (MoCap, video); performer physiological state (eye tracking); recording setup (equipment type and placement); audio; ambient noise (inferable from audio recording); performer experience (difficulty ratings)	5	4	9 N	String quartet	None
RWC Music Database (Classical Music)^²²†	Goto et al. (2002)	Performer biographical data (ensemble name); audio; ambient noise (inferable from audio recording); audio‑derived data (MIDI)	~96	20	20 N	Orchestra, chamber ensembles	Manual
URMP^³⁶	Li et al. (2018)	Engraved score (PDF); symbolic score (MIDI); instrument characteristics (video); venue visuals (described); performer biographical data (expertise); venue visuals and configuration (inferable from performance video); performer movements (video); recording setup (equipment type and placement); recording post‑production; audio; ambient noise (inferable from audio recording); audio‑derived data (audio annotations)	23	44	44 N	String duo, trio, quartet, quintet	None
EEP^³⁷	Marchini et al. (2014)	Engraved score (PDF); performer biographical data (expertise); performer movements (MoCap, bowing annotations); recording setup (equipment type); recording post‑production; audio; ambient noise (inferable from audio recording); audio‑derived data (audio–score alignment annotations)	4	5	23 N	String quartet	None
QUARTET^³⁸	Papiotis (2015)	Instrument characteristics (video); venue visuals (described); venue visuals and configuration (inferable from performance video); performer movements (MoCap, video); recording setup (equipment type and placement); audio; ambient noise (inferable from audio recording); audio‑derived data (audio–score alignment annotations)	4	Not listed	96 N	String quartet	None
Erkomaishvili Dataset^³⁹	Rosenzweig et al. (2020)	Symbolic score (MusicXML); performer biographical data (name); recording setup (equipment type); audio; ambient noise (inferable from audio recording); audio‑derived data (performed note‑onset annotations, fundamental frequency)	1	101	101 N (7 H)	Voice	None
PHENICX‑conduct dataset^⁴⁰	Sarasúa (2017)	Symbolic score (MusicXML/MIDI); instrument characteristics (types); venue visuals (described location); performer biographical data (ensemble name); venue visuals and configuration (inferable from performance video); performer movements (video); recording setup; audio; ambient noise (inferable from audio recording)	Not listed	3 excerpts	75 N	Orchestra	None
ARME Virtuoso Strings Dataset^³¹†	Tomczak et al. (2023)	Engraved score (PNG); interpretative set (researchers asked them to play in different ways); recording setup (equipment type and placement); audio; ambient noise (inferable from audio recording); audio‑derived data (note‑onset annotations)	4	5	746 N	String duo, trio, quartet	None

[i] *Individual movements of a larger work are counted as separate pieces.

[ii] †RWC Music Database and ARME Virtuoso Strings Dataset also contain ensemble performances so are in Tables 3 and 4.