Online Audio-Visual Source Association for Chamber Music Performances

Bochen Li; Karthik Dinesh; Chenliang Xu; Gaurav Sharma; Zhiyan Duan

doi:10.5334/tismir.25

References

1Arandjelović, R., & Zisserman, A. (2018). Objects that sound. In Proceedings of the European Conference on Computer Vision (ECCV), volume 1, pages 451–466. DOI: 10.1007/978-3-030-01246-5_27
Back to article
2Arzt, A., & Widmer, G. (2010). Simple tempo models for real-time music tracking. In Proceedings of the 7th Sound and Music Computing Conference (SMC).
Back to article
3Arzt, A., Widmer, G., & Dixon, S. (2008). Automatic page turning for musicians via real-time machine listening. In Proceedings of the European Conference on Artificial Intelligence (ECAI), pages 241–245.
Back to article
4Askenfelt, A. (1989). Measurement of the bowing parameters in violin playing. II: Bow–bridge distance, dynamic range, and limits of bow force. The Journal of the Acoustical Society of America, 86(2): 503–516. DOI: 10.1121/1.398230
Back to article
5Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M. J., & Szeliski, R. (2011). A database and evaluation methodology for optical flow. International Journal of Computer Vision, 92(1): 1–31. DOI: 10.1007/s11263-010-0390-2
Back to article
6Barzelay, Z., & Schechner, Y. Y. (2007). Harmony in motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8. DOI: 10.1109/CVPR.2007.383344
Back to article
7Barzelay, Z., & Schechner, Y. Y. (2010). Onsets coincidence for cross-modal analysis. IEEE Transactions on Multimedia, 12(2): 108–120. DOI: 10.1109/TMM.2009.2037387
Back to article
8Bazzica, A., Liem, C. C., & Hanjalic, A. (2014). Exploiting instrument-wise playing/non-playing labels for score synchronization of symphonic music. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference, pages 201–206.
Back to article
9Bello, J. P., Daudet, L., Abdallah, S., Duxbury, C., Davies, M., & Sandler, M. (2005). A tutorial on onset detection in music signals. IEEE Transactions on Speech and Audio Processing, 13(5): 1035–1047. DOI: 10.1109/TSA.2005.851998
Back to article
10Burkholder, J. P., & Grout, D. J. (2014). A History of Western Music: Ninth International Student Edition. WW Norton & Company, Inc.
Back to article
11Cao, Z., Simon, T., Wei, S.-E., & Sheikh, Y. (2017). Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 7291–7299. DOI: 10.1109/CVPR.2017.143
Back to article
12Casanovas, A. L., Monaci, G., Vandergheynst, P., & Gribonval, R. (2010). Blind audiovisual source separation based on sparse redundant representations. IEEE Transactions on Multimedia, 12(5): 358–371. DOI: 10.1109/TMM.2010.2050650
Back to article
13Casanovas, A. L., & Vandergheynst, P. (2010). Nonlinear video diffusion based on audio-video synchrony. Unpublished. https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.187.4688
Back to article
14Cutler, R., & Davis, L. (2000). Look who’s talking: Speaker detection using video and audio correlation. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), volume 3, pages 1589–1592. DOI: 10.1109/ICME.2000.871073
Back to article
15Dahl, S. (2004). Playing the accent—Comparing striking velocity and timing in an ostinato rhythm performed by four drummers. Acta Acustica united with Acustica, 90(4): 762–776.
Back to article
16Dinesh, K., Li, B., Liu, X., Duan, Z., & Sharma, G. (2017). Visually informed multi-pitch analysis of string ensembles. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 3021–3025. DOI: 10.1109/ICASSP.2017.7952711
Back to article
17Dixon, S. (2005). Live tracking of musical performances using on-line time warping. In Proceedings of the International Conference on Digital Audio Effects (DAFx), pages 92–97.
Back to article
18Duan, Z., Essid, S., Liem, C., Richard, G., & Sharma, G. (2019). Audiovisual analysis of music performances: Overview of an emerging field. IEEE Signal Processing Magazine, 36(1): 63–73. DOI: 10.1109/MSP.2018.2875511
Back to article
19Duan, Z., & Pardo, B. (2011a). Soundprism: An online system for score-informed source separation of music audio. IEEE Journal of Selected Topics in Signal Processing, 5(6): 1205–1215. DOI: 10.1109/JSTSP.2011.2159701
Back to article
20Duan, Z., & Pardo, B. (2011b). A state space model for online polyphonic audio-score alignment. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 197–200. DOI: 10.1109/ICASSP.2011.5946374
Back to article
21Duan, Z., Pardo, B., & Zhang, C. (2010). Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions. IEEE Transactions on Audio, Speech, and Language Processing, 18(8): 2121–2133. DOI: 10.1109/TASL.2010.2042119
Back to article
22Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Freeman, W. T., & Rubinstein, M. (2018). Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. ACM Transactions on Graphics (TOG), 37(4). DOI: 10.1145/3197517.3201357
Back to article
23Ewert, S., Pardo, B., Müller, M., & Plumbley, M. D. (2014). Score-informed source separation for musical audio recordings: An overview. IEEE Signal Processing Magazine, 31(3): 116–124. DOI: 10.1109/MSP.2013.2296076
Back to article
24Fisher, J. W., & Darrell, T. (2004). Speaker association with signal-level audiovisual fusion. IEEE Transactions on Multimedia, 6(3): 406–413. DOI: 10.1109/TMM.2004.827503
Back to article
25Gao, R., Feris, R., & Grauman, K. (2018). Learning to separate object sounds by watching unlabeled video. In Proceedings of the European Conference on Computer Vision (ECCV), volume 3, pages 36–54. DOI: 10.1007/978-3-030-01219-9_3
Back to article
26Geringer, J. M., MacLeod, R. B., & Allen, M. L. (2010). Perceived pitch of violin and cello vibrato tones among music majors. Journal of Research in Music Education, 57(4): 351–363. DOI: 10.1177/0022429409350510
Back to article
27Godøy, R. I., & Jensenius, A. R. (2009). Body movement in music information retrieval. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference, pages 45–50.
Back to article
28Grubb, L., & Dannenberg, R. (1997). A stochastic method of tracking a vocal performer. In Proceedings of the International Computer Music Conference (ICMC), pages 301–308.
Back to article
29Izadinia, H., Saleemi, I., & Shah, M. (2013). Multimodal analysis for identification and segmentation of moving-sounding objects. IEEE Transactions on Multimedia, 15(2): 378–390. DOI: 10.1109/TMM.2012.2228476
Back to article
30Kidron, E., Schechner, Y. Y., & Elad, M. (2007). Cross-modal localization via sparsity. IEEE Transactions on Signal Processing, 55(4): 1390–1404. DOI: 10.1109/TSP.2006.888095
Back to article
31Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics (NRL), 2(1–2): 83–97. DOI: 10.1002/nav.3800020109
Back to article
32Li, B., Dinesh, K., Duan, Z., & Sharma, G. (2017a). See and listen: Score-informed association of sound tracks to players in chamber music performance videos. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2906–2910. DOI: 10.1109/ICASSP.2017.7952688
Back to article
33Li, B., Dinesh, K., Sharma, G., & Duan, Z. (2017b). Video-based vibrato detection and analysis for polyphonic string music. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference, pages 123–130.
Back to article
34Li, B., Liu, X., Dinesh, K., Duan, Z., & Sharma, G. (2018a). Data from: “Creating a multi-track classical music performance dataset for multi-modal music analysis: Challenges, insights, and applications.” Dryad Digital Repository. DOI: 10.5061/dryad.ng3r749
Back to article
35Li, B., Liu, X., Dinesh, K., Duan, Z., & Sharma, G. (2019). Creating a music performance dataset for multimodal music analysis: Challenges, insights, and applications. IEEE Transactions on Multimedia, 21(2): 522–535. DOI: 10.1109/TMM.2018.2856090
Back to article
36Li, B., Maezawa, A., & Duan, Z. (2018b). Skeleton plays piano: Online generation of pianist body movements from MIDI performance. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference.
Back to article
37Li, B., Xu, C., & Duan, Z. (2017c). Audiovisual source association for string ensembles through multi-modal vibrato analysis. In Proceedings of the Sound and Music Computing Conference (SMC), pages 159–166.
Back to article
38Li, K., Ye, J., & Hua, K. A. (2014). What’s making that sound? In Proceedings of the ACM International Conference on Multimedia, pages 147–156. DOI: 10.1145/2647868.2654936
Back to article
39Liu, Y., & Sato, Y. (2008). Finding speaker face region by audiovisual correlation. In Proceedings of the Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications (M2SFA2).
Back to article
40Müller, M. (2007). Dynamic time warping. In Information retrieval for music and motion, chapter 4, pages 69–84. Springer. DOI: 10.1007/978-3-540-74048-3_4
Back to article
41Müller, M. (2015). Music synchronization. In Fundamentals of music processing, chapter 3, pages 115–166. Springer. DOI: 10.1007/978-3-319-21945-5_3
Back to article
42Müller, M., Mattes, H., & Kurth, F. (2006). An efficient multiscale approach to audio synchronization. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference.
Back to article
43Obata, S., Nakahara, H., Hirano, T., & Kinoshita, H. (2009). Fingering force in violin vibrato. In Proceedings of the International Symposium on Performance Science, volume 429.
Back to article
44Owens, A., & Efros, A. A. (2018). Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the European Conference on Computer Vision (ECCV), volume 6, pages 639–658. DOI: 10.1007/978-3-030-01231-1_39
Back to article
45Paleari, M., Huet, B., Schutz, A., & Slock, D. (2008). A multimodal approach to music transcription. In Proceedings of the IEEE International Conference on Image Processing (ICIP), pages 93–96. DOI: 10.1109/ICIP.2008.4711699
Back to article
46Palmer, C., Carter, C., Koopmans, E., & Loehr, J. D. (2007). Movement, planning, and music: Motion coordinates of skilled performance. In Proceedings of the International Conference on Music Communication Science, pages 119–122. University of New South Wales.
Back to article
47Parekh, S., Essid, S., Ozerov, A., Duong, N. Q., Pérez, P., & Richard, G. (2017a). Motion informed audio source separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6–10. DOI: 10.1109/ICASSP.2017.7951787
Back to article
48Parekh, S., Essid, S., Ozerov, A., Duong, N. Q., Pérez, P., & Richard, G. (2017b). Guiding audio source separation by video object information. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 61–65. DOI: 10.1109/WASPAA.2017.8169995
Back to article
49Parncutt, R., & McPherson, G. (2002). The science and psychology of music performance: Creative strategies for teaching and learning. Oxford University Press. DOI: 10.1177/1321103X020190010803
Back to article
50Senocak, A., Oh, T.-H., Kim, J., Yang, M.-H., & Kweon, I. S. (2018). Learning to localize sound source in visual scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4358–4366. DOI: 10.1109/CVPR.2018.00458
Back to article
51Sigg, C., Fischer, B., Ommer, B., Roth, V., & Buhmann, J. (2007). Nonnegative CCA for audiovisual source separation. In Proceedings of the IEEE Workshop on Machine Learning for Signal Processing, pages 253–258. DOI: 10.1109/MLSP.2007.4414315
Back to article
52Sörgjerd, M. (2000). Auditory and Visual Recogniton of Emotional Expression in Performance of Music. PhD thesis, Uppsala Universitet, Institutionen för Psykologi.
Back to article
53Sun, D., Roth, S., & Black, M. J. (2010). Secrets of optical flow estimation and their principles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2432–2439. DOI: 10.1109/CVPR.2010.5539939
Back to article
54Thomas, V., Fremerey, C., Damm, D., & Clausen, M. (2009). SLAVE: a score-lyrics-audio-videoexplorer. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference.
Back to article
55Tian, Y., Shi, J., Li, B., Duan, Z., & Xu, C. (2018). Audio-visual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision (ECCV), volume 2, pages 252–268. DOI: 10.1007/978-3-030-01216-8_16
Back to article
56Tsay, C.-J. (2014). The vision heuristic: Judging music ensembles by sight alone. Organizational Behavior and Human Decision Processes, 124(1): 24–33. DOI: 10.1016/j.obhdp.2013.10.003
Back to article
57Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., & Torralba, A. (2018). The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV), volume 1, pages 587–604. DOI: 10.1007/978-3-030-01246-5_35
Back to article

Online Audio-Visual Source Association for Chamber Music Performances

References

Paradigm

My account