References
- 1Arandjelović, R., & Zisserman, A. (2018). Objects that sound. In Proceedings of the European Conference on Computer Vision (ECCV), volume 1, pages 451–466. DOI: 10.1007/978-3-030-01246-5_27
- 2Arzt, A., & Widmer, G. (2010). Simple tempo models for real-time music tracking. In Proceedings of the 7th Sound and Music Computing Conference (SMC).
- 3Arzt, A., Widmer, G., & Dixon, S. (2008). Automatic page turning for musicians via real-time machine listening. In Proceedings of the European Conference on Artificial Intelligence (ECAI), pages 241–245.
- 4Askenfelt, A. (1989). Measurement of the bowing parameters in violin playing. II: Bow–bridge distance, dynamic range, and limits of bow force. The Journal of the Acoustical Society of America, 86(2): 503–516. DOI: 10.1121/1.398230
- 5Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M. J., & Szeliski, R. (2011). A database and evaluation methodology for optical flow. International Journal of Computer Vision, 92(1): 1–31. DOI: 10.1007/s11263-010-0390-2
- 6Barzelay, Z., & Schechner, Y. Y. (2007). Harmony in motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8. DOI: 10.1109/CVPR.2007.383344
- 7Barzelay, Z., & Schechner, Y. Y. (2010). Onsets coincidence for cross-modal analysis. IEEE Transactions on Multimedia, 12(2): 108–120. DOI: 10.1109/TMM.2009.2037387
- 8Bazzica, A., Liem, C. C., & Hanjalic, A. (2014). Exploiting instrument-wise playing/non-playing labels for score synchronization of symphonic music. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference, pages 201–206.
- 9Bello, J. P., Daudet, L., Abdallah, S., Duxbury, C., Davies, M., & Sandler, M. (2005). A tutorial on onset detection in music signals. IEEE Transactions on Speech and Audio Processing, 13(5): 1035–1047. DOI: 10.1109/TSA.2005.851998
- 10Burkholder, J. P., & Grout, D. J. (2014). A History of Western Music: Ninth International Student Edition. WW Norton & Company, Inc.
- 11Cao, Z., Simon, T., Wei, S.-E., & Sheikh, Y. (2017). Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 7291–7299. DOI: 10.1109/CVPR.2017.143
- 12Casanovas, A. L., Monaci, G., Vandergheynst, P., & Gribonval, R. (2010). Blind audiovisual source separation based on sparse redundant representations. IEEE Transactions on Multimedia, 12(5): 358–371. DOI: 10.1109/TMM.2010.2050650
- 13Casanovas, A. L., & Vandergheynst, P. (2010). Nonlinear video diffusion based on audio-video synchrony. Unpublished.
https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.187.4688 - 14Cutler, R., & Davis, L. (2000). Look who’s talking: Speaker detection using video and audio correlation. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), volume 3, pages 1589–1592. DOI: 10.1109/ICME.2000.871073
- 15Dahl, S. (2004). Playing the accent—Comparing striking velocity and timing in an ostinato rhythm performed by four drummers. Acta Acustica united with Acustica, 90(4): 762–776.
- 16Dinesh, K., Li, B., Liu, X., Duan, Z., & Sharma, G. (2017). Visually informed multi-pitch analysis of string ensembles. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 3021–3025. DOI: 10.1109/ICASSP.2017.7952711
- 17Dixon, S. (2005). Live tracking of musical performances using on-line time warping. In Proceedings of the International Conference on Digital Audio Effects (DAFx), pages 92–97.
- 18Duan, Z., Essid, S., Liem, C., Richard, G., & Sharma, G. (2019). Audiovisual analysis of music performances: Overview of an emerging field. IEEE Signal Processing Magazine, 36(1): 63–73. DOI: 10.1109/MSP.2018.2875511
- 19Duan, Z., & Pardo, B. (2011a). Soundprism: An online system for score-informed source separation of music audio. IEEE Journal of Selected Topics in Signal Processing, 5(6): 1205–1215. DOI: 10.1109/JSTSP.2011.2159701
- 20Duan, Z., & Pardo, B. (2011b). A state space model for online polyphonic audio-score alignment. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 197–200. DOI: 10.1109/ICASSP.2011.5946374
- 21Duan, Z., Pardo, B., & Zhang, C. (2010). Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions. IEEE Transactions on Audio, Speech, and Language Processing, 18(8): 2121–2133. DOI: 10.1109/TASL.2010.2042119
- 22Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Freeman, W. T., & Rubinstein, M. (2018). Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. ACM Transactions on Graphics (TOG), 37(4). DOI: 10.1145/3197517.3201357
- 23Ewert, S., Pardo, B., Müller, M., & Plumbley, M. D. (2014). Score-informed source separation for musical audio recordings: An overview. IEEE Signal Processing Magazine, 31(3): 116–124. DOI: 10.1109/MSP.2013.2296076
- 24Fisher, J. W., & Darrell, T. (2004). Speaker association with signal-level audiovisual fusion. IEEE Transactions on Multimedia, 6(3): 406–413. DOI: 10.1109/TMM.2004.827503
- 25Gao, R., Feris, R., & Grauman, K. (2018). Learning to separate object sounds by watching unlabeled video. In Proceedings of the European Conference on Computer Vision (ECCV), volume 3, pages 36–54. DOI: 10.1007/978-3-030-01219-9_3
- 26Geringer, J. M., MacLeod, R. B., & Allen, M. L. (2010). Perceived pitch of violin and cello vibrato tones among music majors. Journal of Research in Music Education, 57(4): 351–363. DOI: 10.1177/0022429409350510
- 27Godøy, R. I., & Jensenius, A. R. (2009). Body movement in music information retrieval. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference, pages 45–50.
- 28Grubb, L., & Dannenberg, R. (1997). A stochastic method of tracking a vocal performer. In Proceedings of the International Computer Music Conference (ICMC), pages 301–308.
- 29Izadinia, H., Saleemi, I., & Shah, M. (2013). Multimodal analysis for identification and segmentation of moving-sounding objects. IEEE Transactions on Multimedia, 15(2): 378–390. DOI: 10.1109/TMM.2012.2228476
- 30Kidron, E., Schechner, Y. Y., & Elad, M. (2007). Cross-modal localization via sparsity. IEEE Transactions on Signal Processing, 55(4): 1390–1404. DOI: 10.1109/TSP.2006.888095
- 31Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics (NRL), 2(1–2): 83–97. DOI: 10.1002/nav.3800020109
- 32Li, B., Dinesh, K., Duan, Z., & Sharma, G. (2017a). See and listen: Score-informed association of sound tracks to players in chamber music performance videos. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2906–2910. DOI: 10.1109/ICASSP.2017.7952688
- 33Li, B., Dinesh, K., Sharma, G., & Duan, Z. (2017b). Video-based vibrato detection and analysis for polyphonic string music. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference, pages 123–130.
- 34Li, B., Liu, X., Dinesh, K., Duan, Z., & Sharma, G. (2018a). Data from: “Creating a multi-track classical music performance dataset for multi-modal music analysis: Challenges, insights, and applications.” Dryad Digital Repository. DOI: 10.5061/dryad.ng3r749
- 35Li, B., Liu, X., Dinesh, K., Duan, Z., & Sharma, G. (2019). Creating a music performance dataset for multimodal music analysis: Challenges, insights, and applications. IEEE Transactions on Multimedia, 21(2): 522–535. DOI: 10.1109/TMM.2018.2856090
- 36Li, B., Maezawa, A., & Duan, Z. (2018b). Skeleton plays piano: Online generation of pianist body movements from MIDI performance. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference.
- 37Li, B., Xu, C., & Duan, Z. (2017c). Audiovisual source association for string ensembles through multi-modal vibrato analysis. In Proceedings of the Sound and Music Computing Conference (SMC), pages 159–166.
- 38Li, K., Ye, J., & Hua, K. A. (2014). What’s making that sound? In Proceedings of the ACM International Conference on Multimedia, pages 147–156. DOI: 10.1145/2647868.2654936
- 39Liu, Y., & Sato, Y. (2008). Finding speaker face region by audiovisual correlation. In Proceedings of the Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications (M2SFA2).
- 40Müller, M. (2007).
Dynamic time warping . In Information retrieval for music and motion, chapter 4, pages 69–84. Springer. DOI: 10.1007/978-3-540-74048-3_4 - 41Müller, M. (2015).
Music synchronization . In Fundamentals of music processing, chapter 3, pages 115–166. Springer. DOI: 10.1007/978-3-319-21945-5_3 - 42Müller, M., Mattes, H., & Kurth, F. (2006). An efficient multiscale approach to audio synchronization. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference.
- 43Obata, S., Nakahara, H., Hirano, T., & Kinoshita, H. (2009). Fingering force in violin vibrato. In Proceedings of the International Symposium on Performance Science, volume 429.
- 44Owens, A., & Efros, A. A. (2018). Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the European Conference on Computer Vision (ECCV), volume 6, pages 639–658. DOI: 10.1007/978-3-030-01231-1_39
- 45Paleari, M., Huet, B., Schutz, A., & Slock, D. (2008). A multimodal approach to music transcription. In Proceedings of the IEEE International Conference on Image Processing (ICIP), pages 93–96. DOI: 10.1109/ICIP.2008.4711699
- 46Palmer, C., Carter, C., Koopmans, E., & Loehr, J. D. (2007). Movement, planning, and music: Motion coordinates of skilled performance. In Proceedings of the International Conference on Music Communication Science, pages 119–122.
University of New South Wales . - 47Parekh, S., Essid, S., Ozerov, A., Duong, N. Q., Pérez, P., & Richard, G. (2017a). Motion informed audio source separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6–10. DOI: 10.1109/ICASSP.2017.7951787
- 48Parekh, S., Essid, S., Ozerov, A., Duong, N. Q., Pérez, P., & Richard, G. (2017b). Guiding audio source separation by video object information. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 61–65. DOI: 10.1109/WASPAA.2017.8169995
- 49Parncutt, R., & McPherson, G. (2002). The science and psychology of music performance: Creative strategies for teaching and learning. Oxford University Press. DOI: 10.1177/1321103X020190010803
- 50Senocak, A., Oh, T.-H., Kim, J., Yang, M.-H., & Kweon, I. S. (2018). Learning to localize sound source in visual scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4358–4366. DOI: 10.1109/CVPR.2018.00458
- 51Sigg, C., Fischer, B., Ommer, B., Roth, V., & Buhmann, J. (2007). Nonnegative CCA for audiovisual source separation. In Proceedings of the IEEE Workshop on Machine Learning for Signal Processing, pages 253–258. DOI: 10.1109/MLSP.2007.4414315
- 52Sörgjerd, M. (2000). Auditory and Visual Recogniton of Emotional Expression in Performance of Music. PhD thesis, Uppsala Universitet, Institutionen för Psykologi.
- 53Sun, D., Roth, S., & Black, M. J. (2010). Secrets of optical flow estimation and their principles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2432–2439. DOI: 10.1109/CVPR.2010.5539939
- 54Thomas, V., Fremerey, C., Damm, D., & Clausen, M. (2009). SLAVE: a score-lyrics-audio-videoexplorer. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference.
- 55Tian, Y., Shi, J., Li, B., Duan, Z., & Xu, C. (2018). Audio-visual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision (ECCV), volume 2, pages 252–268. DOI: 10.1007/978-3-030-01216-8_16
- 56Tsay, C.-J. (2014). The vision heuristic: Judging music ensembles by sight alone. Organizational Behavior and Human Decision Processes, 124(1): 24–33. DOI: 10.1016/j.obhdp.2013.10.003
- 57Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., & Torralba, A. (2018). The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV), volume 1, pages 587–604. DOI: 10.1007/978-3-030-01246-5_35
