Have a personal or library account? Click to login
Online Audio-Visual Source Association for Chamber Music Performances Cover

Online Audio-Visual Source Association for Chamber Music Performances

Open Access
|Aug 2019

References

  1. 1Arandjelović, R., & Zisserman, A. (2018). Objects that sound. In Proceedings of the European Conference on Computer Vision (ECCV), volume 1, pages 451466. DOI: 10.1007/978-3-030-01246-5_27
  2. 2Arzt, A., & Widmer, G. (2010). Simple tempo models for real-time music tracking. In Proceedings of the 7th Sound and Music Computing Conference (SMC).
  3. 3Arzt, A., Widmer, G., & Dixon, S. (2008). Automatic page turning for musicians via real-time machine listening. In Proceedings of the European Conference on Artificial Intelligence (ECAI), pages 241245.
  4. 4Askenfelt, A. (1989). Measurement of the bowing parameters in violin playing. II: Bow–bridge distance, dynamic range, and limits of bow force. The Journal of the Acoustical Society of America, 86(2): 503516. DOI: 10.1121/1.398230
  5. 5Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M. J., & Szeliski, R. (2011). A database and evaluation methodology for optical flow. International Journal of Computer Vision, 92(1): 131. DOI: 10.1007/s11263-010-0390-2
  6. 6Barzelay, Z., & Schechner, Y. Y. (2007). Harmony in motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 18. DOI: 10.1109/CVPR.2007.383344
  7. 7Barzelay, Z., & Schechner, Y. Y. (2010). Onsets coincidence for cross-modal analysis. IEEE Transactions on Multimedia, 12(2): 108120. DOI: 10.1109/TMM.2009.2037387
  8. 8Bazzica, A., Liem, C. C., & Hanjalic, A. (2014). Exploiting instrument-wise playing/non-playing labels for score synchronization of symphonic music. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference, pages 201206.
  9. 9Bello, J. P., Daudet, L., Abdallah, S., Duxbury, C., Davies, M., & Sandler, M. (2005). A tutorial on onset detection in music signals. IEEE Transactions on Speech and Audio Processing, 13(5): 10351047. DOI: 10.1109/TSA.2005.851998
  10. 10Burkholder, J. P., & Grout, D. J. (2014). A History of Western Music: Ninth International Student Edition. WW Norton & Company, Inc.
  11. 11Cao, Z., Simon, T., Wei, S.-E., & Sheikh, Y. (2017). Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 72917299. DOI: 10.1109/CVPR.2017.143
  12. 12Casanovas, A. L., Monaci, G., Vandergheynst, P., & Gribonval, R. (2010). Blind audiovisual source separation based on sparse redundant representations. IEEE Transactions on Multimedia, 12(5): 358371. DOI: 10.1109/TMM.2010.2050650
  13. 13Casanovas, A. L., & Vandergheynst, P. (2010). Nonlinear video diffusion based on audio-video synchrony. Unpublished. https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.187.4688
  14. 14Cutler, R., & Davis, L. (2000). Look who’s talking: Speaker detection using video and audio correlation. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), volume 3, pages 15891592. DOI: 10.1109/ICME.2000.871073
  15. 15Dahl, S. (2004). Playing the accent—Comparing striking velocity and timing in an ostinato rhythm performed by four drummers. Acta Acustica united with Acustica, 90(4): 762776.
  16. 16Dinesh, K., Li, B., Liu, X., Duan, Z., & Sharma, G. (2017). Visually informed multi-pitch analysis of string ensembles. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 30213025. DOI: 10.1109/ICASSP.2017.7952711
  17. 17Dixon, S. (2005). Live tracking of musical performances using on-line time warping. In Proceedings of the International Conference on Digital Audio Effects (DAFx), pages 9297.
  18. 18Duan, Z., Essid, S., Liem, C., Richard, G., & Sharma, G. (2019). Audiovisual analysis of music performances: Overview of an emerging field. IEEE Signal Processing Magazine, 36(1): 6373. DOI: 10.1109/MSP.2018.2875511
  19. 19Duan, Z., & Pardo, B. (2011a). Soundprism: An online system for score-informed source separation of music audio. IEEE Journal of Selected Topics in Signal Processing, 5(6): 12051215. DOI: 10.1109/JSTSP.2011.2159701
  20. 20Duan, Z., & Pardo, B. (2011b). A state space model for online polyphonic audio-score alignment. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 197200. DOI: 10.1109/ICASSP.2011.5946374
  21. 21Duan, Z., Pardo, B., & Zhang, C. (2010). Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions. IEEE Transactions on Audio, Speech, and Language Processing, 18(8): 21212133. DOI: 10.1109/TASL.2010.2042119
  22. 22Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Freeman, W. T., & Rubinstein, M. (2018). Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. ACM Transactions on Graphics (TOG), 37(4). DOI: 10.1145/3197517.3201357
  23. 23Ewert, S., Pardo, B., Müller, M., & Plumbley, M. D. (2014). Score-informed source separation for musical audio recordings: An overview. IEEE Signal Processing Magazine, 31(3): 116124. DOI: 10.1109/MSP.2013.2296076
  24. 24Fisher, J. W., & Darrell, T. (2004). Speaker association with signal-level audiovisual fusion. IEEE Transactions on Multimedia, 6(3): 406413. DOI: 10.1109/TMM.2004.827503
  25. 25Gao, R., Feris, R., & Grauman, K. (2018). Learning to separate object sounds by watching unlabeled video. In Proceedings of the European Conference on Computer Vision (ECCV), volume 3, pages 3654. DOI: 10.1007/978-3-030-01219-9_3
  26. 26Geringer, J. M., MacLeod, R. B., & Allen, M. L. (2010). Perceived pitch of violin and cello vibrato tones among music majors. Journal of Research in Music Education, 57(4): 351363. DOI: 10.1177/0022429409350510
  27. 27Godøy, R. I., & Jensenius, A. R. (2009). Body movement in music information retrieval. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference, pages 4550.
  28. 28Grubb, L., & Dannenberg, R. (1997). A stochastic method of tracking a vocal performer. In Proceedings of the International Computer Music Conference (ICMC), pages 301308.
  29. 29Izadinia, H., Saleemi, I., & Shah, M. (2013). Multimodal analysis for identification and segmentation of moving-sounding objects. IEEE Transactions on Multimedia, 15(2): 378390. DOI: 10.1109/TMM.2012.2228476
  30. 30Kidron, E., Schechner, Y. Y., & Elad, M. (2007). Cross-modal localization via sparsity. IEEE Transactions on Signal Processing, 55(4): 13901404. DOI: 10.1109/TSP.2006.888095
  31. 31Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics (NRL), 2(1–2): 8397. DOI: 10.1002/nav.3800020109
  32. 32Li, B., Dinesh, K., Duan, Z., & Sharma, G. (2017a). See and listen: Score-informed association of sound tracks to players in chamber music performance videos. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 29062910. DOI: 10.1109/ICASSP.2017.7952688
  33. 33Li, B., Dinesh, K., Sharma, G., & Duan, Z. (2017b). Video-based vibrato detection and analysis for polyphonic string music. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference, pages 123130.
  34. 34Li, B., Liu, X., Dinesh, K., Duan, Z., & Sharma, G. (2018a). Data from: “Creating a multi-track classical music performance dataset for multi-modal music analysis: Challenges, insights, and applications.” Dryad Digital Repository. DOI: 10.5061/dryad.ng3r749
  35. 35Li, B., Liu, X., Dinesh, K., Duan, Z., & Sharma, G. (2019). Creating a music performance dataset for multimodal music analysis: Challenges, insights, and applications. IEEE Transactions on Multimedia, 21(2): 522535. DOI: 10.1109/TMM.2018.2856090
  36. 36Li, B., Maezawa, A., & Duan, Z. (2018b). Skeleton plays piano: Online generation of pianist body movements from MIDI performance. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference.
  37. 37Li, B., Xu, C., & Duan, Z. (2017c). Audiovisual source association for string ensembles through multi-modal vibrato analysis. In Proceedings of the Sound and Music Computing Conference (SMC), pages 159166.
  38. 38Li, K., Ye, J., & Hua, K. A. (2014). What’s making that sound? In Proceedings of the ACM International Conference on Multimedia, pages 147156. DOI: 10.1145/2647868.2654936
  39. 39Liu, Y., & Sato, Y. (2008). Finding speaker face region by audiovisual correlation. In Proceedings of the Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications (M2SFA2).
  40. 40Müller, M. (2007). Dynamic time warping. In Information retrieval for music and motion, chapter 4, pages 6984. Springer. DOI: 10.1007/978-3-540-74048-3_4
  41. 41Müller, M. (2015). Music synchronization. In Fundamentals of music processing, chapter 3, pages 115166. Springer. DOI: 10.1007/978-3-319-21945-5_3
  42. 42Müller, M., Mattes, H., & Kurth, F. (2006). An efficient multiscale approach to audio synchronization. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference.
  43. 43Obata, S., Nakahara, H., Hirano, T., & Kinoshita, H. (2009). Fingering force in violin vibrato. In Proceedings of the International Symposium on Performance Science, volume 429.
  44. 44Owens, A., & Efros, A. A. (2018). Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the European Conference on Computer Vision (ECCV), volume 6, pages 639658. DOI: 10.1007/978-3-030-01231-1_39
  45. 45Paleari, M., Huet, B., Schutz, A., & Slock, D. (2008). A multimodal approach to music transcription. In Proceedings of the IEEE International Conference on Image Processing (ICIP), pages 9396. DOI: 10.1109/ICIP.2008.4711699
  46. 46Palmer, C., Carter, C., Koopmans, E., & Loehr, J. D. (2007). Movement, planning, and music: Motion coordinates of skilled performance. In Proceedings of the International Conference on Music Communication Science, pages 119122. University of New South Wales.
  47. 47Parekh, S., Essid, S., Ozerov, A., Duong, N. Q., Pérez, P., & Richard, G. (2017a). Motion informed audio source separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 610. DOI: 10.1109/ICASSP.2017.7951787
  48. 48Parekh, S., Essid, S., Ozerov, A., Duong, N. Q., Pérez, P., & Richard, G. (2017b). Guiding audio source separation by video object information. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 6165. DOI: 10.1109/WASPAA.2017.8169995
  49. 49Parncutt, R., & McPherson, G. (2002). The science and psychology of music performance: Creative strategies for teaching and learning. Oxford University Press. DOI: 10.1177/1321103X020190010803
  50. 50Senocak, A., Oh, T.-H., Kim, J., Yang, M.-H., & Kweon, I. S. (2018). Learning to localize sound source in visual scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 43584366. DOI: 10.1109/CVPR.2018.00458
  51. 51Sigg, C., Fischer, B., Ommer, B., Roth, V., & Buhmann, J. (2007). Nonnegative CCA for audiovisual source separation. In Proceedings of the IEEE Workshop on Machine Learning for Signal Processing, pages 253258. DOI: 10.1109/MLSP.2007.4414315
  52. 52Sörgjerd, M. (2000). Auditory and Visual Recogniton of Emotional Expression in Performance of Music. PhD thesis, Uppsala Universitet, Institutionen för Psykologi.
  53. 53Sun, D., Roth, S., & Black, M. J. (2010). Secrets of optical flow estimation and their principles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 24322439. DOI: 10.1109/CVPR.2010.5539939
  54. 54Thomas, V., Fremerey, C., Damm, D., & Clausen, M. (2009). SLAVE: a score-lyrics-audio-videoexplorer. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference.
  55. 55Tian, Y., Shi, J., Li, B., Duan, Z., & Xu, C. (2018). Audio-visual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision (ECCV), volume 2, pages 252268. DOI: 10.1007/978-3-030-01216-8_16
  56. 56Tsay, C.-J. (2014). The vision heuristic: Judging music ensembles by sight alone. Organizational Behavior and Human Decision Processes, 124(1): 2433. DOI: 10.1016/j.obhdp.2013.10.003
  57. 57Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., & Torralba, A. (2018). The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV), volume 1, pages 587604. DOI: 10.1007/978-3-030-01246-5_35
DOI: https://doi.org/10.5334/tismir.25 | Journal eISSN: 2514-3298
Language: English
Submitted on: Dec 18, 2018
Accepted on: May 20, 2019
Published on: Aug 5, 2019
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2019 Bochen Li, Karthik Dinesh, Chenliang Xu, Gaurav Sharma, Zhiyan Duan, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.