When Text and Speech are Not Enough: A Multimodal Dataset of Collaboration in a Situated Task

Ibrahim Khebour; Richard Brutti; Indrani Dey; Rachel Dickler; Kelsey Sikes; Kenneth Lai; Mariah Bradford; Brittany Cates; Paige Hansen; Changsoo Jung; Brett Wisniewski; Corbyn Terpstra; Leanne Hirshfield; Sadhana Puntambekar; Nathaniel Blanchard; James Pustejovsky; Nikhil Krishnaswamy

doi:10.5334/johd.168

Abstract

To adequately model information exchanged in real human-human interactions, considering speech or text alone leaves out many critical modalities. The channels contributing to the “making of sense” in human-human interactions include but are not limited to gesture, speech, user-interaction modeling, gaze, joint attention, and involvement/engagement, all of which need to be adequately modeled to automatically extract correct and meaningful information. In this paper, we present a multimodal dataset of a novel situated and shared collaborative task, with the above channels annotated to encode these different aspects of the situated and embodied involvement of the participants in the joint activity.

References

1Anderson, A. H., Bader, M., Bard, E. G., Boyle, E., Doherty, G., Garrod, S., … Weinert, R. (1991). The hcrc map task corpus. Language and Speech, 34(4), 351–366. DOI: 10.1177/002383099103400404
Back to article
2Bradford, M., Khebour, I., Blanchard, N., & Krishnaswamy, N. (2023). Automatic detection of collaborative states in small groups using multimodal features. In Proceedings of the 24th international conference on artificial intelligence in education. DOI: 10.1007/978-3-031-36272-9_69
Back to article
3Brugman, H., & Russel, A. (2004, May). Annotating multi-media/multi-modal resources with ELAN. Proceedings of the fourth international conference on language resources and evaluation (LREC’04). Lisbon, Portugal: European Language Resources Association (ELRA). Retrieved from http://www.lrec-conf.org/proceedings/lrec2004/pdf/480.pdf
Back to article
4Brutti, R., Donatelli, L., Lai, K., & Pustejovsky, J. (2022, June). Abstract Meaning Representation for gesture. In Proceedings of the thirteenth language resources and evaluation conference (pp. 1576–1583). Marseille, France: European Language Resources Association. Retrieved from https://aclanthology.org/2022.lrec-1.169
Back to article
5Clark, H. H., & Carlson, T. B. (1981). Context for comprehension. Attention and performance IX, 313, 30.
Back to article
6Dey, I., Puntambekar, S., Li, R., Gengler, D., Dickler, R., Hirshfield, L. M., … Krishnaswamy, N. (2023). The NICE framework: analyzing students’ nonverbal interactions during collaborative learning. In Pre-conference workshop on collaboration analytics at 13th international learning analytics and knowledge conference (lak 2023). DOI: 10.22318/cscl2023.218179
Back to article
7Gillies, R. M. (2008). The effects of cooperative learning on junior high school students’ behaviours, discourse and learning during a science-based learning activity. School Psychology International, 29(3), 328–347. DOI: 10.1177/0143034308093673
Back to article
8Langer-Osuna, J. M., Gargroetzi, E., Munson, J., & Chavez, R. (2020). Exploring the role of off-task activity on students’ collaborative dynamics. Journal of Educational Psychology, 112(3), 514. DOI: 10.1037/edu0000464
Back to article
9Liu, B., Cai, H., Ji, X., & Liu, H. (2017). Human-human interaction recognition based on spatial and motion trend feature. In 2017 ieee international conference on image processing (icip) (pp. 4547–4551). DOI: 10.1109/ICIP.2017.8297143
Back to article
10Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. In International conference on machine learning (pp. 28492–28518).
Back to article
11Roschelle, J., & Teasley, S. D. (1995). The construction of shared knowledge in collaborative problem solving. In Computer supported collaborative learning (pp. 69–97). DOI: 10.1007/978-3-642-85098-1_5
Back to article
12Sun, C., Shute, V. J., Stewart, A., Yonehiro, J., Duran, N., & D’Mello, S. (2020). Towards a generalized competency model of collaborative problem solving. Computers & Education, 143, 103672. Retrieved from https://www.sciencedirect.com/science/article/pii/S0360131519302258. DOI: 10.1016/j.compedu.2019.103672
Back to article
13Terpstra, C., Khebour, I., Bradford, M., Wisniewski, B., Krishnaswamy, N., & Blanchard, N. (2023). How good is automatic segmentation as a multimodal discourse annotation aid?
Back to article
14Van Gemeren, C., Poppe, R., & Veltkamp, R. C. (2016). Spatio-temporal detection of fine-grained dyadic human interactions. In Human behavior understanding: 7th international workshop, hbu 2016, Amsterdam, the Netherlands, October 16, 2016, proceedings 7 (pp. 116–133). DOI: 10.1007/978-3-319-46843-3_8
Back to article
15VanderHoeven, H., Blanchard, N., & Krishnaswamy, N. (2023). Robust motion recognition using gesture phase annotation. In International conference on human-computer interaction (pp. 592–608). DOI: 10.1007/978-3-031-35741-1_42
Back to article
16Velikovich, L., Williams, I., Scheiner, J., Aleksic, P. S., Moreno, P. J., & Riley, M. (2018). Semantic lattice processing in contextual automatic speech recognition for google assistant. In Interspeech (pp. 2222–2226). DOI: 10.21437/Interspeech.2018-2453
Back to article
17Wang, I., Fraj, M. B., Narayana, P., Patil, D., Mulay, G., Bangar, R., … Ruiz, J. (2017). Eggnog: A continuous, multi-modal data set of naturally occurring gestures with ground truth labels. In 2017 12th Ieee international conference on automatic face & gesture recognition (fg 2017) (pp. 414–421). DOI: 10.1109/FG.2017.145
Back to article
18Yun, K., Honorio, J., Chattopadhyay, D., Berg, T. L., & Samaras, D. (2012). Two-person interaction detection using body-pose features and multiple instance learning. In 2012 Ieee computer society conference on computer vision and pattern recognition workshops (pp. 28–35). DOI: 10.1109/CVPRW.2012.6239234
Back to article

When Text and Speech are Not Enough: A Multimodal Dataset of Collaboration in a Situated Task

Abstract

Paradigm

My account