Multimodal Referring Expression Generation for Human-Computer Interaction

Alalyani, Nada; Krishnaswamy, Nikhil

doi:10.1007/978-3-031-76827-9_1

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15382))

Included in the following conference series:

International Conference on Human-Computer Interaction

517 Accesses

Abstract

Using both verbal and non-verbal modalities in generating definite descriptions of objects and locations is a critical human capability in collaborative interactions. Despite recent advancements in AI, embodied interactive virtual agents (IVAs) are not equipped to intelligently mix modalities to communicate their intents as humans do, which hamstrings naturalistic multimodal HCI. We introduce SCMRE, a corpus designed for training generative AI systems in multimodal HCI, focusing on multimodal referring expressions. Our contributions include: 1) Developing an interactive virtual agent (IVA) platform that interprets human multimodal instructions and responds with language and gestures; 2) Providing 24 participants with 10 scenes, each involving ten equally-sized blocks randomly placed on a table. These interactions generated a dataset of 10,408 samples; 3) Analyzing SCMRE, revealing that the utilization of pointing significantly reduces the ambiguity of prompts and increases the efficiency of IVA’s execution of humans’ prompts; 4) Augmenting and synthesizing SCMRE, resulting in 22,159 samples to generate more data for model training; 5) Using LLaMA 2-13B to conduct parameter-efficient finetuning for generating contextually-correct and situationally-fluent multimodal referring expressions; 6) Integrating the fine-tuned model into the IVA to evaluate the success of the generative model-enabled IVA in communication with humans; 7) Establishing the evaluation process which applies to both humans and IVAs and combines quantitative and qualitative metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

€34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: EUR 29.95; Price includes VAT (Germany)

eBook: EUR 53.49; Price includes VAT (Germany)

Softcover Book: EUR 70.61; Price includes VAT (Germany)

Tax calculation will be finalised at checkout

Purchases are for personal use only

') var buybox = document.querySelector("[data-id=id_"+ timestamp +"]").parentNode var buyingOptions = buybox.querySelectorAll(".buying-option") ;[].slice.call(buyingOptions).forEach(initCollapsibles) var buyboxMaxSingleColumnWidth = 480 function initCollapsibles(subscription, index) { var toggle = subscription.querySelector(".buying-option-price") subscription.classList.remove("expanded") var form = subscription.querySelector(".buying-option-form") var priceInfo = subscription.querySelector(".price-info") var buyingOption = toggle.parentElement if (toggle && form && priceInfo) { toggle.setAttribute("role", "button") toggle.setAttribute("tabindex", "0") toggle.addEventListener("click", function (event) { var expandedBuyingOptions = buybox.querySelectorAll(".buying-option.expanded") var buyboxWidth = buybox.offsetWidth ;[].slice.call(expandedBuyingOptions).forEach(function(option) { if (buyboxWidth buyboxMaxSingleColumnWidth) { toggle.click() } else { if (index === 0) { toggle.click() } else { toggle.setAttribute("aria-expanded", "false") form.hidden = "hidden" priceInfo.hidden = "hidden" } } }) } initialStateOpen() if (window.buyboxInitialised) return window.buyboxInitialised = true initKeyControls() })()

Institutional subscriptions

Neuro-Symbolic Reasoning for Multimodal Referring Expression Comprehension in HMI Systems

Article Open access 15 February 2024

Human motions and emotions recognition inspired by LMA qualities

Article 18 December 2018

A Virtual Assistive Companion for Older Adults: Design Implications for a Real-World Application

Notes

References

Alalyani, N., Krishnaswamy, N.: A methodology for evaluating multimodal referring expression generation for embodied virtual agents. In: Companion Publication of the 25th International Conference on Multimodal Interaction, pp. 164–173 (2023)
Google Scholar
Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: Advances in Neural Information Processing Systems, vol. 35, pp. 23716–23736 (2022)
Google Scholar
Alayrac, J.B., et al.: Self-supervised multimodal versatile networks. In: Advances in Neural Information Processing Systems, vol. 33, pp. 25–37 (2020)
Google Scholar
Banerjee, S., Lavie, A.: Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation And/Or Summarization, pp. 65–72 (2005)
Google Scholar
Belz, A., Gatt, A.: Intrinsic vs. extrinsic evaluation measures for referring expression generation. In: Proceedings of ACL-08: HLT, Short Papers, pp. 197–200 (2008)
Google Scholar
Bender, E.M., Koller, A.: Climbing towards NLU: on meaning, form, and understanding in the age of data. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5185–5198 (2020)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)
Google Scholar
Chen, Y., et al.: Yourefit: embodied reference understanding with language and gesture. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1385–1395 (2021)
Google Scholar
Chen, Z., Wang, P., Ma, L., Wong, K.Y.K., Wu, Q.: Cops-ref: a new dataset and task on compositional referring expression comprehension. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10086–10095 (2020)
Google Scholar
De Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., Courville, A.: Guesswhat?! Visual object discovery through multi-modal dialogue. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5503–5512 (2017)
Google Scholar
Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)
Article MATH Google Scholar
Doğan, F.I., Kalkan, S., Leite, I.: Learning to generate unambiguous spatial referring expressions for real-world environments. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4992–4999. IEEE (2019)
Google Scholar
Fang, R., Doering, M., Chai, J.Y.: Embodied collaborative referring expression generation in situated human-robot interaction. In: Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction, pp. 271–278 (2015)
Google Scholar
Foster, M.E.: Enhancing human-computer interaction with embodied conversational agents. In: Stephanidis, C. (ed.) UAHCI 2007, Part II. LNCS, vol. 4555, pp. 828–837. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-73281-5_91
Chapter MATH Google Scholar
Gatt, A., Belz, A., Kow, E.: The tuna-reg challenge 2009: overview and evaluation results. Assoc. Comput. Linguist. (2009)
Google Scholar
Gatt, A., Van Deemter, K.: Lexical choice and conceptual perspective in the generation of plural referring expressions. J. Logic Lang. Inform. 16(4), 423–443 (2007)
Article MATH Google Scholar
Goldin-Meadow, S.: The role of gesture in communication and thinking. Trends Cogn. Sci. 3(11), 419–429 (1999)
Article MATH Google Scholar
Gorniak, P., Roy, D.: Grounded semantic composition for visual scenes. J. Artif. Intell. Res. 21, 429–470 (2004)
Article MATH Google Scholar
Han, L., Zheng, T., Xu, L., Fang, L.: OccuSeg: occupancy-aware 3D instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2940–2949 (2020)
Google Scholar
Hu, E.J., et al.: LoRa: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
Islam, M.M., Mirzaiee, R., Gladstone, A., Green, H., Iqbal, T.: Caesar: An embodied simulator for generating multimodal referring expression datasets. In: Advances in Neural Information Processing Systems, vol. 35, pp. 21001–21015 (2022)
Google Scholar
Kalinowska, A., Pilarski, P.M., Murphey, T.D.: Embodied communication: how robots and people communicate through physical interaction. Annu. Rev. Control Robot. Auton. Syst. 6, 205–232 (2023)
Article Google Scholar
Krahmer, E., van der Sluis, I.: A new model for generating multimodal referring expressions. In: Proceedings of the ENLG, vol. 3, pp. 47–54 (2003)
Google Scholar
Kranstedt, A., Kopp, S., Wachsmuth, I.: MurML: a multimodal utterance representation markup language for conversational agents. In: AAMAS’02 Workshop Embodied Conversational Agents-Let’s Specify and Evaluate Them! (2002)
Google Scholar
Krishnaswamy, N., Alalyani, N.: Embodied multimodal agents to bridge the understanding gap. In: Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing, pp. 41–46 (2021)
Google Scholar
Krishnaswamy, N., Alalyani, N.: Embodied multimodal agents to bridge the understanding gap. In: Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing, pp. 41–46. Association for Computational Linguistics, Online (2021)
Google Scholar
Krishnaswamy, N., et al.: Diana’s world: a situated multimodal interactive agent. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13618–13619 (2020)
Google Scholar
Krishnaswamy, N., et al.: Communicating and acting: understanding gesture in simulation semantics. In: Proceedings of the 12th International Conference on Computational Semantics (IWCS)-Short papers (2017)
Google Scholar
Krishnaswamy, N., Pickard, W., Cates, B., Blanchard, N., Pustejovsky, J.: The voxworld platform for multimodal embodied agents. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 1529–1541 (2022)
Google Scholar
Krishnaswamy, N., Pustejovsky, J.: Voxsim: a visual platform for modeling motion language. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, pp. 54–58 (2016)
Google Scholar
Krishnaswamy, N., Pustejovsky, J.: An evaluation framework for multimodal interaction. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Google Scholar
Krishnaswamy, N., Pustejovsky, J.: Generating a novel dataset of multimodal referring expressions. In: Proceedings of the 13th International Conference on Computational Semantics-Short Papers, pp. 44–51 (2019)
Google Scholar
Krishnaswamy, N., Pustejovsky, J.: The role of embodiment and simulation in evaluating hci: Experiments and evaluation. In: International Conference on Human-Computer Interaction, pp. 220–232 (2021)
Google Scholar
Krishnaswamy, N., Pustejovsky, J.: Affordance embeddings for situated language understanding. Front. Artif. Intell. 5, 774752 (2022)
Article MATH Google Scholar
Kunze, L., Williams, T., Hawes, N., Scheutz, M.: Spatial referring expression generation for HRI: algorithms and evaluation framework. In: 2017 AAAI Fall Symposium Series (2017)
Google Scholar
Levenshtein, V.I., et al.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Phys. Doklady, vol. 10, pp. 707–710. Soviet Union (1966)
Google Scholar
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
Li, X., Guo, D., Liu, H., Sun, F.: Reve-CE: remote embodied visual referring expression in continuous environment. IEEE Robot. Autom. Lett. 7(2), 1494–1501 (2022)
Article Google Scholar
Lin, C.Y., Hovy, E.: Automatic evaluation of summaries using n-gram co-occurrence statistics. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 150–157 (2003)
Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: VilBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Ma, E.: NLP augmentation (2019). https://github.com/makcedward/nlpaug
Magassouba, A., Sugiura, K., Kawai, H.: Multimodal attention branch network for perspective-free sentence generation. In: Conference on Robot Learning, pp. 76–85. PMLR (2020)
Google Scholar
Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60. Association for Computational Linguistics, Baltimore, Maryland (2014)
Google Scholar
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11–20 (2016)
Google Scholar
McNeill, D.: So you think gestures are nonverbal? Psychol. Rev. 92(3), 350 (1985)
Article Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Passonneau, R.: Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation (2006)
Google Scholar
Pearson, K.: X. on the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philos. Mag. J. Sci. 50(302), 157–175 (1900)
Google Scholar
Pustejovsky, J., Krishnaswamy, N.: Embodied human-computer interactions through situated grounding. In: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents, pp. 1–3 (2020)
Google Scholar
Pustejovsky, J., Krishnaswamy, N.: Situated meaning in multimodal dialogue: human-robot and human-computer interactions. Traitement Automatique des Langues 61(3), 17–41 (2020)
MATH Google Scholar
Pustejovsky, J., Krishnaswamy, N.: Embodied human computer interaction. KI-Künstliche Intelligenz 35(3), 307–327 (2021)
Article Google Scholar
Pustejovsky, J., Krishnaswamy, N.: Multimodal semantics for affordances and actions. In: Kurosu, M. (ed.) HCII 2022. LNCS, vol. 13302, pp. 137–160. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-05311-5_9
Chapter MATH Google Scholar
Qi, Y., et al.: Reverie: remote embodied visual referring expression in real indoor environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9982–9991 (2020)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Schauerte, B., Fink, G.A.: Focusing computational visual attention in multi-modal human-robot interaction. In: International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction, pp. 1–8 (2010)
Google Scholar
Schauerte, B., Richarz, J., Fink, G.A.: Saliency-based identification and recognition of pointed-at objects. In: 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4638–4643. IEEE (2010)
Google Scholar
Shridhar, M., Mittal, D., Hsu, D.: Ingress: interactive visual grounding of referring expressions. Int. J. Robot. Res. 39(2–3), 217–232 (2020)
Article MATH Google Scholar
Shukla, D., Erkent, O., Piater, J.: Probabilistic detection of pointing directions for human-robot interaction. In: 2015 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pp. 1–8. IEEE (2015)
Google Scholar
Shukla, D., Erkent, Ö., Piater, J.: A multi-view hand gesture rgb-d dataset for human-robot interaction scenarios. In: 2016 25th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 1084–1091. IEEE (2016)
Google Scholar
Taori, R., et al.: Stanford alpaca: an instruction-following llama model (2023). https://github.com/tatsu-lab/stanford_alpaca
Touvron, H., et al.: Llama: open and efficient foundation language models (2023). arXiv preprint arXiv:2302.13971 (2023)
Van Deemter, K.: Generating referring expressions that involve gradable properties. Comput. Linguist. 32(2), 195–222 (2006)
Article MATH Google Scholar
Viethen, J., Dale, R.: Algorithms for generating referring expressions: do they do what people do? In: Proceedings of the Fourth International Natural Language Generation Conference, pp. 63–70 (2006)
Google Scholar
Viethen, J., Dale, R.: The use of spatial relations in referring expression generation. In: Proceedings of the Fifth International Natural Language Generation Conference, pp. 59–67 (2008)
Google Scholar
Wang, I., Smith, J., Ruiz, J.: Exploring virtual agents for augmented reality. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1–12 (2019)
Google Scholar
Wang, Z., Yu, J., Yu, A.W., Dai, Z., Tsvetkov, Y., Cao, Y.: SimVLM: simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904 (2021)
Xu, M., et al.: A survey of resource-efficient LLM and multimodal foundation models. arXiv preprint arXiv:2401.08092 (2024)
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: evaluating text generation with BERT. In: International Conference on Learning Representations (2019)
Google Scholar
Zhu, L., Yang, Y.: ActBERT: learning global-local video-text representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8746–8755 (2020)
Google Scholar

Download references

Acknowledgments

We express our gratitude to our reviewers for their valuable comments. Additionally, we extend our thanks to our participants for their contributions in providing the SCMRE data.

Author information

Authors and Affiliations

Colorado State University, Fort Collins, USA
Nada Alalyani & Nikhil Krishnaswamy

Authors

Nada Alalyani
View author publications
Search author on:PubMed Google Scholar
Nikhil Krishnaswamy
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Nada Alalyani .

Editor information

Editors and Affiliations

Siemens Corporation, Princeton, NJ, USA
Helmut Degen
and Technology - Hellas, Foundation for Research, Heraklion, Crete, Greece
Stavroula Ntoa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alalyani, N., Krishnaswamy, N. (2024). Multimodal Referring Expression Generation for Human-Computer Interaction. In: Degen, H., Ntoa, S. (eds) HCI International 2024 – Late Breaking Papers. HCII 2024. Lecture Notes in Computer Science, vol 15382. Springer, Cham. https://doi.org/10.1007/978-3-031-76827-9_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-76827-9_1
Published: 31 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-76826-2
Online ISBN: 978-3-031-76827-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multimodal Referring Expression Generation for Human-Computer Interaction