Skip to main content

Multimodal Referring Expression Generation for Human-Computer Interaction

  • Conference paper
  • First Online:
HCI International 2024 – Late Breaking Papers (HCII 2024)

Abstract

Using both verbal and non-verbal modalities in generating definite descriptions of objects and locations is a critical human capability in collaborative interactions. Despite recent advancements in AI, embodied interactive virtual agents (IVAs) are not equipped to intelligently mix modalities to communicate their intents as humans do, which hamstrings naturalistic multimodal HCI. We introduce SCMRE, a corpus designed for training generative AI systems in multimodal HCI, focusing on multimodal referring expressions. Our contributions include: 1) Developing an interactive virtual agent (IVA) platform that interprets human multimodal instructions and responds with language and gestures; 2) Providing 24 participants with 10 scenes, each involving ten equally-sized blocks randomly placed on a table. These interactions generated a dataset of 10,408 samples; 3) Analyzing SCMRE, revealing that the utilization of pointing significantly reduces the ambiguity of prompts and increases the efficiency of IVA’s execution of humans’ prompts; 4) Augmenting and synthesizing SCMRE, resulting in 22,159 samples to generate more data for model training; 5) Using LLaMA 2-13B to conduct parameter-efficient finetuning for generating contextually-correct and situationally-fluent multimodal referring expressions; 6) Integrating the fine-tuned model into the IVA to evaluate the success of the generative model-enabled IVA in communication with humans; 7) Establishing the evaluation process which applies to both humans and IVAs and combines quantitative and qualitative metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
€34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (Germany)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 53.49
Price includes VAT (Germany)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 70.61
Price includes VAT (Germany)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

') var buybox = document.querySelector("[data-id=id_"+ timestamp +"]").parentNode var buyingOptions = buybox.querySelectorAll(".buying-option") ;[].slice.call(buyingOptions).forEach(initCollapsibles) var buyboxMaxSingleColumnWidth = 480 function initCollapsibles(subscription, index) { var toggle = subscription.querySelector(".buying-option-price") subscription.classList.remove("expanded") var form = subscription.querySelector(".buying-option-form") var priceInfo = subscription.querySelector(".price-info") var buyingOption = toggle.parentElement if (toggle && form && priceInfo) { toggle.setAttribute("role", "button") toggle.setAttribute("tabindex", "0") toggle.addEventListener("click", function (event) { var expandedBuyingOptions = buybox.querySelectorAll(".buying-option.expanded") var buyboxWidth = buybox.offsetWidth ;[].slice.call(expandedBuyingOptions).forEach(function(option) { if (buyboxWidth buyboxMaxSingleColumnWidth) { toggle.click() } else { if (index === 0) { toggle.click() } else { toggle.setAttribute("aria-expanded", "false") form.hidden = "hidden" priceInfo.hidden = "hidden" } } }) } initialStateOpen() if (window.buyboxInitialised) return window.buyboxInitialised = true initKeyControls() })()

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/nadahass/SCMRE_Dataset.

  2. 2.

    https://github.com/nadahass/Human-based-Evaluation-MREG.git.

References

  1. Alalyani, N., Krishnaswamy, N.: A methodology for evaluating multimodal referring expression generation for embodied virtual agents. In: Companion Publication of the 25th International Conference on Multimodal Interaction, pp. 164–173 (2023)

    Google Scholar 

  2. Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: Advances in Neural Information Processing Systems, vol. 35, pp. 23716–23736 (2022)

    Google Scholar 

  3. Alayrac, J.B., et al.: Self-supervised multimodal versatile networks. In: Advances in Neural Information Processing Systems, vol. 33, pp. 25–37 (2020)

    Google Scholar 

  4. Banerjee, S., Lavie, A.: Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation And/Or Summarization, pp. 65–72 (2005)

    Google Scholar 

  5. Belz, A., Gatt, A.: Intrinsic vs. extrinsic evaluation measures for referring expression generation. In: Proceedings of ACL-08: HLT, Short Papers, pp. 197–200 (2008)

    Google Scholar 

  6. Bender, E.M., Koller, A.: Climbing towards NLU: on meaning, form, and understanding in the age of data. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5185–5198 (2020)

    Google Scholar 

  7. Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)

    Google Scholar 

  8. Chen, Y., et al.: Yourefit: embodied reference understanding with language and gesture. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1385–1395 (2021)

    Google Scholar 

  9. Chen, Z., Wang, P., Ma, L., Wong, K.Y.K., Wu, Q.: Cops-ref: a new dataset and task on compositional referring expression comprehension. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10086–10095 (2020)

    Google Scholar 

  10. De Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., Courville, A.: Guesswhat?! Visual object discovery through multi-modal dialogue. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5503–5512 (2017)

    Google Scholar 

  11. Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)

    Article  MATH  Google Scholar 

  12. Doğan, F.I., Kalkan, S., Leite, I.: Learning to generate unambiguous spatial referring expressions for real-world environments. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4992–4999. IEEE (2019)

    Google Scholar 

  13. Fang, R., Doering, M., Chai, J.Y.: Embodied collaborative referring expression generation in situated human-robot interaction. In: Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction, pp. 271–278 (2015)

    Google Scholar 

  14. Foster, M.E.: Enhancing human-computer interaction with embodied conversational agents. In: Stephanidis, C. (ed.) UAHCI 2007, Part II. LNCS, vol. 4555, pp. 828–837. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-73281-5_91

    Chapter  MATH  Google Scholar 

  15. Gatt, A., Belz, A., Kow, E.: The tuna-reg challenge 2009: overview and evaluation results. Assoc. Comput. Linguist. (2009)

    Google Scholar 

  16. Gatt, A., Van Deemter, K.: Lexical choice and conceptual perspective in the generation of plural referring expressions. J. Logic Lang. Inform. 16(4), 423–443 (2007)

    Article  MATH  Google Scholar 

  17. Goldin-Meadow, S.: The role of gesture in communication and thinking. Trends Cogn. Sci. 3(11), 419–429 (1999)

    Article  MATH  Google Scholar 

  18. Gorniak, P., Roy, D.: Grounded semantic composition for visual scenes. J. Artif. Intell. Res. 21, 429–470 (2004)

    Article  MATH  Google Scholar 

  19. Han, L., Zheng, T., Xu, L., Fang, L.: OccuSeg: occupancy-aware 3D instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2940–2949 (2020)

    Google Scholar 

  20. Hu, E.J., et al.: LoRa: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)

  21. Islam, M.M., Mirzaiee, R., Gladstone, A., Green, H., Iqbal, T.: Caesar: An embodied simulator for generating multimodal referring expression datasets. In: Advances in Neural Information Processing Systems, vol. 35, pp. 21001–21015 (2022)

    Google Scholar 

  22. Kalinowska, A., Pilarski, P.M., Murphey, T.D.: Embodied communication: how robots and people communicate through physical interaction. Annu. Rev. Control Robot. Auton. Syst. 6, 205–232 (2023)

    Article  Google Scholar 

  23. Krahmer, E., van der Sluis, I.: A new model for generating multimodal referring expressions. In: Proceedings of the ENLG, vol. 3, pp. 47–54 (2003)

    Google Scholar 

  24. Kranstedt, A., Kopp, S., Wachsmuth, I.: MurML: a multimodal utterance representation markup language for conversational agents. In: AAMAS’02 Workshop Embodied Conversational Agents-Let’s Specify and Evaluate Them! (2002)

    Google Scholar 

  25. Krishnaswamy, N., Alalyani, N.: Embodied multimodal agents to bridge the understanding gap. In: Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing, pp. 41–46 (2021)

    Google Scholar 

  26. Krishnaswamy, N., Alalyani, N.: Embodied multimodal agents to bridge the understanding gap. In: Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing, pp. 41–46. Association for Computational Linguistics, Online (2021)

    Google Scholar 

  27. Krishnaswamy, N., et al.: Diana’s world: a situated multimodal interactive agent. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13618–13619 (2020)

    Google Scholar 

  28. Krishnaswamy, N., et al.: Communicating and acting: understanding gesture in simulation semantics. In: Proceedings of the 12th International Conference on Computational Semantics (IWCS)-Short papers (2017)

    Google Scholar 

  29. Krishnaswamy, N., Pickard, W., Cates, B., Blanchard, N., Pustejovsky, J.: The voxworld platform for multimodal embodied agents. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 1529–1541 (2022)

    Google Scholar 

  30. Krishnaswamy, N., Pustejovsky, J.: Voxsim: a visual platform for modeling motion language. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, pp. 54–58 (2016)

    Google Scholar 

  31. Krishnaswamy, N., Pustejovsky, J.: An evaluation framework for multimodal interaction. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)

    Google Scholar 

  32. Krishnaswamy, N., Pustejovsky, J.: Generating a novel dataset of multimodal referring expressions. In: Proceedings of the 13th International Conference on Computational Semantics-Short Papers, pp. 44–51 (2019)

    Google Scholar 

  33. Krishnaswamy, N., Pustejovsky, J.: The role of embodiment and simulation in evaluating hci: Experiments and evaluation. In: International Conference on Human-Computer Interaction, pp. 220–232 (2021)

    Google Scholar 

  34. Krishnaswamy, N., Pustejovsky, J.: Affordance embeddings for situated language understanding. Front. Artif. Intell. 5, 774752 (2022)

    Article  MATH  Google Scholar 

  35. Kunze, L., Williams, T., Hawes, N., Scheutz, M.: Spatial referring expression generation for HRI: algorithms and evaluation framework. In: 2017 AAAI Fall Symposium Series (2017)

    Google Scholar 

  36. Levenshtein, V.I., et al.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Phys. Doklady, vol. 10, pp. 707–710. Soviet Union (1966)

    Google Scholar 

  37. Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)

  38. Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)

  39. Li, X., Guo, D., Liu, H., Sun, F.: Reve-CE: remote embodied visual referring expression in continuous environment. IEEE Robot. Autom. Lett. 7(2), 1494–1501 (2022)

    Article  Google Scholar 

  40. Lin, C.Y., Hovy, E.: Automatic evaluation of summaries using n-gram co-occurrence statistics. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 150–157 (2003)

    Google Scholar 

  41. Lu, J., Batra, D., Parikh, D., Lee, S.: VilBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  42. Ma, E.: NLP augmentation (2019). https://github.com/makcedward/nlpaug

  43. Magassouba, A., Sugiura, K., Kawai, H.: Multimodal attention branch network for perspective-free sentence generation. In: Conference on Robot Learning, pp. 76–85. PMLR (2020)

    Google Scholar 

  44. Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60. Association for Computational Linguistics, Baltimore, Maryland (2014)

    Google Scholar 

  45. Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11–20 (2016)

    Google Scholar 

  46. McNeill, D.: So you think gestures are nonverbal? Psychol. Rev. 92(3), 350 (1985)

    Article  Google Scholar 

  47. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)

    Google Scholar 

  48. Passonneau, R.: Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation (2006)

    Google Scholar 

  49. Pearson, K.: X. on the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philos. Mag. J. Sci. 50(302), 157–175 (1900)

    Google Scholar 

  50. Pustejovsky, J., Krishnaswamy, N.: Embodied human-computer interactions through situated grounding. In: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents, pp. 1–3 (2020)

    Google Scholar 

  51. Pustejovsky, J., Krishnaswamy, N.: Situated meaning in multimodal dialogue: human-robot and human-computer interactions. Traitement Automatique des Langues 61(3), 17–41 (2020)

    MATH  Google Scholar 

  52. Pustejovsky, J., Krishnaswamy, N.: Embodied human computer interaction. KI-Künstliche Intelligenz 35(3), 307–327 (2021)

    Article  Google Scholar 

  53. Pustejovsky, J., Krishnaswamy, N.: Multimodal semantics for affordances and actions. In: Kurosu, M. (ed.) HCII 2022. LNCS, vol. 13302, pp. 137–160. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-05311-5_9

    Chapter  MATH  Google Scholar 

  54. Qi, Y., et al.: Reverie: remote embodied visual referring expression in real indoor environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9982–9991 (2020)

    Google Scholar 

  55. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  56. Schauerte, B., Fink, G.A.: Focusing computational visual attention in multi-modal human-robot interaction. In: International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction, pp. 1–8 (2010)

    Google Scholar 

  57. Schauerte, B., Richarz, J., Fink, G.A.: Saliency-based identification and recognition of pointed-at objects. In: 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4638–4643. IEEE (2010)

    Google Scholar 

  58. Shridhar, M., Mittal, D., Hsu, D.: Ingress: interactive visual grounding of referring expressions. Int. J. Robot. Res. 39(2–3), 217–232 (2020)

    Article  MATH  Google Scholar 

  59. Shukla, D., Erkent, O., Piater, J.: Probabilistic detection of pointing directions for human-robot interaction. In: 2015 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pp. 1–8. IEEE (2015)

    Google Scholar 

  60. Shukla, D., Erkent, Ö., Piater, J.: A multi-view hand gesture rgb-d dataset for human-robot interaction scenarios. In: 2016 25th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 1084–1091. IEEE (2016)

    Google Scholar 

  61. Taori, R., et al.: Stanford alpaca: an instruction-following llama model (2023). https://github.com/tatsu-lab/stanford_alpaca

  62. Touvron, H., et al.: Llama: open and efficient foundation language models (2023). arXiv preprint arXiv:2302.13971 (2023)

  63. Van Deemter, K.: Generating referring expressions that involve gradable properties. Comput. Linguist. 32(2), 195–222 (2006)

    Article  MATH  Google Scholar 

  64. Viethen, J., Dale, R.: Algorithms for generating referring expressions: do they do what people do? In: Proceedings of the Fourth International Natural Language Generation Conference, pp. 63–70 (2006)

    Google Scholar 

  65. Viethen, J., Dale, R.: The use of spatial relations in referring expression generation. In: Proceedings of the Fifth International Natural Language Generation Conference, pp. 59–67 (2008)

    Google Scholar 

  66. Wang, I., Smith, J., Ruiz, J.: Exploring virtual agents for augmented reality. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1–12 (2019)

    Google Scholar 

  67. Wang, Z., Yu, J., Yu, A.W., Dai, Z., Tsvetkov, Y., Cao, Y.: SimVLM: simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904 (2021)

  68. Xu, M., et al.: A survey of resource-efficient LLM and multimodal foundation models. arXiv preprint arXiv:2401.08092 (2024)

  69. Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: evaluating text generation with BERT. In: International Conference on Learning Representations (2019)

    Google Scholar 

  70. Zhu, L., Yang, Y.: ActBERT: learning global-local video-text representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8746–8755 (2020)

    Google Scholar 

Download references

Acknowledgments

We express our gratitude to our reviewers for their valuable comments. Additionally, we extend our thanks to our participants for their contributions in providing the SCMRE data.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nada Alalyani .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Alalyani, N., Krishnaswamy, N. (2024). Multimodal Referring Expression Generation for Human-Computer Interaction. In: Degen, H., Ntoa, S. (eds) HCI International 2024 – Late Breaking Papers. HCII 2024. Lecture Notes in Computer Science, vol 15382. Springer, Cham. https://doi.org/10.1007/978-3-031-76827-9_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-76827-9_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-76826-2

  • Online ISBN: 978-3-031-76827-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics