Multimodal LLMs for emotion-aware human–robot interaction: design and implementation

  • Amna Abdulla Alneyadi
  • , Mariam Othman Alamoodi
  • , Khawla Saeed Alneyadi
  • , Bushra Naeem
  • , Omar Mubin
  • , Fady Alnajjar

Research output: Chapter in Book / Conference PaperConference Paperpeer-review

Abstract

Human–robot interaction (HRI) is constrained by an emotional bandwidth gap in which humans express emotion across multiple channels while robots typically monitor only one. In this study, we develop a tri-modal emotion pipeline that fuses facial, vocal, and linguistic cues on a Pepper robot, using classic per-modality models, namely mini-Xception for facial and Wav2Vec2-Large-XLSR for vocal, plus GPT-4o for linguistic appraisal over ASR transcripts and GPT-4o-mini to generate the response and behavior from the fused state. In a pilot within-subjects study (N=2), participants held two 5-minute conversations (health, education) with Pepper under (i) the embedded facial baseline and (ii) the proposed fusion. The baseline produced predominantly Neutral outputs (~70% of cycles), whereas the tri-modal system reduced Neutral to ~30% and revealed nuanced states aligned with context (e.g., concern during health talk, enthusiasm in education). The fused state drove the robot's adaptive responses (content complexity, pacing, and nonverbal behavior) in real time. These findings highlight the potential of MLLM-assisted fusion to enhance emotional understanding and responsiveness in HRI.

Original languageEnglish
Title of host publicationProceedings of the 19th IEEE International Conference on Application of Information and Communication Technologies (AICT 2025), 29-31 Oct 2025, Al-Ain, UAE
Place of PublicationU.S.
PublisherIEEE
Number of pages5
ISBN (Electronic)9798331593421
DOIs
Publication statusPublished - 2025
EventInternational Conference on Application of Information and Communication Technologies - Al Ain, United Arab Emirates
Duration: 29 Oct 202531 Oct 2025
Conference number: 19th

Conference

ConferenceInternational Conference on Application of Information and Communication Technologies
Country/TerritoryUnited Arab Emirates
CityAl Ain
Period29/10/2531/10/25

Keywords

  • Affective Computing
  • Emotion-aware Human–Robot Interaction (HRI)
  • Multimodal Large Language Models (MLLMs)
  • Social Robots
  • Tri-modal Fusion (Face–Voice–Text)

Fingerprint

Dive into the research topics of 'Multimodal LLMs for emotion-aware human–robot interaction: design and implementation'. Together they form a unique fingerprint.

Cite this