Abstract
Human–robot interaction (HRI) is constrained by an emotional bandwidth gap in which humans express emotion across multiple channels while robots typically monitor only one. In this study, we develop a tri-modal emotion pipeline that fuses facial, vocal, and linguistic cues on a Pepper robot, using classic per-modality models, namely mini-Xception for facial and Wav2Vec2-Large-XLSR for vocal, plus GPT-4o for linguistic appraisal over ASR transcripts and GPT-4o-mini to generate the response and behavior from the fused state. In a pilot within-subjects study (N=2), participants held two 5-minute conversations (health, education) with Pepper under (i) the embedded facial baseline and (ii) the proposed fusion. The baseline produced predominantly Neutral outputs (~70% of cycles), whereas the tri-modal system reduced Neutral to ~30% and revealed nuanced states aligned with context (e.g., concern during health talk, enthusiasm in education). The fused state drove the robot's adaptive responses (content complexity, pacing, and nonverbal behavior) in real time. These findings highlight the potential of MLLM-assisted fusion to enhance emotional understanding and responsiveness in HRI.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the 19th IEEE International Conference on Application of Information and Communication Technologies (AICT 2025), 29-31 Oct 2025, Al-Ain, UAE |
| Place of Publication | U.S. |
| Publisher | IEEE |
| Number of pages | 5 |
| ISBN (Electronic) | 9798331593421 |
| DOIs | |
| Publication status | Published - 2025 |
| Event | International Conference on Application of Information and Communication Technologies - Al Ain, United Arab Emirates Duration: 29 Oct 2025 → 31 Oct 2025 Conference number: 19th |
Conference
| Conference | International Conference on Application of Information and Communication Technologies |
|---|---|
| Country/Territory | United Arab Emirates |
| City | Al Ain |
| Period | 29/10/25 → 31/10/25 |
Keywords
- Affective Computing
- Emotion-aware Human–Robot Interaction (HRI)
- Multimodal Large Language Models (MLLMs)
- Social Robots
- Tri-modal Fusion (Face–Voice–Text)
Fingerprint
Dive into the research topics of 'Multimodal LLMs for emotion-aware human–robot interaction: design and implementation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver