This thesis investigated the production and perception of prosodic cues for focus and phrasing contrasts from auditory and visual speech (i.e., visible face and head movements). This was done by examining the form, perceptibility, and potential functions of the visual correlates of spoken prosody using auditory and motion analysis and perception-based measures. The first part of the investigation (Chapters 2 to 3) consisted of a series of perception experiments conducted to determine the degree to which perceivers were sensitive to the visual realisation of prosody across face areas. Here, participants were presented with a visual cue (either from the upper or lower half of the face) to match (based on prosody) with another visual or auditory cue. Performance was much better than chance even when the task involved matching cues produced by different talkers. The results indicate that perceivers were sensitive to visual prosodic cues, that considerable variability in the form of these could be tolerated, and that different cues conveying information about the same prosodic type could be matched. The second part of the thesis (Chapters 4 to 8) reported on the construction of a multi-talker speech prosody corpus and the analysis and perceptibility of this production data. The corpus consisted of auditory and visual speech recording of six talkers producing 30 sentences across three prosodic conditions in two interactive settings (face-to-face and auditory-only), with face movements captured using a 3D motion tracking system and characterised using a guided principal components analysis. The analysis consisted of quantifying auditory and visual characteristics of prosodic contrasts separately as well as the relationship between these. Acoustically, the properties of the contrasts corresponded to those typically described in the literature (however, some properties varied systematically as a function of the interactive setting), and were also perceived as conveying the intended contrasts in subsequent perceptual tasks (reported in Chapter 6). Overall, the types of movements used to contrast narrow from broad focused utterances, and echoic questions from statements, involved the use of both articulatory (e.g., jaw and lip movement) and non-articulatory (e.g., eyebrow and rigid head movement) cues. Both the visual and the acoustic properties varied across talkers and interactive settings. The spatial and temporal relationship between auditory and visual signal modalities was highly variable, differing substantially across utterances. The final part of the thesis (Chapters 9 to 10) reported the results of a series of perception experiments using perceptual rating and cross-modal matching tasks on stimuli resynthesised from the motion capture data. These stimuli showed various combinations of visual cues, and when presented in isolation or combined with the auditory signal, these were perceived as conveying the intended prosodic contrast. However, no auditory-visual (AV) benefit was observed in the perceptual ratings, with the presentation of more cues failing to result in better cross-modal matching performance (suggesting there may be limitations in perceivers' ability to process multiple cues). In sum, the thesis showed that perceivers were sensitive to visual prosodic cues despite variability in production, and were able to match different types of cue. The construction of an AV prosody corpus permitted the characteristics of the auditory and visual prosodic correlates (and their relationship) to be quantified, and allowed for the synthesis of visual cues that perceivers subsequently used to successfully extract prosodic information. In all, the experiments reported in this thesis provide a strong case for the development of well-controlled and measured manipulations of prosody and warrants further examination of the visual cues to prosody.
| Date of Award | 2011 |
|---|
| Original language | English |
|---|
- speech perception
- prosodic analysis (linguistics)
- visual perception
- auditory perception
It's not just what you say, but also how you say it : exploring the auditory and visual properties of speech prosody
Cvejic, E. (Author). 2011
Western Sydney University thesis: Doctoral thesis