Visemes and phonemes do not share a one-to-one correspondence. Often several phonemes correspond to a single viseme, as several phonemes look the same on the face when produced, such as /k, ɡ, ŋ/, (viseme: /k/), /t͡ʃ, ʃ, d͡ʒ, ʒ/ (viseme: /ch/), /t, d, n, l/ (viseme: /t/), and /p, b, m/ (viseme: /p/). Thus words such as pet, bell, and men are difficult for lip-readers to distinguish, as all look like /pet/. However, there may be differences in timing and duration during actual speech in terms of the visual 'signature' of a given gesture that can not be captured with a single photograph. Conversely, some sounds which are hard to distinguish acoustically are clearly distinguished by the face (Chen 2001). For example, acoustically speaking English /l/ and /r/ can be quite similar (especially in clusters, such as 'grass' vs. 'glass'). Yet visual information can show a clear contrast. This is demonstrated by the more frequent mishearing of words on the telephone than in person. Some linguists have argued that speech is best understood as bimodal (aural and visual), and comprehension can be compromised if one of these two domains is absent (McGurk and MacDonald 1976).
- Chen, T. (1998, May). "Audio-visual integration in multi-modal communication". Proceedings of the IEEE 86, 837–852.
- Chen, T. (2001). "Audiovisual speech processing". IEEE Signal Processing Magazine, 9–31.
- Fisher, C.G. (1968). "Confusions among visually perceived consonants". Journal of Speech and Hearing Research, 11(4):796–804.
- McGurk, H. and J. MacDonald (1976, December). "Hearing lips and seeing voices". Nature, 746–748.
- Patrick Lucey, Terrence Martin and Sridha Sridharan. 2004. "Confusability of Phonemes Grouped According to their Viseme Classes in Noisy Environments". Presented at Tenth Australian International Conference on Speech Science & Technology, Macquarie University, Sydney, 8–10 December, 2004. Article online (PDF document)