audio-visual speech and speaker recognition gérard chollet, guido aversano, hervé bredin, fabian...

40
Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam, Chafic Mokbel, Santa Rossi, Eduardo Sanchez, Marc Sigelle, Georges Yazbek, Leila Zouari

Upload: agatha-wilcox

Post on 29-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,

Audio-Visual Speech and Speaker Recognition

Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon,

Walid Karam, Chafic Mokbel, Santa Rossi,

Eduardo Sanchez, Marc Sigelle,

Georges Yazbek, Leila Zouari

Page 2: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,

Talking Faces

• Recognition of face features (lips, jaws, eyebrows, gaze, eye-blinkings,...) in synchrony with speech,

• Tracking of lip movements,• Recognition of visemes,• Lip reading : how well do hard-of-hearing

people perform ?

Page 3: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,

F. J. Huang and T. Chen, "Real-Time Lip-Synch Face Animation driven by human voice", IEEE Workshop on Multimedia Signal Processing, Los Angeles, California, Dec 1998

Audio-visual recognition of spectrally reduced speechFrédéric Berthommier

Page 4: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,

SpeechReading

A human listener can use visual cues, such as lip and tongue movements, to enhance the level of speech understanding, especially in a noisy environment. The process of combining the audio modality and the visual modality is referred to as speechreading, or lipreading.

There are many applications in which it is desired to recognize speech under extremely adverse acoustic environments. Detecting a person's speech from a distance or through a glass window, understanding a person speaking among a very noisy crowd of people, and monitoring a speech over TV broadcast when the audio link is weak or corrupted, are some examples.

Page 5: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,

2001: a Space Odyssee

Page 6: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,

Audio-Visual Speech Recognition(Ref ?)

QuickTime™ et undécompresseur TIFF (non compressé)

sont requis pour visionner cette image.

Page 7: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,

Audio-Visual Speech Recognition(Ref ?)

QuickTime™ et undécompresseur TIFF (non compressé)

sont requis pour visionner cette image.

Page 8: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,

Audio-Visual Speech Recognition(Ref ?)

QuickTime™ et undécompresseur TIFF (non compressé)

sont requis pour visionner cette image.

Page 9: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,

Audio-Visual Speech Recognition(Ref ?)

Page 10: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,

Coupled HMM

QuickTime™ et undécompresseur TIFF (non compressé)

sont requis pour visionner cette image.

Page 11: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,

OpenCV

Open source code for AVCSR can be downloaded from http://sourceforge.net/projects/opencvlibrary/ **.

Page 12: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,

Publications 1- Ara V Nefian, Lu Hong Liang, Xiao Xing Liu, Xiaobo Pi and Kevin Murphy, "Dynamic Bayesian networks for audio-visual speech recognition", EURASIP, Journal of Applied Signal Processing , vol. 2002, no 11, p. 1274-1288, 2002. -Xiao Xing Liu, Yibao Zhao, Xiaobo Pi, Lu Hong Liang and Ara V Nefian, "Audio-visual continuous speech recognition using a coupled hidden Markov model", IEEE International Conference on Spoken Language Processing , p. 213-216, September 2002. -Lu Hong Liang, Xiao Xing Liu, Yibao Zhao, Xiaobo Pi and Ara V Nefian, "Speaker independent audio-visual continuous speech recognition", IEEE International Conference on Multimedia and Expo , vol.2, p. 25-28, August 2002.

Page 13: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,

Publications 2-Ara V Nefian, Lu Hong Liang, Xiao Xing Liu, Xiaobo Pi, Crusoe Mao and Kevin Murphy, "A coupled HMM for audio-visual speech recognition", International Conference on Acoustics Speech and Signal Processing , vol II, pp 2013-2016, Orlando, Florida, May 2002 .

- Gerasimos Potamianos, Chalapathy Neti, Gridharan Iyengar,Andrew W. Senior and Ashish VermaA cascade visual front end for speaker independentautomatic speechreadingInternational Journal of Speech Technology, Special Issue on Multimedia, 4, 193-208, 2001

Page 14: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,

Biblio

Adjoudani, A. and Benoit, C. (1996) . On the integration of auditory and visual parameters in an HMM-based ASR. In Stork, D.G. and Hennecke, M.E. (Eds.), Speechreading by Humans and Machines. Berlin, Germany: Springer, pp. 461-471.

Bregler, C. and Konig, Y. (1994) . `Eigenlips' for robust speech recognition. Proceedings International Conference on Acoustics, Speech, and Signal Processing (ICASSP)'94, Adelaide, Australia, pp. 669-672.

Page 15: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,

Biblio

Brooke, N.M. (1996) . Talking heads and speech recognizers that can see: The computer processing of visual speech signals. In Stork, D.G. and Hennecke, M.E. (Eds.), Speechreading by Humans and Machines. Berlin, Germany: Springer, pp. 351-371.

Chen, T. (2001) . Audiovisual speech processing. Lip reading and lip synchronization.IEEE Signal Processing Magazine, 18(1):9-21.

Page 16: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,

Dupont, S. and Luettin, J. (2000) . Audio-visual speech modeling for continuous speech recognition. IEEE Transactions on Multimedia, 2(3):141-151.

Gray, M.S., Movellan, J.R., and Sejnowski, T.J. (1997) . Dynamic features for visual speech-reading:A systematic comparison. In Mozer, M.C., Jordan, M.I., and Petsche, T. (Eds.), Advances in Neural Information Processing Systems 9. Cambridge, MA: MIT Press, pp. 751-757.

Page 17: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,

Biblio

Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D., Sison, J., Mashari, A., and Zhou, J. (2000). Audio-Visual Speech Recognition. Summer Workshop 2000 Final Technical Report, Center for Language and Speech Processing, The Johns Hopkins University, Baltimore, MD (http: //www.clsp.jhu.edu/ws2000/final reports/avsr/).

Petajan, E.D. (1984) . Automatic lipreading to enhance speech recognition. Proceedings Global Telecommunications Conference (GLOBCOM)'84, Atlanta, GA, pp. 265-272.

Page 18: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,

Rogozan, A., Deleglise, P., and Alissali, M. (1997) . Adaptive determination of audio and visual weights for automatic speech recognition. Proceedings European Tutorial Research Workshopon Audio-Visual Speech Processing (AVSP)'97, Rhodes, Greece, pp. 61-64.

Summerfield, A.Q. (1987) . Some preliminaries to a comprehensive account of audio-visual speech perception. In Dodd, B. and Campbell, R. (Eds.), Hearing by Eye: The Psychology of Lip-Reading. Hillside, NJ: Lawrence Erlbaum Associates, pp. 97-113.

Page 19: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,

Summerfield, Q., MacLeod, A., McGrath, M., and Brooke, M. (1989) . Lips, teeth, and the benefits of lipreading. In Young, A.W. and Ellis, H.D. (Eds.), Handbook of Research on Face Processing. Amsterdam, The Netherlands: Elsevier Science Publishers, pp. 223-233.

Teissier, P., Robert-Ribes, J., Schwartz, J.-L., and Guerin-Dugue, A. (1999) . Comparing modelsfor audiovisual fusion in a noisy-vowel recognition task. IEEE Transactions on Speech and Audio Processing, 7(6):629-642.

Page 20: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,

Wark, T. and Sridharan, S. (1998) . A syntactic approach to automatic lip feature extraction for speaker identication. Proceedings International Conference on Acoustics, Speech, and Signal Processing (ICASSP)'98, Seattle, WA, pp. 3693-3696.

A HYBRID ANN/HMM AUDIO-VISUAL SPEECH RECOGNITION SYSTEMMartin Heckmann, Frédéric Berthommier , Kristian Kroschel

Page 21: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,

A HYBRID ANN/HMM AUDIO-VISUAL SPEECH

RECOGNITION SYSTEMMartin Heckmann, Frédéric Berthommier ,

Kristian Kroschel

Page 22: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,
Page 23: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,

C. Bregler, S. Manke, H. Hild, and A. Waibel, “Bimodalsensor integration on the example of speech-reading,” inProc. IEEE Int. Conf. on Neural Networks, 1993, pp. 667–671.

A. Rogozan and P. Deléglise, “Adaptive fusion of acousticand visual sources for automatic spech recognition,” SpeechCommunication, vol. 26, pp. 149–161, 1998.

Page 24: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,
Page 25: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,
Page 26: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,
Page 27: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,
Page 28: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,
Page 29: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,
Page 30: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,
Page 31: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,
Page 32: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,
Page 33: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,
Page 34: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,
Page 35: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,
Page 36: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,
Page 37: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,
Page 38: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,
Page 39: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,
Page 40: Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,