[ieee 2011 ieee 3rd international conference on communication software and networks (iccsn) -...

Automatic visual speech segmentation

Hamed Talea Department of Electrical and computer Engineering

Semnan University Semnan, Iran

[email protected]

Abstract- Speech recognition techniques which rely on audio features of speech degrade in performance in noisy

environments. Visual Speech Recognition helps this by

incorporating a visual signal into the recognition process. The

performance of automatic speech recognition (ASR) system

can be significantly enhanced with additional information from

visual speech elements such as the movement of lips, tongue,

and teeth. This paper introduces a combined method for lip

region extraction and mouth area estimation, which is then

used to develop technique for automatic visual speech

segmentation. The accuracy of this method is verified by

applying it for syllable boundary separation and the following

vowel segmentation in multi syllable words and phrases.

Keywords; Speech segmentation, Lip tracking, visual syllable separation, visual feature, lipreading.

I. INTRODUCTION

Speech is the primary means of communication between people. Speech recognition techniques have been developed dramatically in recent years. A major weakness of current automatic speech recognition (ASR) systems is their sensitivity to the environmental and channel noises. Many applications try to reduce these effects by using audio preprocessing techniques and noise adaptation algorithms [1]. One of the most efficient approaches to enhance the speech recognition system performance is use of the visual information obtained from processing of the video signal of the speaker (audiovisual systems) [2][3]. In the visual part of systems, detection of the lip motion can help to increase the performance of the speech recognition. This needs the extraction of the lip position from consecutive image frames, in the first step.

In visual speech recognition domain, the technique of retrieving speech content from visual clues such as the movement of the lips, tongue, and teeth is commonly known as automatic lip reading. It has been shown in [4] - [11] that the performance of a purely acoustic-based speech recognition system will improve by using additional information from the visual speech elements, especially when the speech signal has low signal-to-noise ratio (SNR). Automatic lip reading, however, is difficult for both the visual feature extraction and the speech recognition processes. Visual feature extraction requires a robust method of tracking the speaker's lips through a sequence of images and a representation of the inner mouth appearance. Lip tracking is not a trivial task, since there is variety in people in

978-1-61284-486-2/111$26.00 ©2011 IEEE

181

Khashayar Yaghmaie Department of Electrical and computer Engineering

Semnan University Semnan, Iran

[email protected]

skin color, lip color, lip width, and the amount of lip movement during speech, as well as variability in the environment such as lighting conditions. Moreover, any method used to track lips during speech should not only be adaptive to the movement of the lips from frame to frame, but also stable enough not to be affected by the appearance of the teeth and tongue [12]. Regarding the recognition process, different methods have been developed to recognize speech according to the audio and visual features. For example the neural networks and hidden Markov models have widely been used in many problems such as classification and speech recognition [13], [14], [15]. Small vocabulary isolated word recognition methods have the advantages of simplicity in implementation and high recognition accuracy. However, the same approaches cannot be easily extended to large vocabulary continuous speech recognition since the number of possible words in continuous speech is prohibitively large. Therefore, an approach based on use of sub-words, in which different words can share common sub-word units in their representation, can be considered as a possible alternative. To build a dictionary of such units, we need to identifY the sub-word (or phoneme) boundaries by means of speech segmentation [16]. For vowel recognition in multi syllable words or phrases, it is necessary that syllables in every word or phrase be accurately separated in the first stage.

This paper presents a novel combined technique which employs mouth area for syllable separation in multi-syllable words. Techniques for extracting lip region are briefly described in sections II and III. In section IV an automatic visual syllable separation method for speech segmentation in Farsi words and phrases is introduced. It is followed by description of the employed database in section V. The proposed algorithm is evaluated by employing it in segmentation of more than 135 Farsi words and phrases in section VI.

I I. FEATURE EXTRACTION

In the first stage, appropriate features of lips should be extracted. There are two major methods to extract the visual features called" Up to down" and "Down to up" methods. In the following, two down to up methods are introduced, based on which a method is introduced to extract visual features accurately.

A. Red Exclusion Algorithm

In this technique, the green and blue levels of the image pixels are used to exclude lips from the other parts [18]. This method is based on the fact that many segments of the face images, including the lips, are predominantly red. Therefore, after red exclusion, any contrast that may develop would be due to remaining blue and green colors components, thus, the technique named red exclusion. It says that pixels belonging to lips have green and blue levels G and B so that [18].

Log (G / B):<:: B (1)

Eq. 1 shows the basic criteria for mouth detection. The selection of the threshold B, used to recognize lips from the other parts, is a key issue in accurate performance of the red exclusion technique. B can be evaluated from the Eq. (2) [17]:

B = (11 - 1.05 x a) (2)

Where B is the threshold value between lip color and skin color and a is standard deviation and 11 is average According to statistical data. Fig.lb shows the result of application of the red exclusion technique to the face image in Fig. I a.

pixels of the gray scale image in Fig. 2. Fig. 3 shows the binary image resulted from the above described conversion.

In the transformed image, the lip area appears to be considerably brighter than the other segments of the original image suggesting to extract a binary image from the output of the grayscale image obtained in the previous stage. The threshold value for this new transform is obtained by finding the position of the peak value in the illumination histogram. This process can be formulated as Eq. (4):

{I, h(x,y)<1.15xK UnZip region l(x,y)=

0, h(x,y)2':1.15xK Lip region (4)

Where hex, y) and I(x, y) are the pixel values (illuminations) of the original gray and final binary images.

Figure 2. Transformed region of lips.

Figure I. (a) The original image, (b) The result of applying the red Figure 3. 4 Segmented image.

exclusion technique.

B. Color transform using red and green i1?formation

Color transform using red and green information is an efficient preprocessing algorithm for lip segmentation. We define a new transformation based on RGB color space and it is similar to pseudo hue transform. This transformation is defined as Eq. (3):

3 h(x,y) =

R(x,y)

R(x,y)3

+ G(x,y)3

+ 1 (3)

Where R(x, y) and G(x, y) are the red and green components of the original image and hex, y) is the transformed image. In the transformed image, as illustrated in Fig.2, lip segments show brighter than other face components.

The transformed image is now converted to a binary (black and white) image using a threshold value. The threshold is calculated by finding the maximum in the histogram of the earlier transformed image (Fig. 2), where the maximum value in the histogram (defined as K) can be employed as a measure for setting the threshold. In this research the threshold is set as l.15 *K, as presented by Eq. (4). In this equation hex, y) refers to illumination of the

185

C. Vertical center of lip

In the next stage, the boundaries of the upper and lower lips are identified. One of the most common methods for extraction of mouth features is the use of the gray-scale value and edge detection [18]. The initial step, as with many of similar techniques, is the identification of the vertical position of the centre of the mouth. This can be achieved by taking the sum of each row in the gray-scale image and then, identifYing the row with the minimum value. Fig. 4 Shows the result of apply ing this algorithm.

Figure 4. Locating the vertical position of the center of mouth.

III. LIP TRACKING UNIT

We use a combined method for lip tracking and so mouth area as an extracted feature for visual speech segmentation that will be described in this section.

A. A combined method for lip extraction

Methods were described above with regard to projected shadows and sinking below the lower lip and a beard, or the shadow of the nose and top lip sinking failed in accurate diagnosis of lip area. According to the methods that the above-mentioned and results of the tests, red exclusion algorithm can properly separate the lower lip area as sinking under the lip and the shadow from the lip that were always problems, its effect is removed. Also results obtained from applying the color transform introduced in section II.B shows that the notch above the upper lip and the shadow of the nose that caused disruption in the diagnosis of upper lip area, is considerably diminished. By using the algorithm proposed in Fig. 5, the mouth area precisely can be extracted.

The input image is processed to find the vertical center of lip and mouth region, at the first stage. The upper lip is then extracted using the vertical center of the lip and using the color transform using red and green information algorithm while the lower lip is extracted using the vertical center of the lip and the red exclusion method. The two images are then merged together to form the mouth image. At the 4th stage component labeling [19] is exploited to separate individual parts from each other and so to eliminate the excessive and noisy parts, also we spot the biggest region as mouth region. At 5th stage we use morphological image opening. Then we smooth mouth region by using a median filter. The output of algorithm and the original image with superimposed lip shapes are shown in Fig. 6. Only the outer contour was used for this study.

The above algorithm is used to extract the lip region.

Figure 5. The proposed algorithm for lip feature extraction.

186

Figure 6. (a) The original image with highlighted lip contour, (b) The output binary image resulted from the proposed method.

B. mouth area estimation

The described algorithm is used to divide the image into lip and non-lip regions where all lip region pixels are replaced by "1" (white) and all non-lip pixels with "0" (black), as depicted in Fig. 6. All pixels belonging to the lip region (mouth) are white and has value of' l' and all non-lip region are black and has value of '0'. For estimation of the mouth area, all pixels in the mouth region are counted. This number is the designated to the mouth area.

IV. AUTOMATIC VISUAL SPEECH SEGMENTATION

In this section an efficient method for speech segmentation is presented. Small vocabulary isolated word recognition systems have the advantages of simplicity in implementation and high recognition accuracy. Automatic speech segmentation is more practical than manual speech segmentation as the former approach can reduce the amount of work involved. Moreover, automatic speech segmentation provides more consistent results [16]. Throughout this paper, speech segmentation is defined as the process which identifies the vowel boundaries of a speech signal. The intervals between successive vowel boundaries are speech segments. Although Fowler (1984) suggests that human listeners divide the speech signal into overlapping phonetic segments rather than discrete context-sensitive phonetic segments, the latter approach can simplifY implementation substantially [20].

After separation mouth region by proposed method, by counting the number of pixels in mouth region, the mouth area value in each frame will be obtained by mentioned method in part III.B. Any phrase according to the utter time is about 40-160 frame. As was said at the beginning in all frames the mouth area calculates. Fig. 7 shows some examples of mouth area chart in sequence of frames that have been normalized according to maximum value of mouth area in sequence.

In the Fig. 7, the horizontal axis shows the frame number and the vertical axis shows the normalized mouth area. Local maximum (peak) in the Fig. 7 shows start point of each vowel or in the other hand each peak represents the beginning of a syllable. But automatic recognition of these places is almost impossible.

Considering that at the beginning of the syllable is a vowel that leading to changes in mouth form and mouth form is approximately invariant during the utter vowel, Subtracting mouth area in each frame from the previous frame can be a good feature for syllable separation. Thus we make a series of mouth area subtractions. After making

subtraction series then we use an average filter on numeric obtained series to remove the sudden changes caused by Small displacement in mouth boundary from mouth region recognition.

Y(1) = X(1)

Y(2) = (X(1) + X(2) + X(3)) / 3

Y(3) = (X(1) + X(2) + X(3) + X(4) + X(5)) / 5

Y(4) = (X(2) + X(3) + X(4) + X(5) + X(6)) / 5 (5)

Y(i) = (X(i -2) + XU -1) + XU) + XU + 1) + XU + 2)) / 5

Eq. (5) shows average filter that Y (i) is output of filter and X (i) is subtracted mouth area in frame (i) from frame (i-1 ).

To find syllables start point, at first zero crossing points must be found and in each pair of zero crossing points maximum should be determined. Place of the maximums are the syllable start points and considered as segmentation boundaries. Fig. 8 shows the result of applying this method on a three and eight syllabic phrase that place of each maximum shows the start point of a syllable.

The whole method stages can be enumerated as:

1) Calculation of the mouth area

2) Subtraction of the mouth areas of the consecutive

frames and forming a numerical series from this

subtractions

3) Applying a smoothing filtering to the numerical

series obtained in stage 2 4) Finding the zero crossing of the smoothed series

5) Taking the places of maxima between consecutive

pairs of zero crossings as the starting points of syllables

Figure 7. Mouth area in sequence frames (a)Shows chart of phrase "zibarooyane beheshti" b) chart of phrase "Taksavar".

187

Figure 8. Detected syllable boundaries and zero crossing points a)Shows chart of phrase "zibarooyane beheshti" b) chart of phrase "Taksavar".

V. THE DATABASE

In order to prepare the suitable database 4 persons were asked to utter two times 76 Farsi words and phrases . . The database elements are selected so that to include many common combinations that words and phrases contained 2, 3, 4, 5, 6, 8 vowels. The scenes of the utterances were then recorded by the common rate of 30 frames per second. Vowels and consonants are the basic elements of each language. Moreover, the difference in pronunciation of a word uttered by people of different mother tongues is mainly due to the variations in vowels and the way they are pronounced. In Farsi (Persian mother tongue) there are 6 distinct vowels demonstrated as \ ,f ,\ , 1 JI, (,71, which are fairly similar to English vowels a ,e ,0 ,a , i and u respectively[21].

In the conducted experiments, the speakers were not asked to fix their head position, rather, they were asked to keep their heads within the limits of the picture frames. Speaker is not restricted to speak very slow. The image database set includes about 47400 color images where image resolution is 640x480 pixels.

VI. SEGMENTATION RESULTS

Results from automatic boundary detection compared with manual detection by observer results and shows that correct detection rate was 90%. Table.2 shows details of boundaries detection accuracy in various kind of phrases. It should be noted that the Persian words and phrases have been investigated are 2, 3, 4, 5, 6 and 8 syllabic.

Average detection accuracy according to number of sy llables is 90%.

Experiments revealed that the resulting errors are partly due to mistakes in lip detection stage which, in turn, lead to error in calculation of mouth area. Low mouth motion between some syllables is another contribution to the errors.

Since no similar work were found on Farsi, we may compare the results of the presented algorithm with those in [16]. Mak and Allen (1994) explains how visual information from the lips and acoustic signals can be combined together for speech segmentation [16].They extracts the velocity of the lips from image sequences. The velocity of the lips is estimated by a combination of morphological image processing and block matching techniques. The resultant velocity of the lips is used to locate the syllable boundaries. They applied this method on 5 English phrases. Their method was audio-visual and they did not say about their method accuracy in visual mode.

The larger database and the simplicity of the presented method support its superiority to such methods. In fact, the low computation load involved in the proposed method is one of its main advantages when compared to works with similar target.

TABLE I

Phrase

Three syllabic

BOUNDARIES DETECTION RATE IN VARIOUS KIND OF

WORDS AND PHRASES

Number Number of Detection of utter phrases rate

2 61 89.7% More than three syllabic 1 10 89% Disyllabic 1 5 100%

VII. CONCLUSION

In this paper a simple and efficient method for lip extraction and syllable separation was introduced. This technique was applied for separation of syllables in a number of 2-8 syllable words and phrases. The promising results showed the merits of this new technique. It seems that this method can also be applied to other languages with no or little modification.

REFERENCES

[I] S. Boll, "speech enhancement in the 1980s noise suppression with pattern matching," In Advances in Speech Signal Processing, Dekker, 1992.

[2] C C. Bregler and Y. Konig. "Eigenlips for Robust Speech Recognition". In Proc. ICASSP, pages 669-672, 1994.

[3] A J. Goldschen. "Continuous Automatic Speech Recognition by Speechreading". PhD Thesis, George Washington University, Washington DC, 1993.

188

[4] E. D. Petajan, "Automatic lipreading to enhance speech recognition," Ph.D. dissertation, Univ. 111inois, Urbana-Champaign, 1984.

[5] S. Morishima, S. Ogata, K. Murai, and S. Nakamura, "Audio-visual speech translation with automatic lip synchronization and face tracking based on 3D head model," in Proc. lEEE lnt. Conf Acoustics, Speech, and Signal Processing, voL 2, May 2002, pp. 2117-2120.

[6] A Adjoudani and C. Benoit, "On the integration of auditory and visual parameters in an HMM-based ASR," in Speechreading by Humans and Machines, sec NATO ASl Series, D. G. Stork and M. E. Hennecke, Eds , 1996, pp. 461-472.

[7] P. L. Silsbee and A. C. Bovik, "Computer lipreading for improved accuracy in automatic speech recognition," IEEE Trans. Speech Audio Processing, voL 4, pp. 337-351, Sept. 1996.

[8] M. Tomlinson, M. Russell, and N. Brooke, "Integrating audio and visual information to provide highly robust speech recognition," in Proc. IEEE Int. Conf Acoustics, Speech, and Signal Processing, voL 2, 1996, pp. 821-824.

[9] T. Chen and R. R. Rao, "Audio-visual integration in multimodal communication," Proc. IEEE, voL 86, pp. 837-852, May 1998.

[10] K. E. Finn and A A Montgomery, "Automatic optically-based recognition of speech," Pattern Recognition. Lett., vol. 8, no. 3, pp. 159-164, 1988.

[11] J. Luettin, N. A. Thacker, and S. W. Beet, "Speech reading using shape and intensity information," in Int. Conf Spoken Language Processing, 1996, pp. 58-61.

[12] M. Barnard , E. J Holden, and R. Owens, "Lip tracking using pattern matching snakes," 5th Conf on Computer vision,pp.I-6, 2002.

[13] AWaibeL "Modular construction of time-delay neural networks for speech recognition," Neural Computation, 1,39-46 1989.

[14] S.Nakamura, "Statistical Multimodal lntegration for Audio-Visual Speech Processing." IEEE transaction on neural networks, 13(4), JULY 2002.

[15] T. Shinchi,et.al, "Vowel recognition according to lip shapes using neural networks," Proc. oflEEE 1998.

[16] M.W. Mak, W.G.Allen, "Lip-motion analysis for speech segmentation in noise" Speech Communication 14 (1994) 279-296.

[17] Jian-ming Zhang, Hong Tao, Liang-min Wang, Yong-zhao Zhan, and Shun-lin Song, "A real-time approach to the lip-motion extraction in video sequence" lEEE lnternational Conference on Systems, Man and Cybernetics 2004.

[18] Trent W. Lewis and David M.W. Powers, "Audio-Visual Speech Recognition using Red Exclusion and Neural Networks" Journal of Research and Practice in Information Technology, VoL 35, No. 1, February 2003.

[19] R. M., Haralick, GL. Shapiro, Computer and Robot Vision, Volume 1, Addison-Wesley, , 28-48,1992.

[20] C.A Fowler (1984), "Segmentation of coarticulated speech in perception", Perception and Psychophysics, VoL 36, No. 4, pp. 359-368.

[21] Vahideh Sadat Sadeghi, Khashayar Yaghmaie "Vowel Recognition using Neural Networks" UCSNS lnternational Journal of Computer Science and Network Security, VOL.6 No.12, December 2006.

[ieee 2011 ieee 3rd international conference on communication software and networks (iccsn) -...

Documents