real time characterization of lip æmovements using webcam · pg student, ece dept, cit, gubbi,...

6
Real Time Characterization of LipǦMovements Using Webcam Suresh D S, Sandesh C Director & Head of ECE Dept, CIT, Gubbi, Tumkur, Karnataka. PG Student, ECE Dept, CIT, Gubbi, Tumkur, Karnataka. Abstract: Communication plays an important role in the dayǦtoǦday activities of human being. The main objective of this paper is to help the people who are unable to speak. Visual speech plays a major role in the lipǦreading for listeners with deaf person. Here we are using local spatiotemporal descriptors in order to identify or recognize the words or phrases from the disable (dumb) people by their lipǦmovements. Local binary patterns extracted from the lip movements are used to recognize the isolated phrases. Experiments were made on ten speakers with 5 phrases and obtained accuracy of about 75% for speaker dependent and 65% for speaker independent. While being made comparison with other concepts like AV letter database, our method outperforms the other by accuracy of 65%. The advantages of our method are robustness and recognition is possible in real time. Key words: LBP Top, Webcam, Visual speech, KNN classifier, Silent speech interface. I. Introduction In recent days, a silent speech interface has becoming an alternative in the field of speech processing. A voice based speech communication has suffered from many challenges which arises due to fact that speech should be clearly audible, it cannot be masked, includes lack of robustness privacy issues and exclusion of speech disabled person. So these challenges may be overcome by the use of silent speech interface. Silent speech interface is a device, where system enabling speech communication takes place, without the necessity of a voice signal or when audible acoustic signal is unavailable. In our approach, lip movements are captured by the use of a webcam, which is placed in front of the lips. Many research work focuses only on visual information to enhance speech recognition[1]. Audio information still plays a major role than the visual feature or information. But, in most cases it is very difficult to extract information. In our method we are concentrating on the lip movement representations for speech recognition solely with visual information. Extraction of set of visual observation vectors is the key element on AVSR(audioǦvisual speech recognition) system. Geometric feature, combined feature and appearance features are mainly used for representing visual information. Geometric feature method represents the facial animation parameters such as lip movement, shape of jaw and width of the mouth. These methods require more accurate and reliable facial feature detection, which are difficult in practice and impossible at low image resolution. In this paper, we propose an approach for lip reading or lip movements, where humanǦcomputer interaction improves significantly and understanding in noisy environment also improves. II. Local Spatiotemporal Descriptors for Visual Information In this paper, we are concentrating on LBPǦTOP in order to extract the features from the extracted video frames. The Local Binary Pattern (LBP) operator is a grayǦscale which is not varied, always remains the same texture, simple statistic, which has shown the best performance in the classification of different kinds of texture. For an individual pixel in an image its respective binary will be generated and the thresholds will be compared with the its neighborhood value of center pixel as shown in Figure 1(a) [1]. Proceedings of The National Conference on Man Machine Interaction 2014 [NCMMI 2014] 34 NCMMI '14 ISBN : 978-81-925233-4-8 www.edlib.asdf.res.in Downloaded from www.edlib.asdf.res.in

Upload: others

Post on 14-Mar-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Real Time Characterization of Lip MovementsUsing Webcam

Suresh D S, Sandesh C

Director & Head of ECE Dept, CIT, Gubbi, Tumkur, Karnataka.

PG Student, ECE Dept, CIT, Gubbi, Tumkur, Karnataka.

Abstract: Communication plays an important role in the day to day activities of human being. The mainobjective of this paper is to help the people who are unable to speak. Visual speech plays a major role in thelip reading for listeners with deaf person. Here we are using local spatiotemporal descriptors in order toidentify or recognize the words or phrases from the disable (dumb) people by their lip movements. Localbinary patterns extracted from the lip movements are used to recognize the isolated phrases. Experimentswere made on ten speakers with 5 phrases and obtained accuracy of about 75% for speaker dependent and65% for speaker independent. While being made comparison with other concepts like AV letter database,our method outperforms the other by accuracy of 65%. The advantages of our method are robustness andrecognition is possible in real time.

Key words: LBP Top, Webcam, Visual speech, KNN classifier, Silent speech interface.

I. Introduction

In recent days, a silent speech interface has becoming an alternative in the field of speech processing. Avoice based speech communication has suffered from many challenges which arises due to fact that speechshould be clearly audible, it cannot be masked, includes lack of robustness privacy issues and exclusion ofspeech disabled person. So these challenges may be overcome by the use of silent speech interface. Silentspeech interface is a device, where system enabling speech communication takes place, without thenecessity of a voice signal or when audible acoustic signal is unavailable. In our approach, lip movementsare captured by the use of a webcam, which is placed in front of the lips. Many research work focuses onlyon visual information to enhance speech recognition[1]. Audio information still plays a major role than thevisual feature or information. But, in most cases it is very difficult to extract information. In our method weare concentrating on the lip movement representations for speech recognition solely with visualinformation.

Extraction of set of visual observation vectors is the key element on AVSR(audio visual speech recognition)system. Geometric feature, combined feature and appearance features are mainly used for representingvisual information. Geometric feature method represents the facial animation parameters such as lipmovement, shape of jaw and width of the mouth. These methods require more accurate and reliable facialfeature detection, which are difficult in practice and impossible at low image resolution.

In this paper, we propose an approach for lip reading or lip movements, where human computerinteraction improves significantly and understanding in noisy environment also improves.

II. Local Spatiotemporal Descriptors for Visual Information

In this paper, we are concentrating on LBP TOP in order to extract the features from the extracted videoframes. The Local Binary Pattern (LBP) operator is a gray scale which is not varied, always remains thesame texture, simple statistic, which has shown the best performance in the classification of different kindsof texture. For an individual pixel in an image its respective binary will be generated and the thresholds willbe compared with the its neighborhood value of center pixel as shown in Figure 1(a) [1].

Proceedings of The National Conference on Man Machine Interaction 2014 [NCMMI 2014] 34

NCMMI '14 ISBN : 978-81-925233-4-8 www.edlib.asdf.res.in

Downloaded fro

m w

ww.e

dlib

.asd

f.re

s.in

Where ggrey valuthe occuneighbortexture p

Local textheir robrecognitiTOP mefeaturesand T. Fpoints inPYT. Thethe centgiven bygXT,p aneighborthree axewe use L

Figure.

gc denotes thues of P equaurrences of vrhoods with aprimitives or

xture descripbustness to cion using locthod is morein two dimen

For LBP TOPn the XY, XT,e LBP TOP feer pixel gtc,e(xc RXsin(2

are given byrhood in YTes are the samBP TOPP,R f

2. (a) Volum

Figure. 1. (a)

he grey valueally spaced pivarious binaany numbermicro patter

ptors have obhallenge succal binary pae efficient thnsion, where, the radii in, and YT planfeature is thee are (xc, yc,p/PXY),yc+R

y (xc RXsinplane gYT,p

me and so dofor abbreviati

me of utteranc

Basic LBP op

of center pixixels on a cirary patterns.of pixels as srns, like spots

btained tremh as pose anatterns were ehan the ordinas in LBP TO

n spatial andnes can also bn denoted as, tc), than thRY cos(2 p/P(2 p/PXT),yc(xc, yc RYcothe numberion where P=

ce sequence.

perator. (b) C

xel{xc, yc} ofrcle of radiusThe definit

shown in Figs, lines and co

mendous attennd illuminatioextracted fronary LBP. InOP we are extemporal axbe different as LBP TOP Phe coordinatePXY), tc), thec, tc –RTos(2 p/PYT),of neighbori

=PXY=PXT=PY

(b) Image inplane.

Circular (8,2)

f the local nes R. A histogrtion of neighgure 1(b). Byorners.

ntion in theon changes.om the threeordinary LB

xtracting infoxes X, Y, andand can be mPXY, PXT, PYes of local nee coordinatescos(2 p/PXT,, tc –RT sin(ng points in XYT and R=RX

XY plane (c)

neighborhoo

ighborhood aram is createhbors can bthis way, we

analysis of fIn our approorthogonal p

BP we are extormation in thd T, and themarked as RXYT, RX, RY, Reighborhoods of local neiT)) and the(2 p/PYT)). SXY, XT, and YX=RY=RT[1].

Image in XT

od.

and gp is reld in order toe extendedcan collect l

facial imageoach a tempoplanes(LBP Ttracting infohree dimensinumber of n

X, RY and RT,RT. If the cooin XY plane

ighborhood ine coordinateSometimes, tYT planes. In

T plane(d) Im

ated to theo collect upto circularlarger scale

because oforal textureTOP). LBPormation orion i.e, X, YneighboringPXY, PXT,

ordinates ofe gXY,p aren XT planes of localthe radii inn that case ,

mage in TY

Proceedings of The National Conference on Man Machine Interaction 2014 [NCMMI 2014] 35

NCMMI '14 ISBN : 978-81-925233-4-8 www.edlib.asdf.res.in

Downloaded fro

m w

ww.e

dlib

.asd

f.re

s.in

Figure 2Figure 2(Figure 2(over theindicatiodividingof the LBcode ofcorrespo

Figure 3.LBP YT i

Figure. 4(c) Conca

When a“see you”

(a) demonstr(c) is an ima(d) describese whole utteron about thethe mouth imBP images. Tevery pixel

onding to mou

. Mouth regioimages (last r

4. Features inatenated feat

person utter”. If we do no

rates the volage in the XTs the motionrance sequeneir locations.mage into sevThe second, tfrom XY (seuth images in

on images (frow) from on

each block vtures for one

Fig

rs a commanot consider th

lume of utteT plane proviof one columnce it encod. In order toveral overlapthird, and fouecond row),n the first row

first row), LBne utterance[1

volume. (a) Bblock volum

gure. 5. Mouth

nd phrase, thhe time order

erance sequending a visualmn in tempores only theo overcomepping blocks iurth rows shXT (third ro

w.

BP XY images1].

Block volumesme with the ap

h movement

he words arer, these two p

nce. Figure 2l impressionral space[1]. Aoccurrencesthis effect, ais introducedhow the LBPow), and YT

s (second row

s. (b) LBP feappearance and

representati

pronouncedphrases would

2(b) shows iof one row cAn LBP descrof the microa representad. Figure. 3 alimages whic

T (fourth row

w), LBP XT i

atures from td motion[1].

on[1].

d in order, fod generate al

image in thechanging in tription wheno patterns wation whichlso gives somch are drawnw) planes, re

images (third

three orthogo

or instance “ymost the sam

e XY plane.time, whilen computedwithout anyconsists of

me examplesn using LBPespectively,

d row), and

onal planes.

you see” orme features.

Proceedings of The National Conference on Man Machine Interaction 2014 [NCMMI 2014] 36

NCMMI '14 ISBN : 978-81-925233-4-8 www.edlib.asdf.res.in

Downloaded fro

m w

ww.e

dlib

.asd

f.re

s.in

To overcregions bvolumeextractedregion se

A histogr

Multi resbe increaWhen thvector wthe multlocation,vertical mparameteneighbor(verticaltime scal

Our systsecond sto recogn

come this effbut also in tare computed from eachequence, as sh

ram of the m

solution featuased. By usinhe features frould be veryti resolutionwith what rmotion thaters, three diffring points wmotion) slicles; 3) Use of

em consists otage extractsnize the inpu

fect, the whotime order, aed and concblock volumhown in Figu

mouth movem

III. Multi

ures will provg these multirom the diffelong and as afeatures willresolution anare very im

fferent types owhen compuces; 2) Use oblocks of diff

of three stages the visual feut utterance u

ole sequenceas shown incatenated intme are conneure. 5.

ments can be d

iresolution

vide more acci resolution frent resolutia result the cnot contribu

nd more impmportant. Weof spatiotemputing the feaf different rafferent sizes t

IV

es, as showneatures fromusing an KNN

Figur

is not only dFigure. 4(a)to a single hcted to repre

defined as fol

n Features a

curate informfeatures, it heon were madcomputationaute equally, sortantly thee need featuporal resolutitures in XYadii that cano create glob

V. Our Syst

in Figure 6. Tthe mouth m

N classifier.

re 6. System d

divided intoshows. Thehistogram, aesent the app

llows

and Feature

mation and alselps in increade to be connal complexityso it is necestypes such a

ure selectionion are prese(appearancecapture the

bal and local s

tem

The first stagmovement seq

diagram

block volumLBP TOP hiss shown inpearance and

e Selection

so the analysasing the numnected in sery will be to hissary to findas appearancefor this purnted: 1) Use oe), XT (horizoccurrences

statistical fea

ge is a detectiquence. The

mes accordingstograms inFigure. 4. Ad motion of

n

sis of dynamicmber of featuries directly,igh. It is obviout featurese, horizontalrpose. In chof a differentzontal motions in differenttures[4][5].

ion lip movemrole of the fi

g to spatialeach blockAll featuresthe mouth

c event willures greatly.the featureous that alllike which

l motion oranging thet number ofn), and YTt space and

ments. Thenal stage is

Proceedings of The National Conference on Man Machine Interaction 2014 [NCMMI 2014] 37

NCMMI '14 ISBN : 978-81-925233-4-8 www.edlib.asdf.res.in

Downloaded fro

m w

ww.e

dlib

.asd

f.re

s.in

In our apfrom theclassifierto varioupoints, thused to acase, 1 Ntraining,to genera

For morincluding

1. Speakeutilized.with 10 dspeakerssequencetake intowhole seFigure. 7worst, mmouth re

2) Speakutilized fand after

3) One Oclassificayou”, “Yoof classifWhen th

Figure. 8(vertical

pproach webe mouth regr is selected sus object detehe phrase claccomplish rNN templatethe spatiotemate a histogra

e accurate eg speaker ind

er IndependeWe made a

different spea. The overales and N is tho account noequence is di7 demonstrat

mainly becauseegion.

ker Dependenfor cross valir testing is pe

One versus Oation problemou are welcomfiers grows qhe class numb

8. Selected 15motion) is se

bcam is usedgion. These fsince it is welection tasks ilassification precognition. Smatching ismporal LBP ham template

V

evaluation ofdependent, sp

ent Experimetraining of pakers and whl results werhe total numot only locatiivided into bte the perfore the big mou

Figure

nt Experimendation. In ouerformed with

One Rest Recom is decompome” “Have aquadraticallyber is N, the n

5 slices for pelected, and “

to capture tfeatures arell founded inin computerproblem is dSometimes ms applied tohistograms ofor that class

V. Experime

f our propospeaker depen

ents: For theparticular phren testing ware obtained umber of testinions of microblock volumermance for eustache of th

e. 7. Recogniti

nts: For speaur approach wh the same sp

ognition: In tosed into 45a good time”,with the numnumber of th

phrases “See y“_” the XT sli

he lip movemthen fed to

n statistical levision. Sinceecomposed i

more than onthese classe

of utterance ss[1].

ents Protoc

ed method,ndent, multi r

speaker indrase using onas done we wusing M/N (ng sequences)o patterns bues accordingevery speakerhat speaker re

ion performa

aker dependewhen a particpeaker 70% o

the previoustwo class proetc.). But us

mber of classhe KNN classi

you” and “Thce (horizonta

ments. Furththe classifie

earning theore the KNN isnto two class

ne class gets ts to reach thequences bel

ol and Resu

we made aresolution.

ependent exne speaker anwere successfu(M is the tot). When we aut also the tito not onlyr. The resulteally influence

ance for every

ent experimecular speakerof same phras

experimentsoblems (“Helsing this mulses to be recfiers would b

hank you”. “al motion), “/

her we will exer. For speecry and has beonly used fos problems, tthe highest nhe final resulonging to a g

ults

design with

periments, lend we madeul in gettingtal numberare extractinime order ofspatial regiots from the ses the appear

y speaker

ents, the leavr is trained wses were iden

s on our ownllo” “See youltiple two claognized likebe N (N 1)/2.

” in the blo/” means the

xtract the visch recognitioeen successfuor separatingthen a votingnumber of voult. This meagiven class ar

different ex

eave one speto say the sathe same phrof correctlyg the local pf lip movemens but also tsecond speakrance and mo

ve one utterawith some listtified correct

dataset, theu”, “I am sorrass strategy, tin AVLetter

cks means thappearance X

sual featureon, a KNNully appliedtwo sets ofg scheme isotes; in thisans that inre averaged

xperiments,

eaker out isame phraserase from 6recognizedatterns, weents, so thetime order.ker are theotion in the

ance out ist of phrasestly.

ten phrasery” “Thankthe numberrs database.

he YT sliceXY slice[1].

Proceedings of The National Conference on Man Machine Interaction 2014 [NCMMI 2014] 38

NCMMI '14 ISBN : 978-81-925233-4-8 www.edlib.asdf.res.in

Downloaded fro

m w

ww.e

dlib

.asd

f.re

s.in

Figure 8most diffThe seleclast part.

Results f

A real timhelp theutteranceten speaapproach

1. LN

2. Tt

3. Ga9

4. Gs5

5. GeA

shows the sfficult to recocted slices ar. The selected

from One to

me capture odisable perse. LBP TOPakers and thh, our method

Lipreading WNovember 20T. Ahonen, Ato face recognG. Zhao andapplication t928, 2007.G. Zhao, M.spoken phras57–65.G. Zhao andexpression rAnalysis and

selected slicegnize becausre mainly ind features are

One and OnThe

of lip movemson. Our appis used to ex

he lip movemd outperform

With Local Sp009.A. Hadid, andnition,” IEEEd M. Pietikäio facial expr

Pietikäinen,ses,” in Proc.

d M. Pietikäecognition,”HCI, 2009, a

s for similarse they are quthe first ande consistent w

ne to Rest ClaParentheses

ments using wproach uses lxtract the feaments were cms the other b

atiotemporal

d M. PietikäinTrans. Pattenen, “Dynamressions,” IEE

and A. Hadid2nd Int. Wo

äinen, “BoosPattern Rec

accepted for p

phrases “seeuite similar insecond part

with the hum

Table I

assifiers on Seare From On

Conclusio

webcam for rlocal spatioteatures from thconverted inby accuracy o

Reference

l Descriptors

nen, “Face deern Anal. Macmic texture rEE Trans. Pat

d, “Local spaorkshop Hum

sted multi recognit. Lett.,publication.

e you” and “tn the latter pof the phras

man intuition.

emi Speakerne to Rest Str

on

recognition oemporal desche capturednto speech vof 70%.

es

IEEE Transa

escription witch. Intell., volrecognition uttern Anal. M

atiotemporalman Centered

esolution spa, Special Iss

thank you”. Tpart containinse; just one v

Dependent Erategy)

of speech wacriptors for thimages. Expevery efficient

actions On Mu

th local binarl. 28, no. 12, pusing local bMach. Intell.,

descriptors fd Multimedia

atiotemporalsue on Imag

These phraseng the same wvertical slice

Experiments

as proposed,he recognitioeriments werly. Compare

ultimedia, Vo

ry patterns: App. 2037–2041binary patternvol. 29, no.

for visual reca (HCM2007)

l descriptorsge/Video Bas

es were theword “you”.is from the

(Results in

in order toon of inputre made oned to other

ol. 11, No. 7,

Application1, 2006.ns with an6, pp. 915–

ognition of), 2007, pp.

s for facialed Pattern

Proceedings of The National Conference on Man Machine Interaction 2014 [NCMMI 2014] 39

NCMMI '14 ISBN : 978-81-925233-4-8 www.edlib.asdf.res.in

Downloaded fro

m w

ww.e

dlib

.asd

f.re

s.in