advances in computing power and numerical algorithms _sreevidhya@students

Upload: yogesh-yadav

Post on 10-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    1/32

    Abstract

    In this paper, a real-time system to create a talking head from a video sequence

    without any user intervention is presented. In the proposed system, a probabilistic

    approach, to decide whether or not extracted facial features are appropriate for creating a

    three-dimensional (3-D) face model, is presented. Automatically extracted two-

    dimensional facial features from a video sequence are fed into the proposed probabilistic

    framework before a corresponding 3-D face model is built to avoid generating an

    unnaturalor nonrealistic 3-D face model. To extract face shape, we also present a face

    shape extractor based on an ellipse model controlled by three anchor points, which is

    accurate and computationally cheap. To create a 3-D face model, a least-square approach

    is presented to find a coefficient vector that is necessary to adapt a generic 3-D model

    into the extracted facial features. Experimental results show that the proposed system canefficiently build a 3-D face model from a video sequence without any user intervention

    for various Internet applications including virtual conference and a virtual story teller that

    do not require much head movements or high-quality facial animation.

    Index Terms MPEG-4 facial object, probabilistic approach, speech-driven talking

    heads, talking heads, virtual face.

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    2/32

    I. INTRODUCTION

    A DVANCES in computing power and numerical algorithms in graphics and image-

    processing make it possible to build a realistic three-dimensional (3-D) face from a video

    sequence by using a regular PC camera. However, in most reported systems, user

    intervention is generally required to provide feature points at the initialization stage . In

    the initialization stage, feature points in two orthogonal frames or in multiple frames have

    to be provided carefully to generate a photo-realistic 3-D face model. These techniques

    can build high quality face models but they are computationally expensive and time

    consuming. For various multimedia applications such as video conferencing,e-commerce,

    and virtual anchors, integrating talking heads are highly required to enrich their human-computer interface.To provide talking head solutions for these multimedia

    applications,which do not require high quality animation, fast and easy ways to build a 3-

    3-D face model have been investigated to generate many different face models in a short

    time period.However, user intervention is still required to provide several corresponding

    points in two frames from a video sequence, or feature points in a single frontal image .In

    this paper, we present a real-time system that extracts facial features automatically and

    builds a 3-D face model without any user intervention from a video sequence.

    Approaches for creating a 3-D face model can be classified into two groups. Methods in

    the first group use a generic 3-D model, usually generated by a 3-D scanner, and deform

    the 3-D model by calculating coordinates of all vertices in the 3-D model. Lee et al.

    considered deformation of vertices in a 3-D model as an interpolation of the

    displacements of the given control points. They used Dirichlet Free-From Deformation

    technique to calculate new 3-D coordinates of a deformed 3-D model. Pighin et al. also

    considered model deformation as an interpolation problem and used radial basis functions

    to find new 3-D coordinates for vertices in a generic 3-D model. However, methods inthe second group use multiple 3-D face models to find 3-D coordinates of all vertices in a

    new 3-D model based on given feature points. They combine multiple 3-D models to

    generate a new 3-D model by calculating parameters to combine them. Blanz et al. [8]

    used a laser scanner ( Cyberware ) to generate a 3-D model database. They considered a

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    3/32

    new face model as a linear combination of the shapes of 3-D faces in the database. Liu et

    al. [4] simplified the idea of linear combination of 3-D models by designing key 3-D

    faces that can be used to build a new 3-D model by combining the key 3-D faces linearly,

    eliminating the need for a large 3-D face database. The merit of these approaches used in

    the second group is that linearly created face objects can eliminate a wrong face that is

    not natural, which is a very important aspect to create a 3-D face model without user

    intervention. Emerging Internet applications equipped with a talking head system such as

    merchandise narrator , virtual anchors , and e-commerce do not require high quality facial

    animation, e.g., the one used in Shrek or Toy Story , etc. Furthermore, movement of a 3-D

    face model in those applications, i.e., rotation along the x and y directions, can be

    restricted within 510 degrees. In other words, although the movement of a talking head

    is limited, users still do not feel uncomfortable in these applications. Recent approachesof creating a 3-D face model from a single image are applicable to those Internet

    applications , . Valle et al. used manually extracted feature points and an interpolation

    technique based on a radial basis function to obtain coordinates of polygon mesh of a 3-D

    model. Kuo et al. used the anthropometric and a priori information to estimate the depth

    of a 3-D face model. Lin et al. used a two-dimensional (2-D) mesh model to animate a

    talking head by mesh warping. They manually adjust control points of mesh to fit

    eyes, nose, and mouth into an input image. All these approaches,based on a single image to obtain a 3-D face model, are

    computationally cheap and fast, which are suitable to generate

    multiple face models in a short time. Although depth information of a

    created 3-D model from these approaches is not as accurate as other

    labor-intensive approaches, such as , textured 3-D face models should

    be good enough for various Internet applications that do not require

    high quality facial animation. In this paper, we present a real-time

    system that extracts facial features automatically and builds a 3-D face

    model without any user intervention. The main contribution of this

    paper can be summarized as follows. Firstly, we propose a face shape

    extractor, which is easy and accurate for various face shapes. We

    believe face shape is one of the most important facial features in

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    4/32

    creating a 3-D face model. Our face-shape extractor uses a model of an

    ellipse controlled by three anchor points, extracting various face

    shapes successfully. Secondly, we present a probabilisticnetwork to

    maximally use facial feature evidence in deciding if extracted facial

    features are suitable for creating a 3-D face model. To create a 3-D

    model from a video sequence without any user intervention, we need

    to keep on extracting facial features and checking if the extracted

    features are good enough to build a 3-D model in a systematical

    way.We propose facial feature net, a face shape net, and a topology

    net to verify correctness of extracted facial features, which also enable

    the algorithm to extract facial features more accurately. Thirdly,

    a tleast-square approach to create a 3-D face model based on extracted

    facial features is presented. Our approach for 3-D model adaptation is

    similar to Lius approach in a sense that a 3-D model is described as a

    linear combination of a neutral face and some deformation vectors.

    The differences are that we use a least-square approach to find

    coefficients for the deformation vectors and we build a 3-D face model

    from a video sequence with no user input. Lastly, a talking head

    system is presented by combining an audio-to-visual conversiontechnique based on constrained optimization [25] and the proposed

    automatic scheme of 3-D model creation.The organization of this paper

    is as follows. In Section II, the proposed face shape extractor based on

    an ellipse model controlled by three anchor points is presented. The

    detailed explanation of the probabilistic network is described in Section

    III. In Section IV, the proposed least-square approach to create a 3-D

    face model is described. In Section V, experimental results as well as

    implementation of the proposed real-time talking head system are

    described. Finally, conclusions and future work are given in Section VI.

    II. F ACE S HAPE EXTRACTORFace shape is one of the most important features in creating a 3-D face

    model. In this section, we propose a novel idea to extract face shape,

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    5/32

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    6/32

    as an ellipse, a is the distance between x position of P 1 and P2 and b

    is distance between y position of P1 and P3(If face shape is

    symmetric)

    3) Add intensity of pixels that are lower than the left and right anchor

    points on the ellipse and record the sum.

    4) Move the left and right anchor points up and down to find

    parameters of an ellipse that produces maximum boundary energy for

    the face shape from an edge image [see Fig. 2(e)] using (1).After

    positions of facial components such as mouth and eyes are known as

    shown in Fig. 2(a) using various methods , the proposed face shape

    extractor is ready to start. We assume that a human face has a

    homogeneous color distribution,

    Fig. 2. Detecting three anchor points. (a) Extract facial

    features first. (b) Calculate intensity average for inside of a face: 1) draw lines from the

    corner of left eye

    and from nose center and find a intersection point C and 2) find average intensity of

    pixels within a rectangular window (size = 20_20) centered at the point C.

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    7/32

    ( Three anchor points P 1, P2 and P 3 . (d) An ellipse shaped search window

    and search direction. (e) An edge image.

    which means statistics, e.g., means and variances, can be used as

    criteria to decide if a region is inside or outside of the face (if statistics

    for the inside of a face is known). As it starts, the search procedure for

    three anchor points, it calculates statistics first. It calculates an

    intensity average for the inside of a face by using a window as shown

    in Fig. 2(b). By locating a point that has a quite different intensity

    average from the previously calculated average of the inside of a face,

    three anchor points can be found. In our implementation, threshold

    T fs =0.5* (average intensity of the inside of a face) isselected

    experimentally to locate the anchor points. Because the search

    procedure for three anchor points highly depends on color

    distributions, it is sensitive to color distributions of background objects.

    To overcome this weak point, the threshold T fs is adjusted adaptively in

    our procedure (please refer to Section V for details). To find an

    optimalface shape, (1) is used to find parameters a and b of an ellipse.

    where E(x,y) is the intensity of an edge image [Fig. 2(e)] and denotesa subset of pixels on an ellipse, whose pixels are located lower thanthe left and right anchor points.

    III. PROBABILITY NETWORKS

    Probabilistic approaches have been successfully used to locate human

    faces from a scene and to track deformations of local features . Cipolla

    et al. proposed a probabilistic framework to combine different facial

    features and face groups, achieving a high confidence rate for face

    detection from a complicated scene. Huang et al. used a probabilistic

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    8/32

    network for local feature tracking by modeling locations and velocities

    of selected features points. In our automated system, a probabilistic

    framework is adopted to maximally use facial feature evidence for

    deciding correctness of extracted facial features before a 3-D face

    model is built. Fig. 3 shows the selected FDPs for the proposed

    probabilistic framework. The network hierarchy used in our approach is

    shown in Fig. 4, which consists of a facial feature net, a face shape net,

    and a topology net. The facial feature net has a mouth net and eye net

    as its subnets. The detail of each subnet is shown in Fig. 5. In the

    networks, each node represents a random variable and each arrow

    denotes conditional dependency between two nodes. In a study of face

    anthropometry , data are collected by measuring distances and angles

    among selected key points from a human face, e.g., corners of eyes,

    mouth and ears, to describe the variability of a human face. Based on

    the study, we are characterizing a frontal face by measuring distances

    and covariance between key points chosen from the study. All nodes in

    the proposed probability networks are classified into four groups:

    Mouth=[D(8.1,2.2),D(8.4,8.3),D(2.3,8.2)], Eyes=

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    9/32

    [D(3.12,3.7),D(3.12,3.8),D(3.8,3.11),D(3.13,3.9)],Topology=[D(2.1,9.15

    ),D(2.1,3.8),D(9.15,2.2)],and Face

    Shape=[D(2.2,2.1),D(10.7,10.8)],where D(P1,P2) is a distance between

    FDPs P1 and P2 defined in MPEG-4 standard . In our network, the

    distance between two feature points is defined as a random variable

    for each node. For instance, we model D(3.5, 3.6), the distancebetween centers of the left and right eyes, and D(2.1, 9.15), the

    distance of selected two points FDP 2.1 and FDP 9.15, shown in Fig.

    5(b), as a 2-D Gaussian distribution, estimating means, standard

    deviations, and correlation coefficients. Fig. 5(c) shows graphical

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    10/32

    illustrations of the relationship between two nodes in the proposed.

    probability networks. For example, the distance between FDP

    3.5 and FDP 3.6, and the length between FDP 8.4 and FDP 8.3

    (width of mouth), are modeled as a 2-D Gaussian distribution where

    denote the distance between two

    selected FDPs, the means, and standard deviation of D 1 respectively.

    denotes the correlation coefficients between two nodes D 1 and D 2 . To

    model 2-D Gaussian distributions of D(3.5, 3.6) and distances of

    selected paired points, a database from is used in our simulations. The

    reason we model probability distributions based on FDP3.5 and FDP3.6

    is that the left and right eye centers are the features that can be

    detected most reliably and accurately from a video sequence

    according to our implementation. The chain rule and conditional

    independence relationship are applied to calculate the joint probability

    of each network. For

    instance, the probability of the face shape net is defined as a joint

    probability of all three nodes, D(3.5, 3.6), D(2.2, 2.1), and D(10.7,

    10.8), as follows:

    In the same manner, probabilities of other networks can be

    defined as follows:

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    11/32

    in our implementation , P(Face Shape Net) is used to verify face shape

    extracted from our face shape extractor, and P(Mouth Net) is used tocheck extracted mouth features. P(Topology Net) is used for deciding if

    facial components, i.e., eyes, nose, and mouth, are located correctly

    along the vertical axis. P(Facial Features, Face Shape, Topology) of (8)

    is used as a decision criterion for the correctness of extracted facial

    features for building a 3-D face model.

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    12/32

    IV. A LEAST-SQUARE APPROACH TO ADAPT A

    3-D FACE MODEL

    Our system is devoted to creating a 3-D face model without any user

    intervention from a video sequence, which means we need an

    algorithm that is robust and stable to build a photo-realistic and

    natural 3-D face model. Recent approach proposed by Liu et al. shows

    that combining multiple 3-D models linearly is a promising way to

    generate a photo-realistic 3-D model. In this approach, a new face

    model is described as a linear combination of key 3-D face models,

    e.g., big mouth, small eyes, etc. The strong point of this approach is

    that the multiple face models constrain the shape of a new 3-D face,

    preventing algorithms from producing an unrealistic 3-D face model.

    Our approach is similar with Lius approach in the sense that a 3-D

    model is described as a linear combination of a neutral face and some

    deformation vectors. The main differences are that: 1) we use atleast-

    square approach to find the coefficient vector for creating a

    new 3-D face model rather than an iterative approach and2) we build a 3-D face model from a video sequence with no user input.

    A. The 3-D Model Our 3-D model is a modified version of the 3-D face

    model developed by Parke and Waters [28]. We have developed a 3-D

    model editor to build a complete head and shoulder model including

    ears and teeth. Fig. 6(a) shows the modified 3-D model used in our

    system. It has 1294 polygons and it is good enough for realistic facial

    animation. Based on this 3-D model and the 3-D model editor, 16 face

    models have been designed for the proposed system (more face

    models can be added to make a better 3-D model), because eight

    position vectors and eight shape vectors (please see Section IV-B) are

    a minimal requirement to describe a 3-D face in a sense that shapes

    and locations of mouth, nose, eyes are the most important features to

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    13/32

    describe a human frontal face. These face models are combined

    linearly based on automatically extracted facial features such as shape

    of face, location of eyes, nose and mouth, and size of these features,

    etc. If we denote the face geometry by a vector F=(v 1 ,,v n) T , where

    v i=(X i, Yi,Z i) T are the vertices, and a deformation vector

    that contains the amount of variation for size and location of vertices

    on a 3-D model, the face geometry can be described as where F 0 is a

    neutral face vector and is a coefficient vector, i.e c=(c 1, c 2, .,c m ) that

    decides the amount of variation needed to be applied to vertices on

    the neutral face model

    B. The 3-D Model Adaptation

    Finding an optimal 3-D model that is best matched with the input video

    sequences can be considered as a problem to find a coefficient vector

    that minimizes mean-square errors between projected 3-D feature

    points onto 2-D and feature points from input face. We assume that all

    feature points are equally important because locations as well as

    shapes of facial components

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    14/32

    such as mouth, eyes, and nose are all critical to model a 3-D face from

    a frontal face. In our system, all coefficients are decided at once by

    solving the following least-square formulation:

    where n denotes the number of extracted features and is the number

    of the deformation vector . V j is an extracted feature from an input

    image, which has (x,y) location, F 0j

    is the corresponding vertex on a neutral 3-D model projected onto 2-D,

    and D i j means the corresponding vertex on a deformation vector D i

    projected onto 2-D using current camera parameters. Fig. 6(a) shows

    the neutral 3-D face model and Fig. 6(b)(f) show examples of 3-D face

    models used to calculate deformation vectors in our implementation.

    For instance, by subtracting a wide 3-D face model, as shown in Fig.

    6(b), from a neutral 3-D face model, shown in Fig. 6(a), a deformation

    vector for wide face is obtained. For the deformation vectors , eight

    shape vectors (wide face, thin face, big (and small) mouth, nose and

    eyes) and eight position vectors (minimum (and maximum) horizontal

    and vertical translation for eyes and minimum (and maximum) vertical

    translation for mouth and nose) are designed in our implementation.

    To solve the least-square problem the singular value decomposition

    (SVD) is used in our implementation.

    V. IMPLEMENTATION AND EXPERIMENTAL RESULTS

    A. Automatic Creation of a 3-D Face Model

    In this section, the detailed implementation of the proposed real-time

    talking head system is presented. To create a photo-realistic 3-D modelfrom a video sequence without any user intervention, the proposed

    algorithms have to be integrated carefully. We assume that user

    should be in a neutral face as defined in , looking at the camera, and

    rotating in the x and y directions. The proposed algorithms catch the

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    15/32

    best facial orientation,i.e., simply a frontal face, by extracting and

    verifying facial features.By analyzing video sequences, two

    requirements for the real-time system have been established, because

    input is not a single image but a video sequence. First, locating face

    should not be called every frame. Once face is located, face location in

    the following frames is likely to be the same or very close to it. Second,

    facial features obtained in previous frames should be

    exploited to provide a better result in current frame. Fig. 7 shows the

    detailed block diagram of the proposed realtime system. The proposed

    system starts with finding a face location from a video sequence by

    using a method based on a normalized RG color space and frame

    difference. After detecting face location, a valley detection filter, which

    was proposed in , is used to find rough positions of facial

    components.After applying a valley detection filter, rough location of

    facial components, i.e., eyes, nose, and mouth, is located by examining

    its intensity distribution projected in vertical and horizontal directions.

    Then, exact location for nose is obtained by recursive thresholding

    because the nose holes always have the lowest intensity around the

    nose. A threshold value is increased recursively until we reach thenumber of pixels that corresponds to nose holes. To find the exact

    location of mouth and eyes, several approaches can be used. We use a

    pseudo moving difference method to find exact location of facial

    components, which is simple and computationally cheap. Based on the

    extracted feature location, a search area for extracting face shape can

    be found (readers are referred to fordetails.).Within this search area,

    we use the face shape extractor to extract face shape. After feature

    extraction is done, the extracted features are fed into the proposed

    probabilistic networks to verify the correctness and suitability before a

    corresponding 3-D face model is built. The proposed probabilistic

    network acts as a quality control agent in creating a 3-D face model in

    the proposed system. Based on the output of the probability networks,

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    16/32

    T fs is adjusted adaptively to extract face shape more accurately. If only

    face shape is bad, which means extracted features are correct except

    face shape, the algorithm adjusts thresholds, T fs and extracts face

    shape again without moving into the next frame [see Fig. 11(c) and

    (d)]. If extracted face shape is bad again, the algorithm moves to the

    next frame and starts from detecting rough location, without detecting

    face location. If all features are bad, the algorithm moves to the next

    frame, locates face, and extracts all features again.

    B. Speech-Driven Talking Head System

    After a virtual face is built an audio-to-visual conversion technique

    based on constrained optimization is combined with the virtual face to

    make a complete talking head system. There are several research

    results available for audio-to-visual conversion . In our system, we

    have selected the constrained optimization technique that is robust in

    noisy environments . Our talking head system aims at generating FDPs

    and FAPs for MPEG-4 talking head applications with no user input. FDPs

    are obtained automatically from a video sequence, captured by a

    camera connected to a PC, based on the proposed automatic scheme

    of facial feature extraction and a 3-D model adaptation. FAPs aregenerated from an audio-to-visual conversion based on the constrained

    optimization technique. Fig. 8 shows the block diagram of the encoder

    for the proposed talking head system. The FDPs and FAPs, created

    without any user intervention, are

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    17/32

    coded as an MPEG-4 bit stream and sent to a decoder via Internet.

    Because the coded bit stream contains FDPs and FAPs, no animation

    artifacts are expected in the decoder. For transmitting speech viaInternet, G.723.1, a dual rate speech coder for multimedia

    communications, is used. G.723.1, the most widely used standard

    codec for Internet telephony, is selected because of its capability of

    lowbit rate codingworking at 5.3 and 6.3 kb/s (please see [29] for a

    detailed explanation about G.723.1). In initialization stage 3-D

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    18/32

    coordinates and texture information for an adapted 3-D model is sent

    to the decoder via TCP protocol. Then, coded speech and animation

    parameters are sent to the decoder

    via UDP protocol in our implementation. Fig. 9(a) and (b) show screen

    shots of encoder and decoder implemented in our talking head system.

    The performance of the proposed talking head system has been

    evaluated subjectively and the results are

    shown in Section V-C.CHOI AND HWANG: AUTOMATIC CREATION OF A TALKING HEAD FROM A VIDEO SEQUENCE

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    19/32

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    20/32

    C. Experimental Results

    The proposed automatic system, creating a 3-D face model from a

    video sequence without any user intervention, produces facial features

    including face shape about 9 fps (frames per second) on Pentium III

    600-MHz PC. Twenty feature points as shown in Fig. 3, and 16

    deformation vectors were used in our implementation [n=20 and

    m=16 for(10)]. Users are required to provide a frontal view with a

    rotation angle less than 5 degrees. Twenty video sequences were

    recorded, making approximately 2000 frames in total. The proposed

    face shape extractor was tested for the captured video sequences that

    have different types of faces. Fig. 10 shows some examples of

    extracted face shapes for different face shapes and orientation. The

    proposed face shape extractor achieved a detection rate of 64% for

    1180 selected frames from the testing video sequences. Most errors

    come from the similar color distribution between face and background

    and failure to detect facial components such as eyes and mouth. Fifty

    frontal face images of the PICS database from the University of Stirling

    (http:// pics.psych.stir.ac.uk/) were used to build the proposed

    probabilistic network and the Expectation Maximization (EM) algorithmwas used to model 2-D Gaussian distributions. The proposed

    probabilistic network was tested as a quality control agent in our real-

    time talking head system. Fig. 11(a) and (b) show examples of rejected

    facial features from the probabilistic network, preventing the creation

    of unrealistic faces.T fs the threshold value for face shape extraction,

    was adjusted automatically from 0.5 (average intensity of the inside of

    a face) to 1.0 (average intensity of the inside of a face) to improve

    accuracy based on the results of the probabilistic network. If only

    P(Face Shape Net) is low,T fs was increased to find a more clear

    boundary of the face [please see P 2 in Fig. 2(c ).]. Fig. 11(c) and (d) shows

    examples of feature extraction improved via adjusting threshold

    values. According to the simulation results the proposed probabilistic

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    21/32

    networks were successfully combined with our automatic system to

    create a 3-D face model. Fig. 12 shows examples of successfully

    created 3-D face models. By using the probabilistic network approach

    the chance of creating unrealistic faces due to wrong facial features

    was reduced significantly. The performance of the proposed talking

    head system was evaluated subjectively. Twelve people participated in

    the subjective assessments. The 5-point scale was used for the

    subjective evaluations. Table I shows results from the subjective test

    and gives an idea of how good the proposed talking head system is,

    even though it is created without any user intervention. People were

    asked how realistic an adapted 3-D model is and how natural its

    talking head is to see the performance of the proposed system. They

    were also

    asked to measure audio quality, audio-visual synchronization, and

    overall performances. Overall results from the subjective evaluations

    show that the proposed automatic scheme produces

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    22/32

    TABLE ISUBJECTIVE EVALUATIONS OF THE PROPOSED T ALKING HEAD S YSTEM

    a 3-D model that is quite realistic and good enough for various Internet

    applications that do not require high-quality facialanimation.

    VI. CONCLUSIONS AND FUTURE WORK

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    23/32

    We have presented an implementation of an automatic system to

    create a talking head from a video sequence without any user

    intervention. In the proposed system, we have presented: 1) anovel

    scheme to extract face shape based on an ellipse model

    controlled by three anchor points;

    2) a probabilistic network to verify if extracted features are good

    enough to build a 3-D face model;

    3) a least-square approach to adapt a generic 3-D model into

    extracted features from input video; and 4) a talking head system that

    generates FAPs and FDPs without any user intervention for MPEG-4

    facial animation systems. Based on an ellipse model controlled by

    three anchor points, an accurate and computationally cheap method

    for face shape extraction was developed. A least-square approach was

    used to calculate a required coefficient vector to adapt a generic

    model to fit an input face. Probability networks were successfully

    combined with our automatic system to maximally use facial feature

    evidence in deciding if extracted facial features are suitable for

    creating a 3-D

    face model. Creating a 3-D face model with no user intervention is avery difficult task. In this paper, an automatic scheme to build a 3-D

    face model from a video sequence is presented. Although we assume

    that user should be in a neutral face and looking at the

    input camera, we believe this is a basic requirement to build a 3-D face

    model in an automatic fashion. The created 3-D model is allowed to

    rotate less than 10 degrees along x and y directions because z

    coordinates of vertices on the 3-D model are not calculated from input

    features. The proposed speech-driven talking head system, generating

    FDPs and FAPs for MPEG-4 talkingm head applications, is suitable for

    various Internet applications including virtual conference and a virtual

    story teller that do not require much head movements or high quality

    facial animation. For future research, more accurate mouth and eye

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    24/32

    extractionscheme can be considered to improve quality of a created 3-

    D model and to handle nonneutral faces and faces with mustache. The

    current approach based on a simple parametric curve has limitations

    on the shapes of mouth and eyes. In addition, to build a complete 3-D

    face model, extracting hair from the head and modeling its style

    should be considered in future research.

    ACKNOWLEDGMENT

    The authors wish to thank the anonymous reviewers for their valuable

    comments.

    REFERENCES [1] W.-S. Lee, M. Escher, G. Sannier, and N. Magnenat-

    Thalmann, MPEG-4 compatible faces from orthogonal photos, in Proc.

    Int. Conf. Computer Animation , 1999, pp. 186194.

    [2] P. Fua and C. Miccio, Animated heads from ordinary images: a

    leastsquares approach, Comput. Vis. Image Understand. , vol. 75, no.

    3, pp.247259, 1999.

    [3] F. Pighin, R. Szeliski, and D. H. Salesin, Resynthesizing facial

    animation through 3-D model-based tracking, in Proc. 7th IEEE Int.

    Conf. Computer Vision , vol. 1, 1999, pp. 143150.[4] Z. Liu, Z. Zhang, C. Jacobs, and M. Cohen, Rapid Modeling of

    Animated Faces From Video,, Tech.l Rep. MSR-TR-2000-11.

    [5] A. C. A. del Valle and J. Ostermann, 3-D talking head customization

    by adapting a generic model to one uncalibrated picture, in Proc. IEEE

    Int. Symp. Circuits and Systems , 2001, pp. 325328.

    [6] C. J. Kuo, R.-S. Huang, and T.-G. Lin, 3-D facial model estimation

    from single front-view facial image, IEEE Trans. Circuits Syst. Video

    Technol. , vol. 12, no. 3, pp. 183192, Mar. 2002.

    [7] L. Moccozet and N. Magnenat Thalmann, Dirichlet free-from

    deformations symmetry, in Proc. 11th IAPR Int. Conf. Pattern

    Recognition , 1992, pp. 117120.

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    25/32

    and their application to hand simulation, in Proc. Computer

    Animation 97 , 1997, pp. 93102.

    [8] V. Blanz and T. Vetter, A morphable model for the synthesis of 3-D

    faces, in Computer Graphics, Annu. Conf. Series, SIGGRAPH 1999 , pp.

    187194.

    [9] E. Cosatto and H. P. Graf, Photo-realistic talking-heads from image

    smples, IEEE Trans. Multimedia , vol. 2, no. 3, pp. 152163, Jun. 2000.

    [10] I.-C. Lin, C.-S. Hung, T.-J. Yang, and M. Ouhyoung, A speech

    driven talking head system based on a single face image, in Proc. 7th

    Pacific Conf. Computer Graphics and Applications , 1999, pp. 4349.

    [11] http://www.ananova.com/ [Online]

    [12] R.-S.Wang andY.Wang, Facial feature extraction and tracking in

    video sequences, in Proc. IEEE Int. Workshop on Multimedia Signal

    Processing , 1997, pp. 233238.

    [13] D. Reisfeld and Y.Yeshurun, Robust detection of facial features by

    generalized

    symmetry, in Proc. 11th IAPR Int. Conf. Pattern Recognition , 1992, pp.

    117120.

    CHOI AND HWANG: AUTOMATIC CREATION OF A TALKING HEAD FROM A VIDEO SEQUENCE

    [14] M. Zobel, A. Gebhard, D. Paulus, J. Denzler, and H. Niemann,

    Robust

    facial feature localization by coupled features, in Proc. 4th IEEE Int.

    Conf. Automatic Face and Gesture Recognition , 2000, pp. 27.

    [15] Y. Tian, T. Kanade, and J. Cohn, Robust lip tracking by combining

    shape, color and motion, in Proc. 4th Asian Conf. Computer Vision ,2000.

    [16] J. Luettin, N. A. Tracker, and S. W. Beet, Active shape models for

    visual

    speech feature extraction, University of Sheffield, Sheffield, U.K.,

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    26/32

    Electronic Systems Group Rep. 95/44, 1995.

    [17] C. Kim and J.-N. Hwang, An integrated scheme for object-based

    video

    abstraction, in Proc. ACM Int. Multimedia Conf. , 2000.

    [18] L. G. Farkas, Anthropometry of the Head and Face . New York:

    Raven,

    1994.

    [19] K. C. Yow and R. Cipolla, A probabilistic framework for perceptual

    grouping of features for human face detection, in Proc. IEEE Int. Conf.

    Automatic Face and Gesture Recognition 96 , 1996, pp. 1621.

    [20] H. Tao, R. Lopez, and T. Huang, Tracking facial features using

    probabilistic

    network, Auto. Face Gesture Recognit. , pp. 166170, 1998.

    [21] ISO/IEC FDIS 14 496-1 Systems, ISO/IEC JTC1/SC29/WG11 N2501 ,

    Nov. 1998.

    [22] ISO/IEC FDIS 14 496-2 Visual, ISO/IEC JTC1/SC29/WG11 N2502 ,

    Nov. 1998.

    [23] Psychological Image Collection at Stirling (PICS). [Online]

    Available:http://pics.psych.stir.ac.uk/

    [24] J. Luettin, N. A. Tracker, and S. W. Beet, Active shape models for

    visual

    speech feature extraction, University of Sheffield, Sheffield, U.K.,

    Electronic Systems Group Rep. 95/44, 1995.

    [25] K. H. Choi and J.-N.Hwang, Creating 3-D speech-driven talking

    heads:

    a probabilistic approach, in Proc. IEEE Int. Conf. Image Processing ,

    2002, pp. 984987.

    [26] F. Lavagetto, Converting speech into lip movement: A multimedia

    telephone

    for hard of hearing people, IEEE Trans. Rehabil. Eng. , vol. 3, no.

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    27/32

    1, pp. 90102, Jan. 1995.

    [27] R. R. Rao, T. Chen, and R. M. Mersereau, Audio-to-visual

    conversion

    for multimedia communication, IEEE Trans. Ind. Electron. , vol. 45, no.

    1, pp. 1522, Feb. 1998.

    [28] F. I. Parke and K.Waters, Computer Facial Animation . Wellesley,

    MA:

    A. K. Peters, 1996.

    [29] Dual Rate Speech Coder for Multimedia Communications

    Transmitting

    at 5.3 and 6.3 kbits/s , ITU-T Recommendation G.723.1, Mar. 1996.

    [30] K. H. Choi and J.-N. Hwang, A real-time system for automatic

    creation

    of 3-D face models from a video sequence, in Proc. IEEE Int. Conf.

    Acoustics, Speech, and Signal Processing , 2002, pp. 21212124.

    Kyoung-Ho Choi (M03) received the B.S. and M.S.

    degrees in electrical and electronics engineering from

    Inha University, Korea, in 1989 and 1991, respectively,

    and the Ph.D. degree in electrical engineering

    from the University of Washington, Seattle, in 2002.

    In January 1991, he joined the Electronics and

    Telecommunications Research Institute (ETRI),

    where he was a Leader of the Telematics Content

    Research Team. He was also a Visiting Scholar at

    Cornell University, Ithaca, NY, in 1995. In March

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    28/32

    2005, he joined the Department of Information

    and Electronic Engineering, Mokpo National University, Chonnam,

    Korea.

    His research interests include telematics, multimedia signal processing

    and

    systems, mobile computing, MPE4/7/21, multimedia-GIS, and audio-to-

    visual

    conversion and audiovisual interaction..

    Dr. Choi was selected as an Outstanding Researcher at ETRI in 1992.

    Jenq-Neng Hwang (F03) received the B.S. and

    M.S. degrees, both in electrical engineering, from the

    National Taiwan University, Taipei, Taiwan, R.O.C.,

    in 1981 and 1983, respectively, and the Ph.D.

    degree from the University of Southern California in

    December 1988.

    He spent 19831985 in obligatory military services.

    He was then a Research Assistant in the Signal

    and Image Processing Institute, Department of

    Electrical Engineering, University of Southern California.He was also a visiting student at Princeton

    University, Princeton, NJ, from 1987 to 1989. In the summer of 1989,

    he

    joined the Department of Electrical Engineering, University of

    Washington,

    Seattle, where he is currently a Professor. He has published more than

    180

    journal, conference paper, and book chapters in the areas of

    image/video signal

    processing, computational neural networks, multimedia system

    integration,

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    29/32

    and networking. He is the co-author of the Handbook of Neural

    Networks for

    Signal Processing (Boca Raton, FL: CRC Press, 2001).

    Dr. Hwang served as the Secretary of the Neural Systems and

    Applications

    Committee of the IEEE Circuits and Systems Society from 1989 to

    1991, and

    was a member of Design and Implementation of the SP Systems

    Technical Committee

    of the IEEE SP Society. He is also a Founding Member of the Multimedia

    SP Technical Committee of IEEE SP Society. He served as the Chairman

    of the

    Neural Networks SP Technical Committee of the IEEE SP Society from

    1996

    to 1998, and the Societys representative to the IEEE Neural Network

    Council

    from 1997 to 2000. He served as Associate Editor for the IEEE

    TRANSACTIONS

    ON SIGNAL PROCESSING and IEEE TRANSACTIONS ON NEURALNETWORKS, and

    is currently an Associate Editor for the IEEE TRANSACTIONS ON

    CIRCUITS AND

    SYSTEMS FOR VIDEO TECHNOLOGY. He is also on the editorial board of

    the

    Journal of VLSI Signal Processing Systems for Signal, Image, and Video

    Technology.

    He was a Guest Editor for the IEEE TRANSACTIONS ON MULTIMEDIA,

    Special Issue on Multimedia over IP in March/June 2001, the

    Conference Program

    Chair for the 1994 IEEE Workshop on Neural Networks for Signal

    Processing

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    30/32

    held in Ermioni, Greece, in September 1994, the General Co-Chair of

    the International Symposium on Artificial Neural Networks held in

    Hsinchu,

    Taiwan, R.O.C., in December 1995, the Chair of the Tutorial Committee

    for the

    IEEE International Conference on Neural Networks (ICNN96) held in

    Washington,

    DC, in June 1996, and the Program Co-Chair of the International

    Conference

    on Acoustics, Speech, and Signal Processing (ICASSP) held in Seattle,

    WA, in 1998. He received the 1995 IEEE Signal Processing (SP)

    Societys Annual

    Best Paper Award (with S.-R. Lay and A. Lippman) in the area of Neural

    Networks for Signal Processing.

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    31/32

  • 8/8/2019 ADVANCES in Computing Power and Numerical Algorithms _Sreevidhya@Students

    32/32