[ieee gesture recognition (fg 2011) - santa barbara, ca, usa (2011.03.21-2011.03.25)] face and...

6
Multimodal Identification using Markov Logic Networks Wallace Lawson, Eric Martinson Naval Research Laboratory Center for Applied Research in Artificial Intelligence Washington, DC 20375 {ed.lawson, eric.martinson.ctr}@nrl.navy.mil Abstract—Human robot interaction presents a unique set of challenges for biometric person identification. During normal interactions between the robot and a user, a tremendous amount of information is available for identification. Our objective is to use this information to identify users quickly and accurately during interactions with a robot. We present our approach for multimodal person identification using Markov logic networks (MLN). We use appearance, clothing, speaker recognition, and face recognition to identify a person during an interaction where they are speaking to the robot. We demonstrate the effectiveness of our approach using sequences of individuals speaking freely on a topic of their choosing. I. I NTRODUCTION An autonomous robot is expected to interact with humans in a natural manner, making identification extremely important. The robot must be able to recognize a set of individuals (known users), while rejecting everybody else as unknown. A known user will prompt one response, while an unknown user will prompt a different response, depending on the scenario. During a normal interaction between the robot and a user, a lot of biometric information may be available for identification. Among this information, face and speech have a low false reject rate (FRR) and false accept rate (FAR) [11], [12]. However, face and speech represents only a fraction of the available identifying information. A number of other indicators including complexion, clothing, location, time of day, etc. can provide invaluable clues for identification. Through repeated interactions with individuals, we can model this type of information and learn to associate it with identities. The challenge to integrating multiple indicators is that some indicators may be absent. We must have a framework that is sufficiently robust to be able to process every piece of information if it is available, but the flexibility to still operate when certain variables are missing. Likewise, we must be able to attach an uncertainty to each available piece of information. For example, face recognition is certainly more indicative of identity than clothing recognition. That is, unless the clothing is particularly distinctive in which case the clothing might have less uncertainty. Identification must be performed quickly and accurately, which means that we must have a high level of recall, but this cannot come at the expense of decreased precision. Recall indicates the percentage of time where a user was identified Fig. 1. An MDS robot with a 4-element microphone array mounted on the body, cameras in the eyes, and an SR3000 in the forehead. when compared with the time where user was present. Preci- sion indicates the identification accuracy. By necessity, recall includes typical biometric problems such as failure to acquire or failure to enroll. These problems affect the rate at which an individual will be recognized during their interaction with the robot, and can be treated as missing information. In this paper, we discuss the use of face, speaker, clothing and complexion for person identification in human-robot interaction (HRI). We learn the uncertainties for each indicator using Markov Logic Networks (MLN). We use the MLNs to perform MAP identification of an individual. We demonstrate our results using data collected in a laboratory environment. Our robotics platform is the Mobile-Dextrous-Social (MDS) robot called Octavia [9] (see figure 1). Perceptual inputs include two color video cameras, an SR3000 camera to provide depth information, and a 4-element microphone array. We begin with a brief review of related work in section II. We present each modality and high-level logical integration in section III. We discuss data collection and experimental results in section IV. We provide concluding remarks in section V. 65

Upload: eric

Post on 09-Feb-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Multimodal Identification usingMarkov Logic Networks

Wallace Lawson, Eric MartinsonNaval Research Laboratory

Center for Applied Research in Artificial Intelligence

Washington, DC 20375

{ed.lawson, eric.martinson.ctr}@nrl.navy.mil

Abstract—Human robot interaction presents a unique set ofchallenges for biometric person identification. During normalinteractions between the robot and a user, a tremendous amountof information is available for identification. Our objective isto use this information to identify users quickly and accuratelyduring interactions with a robot. We present our approach formultimodal person identification using Markov logic networks(MLN). We use appearance, clothing, speaker recognition, andface recognition to identify a person during an interaction wherethey are speaking to the robot. We demonstrate the effectivenessof our approach using sequences of individuals speaking freelyon a topic of their choosing.

I. INTRODUCTION

An autonomous robot is expected to interact with humans in

a natural manner, making identification extremely important.

The robot must be able to recognize a set of individuals

(known users), while rejecting everybody else as unknown. A

known user will prompt one response, while an unknown user

will prompt a different response, depending on the scenario.

During a normal interaction between the robot and a user, a lot

of biometric information may be available for identification.

Among this information, face and speech have a low false

reject rate (FRR) and false accept rate (FAR) [11], [12].

However, face and speech represents only a fraction of the

available identifying information. A number of other indicators

including complexion, clothing, location, time of day, etc. can

provide invaluable clues for identification. Through repeated

interactions with individuals, we can model this type of

information and learn to associate it with identities.

The challenge to integrating multiple indicators is that some

indicators may be absent. We must have a framework that

is sufficiently robust to be able to process every piece of

information if it is available, but the flexibility to still operate

when certain variables are missing. Likewise, we must be able

to attach an uncertainty to each available piece of information.

For example, face recognition is certainly more indicative of

identity than clothing recognition. That is, unless the clothing

is particularly distinctive in which case the clothing might have

less uncertainty.

Identification must be performed quickly and accurately,

which means that we must have a high level of recall, but

this cannot come at the expense of decreased precision. Recall

indicates the percentage of time where a user was identified

Fig. 1. An MDS robot with a 4-element microphone array mounted on thebody, cameras in the eyes, and an SR3000 in the forehead.

when compared with the time where user was present. Preci-

sion indicates the identification accuracy. By necessity, recall

includes typical biometric problems such as failure to acquire

or failure to enroll. These problems affect the rate at which an

individual will be recognized during their interaction with the

robot, and can be treated as missing information. In this paper,

we discuss the use of face, speaker, clothing and complexion

for person identification in human-robot interaction (HRI).

We learn the uncertainties for each indicator using Markov

Logic Networks (MLN). We use the MLNs to perform MAP

identification of an individual. We demonstrate our results

using data collected in a laboratory environment. Our robotics

platform is the Mobile-Dextrous-Social (MDS) robot called

Octavia [9] (see figure 1). Perceptual inputs include two

color video cameras, an SR3000 camera to provide depth

information, and a 4-element microphone array.

We begin with a brief review of related work in section II.

We present each modality and high-level logical integration

in section III. We discuss data collection and experimental

results in section IV. We provide concluding remarks in

section V.

65

II. RELATED WORK

Face recognition is a widely studied biometric that has

attracted the attention of many researchers. Open-world face

recognition has attracted a relatively smaller number of re-

searchers than the general problem of face recognition. Li

and Wechsler [6] explored the user of transductive confidence

machines (TCM) to evaluate a face in terms of the closeness

to neighboring points. Under their approach, they evaluate the

“strangeness” of a classification decision by looking at the

k closest positive and negative data points. Recently work

by Wright et al. [20] explored the use of sparse coding for

face recognition. Features are extracted from a set of training

images using l1 minimization. The quality of the match is

evaluated from reconstruction error.

Similar to face recognition, speaker recognition is the ability

to identify a speaker from their voice. In robotics, speaker

recognition has been most commonly tied to speaker position.

By separating out different speech streams and localizing the

speaker as they move, the speaker is identified for purposes

of interaction [19]. This focus on position is most common

because robots are noisy platforms. Motors, fans, and other

sources of ego-noise interfere with recorded speech, lowering

precision when using traditional approaches such as gaussian

mixture models [13]. One solution to this problem is to

account for variable speaker positions and signal-to-noise

ratios by intelligently selecting microphone channels from a

disparate array [2]. Alternatively, we can incorporate other

perceptual mediums into the identification algorithm, to create

a multi-modal approach.

The combined problem of face recognition and speaker

identification is one possible technique for improving recog-

nition rates. Palanivel and Yegnanaravana [10] use decision

level fusion to integrate speech, face recognition and visual

speech using a weighted confidence measure. This work uses

a database created from news anchor footage to demonstrate

an increase in overall recognition by combining modalities.

While the authors demonstrate excellent results, news anchor

footage does not necessarily reflect the type of data seen when

interacting with users on a robotics platform. Compared to the

robotics platform, the anchor is wearing a microphone, the face

is well lit, and the anchor has been trained to show very little

change in facial expression.

Salah et al [15] integrated tracking, face recognition and

speaker recognition together, this time for simultaneous iden-

tification of multiple people in a “smart” environment. They

overcame poor visual angles and SNR problems by con-

tinuously tracking users from the moment they entered the

room. Unlike robots, which often have a limited field of view,

microphones and cameras covering all regions meant that users

were rarely lost, and recognition scores could be combined

over lengthy periods of time using a particle filter.

Nandakumar et al. [8] used a Bayesian approach to multi-

modal score-level fusion. This facilitates recognition of indi-

viduals in cases when information is missing. While their goals

are similar to ours, we instead are using an adaptive approach

to decision level fusion in order to allow each modality to be

weighted differently. As we will demonstrate, this weighting

result changes with users, distance, and modality.

Contextual information has been recently explored as a

method to improve the accuracy of face recognition. Stone

et al. [18] used conditional random fields to improve face

recognition accuracy using contextual information such as

who an individual is “friends” with on a social network, who

they are typically photographed with, etc. Singla et al. [16]

extended this concept by learning relationships between indi-

viduals (friends, child, parent, spouse, etc.) using a set of rules

embedded into a Markov Logic Network.

III. METHODOLOGY

To identify a person interacting with the robot, we recognize

face, speech, complexion, and appearance. We also measure

distance in order to examine the affect of distance on the

varying modalities. In this section, we describe each modality

in turn, then discuss our method of data fusion using MLNs.

A. Face Recognition

We use the face recognition approach developed by Kamgar-

Parsi et al. [4]. In this section, we present a brief description

of this approach while referring the interested reader to the

referenced publication for further details.

Face recognition on a robotics platform must be capable of

identifying a small set of individuals, i.e. those that the robot

“knows” while rejecting all others as unfamiliar, much like

humans. Unlike closed-world approaches to face recognition,

we cannot make the assumption that each person that we see

will be somebody the robot knows, so rejection of unknown

individuals becomes extremely important.

This approach recognizes individuals by identifying and

enclosing the region RT in the human face space that belongs

to the target person T . If a face is projected inside RT ,

it is identified as the target person, otherwise it is rejected

as being T . This is accomplished using a set of morphing

operations which build a large, dense set of positive and

negative borderline exemplars.

During the training phase, these morphing operators gener-

ate a large number of borderline exemplars which are used to

train a single dedicated classifier for each individual. During

the test phase, these dedicated classifiers are used to identify

individuals.

While there is some time involved to acquire, register, and

normalize faces, this approach provides the ability to very

rapidly identify individuals. It has also been shown to perform

extremely well on the task of open-world identification.

B. Speaker Recognition

Speaker recognition on Octavia is based on Gaussian mix-

ture models (GMM). We use a speaker verification based

approach [13]. A model created for an individual is compared

to an impostor model created from all other speakers in

the speaker set. To build a speaker model, recorded samples

66

already associated with the correct speaker are processed to ex-

tract the first 10 mel-frequency cepstral coefficients (MFCCs)

for each 10-msec frame. This produces a set of MFCC

vectors. As the first MFCC is correlated most strongly to the

energy of the audio segment, it is applied as a threshold for

separating speech from ambient noise. This speech detection

method assumes that all loud, wide spectrum sounds heard

by Octavia are actually speech, but is otherwise effective for

removing quiet frames. A GMM with 50 components is then

created from all remaining speech vectors using k-means. The

resulting model consists of a centroid μi, a covariance matrix

σi, and a prior probability pi for each component i. Except to

identify the presence of speech, the first two MFCCs are not

used in model creation, leaving only 8 dimensions (R = 8).

Given a speaker/impostor pair Sk, Ik, which will henceforth

be called a speaker verifier, the odds of an audio segment

belonging to speaker k are calculated as follows. First, MFCC

vectors xm are extracted for the entire segment and quiet

frames are removed. Next, for each vector, the probability of

belonging to either model M is determined

p(x̄,M) =50∑i=1

pi√(2π)R|σi|

exp((x̄− μ̄)Tσ−1i (x̄− μ̄)) (1)

To avoid zeros in subsequent calculations, it is assumed that

each vector had to belong to either the speaker or the impostor,

and that each were equally likely. Therefore, given an audio

stream segment αt, which contains a set of speech frames m,

the odds of a speaker being present at time t are:

Ok(t) = Ok(t− 1) +∑m

(log

(p(�x|Sk)

p(�x|Ik))), ∀m ∈ αt (2)

To allow for multi-person interactions with the robot, where

other speakers may be present in the environment, some

additional modifications are made to the traditional speaker

recognition approach. First, Ok is restricted to the range [-

40,40] to prevent extremely large values. Second, a decay

function is added to bring all speaker likelihoods back to

neutral (i.e. 50% or Ok=0) when no one is talking. Whenever

an audio segment contains no speech, Ok is reduced by

−0.2 ∗ Ok across all speakers. Finally, for use with Markov

Logic Networks, the final Ok value is thresholded at 20 to

indicate the presence of speaker k.

C. Appearance

We model the appearance of individuals whose face is

detected using the Viola-Jones face detector provided in

OpenCV. The appearance of the face is computed using the

region approximately from the forehead to the upper-lip, from

the left eye to the right-eye (see figure 2). This region provides

some tolerance to different poses, while including as little of

the background as possible. For clothing, we use the region

below the detected face, approximately covering the region of

the individual’s shirt.

Appearance can be modeled using either a Gaussian mixture

model or a color histogram. For this work, we have selected

color histograms as they have been demonstrated to be an ef-

fective means of modeling both clothing [7] and skin color [3].

GMM may be more appropriate in cases where there is a need

to both model appearance and segment simultaneously [17].

The color histogram H is a measure of the frequency of

each color within the given region. For color histograms, we

quantize each channel (r, g, b) using the 4 most significant

bits. We concatenate the resulting quantized data, producing a

histogram of 4096 bins.

We compare histograms using the χ2 similarity measure

χ2(Ha, Hb) = 2∑i

(Ha(i)−Hb(i))2

Ha(i) +Hb(i)(3)

1) Training: When a person is recognized using their

face, the system makes note of their appearance using a

color histogram. Since appearance may vary depending on

lighting and even pose, we maintain multiple clusters of color

histograms for each participant. We maintain these clusters

using streaming clustering. Streaming clustering is ideal for

situations such as these where we wish to cluster a large

amount of data but it is infeasible to make multiple passes

over the data.

When an individual has been recognized using face recog-

nition, we compute the χ2 measure for each existing cluster of

that individual. If an existing cluster matches this appearance

at a threshold of θh, we add support to the cluster. If no

cluster matches at the specified tolerance, then we create a

new cluster. We periodically prune clusters to remove those

that have not gained a sufficient amount of support within

a period of time. This provides tolerance to occasional false

positives, since they will likely not receive enough support

over a period of time.

2) Testing: Given an individual of an unknown identity,

we compare the color histogram against existing clusters of

all individuals. We return identity of the individual whose the

color histogram that has the lowest χ2 value.

D. Distance

The distance between the user and robot can be measured

using the SR-3000 infrared time of flight camera. In this

case, the face is detected in the visible spectrum. The visible

spectrum cameras and SR-3000 are aligned to center the

images. Within the region of the face, we can measure the

distance, returning the median distance in order to provide

some tolerance to noise. When this data was collected, position

was measured by hand. In an automated system, we would

likely use the SR-3000 to acquire distance.

E. Markov Logic Networks

Logical representations of AI are extremely adept at han-

dling complexity and making inferences about the real world.

The difficulty with logical representations is that in some cases

information cannot be stated without uncertainty. The power

67

Fig. 2. Windows used for color histograms of shirt and face.

of the Markov Logic Networks [1] lies in it’s ability to fuse

a logical representation with statistical uncertainty.

We define relations between objects using first order logic,

and use Markov networks to assign a weight to each relation

in terms of relative importance. In formal inference, if one for-

mula is violated, the world has zero probability of occurring.

Weighting eases this restriction by saying that if a relationship

is violated, it is still possible, just less likely.

Formally, a Markov Logic Network (MLN) M is a set of

formulas Fi and associated real valued weights wi (paired

L = (Fi, wi)). Combining L with a set of constraints C gives

an MLN ML,C .

The probability of a set of ground predicates X is defined

P (X = x) =1

Zexp

(m∑i=1

(wini(x))

)(4)

where ni(x) is the number of true groundings of the world

under the assignment x. Z is a normalizing constant to ensure

that the values sum to 1.

Our logical representation includes predicates:

• FaceLike(person, time) - Face identification,

• LooksLike(person, time) - Complexion comparison,

• DressedLike(person, time) - Clothing comparison,

• SoundsLike(person, time) - Speaker identification,

• Distance(time, range) - Distance to speaker (quantized

into 1m)

• Id(identity, time) - True identity

Person indicates the identity of the participant; time is a

unique identifier of each time instant which may not be in

precise correlation with chronological time; range is quantized

to 1 or 2 meters. All values, with the exception of true identity

are known at the time of inference, although some data may

be omitted because it is not present. The true identity is the

only query predicate.

We define the following rules:

• SoundsLike(p, t) ∧Distance(t, r)→ Id(i, t)• LooksLike(p, t) ∧Distance(t, r)→ Id(i, t)• DressedLike(p, t) ∧Distance(t, r)→ Id(i, t)• FaceLike(p, t) ∧Distance(t, r)→ Id(i, t)

The MLN with Maximum A Posteri (MAP) inference finds

the most probable state, given evidence. In the case of person

identification, the evidence is provided by sensor modalities.

The MLN uses all of the available information to return the

state of the world that is most likely, i.e., the person most likely

to be interacting with the robot. All learning and inference

procedures are packaged together in the Alchemy package [5],

which we make use of in this paper.

IV. EXPERIMENTAL RESULTS

We demonstrate results on the problem of person identifica-

tion using internally collected data sequences. In the training

sequences, individuals speak freely for four different 20-30

seconds sessions on a topic of their choosing. In the valida-

tion and test sets, participants speak for two 20-30 second

sessions. We collect approximately 3 frames per second. We

systematically vary lighting, angle and distance between the

robot and user.During normal conversations, participants varied pose and

facial expression. Participants were asked to make an effort to

periodically look at the robot, but often times would move

their eyes and head. In order to aid in the perception of

interacting with the robot, a speaker was placed on the robot

and periodically the robot “spoke” to the user.In the first experiment, we use MAP identification to

evaluate precision and recall, using fixed thresholds for face,

speaker, and appearance. Recall is defined in terms of the

percentage of images where the user was detected. Precision

represents the accuracy of the system. This dataset represents

a set of known users that the robot will interact with on a

regular basis.We evaluated person identification using MAP identification

using MLNs trained using a variety of modalities. Table I

shows the results of this experiment. As anticipated from

previously reported results, the precision rate is good for face

recognition (94.1%), but the recall rate is 47%. In many cases

individuals were not looking at the robot or in some cases their

eyes were closed. Face recognition was unable to register these

images.Speech recognition has a recall of just under 20% with a

precision of 79.6%. This reflects the difficulty of recogniz-

ing people with ambient noise in the environment and with

microphones located possibly up to 2 meters away from the

participant. Higher precision is achieved in this modality by

changing the threshold or by reducing the speaker set size.Complexion and appearance by itself produces good re-

markably good recall (68.2%) and precision (92.2%). Recall

is to be expected since this is using the appearance from

any detected face. In some cases, however, the individuals

were too close to the camera to have any meaningful clothing

information.By combining face with speech, we increase the recall

to 54%, with a precision to 92.4%. Face, complexion and

68

Fig. 3. Examples of challenging, yet correctly classified instances.

Modalities Recall PrecisionFace 47.1% 94.1 %

Speech 19.5% 79.6%Face, Speech 54 % 92.35%

Complexion, Appearance 68.2% 92.2%Face, Complexion,

Appearance 72% 95.3%Face, Complexion,Apperance, Speech 74% 97%

TABLE IFUSION RESULTS FOR MAP INFERENCE ON DIFFERENT MODALITIES

appearance produces a recall of 72% with a precision of

95.3%. Face, complexion, appearance, and speech produces

74% recall with 97% precision. What is somewhat remarkable

about these results is that as we add modalities, not only does

the recall increase, but the precision increases as well. This

trend is predicted by Ross et al. [14].

Figure 3 shows a few correctly classified examples, illus-

trative of the variety in this data set. All three participants

have their eyes closed, and none of them are looking at the

camera. The participant on the left is looking upward, the

middle participant is looking off to the right and the participant

on the right is has tilted his head slightly. The combination

of atypical poses combined with closed eyes make these very

challenging to recognize.

Next, we examine the impact of varying threshold on the

precision and recall. As the threshold goes down, we expect

recall to increase and precision to decrease. This is due to the

fact that there will be a greater rate of identification, but this

comes at the cost of decreased precision. The results of this

analysis are shown in figure 4. The figure shows the trend

for both face and speaker identification. In the experiment, we

fixed the thresholds for all other modalities except the one in

question. Note that in the worst case, the precision is at 94%;

in the best case the precision is 97.4%. The worst case for

recall is 71.3%; the best case is 77.2%. The Markov Logic

Networks demonstrate a tolerance to varying thresholds for

underlying classifiers.

Finally, we consider the question of which modality is best

for recognition. MLNs learn all possible combinations of re-

ported identity to true identity. Because of this, in many cases

the MLNs are capable of learning which individuals could

be confused and which individuals often are not confused.

This can also learn a different weighting rule for modalities

Fig. 4. Recall / precision curve showing the impact of varying thresholds.

depending on the participant. For example, if a person had a

distinctive voice, the MLN might assign a much higher weight

for voice than for other modalities.

The weighting of modalities for one participant is shown

in figure 5. At 1 meter, the most important modality is cloth-

ing, followed by appearance. This result may seem counter

intuitive, until examining the data. The participant is wearing

a black shirt with a bright green picture. This picture is

distinctive among other participants and can only be clearly

seen at a close distance. When the participant is farther away,

the details of the picture become less clear.

At 2 meters, the most important modality is face, followed

by appearance. This is a result that has been supported by

psychologists that have stressed the importance of appearance

on face recognition. Speech is also important and clothing is

least important.

We again look at the weighting for each modality for a

different participant. At 1 meter, the most significant modality

is speaker identification, followed by face recognition. At 2

meters, the most significant modality is face recognition while

the rest of the modalities are approximately comparable. The

voice appears to be quite distinctive, but the distinctiveness is

only heard clearly when the participant is close to the robot.

When farther away, the face becomes more reliable.

69

Fig. 5. The importance of modalities, varying by distance.

V. CONCLUSION AND FUTURE WORK

To make human-robot interaction natural, robots must be

capable of quickly and accurately identifying individuals

through biometric information available during typical inter-

actions. This necessitates both a high level of recall without

any sacrifice in precision. We demonstrated the fusion of

face, speaker, clothing, and complexion for identification. We

demonstrated a dramatic improvement in recall rates by fusing

these modalities using Markov Logic Networks. We’ve also

shown the robustness of MLNs to missing information, and

modalities of varying levels of precision.

MLNs provide an interesting framework for both biometrics

and understand human actions and activities in general. This

could be extended in the future to incorporate other aspects,

such as learning habits and preferences of users, then identi-

fying individuals based on their actions. We did not explore

the role of contextual information on prior probability. We

anticipate that contextual information such as where and when

can add a strong prior probability for identifying individuals.

In this work, we ran all modalities in parallel. However, this

is not necessary. We could extend this work to a cascading

classifier. In a laboratory environment computational power

may not be an issue. However, on a mobile robot, we may

have limited computational power and selecting one or more

modalities that provide the best combination of identification

power and computational efficiency becomes an important

task. For example, if a subject is 10 meters away, we may

select gait recognition to identify the subject. If the subject is

closer, we may select face recognition. Intermediate distances

may include a combination of both modalities.

ACKNOWLEDGMENT

This work was partially supported by the Office of Naval

Research under job order number N0001407WX20452 and

N0001408WX30007 awarded to Greg Trafton.

REFERENCES

[1] Domingos, P. and Richardson, M. Markov Logic: A Unifying Frameworkfor Statistical Relational Learning. In L. Getoor and B. Taskar (eds.),Introduction to Statistical Relational Learning (pp. 339-371), 2007. Cam-bridge, MA: MIT Press.

[2] Ji, M., Kim, S., Kim, H., “Text-Independent Speaker Identification usingSoft Channel Selection in Home Robot Environments,” IEEE Trans. onConsumer Electronics, v 54(1), pp. 140-144, 2008.

[3] Jones, M.J., Rehg, J.M., “Statistical color models with application to skindetection”,Technical Report CRL 98/11, Compaq, Cambridge ResearchLaboratory, 1998.

[4] B. Kamgar-Parsi, W. Lawson, and B. Kamgar-Parsi, “Towards Develop-ment of a Face Recognition System for Watch-List Surveillance”, IEEETrans. Pattern Analysis and Machine Intelligence (PAMI), in press.

[5] Kok, S., Singla, P. Richardson, M., and Domingos, P., “The alchemysystem for statistical relational AI”, Technical report, Department ofComputer Science and Engineering, University of Washington, Seattle,WA, 2005.

[6] Li, F., Wechsler, H., “Open Set Face Recognition using Transduction”IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), vol.27(11), 2005, pp. 1686–1697.

[7] McKenna, S.J., Jabri, S., Duric, Z., Rosenfeld, A. Wechsler, H., “TrackingGroups of People”, Computer Vision and Image Understanding (CVIU),Vol. 80(1), October, 2000, pp. 42–56.

[8] Nandakumar, K. Jain, A.K., Ross, A., “Fusion in Multibiometric Identi-fication Systems: What about the Missing Data?”, In Proc. InternationalConference on Biometrics (ICB), 2009.

[9] http://robotic.media.mit.edu/projects/robots/mds/overview/overview.html[10] Palanivel, S. Yegnanarayana, B., “Multimodal Person Authentication

using Speech, Face, and Visual Speech”, Computer Vision and ImageUnderstanding (CVIU), 2008, pp. 44–55

[11] Phillips, P. J., Grother, P., Micheals, R. J., Blackburn, D. M., Tabassi,E., Bone, J.M., FRVT2002: Overview and Summary, Available athttp://www.frvt.org/FVRT2002, 2003.

[12] Przybocki, M., Martin, A., NIST Speaker Recognition Evaluation Chron-icles. In Odyssey: The Speaker and Language Recognition Workshop, pp.12-22, 2004.

[13] Quatiri, T., Discrete Time Speech Signal Processing. Pearson EducationInc., Delhi, India, 2002.

[14] Ross, A; Nandakumar, K; Jain, A.K; Handbook of Multibiometrics.Springer, 2006.

[15] A. Salah et al., ”Multimodal identification and localization of users ina smart environment,” Journal of Multimodal User Interfaces, vol. 2, pp.75-91, 2008.

[16] Singla, P.; Kautz, H.; Gallagher, A.; Luo, J.; “Discovery of social rela-tionships in consumer photo collections using Markov Logic”, ComputerVision and Pattern Recognition Workshops, 2008.

[17] Sivic, J., Zitnick, C.L, and Szeliski. R., “Finding people in repeatedshots of the same scene”. In Proc. of the 16th British Machine VisionConference (BMVC), pages 909918, 2006.

[18] Stone, P.; Zickler, T; Darrell, T.; “Autotagging Facebook: Social net-work context improves photo annotation,” Computer Vision and PatternRecognition Workshops, 2008.

[19] Valin, J.-M.; Yamamoto, S.; Rouat, J.; Michaud, F.; Nakadai, K.; Okuno,H.G.; ,“Robust Recognition of Simultaneous Speech by a Mobile Robot,”Robotics IEEE Transactions on , vol.23, no.4, pp.742-752, Aug. 2007.

[20] Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., Ma, Y., “Robust FaceRecognition via Sparse Representation”, IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 31, no. 2, pp. 210-227, Feb 2009.

70