[ieee gesture recognition (fg 2011) - santa barbara, ca, usa (2011.03.21-2011.03.25)] face and...
TRANSCRIPT
Multimodal Identification usingMarkov Logic Networks
Wallace Lawson, Eric MartinsonNaval Research Laboratory
Center for Applied Research in Artificial Intelligence
Washington, DC 20375
{ed.lawson, eric.martinson.ctr}@nrl.navy.mil
Abstract—Human robot interaction presents a unique set ofchallenges for biometric person identification. During normalinteractions between the robot and a user, a tremendous amountof information is available for identification. Our objective isto use this information to identify users quickly and accuratelyduring interactions with a robot. We present our approach formultimodal person identification using Markov logic networks(MLN). We use appearance, clothing, speaker recognition, andface recognition to identify a person during an interaction wherethey are speaking to the robot. We demonstrate the effectivenessof our approach using sequences of individuals speaking freelyon a topic of their choosing.
I. INTRODUCTION
An autonomous robot is expected to interact with humans in
a natural manner, making identification extremely important.
The robot must be able to recognize a set of individuals
(known users), while rejecting everybody else as unknown. A
known user will prompt one response, while an unknown user
will prompt a different response, depending on the scenario.
During a normal interaction between the robot and a user, a lot
of biometric information may be available for identification.
Among this information, face and speech have a low false
reject rate (FRR) and false accept rate (FAR) [11], [12].
However, face and speech represents only a fraction of the
available identifying information. A number of other indicators
including complexion, clothing, location, time of day, etc. can
provide invaluable clues for identification. Through repeated
interactions with individuals, we can model this type of
information and learn to associate it with identities.
The challenge to integrating multiple indicators is that some
indicators may be absent. We must have a framework that
is sufficiently robust to be able to process every piece of
information if it is available, but the flexibility to still operate
when certain variables are missing. Likewise, we must be able
to attach an uncertainty to each available piece of information.
For example, face recognition is certainly more indicative of
identity than clothing recognition. That is, unless the clothing
is particularly distinctive in which case the clothing might have
less uncertainty.
Identification must be performed quickly and accurately,
which means that we must have a high level of recall, but
this cannot come at the expense of decreased precision. Recall
indicates the percentage of time where a user was identified
Fig. 1. An MDS robot with a 4-element microphone array mounted on thebody, cameras in the eyes, and an SR3000 in the forehead.
when compared with the time where user was present. Preci-
sion indicates the identification accuracy. By necessity, recall
includes typical biometric problems such as failure to acquire
or failure to enroll. These problems affect the rate at which an
individual will be recognized during their interaction with the
robot, and can be treated as missing information. In this paper,
we discuss the use of face, speaker, clothing and complexion
for person identification in human-robot interaction (HRI).
We learn the uncertainties for each indicator using Markov
Logic Networks (MLN). We use the MLNs to perform MAP
identification of an individual. We demonstrate our results
using data collected in a laboratory environment. Our robotics
platform is the Mobile-Dextrous-Social (MDS) robot called
Octavia [9] (see figure 1). Perceptual inputs include two
color video cameras, an SR3000 camera to provide depth
information, and a 4-element microphone array.
We begin with a brief review of related work in section II.
We present each modality and high-level logical integration
in section III. We discuss data collection and experimental
results in section IV. We provide concluding remarks in
section V.
65
II. RELATED WORK
Face recognition is a widely studied biometric that has
attracted the attention of many researchers. Open-world face
recognition has attracted a relatively smaller number of re-
searchers than the general problem of face recognition. Li
and Wechsler [6] explored the user of transductive confidence
machines (TCM) to evaluate a face in terms of the closeness
to neighboring points. Under their approach, they evaluate the
“strangeness” of a classification decision by looking at the
k closest positive and negative data points. Recently work
by Wright et al. [20] explored the use of sparse coding for
face recognition. Features are extracted from a set of training
images using l1 minimization. The quality of the match is
evaluated from reconstruction error.
Similar to face recognition, speaker recognition is the ability
to identify a speaker from their voice. In robotics, speaker
recognition has been most commonly tied to speaker position.
By separating out different speech streams and localizing the
speaker as they move, the speaker is identified for purposes
of interaction [19]. This focus on position is most common
because robots are noisy platforms. Motors, fans, and other
sources of ego-noise interfere with recorded speech, lowering
precision when using traditional approaches such as gaussian
mixture models [13]. One solution to this problem is to
account for variable speaker positions and signal-to-noise
ratios by intelligently selecting microphone channels from a
disparate array [2]. Alternatively, we can incorporate other
perceptual mediums into the identification algorithm, to create
a multi-modal approach.
The combined problem of face recognition and speaker
identification is one possible technique for improving recog-
nition rates. Palanivel and Yegnanaravana [10] use decision
level fusion to integrate speech, face recognition and visual
speech using a weighted confidence measure. This work uses
a database created from news anchor footage to demonstrate
an increase in overall recognition by combining modalities.
While the authors demonstrate excellent results, news anchor
footage does not necessarily reflect the type of data seen when
interacting with users on a robotics platform. Compared to the
robotics platform, the anchor is wearing a microphone, the face
is well lit, and the anchor has been trained to show very little
change in facial expression.
Salah et al [15] integrated tracking, face recognition and
speaker recognition together, this time for simultaneous iden-
tification of multiple people in a “smart” environment. They
overcame poor visual angles and SNR problems by con-
tinuously tracking users from the moment they entered the
room. Unlike robots, which often have a limited field of view,
microphones and cameras covering all regions meant that users
were rarely lost, and recognition scores could be combined
over lengthy periods of time using a particle filter.
Nandakumar et al. [8] used a Bayesian approach to multi-
modal score-level fusion. This facilitates recognition of indi-
viduals in cases when information is missing. While their goals
are similar to ours, we instead are using an adaptive approach
to decision level fusion in order to allow each modality to be
weighted differently. As we will demonstrate, this weighting
result changes with users, distance, and modality.
Contextual information has been recently explored as a
method to improve the accuracy of face recognition. Stone
et al. [18] used conditional random fields to improve face
recognition accuracy using contextual information such as
who an individual is “friends” with on a social network, who
they are typically photographed with, etc. Singla et al. [16]
extended this concept by learning relationships between indi-
viduals (friends, child, parent, spouse, etc.) using a set of rules
embedded into a Markov Logic Network.
III. METHODOLOGY
To identify a person interacting with the robot, we recognize
face, speech, complexion, and appearance. We also measure
distance in order to examine the affect of distance on the
varying modalities. In this section, we describe each modality
in turn, then discuss our method of data fusion using MLNs.
A. Face Recognition
We use the face recognition approach developed by Kamgar-
Parsi et al. [4]. In this section, we present a brief description
of this approach while referring the interested reader to the
referenced publication for further details.
Face recognition on a robotics platform must be capable of
identifying a small set of individuals, i.e. those that the robot
“knows” while rejecting all others as unfamiliar, much like
humans. Unlike closed-world approaches to face recognition,
we cannot make the assumption that each person that we see
will be somebody the robot knows, so rejection of unknown
individuals becomes extremely important.
This approach recognizes individuals by identifying and
enclosing the region RT in the human face space that belongs
to the target person T . If a face is projected inside RT ,
it is identified as the target person, otherwise it is rejected
as being T . This is accomplished using a set of morphing
operations which build a large, dense set of positive and
negative borderline exemplars.
During the training phase, these morphing operators gener-
ate a large number of borderline exemplars which are used to
train a single dedicated classifier for each individual. During
the test phase, these dedicated classifiers are used to identify
individuals.
While there is some time involved to acquire, register, and
normalize faces, this approach provides the ability to very
rapidly identify individuals. It has also been shown to perform
extremely well on the task of open-world identification.
B. Speaker Recognition
Speaker recognition on Octavia is based on Gaussian mix-
ture models (GMM). We use a speaker verification based
approach [13]. A model created for an individual is compared
to an impostor model created from all other speakers in
the speaker set. To build a speaker model, recorded samples
66
already associated with the correct speaker are processed to ex-
tract the first 10 mel-frequency cepstral coefficients (MFCCs)
for each 10-msec frame. This produces a set of MFCC
vectors. As the first MFCC is correlated most strongly to the
energy of the audio segment, it is applied as a threshold for
separating speech from ambient noise. This speech detection
method assumes that all loud, wide spectrum sounds heard
by Octavia are actually speech, but is otherwise effective for
removing quiet frames. A GMM with 50 components is then
created from all remaining speech vectors using k-means. The
resulting model consists of a centroid μi, a covariance matrix
σi, and a prior probability pi for each component i. Except to
identify the presence of speech, the first two MFCCs are not
used in model creation, leaving only 8 dimensions (R = 8).
Given a speaker/impostor pair Sk, Ik, which will henceforth
be called a speaker verifier, the odds of an audio segment
belonging to speaker k are calculated as follows. First, MFCC
vectors xm are extracted for the entire segment and quiet
frames are removed. Next, for each vector, the probability of
belonging to either model M is determined
p(x̄,M) =50∑i=1
pi√(2π)R|σi|
exp((x̄− μ̄)Tσ−1i (x̄− μ̄)) (1)
To avoid zeros in subsequent calculations, it is assumed that
each vector had to belong to either the speaker or the impostor,
and that each were equally likely. Therefore, given an audio
stream segment αt, which contains a set of speech frames m,
the odds of a speaker being present at time t are:
Ok(t) = Ok(t− 1) +∑m
(log
(p(�x|Sk)
p(�x|Ik))), ∀m ∈ αt (2)
To allow for multi-person interactions with the robot, where
other speakers may be present in the environment, some
additional modifications are made to the traditional speaker
recognition approach. First, Ok is restricted to the range [-
40,40] to prevent extremely large values. Second, a decay
function is added to bring all speaker likelihoods back to
neutral (i.e. 50% or Ok=0) when no one is talking. Whenever
an audio segment contains no speech, Ok is reduced by
−0.2 ∗ Ok across all speakers. Finally, for use with Markov
Logic Networks, the final Ok value is thresholded at 20 to
indicate the presence of speaker k.
C. Appearance
We model the appearance of individuals whose face is
detected using the Viola-Jones face detector provided in
OpenCV. The appearance of the face is computed using the
region approximately from the forehead to the upper-lip, from
the left eye to the right-eye (see figure 2). This region provides
some tolerance to different poses, while including as little of
the background as possible. For clothing, we use the region
below the detected face, approximately covering the region of
the individual’s shirt.
Appearance can be modeled using either a Gaussian mixture
model or a color histogram. For this work, we have selected
color histograms as they have been demonstrated to be an ef-
fective means of modeling both clothing [7] and skin color [3].
GMM may be more appropriate in cases where there is a need
to both model appearance and segment simultaneously [17].
The color histogram H is a measure of the frequency of
each color within the given region. For color histograms, we
quantize each channel (r, g, b) using the 4 most significant
bits. We concatenate the resulting quantized data, producing a
histogram of 4096 bins.
We compare histograms using the χ2 similarity measure
χ2(Ha, Hb) = 2∑i
(Ha(i)−Hb(i))2
Ha(i) +Hb(i)(3)
1) Training: When a person is recognized using their
face, the system makes note of their appearance using a
color histogram. Since appearance may vary depending on
lighting and even pose, we maintain multiple clusters of color
histograms for each participant. We maintain these clusters
using streaming clustering. Streaming clustering is ideal for
situations such as these where we wish to cluster a large
amount of data but it is infeasible to make multiple passes
over the data.
When an individual has been recognized using face recog-
nition, we compute the χ2 measure for each existing cluster of
that individual. If an existing cluster matches this appearance
at a threshold of θh, we add support to the cluster. If no
cluster matches at the specified tolerance, then we create a
new cluster. We periodically prune clusters to remove those
that have not gained a sufficient amount of support within
a period of time. This provides tolerance to occasional false
positives, since they will likely not receive enough support
over a period of time.
2) Testing: Given an individual of an unknown identity,
we compare the color histogram against existing clusters of
all individuals. We return identity of the individual whose the
color histogram that has the lowest χ2 value.
D. Distance
The distance between the user and robot can be measured
using the SR-3000 infrared time of flight camera. In this
case, the face is detected in the visible spectrum. The visible
spectrum cameras and SR-3000 are aligned to center the
images. Within the region of the face, we can measure the
distance, returning the median distance in order to provide
some tolerance to noise. When this data was collected, position
was measured by hand. In an automated system, we would
likely use the SR-3000 to acquire distance.
E. Markov Logic Networks
Logical representations of AI are extremely adept at han-
dling complexity and making inferences about the real world.
The difficulty with logical representations is that in some cases
information cannot be stated without uncertainty. The power
67
Fig. 2. Windows used for color histograms of shirt and face.
of the Markov Logic Networks [1] lies in it’s ability to fuse
a logical representation with statistical uncertainty.
We define relations between objects using first order logic,
and use Markov networks to assign a weight to each relation
in terms of relative importance. In formal inference, if one for-
mula is violated, the world has zero probability of occurring.
Weighting eases this restriction by saying that if a relationship
is violated, it is still possible, just less likely.
Formally, a Markov Logic Network (MLN) M is a set of
formulas Fi and associated real valued weights wi (paired
L = (Fi, wi)). Combining L with a set of constraints C gives
an MLN ML,C .
The probability of a set of ground predicates X is defined
P (X = x) =1
Zexp
(m∑i=1
(wini(x))
)(4)
where ni(x) is the number of true groundings of the world
under the assignment x. Z is a normalizing constant to ensure
that the values sum to 1.
Our logical representation includes predicates:
• FaceLike(person, time) - Face identification,
• LooksLike(person, time) - Complexion comparison,
• DressedLike(person, time) - Clothing comparison,
• SoundsLike(person, time) - Speaker identification,
• Distance(time, range) - Distance to speaker (quantized
into 1m)
• Id(identity, time) - True identity
Person indicates the identity of the participant; time is a
unique identifier of each time instant which may not be in
precise correlation with chronological time; range is quantized
to 1 or 2 meters. All values, with the exception of true identity
are known at the time of inference, although some data may
be omitted because it is not present. The true identity is the
only query predicate.
We define the following rules:
• SoundsLike(p, t) ∧Distance(t, r)→ Id(i, t)• LooksLike(p, t) ∧Distance(t, r)→ Id(i, t)• DressedLike(p, t) ∧Distance(t, r)→ Id(i, t)• FaceLike(p, t) ∧Distance(t, r)→ Id(i, t)
The MLN with Maximum A Posteri (MAP) inference finds
the most probable state, given evidence. In the case of person
identification, the evidence is provided by sensor modalities.
The MLN uses all of the available information to return the
state of the world that is most likely, i.e., the person most likely
to be interacting with the robot. All learning and inference
procedures are packaged together in the Alchemy package [5],
which we make use of in this paper.
IV. EXPERIMENTAL RESULTS
We demonstrate results on the problem of person identifica-
tion using internally collected data sequences. In the training
sequences, individuals speak freely for four different 20-30
seconds sessions on a topic of their choosing. In the valida-
tion and test sets, participants speak for two 20-30 second
sessions. We collect approximately 3 frames per second. We
systematically vary lighting, angle and distance between the
robot and user.During normal conversations, participants varied pose and
facial expression. Participants were asked to make an effort to
periodically look at the robot, but often times would move
their eyes and head. In order to aid in the perception of
interacting with the robot, a speaker was placed on the robot
and periodically the robot “spoke” to the user.In the first experiment, we use MAP identification to
evaluate precision and recall, using fixed thresholds for face,
speaker, and appearance. Recall is defined in terms of the
percentage of images where the user was detected. Precision
represents the accuracy of the system. This dataset represents
a set of known users that the robot will interact with on a
regular basis.We evaluated person identification using MAP identification
using MLNs trained using a variety of modalities. Table I
shows the results of this experiment. As anticipated from
previously reported results, the precision rate is good for face
recognition (94.1%), but the recall rate is 47%. In many cases
individuals were not looking at the robot or in some cases their
eyes were closed. Face recognition was unable to register these
images.Speech recognition has a recall of just under 20% with a
precision of 79.6%. This reflects the difficulty of recogniz-
ing people with ambient noise in the environment and with
microphones located possibly up to 2 meters away from the
participant. Higher precision is achieved in this modality by
changing the threshold or by reducing the speaker set size.Complexion and appearance by itself produces good re-
markably good recall (68.2%) and precision (92.2%). Recall
is to be expected since this is using the appearance from
any detected face. In some cases, however, the individuals
were too close to the camera to have any meaningful clothing
information.By combining face with speech, we increase the recall
to 54%, with a precision to 92.4%. Face, complexion and
68
Fig. 3. Examples of challenging, yet correctly classified instances.
Modalities Recall PrecisionFace 47.1% 94.1 %
Speech 19.5% 79.6%Face, Speech 54 % 92.35%
Complexion, Appearance 68.2% 92.2%Face, Complexion,
Appearance 72% 95.3%Face, Complexion,Apperance, Speech 74% 97%
TABLE IFUSION RESULTS FOR MAP INFERENCE ON DIFFERENT MODALITIES
appearance produces a recall of 72% with a precision of
95.3%. Face, complexion, appearance, and speech produces
74% recall with 97% precision. What is somewhat remarkable
about these results is that as we add modalities, not only does
the recall increase, but the precision increases as well. This
trend is predicted by Ross et al. [14].
Figure 3 shows a few correctly classified examples, illus-
trative of the variety in this data set. All three participants
have their eyes closed, and none of them are looking at the
camera. The participant on the left is looking upward, the
middle participant is looking off to the right and the participant
on the right is has tilted his head slightly. The combination
of atypical poses combined with closed eyes make these very
challenging to recognize.
Next, we examine the impact of varying threshold on the
precision and recall. As the threshold goes down, we expect
recall to increase and precision to decrease. This is due to the
fact that there will be a greater rate of identification, but this
comes at the cost of decreased precision. The results of this
analysis are shown in figure 4. The figure shows the trend
for both face and speaker identification. In the experiment, we
fixed the thresholds for all other modalities except the one in
question. Note that in the worst case, the precision is at 94%;
in the best case the precision is 97.4%. The worst case for
recall is 71.3%; the best case is 77.2%. The Markov Logic
Networks demonstrate a tolerance to varying thresholds for
underlying classifiers.
Finally, we consider the question of which modality is best
for recognition. MLNs learn all possible combinations of re-
ported identity to true identity. Because of this, in many cases
the MLNs are capable of learning which individuals could
be confused and which individuals often are not confused.
This can also learn a different weighting rule for modalities
Fig. 4. Recall / precision curve showing the impact of varying thresholds.
depending on the participant. For example, if a person had a
distinctive voice, the MLN might assign a much higher weight
for voice than for other modalities.
The weighting of modalities for one participant is shown
in figure 5. At 1 meter, the most important modality is cloth-
ing, followed by appearance. This result may seem counter
intuitive, until examining the data. The participant is wearing
a black shirt with a bright green picture. This picture is
distinctive among other participants and can only be clearly
seen at a close distance. When the participant is farther away,
the details of the picture become less clear.
At 2 meters, the most important modality is face, followed
by appearance. This is a result that has been supported by
psychologists that have stressed the importance of appearance
on face recognition. Speech is also important and clothing is
least important.
We again look at the weighting for each modality for a
different participant. At 1 meter, the most significant modality
is speaker identification, followed by face recognition. At 2
meters, the most significant modality is face recognition while
the rest of the modalities are approximately comparable. The
voice appears to be quite distinctive, but the distinctiveness is
only heard clearly when the participant is close to the robot.
When farther away, the face becomes more reliable.
69
Fig. 5. The importance of modalities, varying by distance.
V. CONCLUSION AND FUTURE WORK
To make human-robot interaction natural, robots must be
capable of quickly and accurately identifying individuals
through biometric information available during typical inter-
actions. This necessitates both a high level of recall without
any sacrifice in precision. We demonstrated the fusion of
face, speaker, clothing, and complexion for identification. We
demonstrated a dramatic improvement in recall rates by fusing
these modalities using Markov Logic Networks. We’ve also
shown the robustness of MLNs to missing information, and
modalities of varying levels of precision.
MLNs provide an interesting framework for both biometrics
and understand human actions and activities in general. This
could be extended in the future to incorporate other aspects,
such as learning habits and preferences of users, then identi-
fying individuals based on their actions. We did not explore
the role of contextual information on prior probability. We
anticipate that contextual information such as where and when
can add a strong prior probability for identifying individuals.
In this work, we ran all modalities in parallel. However, this
is not necessary. We could extend this work to a cascading
classifier. In a laboratory environment computational power
may not be an issue. However, on a mobile robot, we may
have limited computational power and selecting one or more
modalities that provide the best combination of identification
power and computational efficiency becomes an important
task. For example, if a subject is 10 meters away, we may
select gait recognition to identify the subject. If the subject is
closer, we may select face recognition. Intermediate distances
may include a combination of both modalities.
ACKNOWLEDGMENT
This work was partially supported by the Office of Naval
Research under job order number N0001407WX20452 and
N0001408WX30007 awarded to Greg Trafton.
REFERENCES
[1] Domingos, P. and Richardson, M. Markov Logic: A Unifying Frameworkfor Statistical Relational Learning. In L. Getoor and B. Taskar (eds.),Introduction to Statistical Relational Learning (pp. 339-371), 2007. Cam-bridge, MA: MIT Press.
[2] Ji, M., Kim, S., Kim, H., “Text-Independent Speaker Identification usingSoft Channel Selection in Home Robot Environments,” IEEE Trans. onConsumer Electronics, v 54(1), pp. 140-144, 2008.
[3] Jones, M.J., Rehg, J.M., “Statistical color models with application to skindetection”,Technical Report CRL 98/11, Compaq, Cambridge ResearchLaboratory, 1998.
[4] B. Kamgar-Parsi, W. Lawson, and B. Kamgar-Parsi, “Towards Develop-ment of a Face Recognition System for Watch-List Surveillance”, IEEETrans. Pattern Analysis and Machine Intelligence (PAMI), in press.
[5] Kok, S., Singla, P. Richardson, M., and Domingos, P., “The alchemysystem for statistical relational AI”, Technical report, Department ofComputer Science and Engineering, University of Washington, Seattle,WA, 2005.
[6] Li, F., Wechsler, H., “Open Set Face Recognition using Transduction”IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), vol.27(11), 2005, pp. 1686–1697.
[7] McKenna, S.J., Jabri, S., Duric, Z., Rosenfeld, A. Wechsler, H., “TrackingGroups of People”, Computer Vision and Image Understanding (CVIU),Vol. 80(1), October, 2000, pp. 42–56.
[8] Nandakumar, K. Jain, A.K., Ross, A., “Fusion in Multibiometric Identi-fication Systems: What about the Missing Data?”, In Proc. InternationalConference on Biometrics (ICB), 2009.
[9] http://robotic.media.mit.edu/projects/robots/mds/overview/overview.html[10] Palanivel, S. Yegnanarayana, B., “Multimodal Person Authentication
using Speech, Face, and Visual Speech”, Computer Vision and ImageUnderstanding (CVIU), 2008, pp. 44–55
[11] Phillips, P. J., Grother, P., Micheals, R. J., Blackburn, D. M., Tabassi,E., Bone, J.M., FRVT2002: Overview and Summary, Available athttp://www.frvt.org/FVRT2002, 2003.
[12] Przybocki, M., Martin, A., NIST Speaker Recognition Evaluation Chron-icles. In Odyssey: The Speaker and Language Recognition Workshop, pp.12-22, 2004.
[13] Quatiri, T., Discrete Time Speech Signal Processing. Pearson EducationInc., Delhi, India, 2002.
[14] Ross, A; Nandakumar, K; Jain, A.K; Handbook of Multibiometrics.Springer, 2006.
[15] A. Salah et al., ”Multimodal identification and localization of users ina smart environment,” Journal of Multimodal User Interfaces, vol. 2, pp.75-91, 2008.
[16] Singla, P.; Kautz, H.; Gallagher, A.; Luo, J.; “Discovery of social rela-tionships in consumer photo collections using Markov Logic”, ComputerVision and Pattern Recognition Workshops, 2008.
[17] Sivic, J., Zitnick, C.L, and Szeliski. R., “Finding people in repeatedshots of the same scene”. In Proc. of the 16th British Machine VisionConference (BMVC), pages 909918, 2006.
[18] Stone, P.; Zickler, T; Darrell, T.; “Autotagging Facebook: Social net-work context improves photo annotation,” Computer Vision and PatternRecognition Workshops, 2008.
[19] Valin, J.-M.; Yamamoto, S.; Rouat, J.; Michaud, F.; Nakadai, K.; Okuno,H.G.; ,“Robust Recognition of Simultaneous Speech by a Mobile Robot,”Robotics IEEE Transactions on , vol.23, no.4, pp.742-752, Aug. 2007.
[20] Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., Ma, Y., “Robust FaceRecognition via Sparse Representation”, IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 31, no. 2, pp. 210-227, Feb 2009.
70