probabilistic graphical models and their applications · 10/22/2015 · i koller, friedman,...
TRANSCRIPT
Probabilistic Graphical Modelsand Their Applications
Bjoern Andres and Bernt Schiele
Max Planck Institute for Informatics
slides adapted from Peter Gehler
October 22, 2015
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 1 / 68
Intro
Organization
I Lecture 2 hours/weekI Room E1.4 023, Thu 10:00–12:00 (except Nov 5: E1.4 021)
I Exercises 2 hours/weekI Room E1.4 024, Mon: 10:00–12:00 (except Oct 26: E1.4 023)I Starts Monday
I Course website: http://www.d2.mpi-inf.mpg.de/gmI SlidesI Pointers to books and papersI Homework assignments
I Mailing list: See website for how to subscribe
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 2 / 68
Intro
Exercises & Exam
I ExercisesI Typically one assignment per weekI Typically Thursday → ThursdayI Theoretical and practical ExercisesI Starts Monday with a primer on Scientific Programming
I ExamI Oral exam at the end of the semester (must be passed!)I Offered both in English and German
I Final gradeI 50% exercisesI 50% oral exam
I TutorsI Evgeny Levinkov ([email protected])I Eldar Insafutdinov ([email protected])
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 3 / 68
Intro
Related Classes @UdS
I High-Level Computer Vision (SS), Fritz & Schiele
I Machine Learning (WS), Hein
I Statistical Learning I+II (SS, WS), Lengauer
I Optimization I+II, Convex Optimization (SS, WS), . . .
I Pattern and Speech Recognition (WS), Klakow
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 4 / 68
Intro
Department: Comptuter Vision & Multimodal Computing
Offers
I Master’s theses
I Bachelor’s theses
I HiWi positions
I etc.
Topics
I Machine learning
I Computer vision
I Machine learning applied to computer vision
Come, talk to us!
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 5 / 68
Intro
Literature
I All books available from the library as part of the “Semesterapparat”I Main book
I Barber, Bayesian Reasoning and Machine Learning, CambridgeUniversity Press, 2011, ISBN-13: 978-0521518147,http://tinyurl.com/3flppuo
I Additional referencesI Bishop, Pattern Recognition and Machine Learning, Springer New
York, 2006, ISBN-13: 978-0387310732I Koller, Friedman, Probabilistic Graphical Models: Principles and
Techniques, The MIT Press, 2009, ISBN-13: 978-0262013192I MacKay, Information Theory, Inference and Learning Algorithms,
Cambridge Universsity Press, 2003, ISBN-13: 978-0521642989
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 6 / 68
Intro
Literature
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 7 / 68
Intro
Topics
I Recap: Probability and Decision theory (today)I Graphical Models
I Basics (Directed, Undirected, Factor Graphs)I InferenceI Learning
I InferenceI Deterministic Inference (Sum-Prodcut, Junction Tree)I Approximate Inference (Loopy BP, Sampling, Variational)
I Application to Computer Vision ProblemsI Body Pose EstimationI Object DetectionI Semantic SegmentationI Image DenoisingI . . .
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 8 / 68
Intro
Today’s topics
I Overview: Machine LearningI What is machine learning?I Different problem settings and examples
I Probability theory
I Decision theory, inference and decision
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 9 / 68
Machine Learning
Machine Learning
Overview
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 10 / 68
Machine Learning
Machine Learning – What’s that?
I Do you use machine learning systems already?
I Can you think of an application?
I Can you define the term “machine learning”?
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 11 / 68
Machine Learning
I Goal of machine learning:I Machines that learn to perform a task from experience
I We can formalize this as
y = f(x;w) (1)
y is called output variable,x the input variable andw the model parameters (typically learned)
I Classification vs regression:I regression: y continuousI classification: y discrete (e.g. a class label)
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 12 / 68
Machine Learning
I Goal of machine learning:I Machines that learn to perform a task from experience
I We can formalize this as
y = f(x;w) (2)
y is called output variable,x the input variable andw the model parameters (typically learned)
I learn... adjust the parameter w
I ... a task ... the function f
I ... from experience using a training dataset D, where of eitherD = {x1, . . . , xn} or D = {(x1, y1), . . . , (xn, yn)}
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 13 / 68
Machine Learning
Different Scenarios
I Unsupervised Learning
I Supervised Learning
I Reinforcement Learning
I Let’s discuss
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 14 / 68
Machine Learning
Unsupervised Learning
I We are given some input data points
D = {x1, x2, . . . , xn} (3)
I Goals:I Determine the data distribution p(x) → density estimationI Visualize the data by projections → dimensionality reductionI Find groupings of the data → clustering
→
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 15 / 68
Machine Learning
Unsupervised Learning – Examples
Image Priors for Denoising
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 16 / 68
Machine Learning
Unsupervised Learning – Examples
Image Priors for Inpainting
Image from “A generative perspective on MRFs in low-level vision”,Schmidt et al., CVPR2010
black line: statistics form original images, blue and red: statistics after applying
two different algorithms
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 17 / 68
Machine Learning
Unsupervised Learning – Examples
Human Shape ModelSCAPE: Shape Completion and Animation of People, Anguelov et al.
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 18 / 68
Machine Learning
Unsupervised Learning – Examples
I Clustering scientific publications according to topics
I A generative model for human motion
I Generating training data for Microsoft Kinect xbox controller
I Clustering flickr imagesI Novelty detection, predicting outliers
I Anomality detection in visual inspectionI Video surveillance
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 19 / 68
Machine Learning
Unsupervised Learning – Models
Just flashing some keywords (→ Machine Learning)
I Mixture Models
I Neural Networks
I K-Means
I Kernel Density Estimation
I Principal Component Analysis (PCA)
I Graphical Models (here)
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 20 / 68
Machine Learning
Supervised Learning
I Given are pairs of training examples from X × Y
D = {(x1, y1), (x2, y2), . . . , (xn, yn)} (4)
I Goal is to learn the relationship between x and y
I Given a new example point x predict y
y = f(x;w) (5)
I We want to generalize to unseen data
→
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 21 / 68
Machine Learning
Supervised Learning – Examples
Face Detection
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 22 / 68
Machine Learning
Supervised Learning – Examples
Image ClassificationAndres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 23 / 68
Machine Learning
Supervised Learning – Examples
Semantic Image Segmentation
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 24 / 68
Machine Learning
Supervised Learning – Examples
Body Part Estimation (in Kinect)Figure from Decision Tree Fields, Nowozin et al., ICCV11
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 25 / 68
Machine Learning
Supervised Learning – Examples
I Person identification
I Credit card fraud detection
I Industrial inspection
I Speech recognition
I Action classification in videos
I Human body pose estimation
I Visual object detection
I Prediction survival rate of a patient
I ...
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 26 / 68
Machine Learning
Supervised Learning - Models
Flashing more keywords
I Multilayer Perceptron (Backpropagation)
I Linear Regression, Logistic Regression
I Support Vector Machine (SVM)
I Boosting
I Graphical models
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 27 / 68
Machine Learning
Reinforcement Learning
I Setting: given a situation, find an action to maximize a rewardfunction
I Feedback:I we only get feedback of how well we are doingI we do not get feedback what the best action would be
(“indirect teaching”)
I Feedback given as reward:I each action yields reward, orI a reward is given at the end (e.g. robot has found his goal, computer
has won game in Backgammon)
I Exploration: try out new actions
I Exploitation: use known actions that yield high rewards
I Find a good trade-off between exploration and exploitation
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 28 / 68
Machine Learning
Variations of the general theme
I All problems fall in these broad categories
I But your problem will surely have some extra twists
I Many different variations of the aforementioned problems are studiedseparately
I Let’s look at some ...
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 29 / 68
Machine Learning
Semi-Supervised Learning
I We are given a dataset of l labeled examplesDl = {(x1, y1), . . . , (xl, yl)}
as in supervised learning
I Additionally we are given a set of u unlabeled examplesDu = {xl+1, . . . , xl+u}
as in unsupervised learning
I Goal is y = f(x;w)
I Question: how can we utilize the extra information in Du?
→
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 30 / 68
Machine Learning
Semi-Supervised Learning: Two Moons
I Two labeled examples (red and blue) and additional unlabeled blackdots
Two moons
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 31 / 68
Machine Learning
Transductive Learning
I We are given a set of labeled examples
D = {(x1, y1), . . . , (xn, yn)} (6)
I Additionally we know the test data points {xte1 , . . . , xtem}(not their labels!)
I Can we do better, including this knowledge?
I This should be easier than making predictions for the entire set X
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 32 / 68
Machine Learning
On-line Learning
I The training data is presented step-by-step and is never availableentirely
I At each time-step t we are given a new datapoint xt(or (xt, yt))
I When is online learning a sensible scenario?I We want to continuously update the model – we can train a model
with little data, but the model should become better over time whenmore data is available (similar to how humans learn)
I We have limited storage for data and the model – a viable setting forlarge-scale datasets (e.g. the size of the internet)
I How do we learn in this scenario?
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 33 / 68
Machine Learning
Large-Scale Learning
I Learning with millions of examples
I Study fast learning algorithms (e.g. parallelizable, special hardware)
I Problems of storing the data, computing the features, etc.
I There is no strict definition for “large-scale”
I Small-scale learning: limiting factor is number of examples
I Large-scale learning: limited by Tmax maximal time for computation
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 34 / 68
Machine Learning
Active Learning
I We are given a set of examples
D = {x1, . . . , xn} (7)
I Goal is to learn y = f(x;w)
I Each label yi costs something, e.g. Ci ∈ R+
I Question: How to learn well while paying little?
I This is almost always the case, labeling is expensive
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 35 / 68
Machine Learning
Structured Output Learning
I We are given a set of training examplesD = {(x1, y1), . . . , (xn, yn)},
but y ∈ Y contains more structure than y ∈ Ror y ∈ {−1, 1}
I Consider binary image segmentationI y is entire image labelingI Y is the set of all labelings 2#pixels
I Other examples: y could be a graph, a tree,a ranking, . . .
I Goal is to learn a function f(x, y;w) and predicty = argmax
y∈Yf(x, y;w)
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 36 / 68
Machine Learning
Some final comments
I All topics are under active development and research
I Supervised classification: basically understood
I Broad range of applications, many exciting developments
I Adopting a “ML view” has far reaching consequences, it touchesproblems of empirical sciences in general
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 37 / 68
Probability Theory
Probability Theory
Brief Review
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 38 / 68
Probability Theory
Brief Review
I A random variable (RV) X can take values from some set X ofoutcomes.
I We usually use the short-hand notation
p(x) for p(X = x) ∈ [0, 1] (8)
for the probability that X takes value x
I Withp(X), (9)
we denote the probability distribution over X
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 39 / 68
Probability Theory
Brief Review
I Two random variables (RVs) are called independent if
p(X = x, Y = y) = p(X = x)p(Y = y) (10)
I Joint probability (of X and Y )
p(x, y) instead p(X = x, Y = y) (11)
I Conditional probability
p(x|y) instead p(X = x|Y = y) (12)
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 40 / 68
Probability Theory
The Rules of Probability
I Sum rulep(X) =
∑y∈Y
p(X,Y = y) (13)
we “marginalize out y”. p(X = x) is also called amarginal probability
I Product Rulep(X,Y ) = p(Y |X)p(X) (14)
I And as a consequence: Bayes Theorem or Bayes Rule
p(Y |X) =p(X|Y )p(Y )
p(X)(15)
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 41 / 68
Probability Theory
Vocabulary
I Joint Probability
p(xi, yj) =nij
N
I Marginal Probability
p(xi) =ciN
I Conditional Probability
p(yj | xi) =nij
ci
ci =∑j
nij︸ ︷︷ ︸yj nij
xi
N =∑ij
nij
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 42 / 68
Probability Theory
Probability Densities
I Now X is a continuous random variable, eg taking values in RI Probability that X takes a value in the interval (a, b) is
p(X ∈ (a, b)) =
∫ b
ap(x)dx (16)
and we call p(x) the probability density over x
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 43 / 68
Probability Theory
Probability Densities
I p(x) must satisfy the following conditions
p(x) ≥ 0 (17)∫ ∞−∞
p(x)dx = 1 (18)
I The probability that x lies in (−∞, z) is given by the cumulativedistribution function
P (z) =
∫ z
−∞p(x)dx (19)
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 44 / 68
Probability Theory
Probability Densities
Figure : Probability density of a continuous variable
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 45 / 68
Probability Theory
Expectation and Variances
I Expectation
E[f ] =∑x∈X
p(x)f(x) (20)
E[f ] =
∫x∈X
p(x)f(x)dx (21)
I Sometimes we denote the distribution that we take the expectationover as a subscript, eg
Ep(·|y)[f ] =∑x∈X
p(x|y)f(x) (22)
I Variancevar[f ] = E
[(f(x)− E [f(x)])2
](23)
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 46 / 68
Decision Theory
Decision Theory
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 47 / 68
Decision Theory
Digit Classification
I Classify digits “a” versus “b”
Figure : The digits “a” and “b”
I Goal: classify new digits such that the error probability is minimized
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 48 / 68
Decision Theory
Digit Classification - Priors
Prior Distribution
I How often do the letters “a” and “b” occur ?
I Let us assume
C1 = a p(C1) = 0.75 (24)
C2 = b p(C2) = 0.25 (25)
The prior has to be a distribution, in particular∑k=1,2
p(Ck) = 1 (26)
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 49 / 68
Decision Theory
Digit Classification - Class Conditionals
I We describe every digit using some feature vectorI the number of black pixels in each boxI relation between width and height
I Likelihood: How likely has x been generated from p(· | a), resp.p(· | b)?
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 50 / 68
Decision Theory
Digit Classification
I Which class should we assign x to ?
I The answer
I Class a
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 51 / 68
Decision Theory
Digit Classification
I Which class should we assign x to ?
I The answer
I Class b
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 52 / 68
Decision Theory
Digit Classification
I Which class should we assign x to ?
I The answer
I Class a, since p(a)=0.75
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 53 / 68
Decision Theory
Bayes Theorem
I How do we formalize this?
I We already mentioned Bayes Theorem
p(Y |X) =p(X|Y )p(Y )
p(X)(27)
I Now we apply it
p(Ck|x) =p(x|Ck)p(Ck)
p(x)=
p(x|Ck)p(Ck)∑j p(x|Cj)p(Cj)
(28)
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 54 / 68
Decision Theory
Bayes Theorem
I Some terminology! Repeated from last slide:
p(Ck|x) =p(x|Ck)p(Ck)
p(x)=
p(x|Ck)p(Ck)∑j p(x|Cj)p(Cj)
(29)
I We use the following names
Posterior =Likelihood× Prior
Normalization Factor(30)
I Here the normalization factor is easy to compute. Keep an eye out forit, it will haunt us until the end of this class(and longer :) )
I It is also called the Partition Function, common symbol Z
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 55 / 68
Decision Theory
Bayes Theorem
Likelihood
Likelihood × Prior
Posterior = Likelihood×PriorNormalization Factor
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 56 / 68
Decision Theory
How to Decide?
I Two class problem C1, C2, plotting Likelihood × Prior
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 57 / 68
Decision Theory
Minmizing the Error
p(error) = p(x ∈ R2, C1) + p(x ∈ R1, C2) (31)
= p(x ∈ R2|C1)p(C1) + p(x ∈ R1|C2)p(C2) (32)
=
∫R2
p(x|C1)p(C1)dx +
∫R1
p(x|C2)p(C2)dx (33)
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 58 / 68
Decision Theory
General Loss Functions
I So far we considered misclassification error only
I This is also referred to as 0/1 loss
I Now suppose we are given a more general loss function
∆ : Y × Y → R+ (34)
(y, y) 7→ ∆(y, y) (35)
I How do we read this?
I ∆(y, y) is the cost we have to pay if y is the true class but we predicty instead
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 59 / 68
Decision Theory
Example: Predicting Cancer
∆ : Y × Y → R+ (36)
(y, y) 7→ ∆(y, y) (37)
I Given: X-Ray image, Question: Cancer yes or no?Should we have another medical check of the patient?
diagnosis :cancer normal
truth : cancer 0 1000normal 1 0
I For discrete sets Y this is a loss matrix
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 60 / 68
Decision Theory
Digit Classification
I Which class should we assign x to? (p(a) = p(b) = 0.5)
I The answer
I It depends on the loss
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 61 / 68
Decision Theory
Minmizing Expected Loss (or Error)
I The expected loss for x (averaged over all decisions)
E[∆] =∑
k=1,...,K
∑j=1,...,K
∫Rj
∆(Ck, Cj)p(x,Ck)dx (38)
I And how do we predict? Decide on one y!
y∗ = argminy∈Y
∑k=1,...,K
∆(Ck, y)p(Ck|x) (39)
= argminy∈Y
Ep(·|x)[∆(·, y)] (40)
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 62 / 68
Decision Theory
Inference and Decision
I We broke down the process into two stepsI Inference: obtaining the probabilities p(Ck|x)I Decision: Obtain optimal class assignment
I Two steps !!
I The probabilites p(·|x) represent our belief of the world
I The loss ∆ tells us what to do with it!
I 0/1 loss implies deciding for max probability (exercise)
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 63 / 68
Decision Theory
Three Approaches to Solve Decision Problems
1. Generative models: infer the class conditionals
p(x|Ck), k = 1, . . . ,K (41)
then combine using Bayes Theorem p(Ck|x) = p(x|Ck)p(Ck)p(x)
2. Discriminative models: infer posterior probabilities directly
p(Ck|x) (42)
3. Find a discriminative function minimizing Expected Loss ∆
f : X → {1, . . . ,K} (43)
Let’s discuss these optionsAndres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 64 / 68
Decision Theory
Generative Models
Pros:
I The name generative is because we cangenerate samples from the learnt distribution
I We can infer p(x|Ck) (or p(x) for short)
Cons:
I With high dimensionality of x ∈ X we need alarge training set to determine theclass-conditionals
I We may not be interested in all quantities
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 65 / 68
Decision Theory
Discriminative Models
Pros:
I No need to model p(x|Ck)(i.e. in general easier)
Cons:
I No access to model p(x|Ck)
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 66 / 68
Decision Theory
Discriminative Functions
When solving a problem of interest, do not solve a harder / more generalproblem as an intermediate step.
– Vladimir VapnikPros:
I One integrated system, we directly estimate the quantity of interest
Cons:
I Need ∆ during training time – revision requires re-learning
I No access to probabilities or uncertainty, thus difficult to rejectdecision?
I Prominent example: Support Vector Machines (SVMs)
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 67 / 68
Decision Theory
Next Time ...
I ... we will meet our new friends:
Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 68 / 68