probabilistic graphical models and their applications · 10/22/2015 · i koller, friedman,...

Probabilistic Graphical Modelsand Their Applications

Bjoern Andres and Bernt Schiele

Max Planck Institute for Informatics

slides adapted from Peter Gehler

October 22, 2015

Andres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 1 / 68

Intro

Organization

I Lecture 2 hours/weekI Room E1.4 023, Thu 10:00–12:00 (except Nov 5: E1.4 021)

I Exercises 2 hours/weekI Room E1.4 024, Mon: 10:00–12:00 (except Oct 26: E1.4 023)I Starts Monday

I Course website: http://www.d2.mpi-inf.mpg.de/gmI SlidesI Pointers to books and papersI Homework assignments

I Mailing list: See website for how to subscribe


http://www.d2.mpi-inf.mpg.de/gm

Intro

Exercises & Exam

I ExercisesI Typically one assignment per weekI Typically Thursday → ThursdayI Theoretical and practical ExercisesI Starts Monday with a primer on Scientific Programming

I ExamI Oral exam at the end of the semester (must be passed!)I Offered both in English and German

I Final gradeI 50% exercisesI 50% oral exam

I TutorsI Evgeny Levinkov ([email protected])I Eldar Insafutdinov ([email protected])


Intro

Related Classes @UdS

I High-Level Computer Vision (SS), Fritz & Schiele

I Machine Learning (WS), Hein

I Statistical Learning I+II (SS, WS), Lengauer

I Optimization I+II, Convex Optimization (SS, WS), . . .

I Pattern and Speech Recognition (WS), Klakow


Intro

Department: Comptuter Vision & Multimodal Computing

Offers

I Master’s theses

I Bachelor’s theses

I HiWi positions

I etc.

Topics

I Machine learning

I Computer vision

I Machine learning applied to computer vision

Come, talk to us!


Intro

Literature

I All books available from the library as part of the “Semesterapparat”I Main book

I Barber, Bayesian Reasoning and Machine Learning, CambridgeUniversity Press, 2011, ISBN-13: 978-0521518147,http://tinyurl.com/3flppuo

I Additional referencesI Bishop, Pattern Recognition and Machine Learning, Springer New

York, 2006, ISBN-13: 978-0387310732I Koller, Friedman, Probabilistic Graphical Models: Principles and

Techniques, The MIT Press, 2009, ISBN-13: 978-0262013192I MacKay, Information Theory, Inference and Learning Algorithms,

Cambridge Universsity Press, 2003, ISBN-13: 978-0521642989


http://tinyurl.com/3flppuo

Intro

Literature


Intro

Topics

I Recap: Probability and Decision theory (today)I Graphical Models

I Basics (Directed, Undirected, Factor Graphs)I InferenceI Learning

I InferenceI Deterministic Inference (Sum-Prodcut, Junction Tree)I Approximate Inference (Loopy BP, Sampling, Variational)

I Application to Computer Vision ProblemsI Body Pose EstimationI Object DetectionI Semantic SegmentationI Image DenoisingI . . .


Intro

Today’s topics

I Overview: Machine LearningI What is machine learning?I Different problem settings and examples

I Probability theory

I Decision theory, inference and decision


Machine Learning

Machine Learning

Overview


Machine Learning

Machine Learning – What’s that?

I Do you use machine learning systems already?

I Can you think of an application?

I Can you define the term “machine learning”?


Machine Learning

I Goal of machine learning:I Machines that learn to perform a task from experience

I We can formalize this as

y = f(x;w) (1)

y is called output variable,x the input variable andw the model parameters (typically learned)

I Classification vs regression:I regression: y continuousI classification: y discrete (e.g. a class label)


Machine Learning

I Goal of machine learning:I Machines that learn to perform a task from experience

I We can formalize this as

y = f(x;w) (2)

y is called output variable,x the input variable andw the model parameters (typically learned)

I learn... adjust the parameter w

I ... a task ... the function f

I ... from experience using a training dataset D, where of eitherD = {x1, . . . , xn} or D = {(x1, y1), . . . , (xn, yn)}


Machine Learning

Different Scenarios

I Unsupervised Learning

I Supervised Learning

I Reinforcement Learning

I Let’s discuss


Machine Learning

Unsupervised Learning

I We are given some input data points

D = {x1, x2, . . . , xn} (3)

I Goals:I Determine the data distribution p(x) → density estimationI Visualize the data by projections → dimensionality reductionI Find groupings of the data → clustering

→


Machine Learning

Unsupervised Learning – Examples

Image Priors for Denoising


Machine Learning


Image Priors for Inpainting

Image from “A generative perspective on MRFs in low-level vision”,Schmidt et al., CVPR2010

black line: statistics form original images, blue and red: statistics after applying

two different algorithms


Machine Learning


Human Shape ModelSCAPE: Shape Completion and Animation of People, Anguelov et al.


Machine Learning


I Clustering scientific publications according to topics

I A generative model for human motion

I Generating training data for Microsoft Kinect xbox controller

I Clustering flickr imagesI Novelty detection, predicting outliers

I Anomality detection in visual inspectionI Video surveillance


Machine Learning

Unsupervised Learning – Models

Just flashing some keywords (→ Machine Learning)

I Mixture Models

I Neural Networks

I K-Means

I Kernel Density Estimation

I Principal Component Analysis (PCA)

I Graphical Models (here)


Machine Learning

Supervised Learning

I Given are pairs of training examples from X × Y

D = {(x1, y1), (x2, y2), . . . , (xn, yn)} (4)

I Goal is to learn the relationship between x and y

I Given a new example point x predict y

y = f(x;w) (5)

I We want to generalize to unseen data

→


Machine Learning

Supervised Learning – Examples

Face Detection


Machine Learning


Image ClassificationAndres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 23 / 68

Machine Learning


Semantic Image Segmentation


Machine Learning


Body Part Estimation (in Kinect)Figure from Decision Tree Fields, Nowozin et al., ICCV11


Machine Learning


I Person identification

I Credit card fraud detection

I Industrial inspection

I Speech recognition

I Action classification in videos

I Human body pose estimation

I Visual object detection

I Prediction survival rate of a patient

I ...


Machine Learning

Supervised Learning - Models

Flashing more keywords

I Multilayer Perceptron (Backpropagation)

I Linear Regression, Logistic Regression

I Support Vector Machine (SVM)

I Boosting

I Graphical models


Machine Learning

Reinforcement Learning

I Setting: given a situation, find an action to maximize a rewardfunction

I Feedback:I we only get feedback of how well we are doingI we do not get feedback what the best action would be

(“indirect teaching”)

I Feedback given as reward:I each action yields reward, orI a reward is given at the end (e.g. robot has found his goal, computer

has won game in Backgammon)

I Exploration: try out new actions

I Exploitation: use known actions that yield high rewards

I Find a good trade-off between exploration and exploitation


Machine Learning

Variations of the general theme

I All problems fall in these broad categories

I But your problem will surely have some extra twists

I Many different variations of the aforementioned problems are studiedseparately

I Let’s look at some ...


Machine Learning

Semi-Supervised Learning

I We are given a dataset of l labeled examplesDl = {(x1, y1), . . . , (xl, yl)}

as in supervised learning

I Additionally we are given a set of u unlabeled examplesDu = {xl+1, . . . , xl+u}

as in unsupervised learning

I Goal is y = f(x;w)

I Question: how can we utilize the extra information in Du?

→


Machine Learning

Semi-Supervised Learning: Two Moons

I Two labeled examples (red and blue) and additional unlabeled blackdots

Two moons


Machine Learning

Transductive Learning

I We are given a set of labeled examples

D = {(x1, y1), . . . , (xn, yn)} (6)

I Additionally we know the test data points {xte1 , . . . , xtem}(not their labels!)

I Can we do better, including this knowledge?

I This should be easier than making predictions for the entire set X


Machine Learning

On-line Learning

I The training data is presented step-by-step and is never availableentirely

I At each time-step t we are given a new datapoint xt(or (xt, yt))

I When is online learning a sensible scenario?I We want to continuously update the model – we can train a model

with little data, but the model should become better over time whenmore data is available (similar to how humans learn)

I We have limited storage for data and the model – a viable setting forlarge-scale datasets (e.g. the size of the internet)

I How do we learn in this scenario?


Machine Learning

Large-Scale Learning

I Learning with millions of examples

I Study fast learning algorithms (e.g. parallelizable, special hardware)

I Problems of storing the data, computing the features, etc.

I There is no strict definition for “large-scale”

I Small-scale learning: limiting factor is number of examples

I Large-scale learning: limited by Tmax maximal time for computation


Machine Learning

Active Learning

I We are given a set of examples

D = {x1, . . . , xn} (7)

I Goal is to learn y = f(x;w)

I Each label yi costs something, e.g. Ci ∈ R+

I Question: How to learn well while paying little?

I This is almost always the case, labeling is expensive


Machine Learning

Structured Output Learning

I We are given a set of training examplesD = {(x1, y1), . . . , (xn, yn)},

but y ∈ Y contains more structure than y ∈ Ror y ∈ {−1, 1}

I Consider binary image segmentationI y is entire image labelingI Y is the set of all labelings 2#pixels

I Other examples: y could be a graph, a tree,a ranking, . . .

I Goal is to learn a function f(x, y;w) and predicty = argmax

y∈Yf(x, y;w)


Machine Learning

Some final comments

I All topics are under active development and research

I Supervised classification: basically understood

I Broad range of applications, many exciting developments

I Adopting a “ML view” has far reaching consequences, it touchesproblems of empirical sciences in general


Probability Theory

Probability Theory

Brief Review


Probability Theory

Brief Review

I A random variable (RV) X can take values from some set X ofoutcomes.

I We usually use the short-hand notation

p(x) for p(X = x) ∈ [0, 1] (8)

for the probability that X takes value x

I Withp(X), (9)

we denote the probability distribution over X


Probability Theory

Brief Review

I Two random variables (RVs) are called independent if

p(X = x, Y = y) = p(X = x)p(Y = y) (10)

I Joint probability (of X and Y )

p(x, y) instead p(X = x, Y = y) (11)

I Conditional probability

p(x|y) instead p(X = x|Y = y) (12)


Probability Theory

The Rules of Probability

I Sum rulep(X) =

∑y∈Y

p(X,Y = y) (13)

we “marginalize out y”. p(X = x) is also called amarginal probability

I Product Rulep(X,Y ) = p(Y |X)p(X) (14)

I And as a consequence: Bayes Theorem or Bayes Rule

p(Y |X) =p(X|Y )p(Y )

p(X)(15)


Probability Theory

Vocabulary

I Joint Probability

p(xi, yj) =nij

N

I Marginal Probability

p(xi) =ciN

I Conditional Probability

p(yj | xi) =nij

ci

ci =∑j

nij︸︷︷︸yj nij

xi

N =∑ij

nij


Probability Theory

Probability Densities

I Now X is a continuous random variable, eg taking values in RI Probability that X takes a value in the interval (a, b) is

p(X ∈ (a, b)) =

∫ b

ap(x)dx (16)

and we call p(x) the probability density over x


Probability Theory


I p(x) must satisfy the following conditions

p(x) ≥ 0 (17)∫ ∞−∞

p(x)dx = 1 (18)

I The probability that x lies in (−∞, z) is given by the cumulativedistribution function

P (z) =

∫ z

−∞p(x)dx (19)


Probability Theory


Figure : Probability density of a continuous variable


Probability Theory

Expectation and Variances

I Expectation

E[f ] =∑x∈X

p(x)f(x) (20)

E[f ] =

∫x∈X

p(x)f(x)dx (21)

I Sometimes we denote the distribution that we take the expectationover as a subscript, eg

Ep(·|y)[f ] =∑x∈X

p(x|y)f(x) (22)

I Variancevar[f ] = E

[(f(x)− E [f(x)])2

](23)


Decision Theory

Decision Theory


Decision Theory

Digit Classification

I Classify digits “a” versus “b”

Figure : The digits “a” and “b”

I Goal: classify new digits such that the error probability is minimized


Decision Theory

Digit Classification - Priors

Prior Distribution

I How often do the letters “a” and “b” occur ?

I Let us assume

C1 = a p(C1) = 0.75 (24)

C2 = b p(C2) = 0.25 (25)

The prior has to be a distribution, in particular∑k=1,2

p(Ck) = 1 (26)


Decision Theory

Digit Classification - Class Conditionals

I We describe every digit using some feature vectorI the number of black pixels in each boxI relation between width and height

I Likelihood: How likely has x been generated from p(· | a), resp.p(· | b)?


Decision Theory


I Which class should we assign x to ?

I The answer

I Class a


Decision Theory



I The answer

I Class b


Decision Theory



I The answer

I Class a, since p(a)=0.75


Decision Theory

Bayes Theorem

I Some terminology! Repeated from last slide:

p(Ck|x) =p(x|Ck)p(Ck)

p(x)=

p(x|Ck)p(Ck)∑j p(x|Cj)p(Cj)

(29)

I We use the following names

Posterior =Likelihood× Prior

Normalization Factor(30)

I Here the normalization factor is easy to compute. Keep an eye out forit, it will haunt us until the end of this class(and longer :) )

I It is also called the Partition Function, common symbol Z


Decision Theory

Bayes Theorem

Likelihood

Likelihood × Prior

Posterior = Likelihood×PriorNormalization Factor


Decision Theory

How to Decide?

I Two class problem C1, C2, plotting Likelihood × Prior


Decision Theory

Minmizing the Error

p(error) = p(x ∈ R2, C1) + p(x ∈ R1, C2) (31)

= p(x ∈ R2|C1)p(C1) + p(x ∈ R1|C2)p(C2) (32)

=

∫R2

p(x|C1)p(C1)dx +

∫R1

p(x|C2)p(C2)dx (33)


Decision Theory

General Loss Functions

I So far we considered misclassification error only

I This is also referred to as 0/1 loss

I Now suppose we are given a more general loss function

∆ : Y × Y → R+ (34)

(y, y) 7→ ∆(y, y) (35)

I How do we read this?

I ∆(y, y) is the cost we have to pay if y is the true class but we predicty instead


Decision Theory

Example: Predicting Cancer

∆ : Y × Y → R+ (36)

(y, y) 7→ ∆(y, y) (37)

I Given: X-Ray image, Question: Cancer yes or no?Should we have another medical check of the patient?

diagnosis :cancer normal

truth : cancer 0 1000normal 1 0

I For discrete sets Y this is a loss matrix


Decision Theory


I Which class should we assign x to? (p(a) = p(b) = 0.5)

I The answer

I It depends on the loss


Decision Theory

Minmizing Expected Loss (or Error)

I The expected loss for x (averaged over all decisions)

E[∆] =∑

k=1,...,K

∑j=1,...,K

∫Rj

∆(Ck, Cj)p(x,Ck)dx (38)

I And how do we predict? Decide on one y!

y∗ = argminy∈Y

∑k=1,...,K

∆(Ck, y)p(Ck|x) (39)

= argminy∈Y

Ep(·|x)[∆(·, y)] (40)


Decision Theory

Inference and Decision

I We broke down the process into two stepsI Inference: obtaining the probabilities p(Ck|x)I Decision: Obtain optimal class assignment

I Two steps !!

I The probabilites p(·|x) represent our belief of the world

I The loss ∆ tells us what to do with it!

I 0/1 loss implies deciding for max probability (exercise)


Decision Theory

Three Approaches to Solve Decision Problems

1. Generative models: infer the class conditionals

p(x|Ck), k = 1, . . . ,K (41)

then combine using Bayes Theorem p(Ck|x) = p(x|Ck)p(Ck)p(x)

2. Discriminative models: infer posterior probabilities directly

p(Ck|x) (42)

3. Find a discriminative function minimizing Expected Loss ∆

f : X → {1, . . . ,K} (43)

Let’s discuss these optionsAndres & Schiele (MPII) Probabilistic Graphical Models October 22, 2015 64 / 68

Decision Theory

Generative Models

Pros:

I The name generative is because we cangenerate samples from the learnt distribution

I We can infer p(x|Ck) (or p(x) for short)

Cons:

I With high dimensionality of x ∈ X we need alarge training set to determine theclass-conditionals

I We may not be interested in all quantities


Decision Theory

Discriminative Models

Pros:

I No need to model p(x|Ck)(i.e. in general easier)

Cons:

I No access to model p(x|Ck)


Decision Theory

Discriminative Functions

When solving a problem of interest, do not solve a harder / more generalproblem as an intermediate step.

– Vladimir VapnikPros:

I One integrated system, we directly estimate the quantity of interest

Cons:

I Need ∆ during training time – revision requires re-learning

I No access to probabilities or uncertainty, thus difficult to rejectdecision?

I Prominent example: Support Vector Machines (SVMs)


Decision Theory

Next Time ...

I ... we will meet our new friends:


probabilistic graphical models and their applications · 10/22/2015 · i koller, friedman,...

Documents