understanding social media with machine...

96
Understanding Social Media with Machine Learning Xiaojin Zhu [email protected] Department of Computer Sciences University of Wisconsin–Madison, USA CCF/ADL Beijing 2013 Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 1 / 95

Upload: others

Post on 15-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Understanding Social Mediawith Machine Learning

Xiaojin Zhu

[email protected] of Computer Sciences

University of Wisconsin–Madison, USA

CCF/ADL Beijing 2013

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 1 / 95

Page 2: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Outline

1 Spatio-Temporal Signal Recovery from Social Media

2 Machine Learning BasicsProbabilityStatistical EstimationDecision TheoryGraphical ModelsRegularizationStochastic Processes

3 Socioscope: A Probabilistic Model for Social Media

4 Case Study: Roadkill

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 2 / 95

Page 3: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Spatio-Temporal Signal Recovery from Social Media

Outline

1 Spatio-Temporal Signal Recovery from Social Media

2 Machine Learning BasicsProbabilityStatistical EstimationDecision TheoryGraphical ModelsRegularizationStochastic Processes

3 Socioscope: A Probabilistic Model for Social Media

4 Case Study: Roadkill

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 3 / 95

Page 4: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Spatio-Temporal Signal Recovery from Social Media

Spatio-temporal Signal: When, Where, How Much

Direct instrumental sensing is difficult and expensive

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 4 / 95

Page 5: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Spatio-Temporal Signal Recovery from Social Media

Humans as Sensors

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 5 / 95

Page 6: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Spatio-Temporal Signal Recovery from Social Media

Humans as Sensors

Not “hot trend” discovery: We know what event we want to monitor

Not natural language processing for social media: We are given areliable text classifier for “hit”

Our task: precisely estimating a spatiotemporal intensity function fstof a pre-defined target phenomenon.

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 6 / 95

Page 7: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Spatio-Temporal Signal Recovery from Social Media

Challenges of Using Humans as Sensors

Keyword doesn’t always mean eventI I was just told I look like dead crow.I Don’t blame me if one day I treat you like a dead crow.

Human sensors aren’t under our control

Location stamps may be erroneous or missingI 3% have GPS coordinates: (-98.24, 23.22)I 47% have valid user profile location: Bristol, UK, New YorkI 50% don’t have valid location informationI Hogwarts, In the traffic..blah, Sitting On A Taco

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 7 / 95

Page 8: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Spatio-Temporal Signal Recovery from Social Media

Problem Definition

Input: A list of time and location stamps of the target posts.

Output: fst Intensity of target phenomenon at location s (e.g., NewYork) and time t (e.g., 0-1am)

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 8 / 95

Page 9: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Spatio-Temporal Signal Recovery from Social Media

Why Simple Estimation is Bad

fst = xst, the count of target posts in bin (s, t)

Justification: MLE of the model x ∼ Poisson(f)

However,I Population Bias: Assume fst = fs′t′ , if more users in (s, t), thenxst > xs′t′

I Imprecise location: Posts without location stamp, noisy user profilelocation

I Zero/Low counts: If we don’t see tweets from Antarctica, no penguinsthere?

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 9 / 95

Page 10: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics

Outline

1 Spatio-Temporal Signal Recovery from Social Media

2 Machine Learning BasicsProbabilityStatistical EstimationDecision TheoryGraphical ModelsRegularizationStochastic Processes

3 Socioscope: A Probabilistic Model for Social Media

4 Case Study: Roadkill

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 10 / 95

Page 11: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Probability

Outline

1 Spatio-Temporal Signal Recovery from Social Media

2 Machine Learning BasicsProbabilityStatistical EstimationDecision TheoryGraphical ModelsRegularizationStochastic Processes

3 Socioscope: A Probabilistic Model for Social Media

4 Case Study: Roadkill

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 11 / 95

Page 12: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Probability

Probability

The probability of a discrete random variable A taking the value a isP (A = a) ∈ [0, 1].

Sometimes written as P (a) when no danger of confusion.

Normalization∑

all a P (A = a) = 1.

Joint probability P (A = a,B = b) = P (a, b), the two events bothhappen at the same time.

Marginalization P (A = a) =∑

all b P (A = a,B = b), “summing outB”.

Conditional probability P (a|b) = P (a,b)P (b) , a happens given b happened.

The product rule P (a, b) = P (a)P (b|a) = P (b)P (a|b).

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 12 / 95

Page 13: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Probability

Bayes Rule

Bayes rule P (a|b) = P (b|a)P (a)P (b) .

In general, P (a|b, C) = P (b|a,C)P (a|C)P (b|C) where C can be one or more

random variables.

Bayesian approach: when θ is model parameter, D is observed data,we have

p(θ|D) =p(D|θ)p(θ)p(D)

,

I p(θ) is the prior,I p(D|θ) the likelihood function (of θ, not normalized:

∫p(D|θ) dθ 6= 1),

I p(D) =∫p(D|θ)p(θ) dθ the evidence,

I p(θ|D) the posterior.

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 13 / 95

Page 14: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Probability

Independence

The product rule can be simplified as P (a, b) = P (a)P (b) iff A andB are independent

Equivalently, P (a|b) = P (a), P (b|a) = P (b).

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 14 / 95

Page 15: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Probability

Probability density

A continuous random variable x has a probability density function(pdf) p(x) ∈ [0,∞].

p(x) > 1 is possible! Integrates to 1.∫ ∞−∞

p(x) dx = 1

P (x1 < X < x2) =∫ x2x1p(x) dx

Marginalization p(x) =∫∞−∞ p(x, y) dy

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 15 / 95

Page 16: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Probability

Expectation and Variance

The expectation (“mean” or “average”) of a function f under theprobability distribution P is

EP [f ] =∑a

P (a)f(a)

Ep[f ] =

∫xp(x)f(x) dx

In particular if f(x) = x, this is the mean of the random variable x.

The variance of f is

Var(f) = E[(f(x)− E[f(x)])2] = E[f(x)2]− E[f(x)]2

The standard deviation is std(f) =√

Var(f).

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 16 / 95

Page 17: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Probability

Multivariate Statistics

When x, y are vectors, E[x] is the mean vector

Cov(x, y) is the covariance matrix with i, j-th entry beingCov(xi, yj).

Cov(x, y) = Ex,y[(x− E[x])(y − E[y])] = Ex,y[xy]− E[x]E[y]

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 17 / 95

Page 18: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Probability

Some Discrete Distributions

Dirac or point mass distribution X ∼ δa if P (X = a) = 1

Binomial. n (number of trials) and p (head probability)

f(x) =

(nx

)px(1− p)n−x for x = 0, 1, . . . , n

0 otherwise

Bernoulli. Binomial with n = 1.

Multinomial p = (p1, . . . , pd)> (d-sided die)

f(x) =

(

nx1, . . . , xd

)∏dk=1 p

xkk if

∑dk=1 xk = n

0 otherwise

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 18 / 95

Page 19: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Probability

More Discrete Distributions

Poisson. X ∼ Poisson(λ) if

f(x) = e−λλx

x!

for x = 0, 1, 2, . . ..

λ the rate or intensity parameter

mean: λ, variance: λ

If X1 ∼ Poisson(λ1) and X2 ∼ Poisson(λ2) thenX1 +X2 ∼ Poisson(λ1 + λ2).

This is a distribution on unbounded counts with a probability massfunction“hump” (mode at dλe − 1).

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 19 / 95

Page 20: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Probability

Some Continuous Distributions

Gaussian (Normal): X ∼ N(µ, σ2) with parameters µ ∈ R (themean) and σ2 (the variance)

f(x) =1

σ√

2πexp

(−(x− µ)2

2σ2

).

σ is the standard deviation.

If µ = 0, σ = 1, X has a standard normal distribution.

(Scaling) If X ∼ N(µ, σ2), then Z = (X − µ)/σ ∼ N(0, 1)

(Independent sum) If Xi ∼ N(µi, σ2i ) are independent, then∑

iXi ∼ N(∑

i µi,∑

i σ2i

)

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 20 / 95

Page 21: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Probability

Some Continuous Distributions

Multivariate Gaussian. Let x, µ ∈ Rd, Σ ∈ Sd+ a symmetric, positivedefinite matrix of size d× d. Then X ∼ N(µ,Σ) with PDF

f(x) =1

|Σ|1/2(2π)d/2exp

(−1

2(x− µ)>Σ−1(x− µ)

).

µ is the mean vector, Σ is the covariance matrix, |Σ| its determinant,and Σ−1 its inverse

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 21 / 95

Page 22: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Probability

Marginal and Conditional of Gaussian

If two (groups of) variables x, y are jointly Gaussian:[xy

]∼ N

([µxµy

],

[A CC> B

])(1)

(Marginal) x ∼ N(µx, A)

(Conditional) y|x ∼ N(µy + C>A−1(x− µx), B − C>A−1C)

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 22 / 95

Page 23: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Probability

More Continuous Distributions

The Gamma function (not distribution) is Γ(α) =∫∞

0 yα−1e−ydywith α > 0.

Generalizes factorial: Γ(n) = (n− 1)! when n is a positive integer.

Γ(α+ 1) = αΓ(α) for α > 0.

X has a Gamma distribution X ∼ Gamma(α, β) with shapeparameter α > 0 and scale parameter β > 0

f(x) =1

βαΓ(α)xα−1e−x/β, x > 0.

Conjugate prior for Poisson rate.

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 23 / 95

Page 24: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Statistical Estimation

Outline

1 Spatio-Temporal Signal Recovery from Social Media

2 Machine Learning BasicsProbabilityStatistical EstimationDecision TheoryGraphical ModelsRegularizationStochastic Processes

3 Socioscope: A Probabilistic Model for Social Media

4 Case Study: Roadkill

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 24 / 95

Page 25: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Statistical Estimation

Parametric Models

A statistical model H is a set of distributions.

In machine learning, we call H the hypothesis space.

A parametric model can be parametrized by a finite number ofparameters: f(x) ≡ f(x; θ) with parameter θ ∈ Rd:

H =f(x; θ) : θ ∈ Θ ⊂ Rd

where Θ is the parameter space.

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 25 / 95

Page 26: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Statistical Estimation

Parametric Models

We denote the expectation

Eθ(g) =

∫xg(x)f(x; θ) dx

Eθ means Ex∼f(x;θ), not over different θ’s.

For parametric model H = N(µ, 1) : µ ∈ R, given iid datax1, . . . , xn, the optimal estimator of the mean is µ = 1

n

∑xi.

All (parametric) models are wrong. Some are more useful than others.

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 26 / 95

Page 27: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Statistical Estimation

Nonparametric model

A nonparametric model cannot be parametrized by a fixed number ofparameters.

Model complexity grows indefinitely with sample size

Example: H = P : V arP (X) <∞.Given iid data x1, . . . , xn, the optimal estimator of the mean is againµ = 1

n

∑xi.

Nonparametric makes weaker model assumptions and thus ispreferred.

But parametric models converge faster and are more practical.

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 27 / 95

Page 28: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Statistical Estimation

Estimation

Given X1 . . . Xn ∼ F ∈ H, an estimator θn is any function ofX1 . . . Xn that attempts to estimate a parameter θ.

This is the “learning” in machine learning!

Example: In classification Xi = (xi, yi) and θn is the learned model.

θn is a random variable because the training set is random.

An estimator is consistent if θnP→ θ.

Consistent estimators learn the correct model with more training dataeventually.

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 28 / 95

Page 29: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Statistical Estimation

Bias

Since θn is a random variable, it has an expectation Eθ(θn)

Eθ is w.r.t. the joint distribution f(x1, . . . , xn; θ) =∏ni=1 f(xi; θ).

The bias of the estimator is

bias(θn) = Eθ(θn)− θ

An estimator is unbiased if bias(θn) = 0.

The standard error of an estimator is se(θn) =

√Varθ(θn)

Example: Let µ = 1n

∑i xi, where xi ∼ N(0, 1). Then the standard

deviation of xi is 1 regardless of n. In contrast, se(µ) = 1/√n = n−

12

which decreases with n.

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 29 / 95

Page 30: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Statistical Estimation

MSE

The mean squared error of an estimator is

mse(θn) = Eθ(

(θn − θ)2)

Bias-variance decomposition

mse(θn) = bias2(θn) + se2(θn) = bias2(θn) + Varθ(θn)

If bias(θn)→ 0 and Varθ(θn)→ 0 then mse(θn)→ 0.

This implies θnP→ θ, so that θn is consistent.

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 30 / 95

Page 31: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Statistical Estimation

Maximum Likelihood

Let x1, . . . , xn ∼ f(x; θ) where θ ∈ Θ.

The likelihood function is

Ln(θ) = f(x1, . . . , xn; θ) =

n∏i=1

f(xi; θ)

The log likelihood function is `n(θ) = logLn(θ).

The maximum likelihood estimator (MLE) is

θn = argmaxθ∈ΘLn(θ) = argmaxθ∈Θ`n(θ)

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 31 / 95

Page 32: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Statistical Estimation

MLE examples

The MLE for p(head) from n coin flips is count(head)/n

The MLE for X1, . . . , XN ∼ N(µ, σ2) is µ = 1/n∑

iXi andσ2 = 1/n

∑(Xi − µ)2.

The MLE does not always agree with intuition. The MLE forX1, . . . , Xn ∼ uniform(0, θ) is θ = max(X1, . . . , Xn).

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 32 / 95

Page 33: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Statistical Estimation

Properties of MLE

When H is identifiable, under certain conditions (see Wasserman

Theorem 9.13), the MLE θnP→ θ∗, where θ∗ is the true value of the

parameter θ. That is, the MLE is consistent.

Asymptotic Normality: Let se =

√V arθ(θn). Under appropriate

regularity conditions, se ≈√

1/In(θ) where In(θ) is the Fisherinformation, and

θn − θse

N(0, 1)

The MLE is asymptotically efficient (achieves the Cramer-Rao lowerbound), “best” among unbiased estimators.

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 33 / 95

Page 34: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Statistical Estimation

Frequentist statistics

Probability refers to limiting relative frequency.

Data are random.

Estimators are random because they are functions of data.

Parameters are fixed, unknown constants not subject to probabilisticstatements.

Procedures are subject to probabilistic statements, for example 95%confidence intervals trap the true parameter value 95

Classifiers, even learned with deterministic procedures, are randombecause the training set is random.

PAC bound is frequentist. Most procedures in machine learning arefrequentist methods.

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 34 / 95

Page 35: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Statistical Estimation

Bayesian statistics

Probability refers to degree of belief.

Inference about a parameter θ is by producing a probabilitydistributions on it.

Starts with prior distribution p(θ).

Likelihood function p(x | θ), a function of θ not x.

After observing data x, one applies the Bayes rule to obtain theposterior

p(θ | x) =p(θ)p(x | θ)∫p(θ′)p(x | θ′)dθ′

=1

Zp(θ)p(x | θ)

Z ≡∫p(θ′)p(x | θ′)dθ′ = p(x) is the normalizing constant or

evidence.

Prediction by integrating parameters out:

p(x | Data) =

∫p(x | θ)p(θ | Data)dθ

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 35 / 95

Page 36: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Statistical Estimation

Frequentist vs Bayesian in machine learning

Frequentists produce a point estimate θ from Data, and predict withp(x | θ).

Bayesians keep the posterior distribution p(θ | Data), and predict byintegrating over θs.

Bayesian integration is often intractable, need either “nice”distributions or approximations.

The maximum a posteriori (MAP) estimate

θMAP = argmaxθp(θ | x)

is a point estimate and not Bayesian.

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 36 / 95

Page 37: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Decision Theory

Outline

1 Spatio-Temporal Signal Recovery from Social Media

2 Machine Learning BasicsProbabilityStatistical EstimationDecision TheoryGraphical ModelsRegularizationStochastic Processes

3 Socioscope: A Probabilistic Model for Social Media

4 Case Study: Roadkill

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 37 / 95

Page 38: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Decision Theory

Comparing Estimators

Training set D = (x1, . . . , xn) ∼ p(x; θ)

Learned model: θ ≡ θ(D) an estimator of θ based on data D.

Loss function L(θ, θ) : Θ×Θ 7→ R+

squared loss L(θ, θ) = (θ − θ)2

0-1 loss L(θ, θ) =

0 θ = θ

1 θ 6= θ

KL loss L(θ, θ) =∫p(x; θ) log

(p(x;θ)

p(x;θ)

)dx

Since D is random, both θ(D) and L(θ, θ) are random variables

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 38 / 95

Page 39: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Decision Theory

Risk

The risk R(θ, θ) is the expected loss

R(θ, θ) = ED[L(θ, θ(D))]

ED averaged over training sets D sampled from the true θ

The risk is the “average training set” behavior of a learning algorithmwhen the world is θ

Not computable: we don’t know which θ the world is in.

Example: Let D = X1 ∼ N(θ, 1). Let θ1 = X1 and θ2 = 3.14.Assume squared loss. Then R(θ, θ1) = 1 (hint: variance),R(θ, θ2) = ED(θ − 3.14)2 = (θ − 3.14)2.

Smart learning algorithm θ1 and a dumb one θ2. However, for tasksθ ∈ (3.14− 1, 3.14 + 1) the dumb algorithm is better.

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 39 / 95

Page 40: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Decision Theory

Minimax Estimator

maximum riskRmax(θ) = sup

θR(θ, θ)

The minimax estimator θminimax minimizes the maximum risk

θminimax = arg infθ

supθR(θ, θ)

The infimum is over all estimators θ.

The minimax estimator is the “best” in guarding against the worstpossible world.

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 40 / 95

Page 41: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Graphical Models

Outline

1 Spatio-Temporal Signal Recovery from Social Media

2 Machine Learning BasicsProbabilityStatistical EstimationDecision TheoryGraphical ModelsRegularizationStochastic Processes

3 Socioscope: A Probabilistic Model for Social Media

4 Case Study: Roadkill

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 41 / 95

Page 42: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Graphical Models

The envelope quiz

Random variables E ∈ 1, 0, B ∈ r, bP (E = 1) = P (E = 0) = 1/2

P (B = r | E = 1) = 1/2, P (B = r | E = 0) = 0

We ask: P (E = 1 | B = b) ≥ 1/2?

P (E = 1 | B = b) = P (B=b|E=1)P (E=1)P (B=b) = 1/2×1/2

3/4 = 1/3

Switch.

The graphical model:

B

E

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 42 / 95

Page 43: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Graphical Models

The envelope quiz

Random variables E ∈ 1, 0, B ∈ r, bP (E = 1) = P (E = 0) = 1/2

P (B = r | E = 1) = 1/2, P (B = r | E = 0) = 0

We ask: P (E = 1 | B = b) ≥ 1/2?

P (E = 1 | B = b) = P (B=b|E=1)P (E=1)P (B=b) = 1/2×1/2

3/4 = 1/3

Switch.

The graphical model:

B

E

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 42 / 95

Page 44: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Graphical Models

Probabilistic Reasoning

The world is reduced to a set of random variables x1, . . . , xnI e.g. (x1, . . . , xn−1) a feature vector, xn ≡ y the class label

Inference: given joint distribution p(x1, . . . , xn), computep(XQ | XE) where XQ ∪XE ⊆ x1 . . . xn

I e.g. Q = n, E = 1 . . . n− 1, by the definition of conditional

p(xn | x1, . . . , xn−1) =p(x1, . . . , xn−1, xn)∑

v p(x1, . . . , xn−1, xn = v)

Learning: estimate p(x1, . . . , xn) from training data X(1), . . . , X(N),

where X(i) = (x(i)1 , . . . , x

(i)n )

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 43 / 95

Page 45: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Graphical Models

It is difficult to reason with uncertainty

joint distribution p(x1, . . . , xn)I exponential naıve storage (2n for binary r.v.)I hard to interpret (conditional independence)

inference p(XQ | XE)I Often can’t afford to do it by brute force

If p(x1, . . . , xn) not given, estimate it from dataI Often can’t afford to do it by brute force

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 44 / 95

Page 46: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Graphical Models

Graphical models

Graphical models: efficient representation, inference, and learning onp(x1, . . . , xn), exactly or approximately

Two main “flavors”:I directed graphical models = Bayesian Networks (often frequentist

instead of Bayesian)I undirected graphical models = Markov Random Fields

Key idea: make conditional independence explicit

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 45 / 95

Page 47: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Graphical Models

Bayesian Network

Directed graphical models are also called Bayesian networks

A directed graph has nodes X = (x1, . . . , xn), some of themconnected by directed edges xi → xj

A cycle is a directed path x1 → . . .→ xk where x1 = xk

A directed acyclic graph (DAG) contains no cycles

A Bayesian network on the DAG is a family of distributions satisfying

p | p(X) =∏i

p(xi | Pa(xi))

where Pa(xi) is the set of parents of xi.

p(xi | Pa(xi)) is the conditional probability distribution (CPD) at xi

By specifying the CPDs for all i, we specify a particular distributionp(X)

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 46 / 95

Page 48: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Graphical Models

Example: Alarm

Binary variables

P(A | B, E) = 0.95P(A | B, ~E) = 0.94P(A | ~B, E) = 0.29P(A | ~B, ~E) = 0.001

P(J | A) = 0.9P(J | ~A) = 0.05

P(M | A) = 0.7P(M | ~A) = 0.01

A

J M

B E

P(E)=0.002P(B)=0.001

P (B,∼ E,A, J,∼M)

= P (B)P (∼ E)P (A | B,∼ E)P (J | A)P (∼M | A)

= 0.001× (1− 0.002)× 0.94× 0.9× (1− 0.7)

≈ .000253

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 47 / 95

Page 49: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Graphical Models

Example: Naive Bayes

y y

x x. . .1 d x

d

p(y, x1, . . . xd) = p(y)∏di=1 p(xi | y)

Used extensively in natural language processing

Plate representation on the right

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 48 / 95

Page 50: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Graphical Models

No Causality Whatsoever

P(A)=aP(B|A)=bP(B|~A)=c

A

B

B

A

P(B)=ab+(1−a)cP(A|B)=ab/(ab+(1−a)c)P(A|~B)=a(1−b)/(1−ab−(1−a)c)

The two BNs are equivalent in all respects

Bayesian networks imply no causality at all

They only encode the joint probability distribution (hence correlation)

However, people tend to design BNs based on causal relations

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 49 / 95

Page 51: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Graphical Models

Conditional Independence

Two r.v.s A, B are independent if P (A,B) = P (A)P (B) orP (A|B) = P (A) (the two are equivalent)

Two r.v.s A, B are conditionally independent given C ifP (A,B | C) = P (A | C)P (B | C) or P (A | B,C) = P (A | C) (thetwo are equivalent)

This extends to groups of r.v.s

Conditional independence in a BN is precisely specified byd-separation (“directed separation”)

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 50 / 95

Page 52: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Graphical Models

d-Separation Case 1: Tail-to-Tail

C

A B

C

A B

A, B in general dependent

A, B conditionally independent given C (observed nodes are shaded)

An observed C is a tail-to-tail node, blocks the undirected path A-B

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 51 / 95

Page 53: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Graphical Models

d-Separation Case 2: Head-to-Tail

A C B A C B

A, B in general dependent

A, B conditionally independent given C

An observed C is a head-to-tail node, blocks the path A-B

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 52 / 95

Page 54: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Graphical Models

d-Separation Case 3: Head-to-Head

A B A B

C C

A, B in general independent

A, B conditionally dependent given C, or any of C’s descendants

An observed C is a head-to-head node, unblocks the path A-B

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 53 / 95

Page 55: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Graphical Models

d-Separation

Any groups of nodes A and B are conditionally independent givenanother group C, if all undirected paths from any node in A to anynode in B are blocked

A path is blocked if it includes a node x such that eitherI The path is head-to-tail or tail-to-tail at x and x ∈ C, orI The path is head-to-head at x, and neither x nor any of its

descendants is in C.

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 54 / 95

Page 56: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Graphical Models

d-Separation Example 1

The undirected path from A to B is unblocked by E (because of C),and is not blocked by F

A, B dependent given C

A

C

B

F

E

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 55 / 95

Page 57: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Graphical Models

d-Separation Example 2

The path from A to B is blocked both at E and F

A, B conditionally independent given F

A

B

F

E

C

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 56 / 95

Page 58: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Graphical Models

Markov Random Fields

Undirected graphical models are also called Markov Random Fields

The efficiency of directed graphical model (acyclic graph, locallynormalized CPDs) also makes it restrictive

A clique C in an undirected graph is a fully connected set of nodes(note: full of loops!)

Define a nonnegative potential function ψC : XC 7→ R+

An undirected graphical model is a family of distributions satisfyingp | p(X) =

1

Z

∏C

ψC(XC)

Z =∫ ∏

C ψC(XC)dX is the partition function

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 57 / 95

Page 59: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Graphical Models

Example: A Tiny Markov Random Field

x x1 2

C

x1, x2 ∈ −1, 1A single clique ψC(x1, x2) = eax1x2

p(x1, x2) = 1Z e

ax1x2

Z = (ea + e−a + e−a + ea)

p(1, 1) = p(−1,−1) = ea/(2ea + 2e−a)

p(−1, 1) = p(1,−1) = e−a/(2ea + 2e−a)

When the parameter a > 0, favor homogeneous chains

When the parameter a < 0, favor inhomogeneous chains

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 58 / 95

Page 60: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Graphical Models

Log Linear Models

Real-valued feature functions f1(X), . . . , fk(X)

Real-valued weights w1, . . . , wk

p(X) =1

Zexp

(−

k∑i=1

wifi(X)

)

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 59 / 95

Page 61: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Graphical Models

Example: The Ising Model

θsθ

xs xtst

This is an undirected model with x ∈ 0, 1.

pθ(x) =1

Zexp

∑s∈V

θsxs +∑

(s,t)∈E

θstxsxt

fs(X) = xs, fst(X) = xsxt

ws = −θs, wst = −θst

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 60 / 95

Page 62: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Graphical Models

Example: Image Denoising

[From Bishop PRML] noisy image argmaxXP (X|Y )

pθ(X | Y ) =1

Zexp

∑s∈V

θsxs +∑

(s,t)∈E

θstxsxt

θs =

c ys = 1−c ys = 0

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 61 / 95

Page 63: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Graphical Models

Example: Gaussian Random Field

p(X) ∼ N(µ,Σ) =1

(2π)n/2|Σ|1/2exp

(−1

2(X − µ)>Σ−1(X − µ)

)

Multivariate Gaussian

The n× n covariance matrix Σ positive semi-definite

Let Ω = Σ−1 be the precision matrix

xi, xj are conditionally independent given all other variables, if andonly if Ωij = 0

When Ωij 6= 0, there is an edge between xi, xj

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 62 / 95

Page 64: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Graphical Models

Conditional Independence in Markov Random Fields

Two group of variables A, B are conditionally independent givenanother group C, if:

A, B become disconnected by removing C and all edges involving C

AC

B

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 63 / 95

Page 65: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Regularization

Outline

1 Spatio-Temporal Signal Recovery from Social Media

2 Machine Learning BasicsProbabilityStatistical EstimationDecision TheoryGraphical ModelsRegularizationStochastic Processes

3 Socioscope: A Probabilistic Model for Social Media

4 Case Study: Roadkill

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 64 / 95

Page 66: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Regularization

Regularization for Maximum Likelihood

Recall the MLE θn = argmaxθ∈Θ`n(θ)

Can overfit.

Regularized likelihood

θn = argminθ∈Θ − `n(θ) + λΩ(θ)

Ω(θ) is the regularizer, for example Ω(θ) = ‖θ‖2.

Coincides with MAP estimate with prior distributionp(θ) ∝ exp(−λΩ(θ))

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 65 / 95

Page 67: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Regularization

Graph-based regularization

Nodes: x1 . . . xn, θ = f = (f(x1), . . . , f(xn))Edges: similarity weights computed from features, e.g.,

I k-nearest-neighbor graph, unweighted (0, 1 weights)I fully connected graph, weight decays with distancew = exp

(−‖xi − xj‖2/σ2

)I ε-radius graph

Assumption Nodes connected by heavy edge tend to have the samevalue.

x2

x3

x1

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 66 / 95

Page 68: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Regularization

Graph energy

f incurs the energy ∑i∼j

wij(f(xi)− f(xj))2

smooth f has small energy

constant f has zero energy

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 67 / 95

Page 69: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Regularization

An electric network interpretation

Edges are resistors with conductance wij

Nodes clamped at voltages specified by f

Energy = heat generated by the network in unit time

+1 volt

wijR =ij

1

1

0

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 68 / 95

Page 70: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Regularization

The graph Laplacian

We can express the energy of f in closed-form using the graph Laplacian.

n× n weight matrix W on Xl ∪Xu

I symmetric, non-negative

Diagonal degree matrix D: Dii =∑n

j=1Wij

Graph Laplacian matrix ∆

∆ = D −W

The energy ∑i∼j

wij(f(xi)− f(xj))2 = f>∆f

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 69 / 95

Page 71: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Regularization

Graph Laplacian as a Regularizer

Regression problem with training data xi ∈ Rd, yi ∈ R, i = 1 . . . n

Allow f(Xi) to be different from Yi, but penalize the difference witha Gaussian log likelihood

Regularizer Ω(f) = f>∆f

minf

n∑i=1

(f(xi)− yi)2 + λf>∆f

Equivalent to MAP estimate withI Gaussian likelihood yi = f(xi) + εi where εi ∼ N(0, σ2), andI Gaussian Random Field prior p(f) = 1

Z exp(−λf>∆f

)

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 70 / 95

Page 72: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Regularization

Graph Spectrum and Regularization

Assumption: labels are “smooth” on the graph, characterized by the graphspectrum (eigen-values/vectors (λi, φi)ni=1 of the Laplacian L):

L =∑n

i=1 λiφiφi>

a graph has k connected components if and only if λ1 = . . . = λk = 0.

the corresponding eigenvectors are constant on individual connectedcomponents, and zero elsewhere.

any f on the graph can be represented as f =∑n

i=1 aiφi

graph regularizer f>Lf =∑n

i=1 a2iλi

smooth function f uses smooth basis (those with small λi)

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 71 / 95

Page 73: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Regularization

Example graph spectrum

The graph

Eigenvalues and eigenvectors of the graph Laplacian

λ1=0.00 λ

2=0.00 λ

3=0.04 λ

4=0.17 λ

5=0.38

λ6=0.38 λ

7=0.66 λ

8=1.00 λ

9=1.38 λ

10=1.38

λ11

=1.79 λ12

=2.21 λ13

=2.62 λ14

=2.62 λ15

=3.00

λ16

=3.34 λ17

=3.62 λ18

=3.62 λ19

=3.83 λ20

=3.96

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 72 / 95

Page 74: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Stochastic Processes

Outline

1 Spatio-Temporal Signal Recovery from Social Media

2 Machine Learning BasicsProbabilityStatistical EstimationDecision TheoryGraphical ModelsRegularizationStochastic Processes

3 Socioscope: A Probabilistic Model for Social Media

4 Case Study: Roadkill

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 73 / 95

Page 75: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Stochastic Processes

Stochastic Process

Infinite collection of random variables indexed by a set x.x ∈ R for “time”

More generally, x ∈ Rd (e.g., space and time).

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 74 / 95

Page 76: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Stochastic Processes

Gaussian Process

A stochastic process where any finite number of random variableshave a joint Gaussian distribution.

Index x ∈ Rd

f(x) ∈ R: the random variable indexed by x

mean function m(x) = E[f(x)], e.g. m(x) = 0, ∀xcovariance function (kernel)k(x,x′) = E[(f(x)−m(x))(f(x′)−m(x′))], e.g. RBFk(x,x′) = exp

(− 1

2σ2 ‖x− x′‖2)

Gaussian Process (GP) f(x) ∼ GP (m(x), k(x,x′))

A draw from a GP is a function f over Rd.

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 75 / 95

Page 77: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Stochastic Processes

Gaussian Process vs. Gaussian Distribution

For any x1, . . . ,xn,

(f(x1), . . . , f(xn)) ∼ N(µ,Σ)

Mean vector µ = (m(x1), . . . ,m(xn))

Covariance matrix Σ =

k(x1,x1) . . . k(x1,xn). . .

k(xn,x1) . . . k(xn,xn)

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 76 / 95

Page 78: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Stochastic Processes

Poisson Process

Rate function λ(x)I homogeneous Poisson Process: λ(x) = λ constantI inhomogeneous Poisson Process: otherwise

Expected number of events in region T ⊂ Rd is λT =∫T λ(x)dx

The number of events in T follows a Poisson distribution

P (N(T ) = k) =e−λT λT

k

k!

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 77 / 95

Page 79: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Machine Learning Basics Stochastic Processes

Cox Process

Doubly stochastic process

The rate λ(x) itself sampled from another stochastic process

Eg, Sigmoidal Gaussian Cox Process [Adams et al. ICML09]

λ(x) = λ∗σ(f(x))

λ∗ upper bound intensity

logistic “squashing” σ(z) = 11+e−z

f(x) ∼ GP

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 78 / 95

Page 80: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Socioscope: A Probabilistic Model for Social Media

Outline

1 Spatio-Temporal Signal Recovery from Social Media

2 Machine Learning BasicsProbabilityStatistical EstimationDecision TheoryGraphical ModelsRegularizationStochastic Processes

3 Socioscope: A Probabilistic Model for Social Media

4 Case Study: Roadkill

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 79 / 95

Page 81: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Socioscope: A Probabilistic Model for Social Media

Correcting Population Bias

Social media user activity intensity gst

x ∼ Poisson(η(f, g))

Link function (target post intensity) η(f, g) = f · gCount of all posts z ∼ Poisson(g)

gst can be accurately recovered

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 80 / 95

Page 82: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Socioscope: A Probabilistic Model for Social Media

Handling Imprecise Location

Reproduced from Vardi et al(1985), A statistical model for positronemission tomography

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 81 / 95

Page 83: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Socioscope: A Probabilistic Model for Social Media

Handling Imprecise Location: Transition

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 82 / 95

Page 84: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Socioscope: A Probabilistic Model for Social Media

Handling Imprecise Location: Transition

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 83 / 95

Page 85: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Socioscope: A Probabilistic Model for Social Media

Handling Zero / Low Counts

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 84 / 95

Page 86: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Socioscope: A Probabilistic Model for Social Media

The Graphical Model

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 85 / 95

Page 87: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Socioscope: A Probabilistic Model for Social Media

Optimization and Parameter Tuning

minθ∈Rn

−m∑i=1

(xi log hi − hi) + λΩ(θ)

θj = log fj

hi =

n∑j=1

Pijη(θj , ψj)

Quasi-Newton method (BFGS)

Cross-Validation: Data-based and objective approach toregularization; Sub-sample events from the total observations

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 86 / 95

Page 88: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Socioscope: A Probabilistic Model for Social Media

Theoretical Consideration

How many posts do we need to obtain reliable recovery?

If x ∼ Poisson(h), then E[(x−hh )2] = h−1 ≈ x−1: more counts, lesserror

Theorem: Let f be a Holder α-smooth d-dimensional intensityfunction and suppose we observe N events from the distributionPoisson(f). Then there exists a constant Cα > 0 such that

inff

supf

E[‖f − f‖21]

‖f‖21≥ CαN

−2α2α+d

Best achievable recovery error is inversely proportional to N withexponent depending on the underlying smoothness

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 87 / 95

Page 89: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Case Study: Roadkill

Outline

1 Spatio-Temporal Signal Recovery from Social Media

2 Machine Learning BasicsProbabilityStatistical EstimationDecision TheoryGraphical ModelsRegularizationStochastic Processes

3 Socioscope: A Probabilistic Model for Social Media

4 Case Study: Roadkill

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 88 / 95

Page 90: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Case Study: Roadkill

Roadkill Spatio-Temporal Intensity

The intensity of roadkill events within the continental US

Spatio-Temporal resolution: State: 48 continental US states,hour-of-day: 24 hours

Data source: Twitter

Text classifier: Trained with 1450 labeled tweets. CV accuracy 90

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 89 / 95

Page 91: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Case Study: Roadkill

Text preprocessing

Twitter streaming API: animal name + “ran over”

Remove false positives by text classification“I almost ran over an armadillo on my longboard, luckily my cat-likereflexes saved me.”

Feature representationI Case folding, no stemming, keep stopwordsI @john → @USERNAME, http://wisc.edu → HTTPLINK, keep

#hashtags, keep emoticonsI Unigrams + bigrams

Linear SVMI Trained on 1450 labeled tweets outside study periodI Cross validation accuracy 90%

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 90 / 95

Page 92: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Case Study: Roadkill

Chipmunk Roadkill Results

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 91 / 95

Page 93: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Case Study: Roadkill

Roadkill Results on Other Species

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 92 / 95

Page 94: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Case Study: Roadkill

More Interesting Things to Do

Incorporate text classification confidence in the input

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 93 / 95

Page 95: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Case Study: Roadkill

More Interesting Things to Do

Handle the time delay and spatial displacement between the targetevent and the generation of a post

I So the pigeon I ran over yesterday must have some bird friends in highplaces. Car is full of bird shit.

I Ran over a chipmunk on my way 2 work this morning

Incorporate Psychology factors : Will you post a tweet about runningover a ...?

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 94 / 95

Page 96: Understanding Social Media with Machine Learningpages.cs.wisc.edu/~jerryzhu/ssl/pub/ZhuCCFADL39.pdf · 2013. 8. 12. · Understanding Social Media with Machine Learning Xiaojin Zhu

Case Study: Roadkill

References

Xu, Bhargava, Nowak, Zhu. Socioscope: Spatio-temporal signalrecovery from social media (extended abstract). IJCAI 2013.

Wasserman, All of Statistics: A Concise Course in StatisticalInference. Springer 2003.

Bishop, Pattern Recognition and Machine Learning. Springer 2006.

Koller & Friedman, Probabilistic Graphical Models. MIT 2009.

Zhu (U Wisconsin) Understanding Social Media CCF/ADL Beijing 2013 95 / 95