all the data and still not enough

All the data and still not enough!

Claudia Perlich Chief Scientist

@claudia_perlich

Predictive Modeling: Algorithms that Learn Functions

Estimating conditional probabilities

Income

Age

Not interestedBuy

50K

45

Logistic Regression

p(buy|37,78000) = 0.48

p(+|x)=

P(Buy|Age,Income)

Income Age Buy

123,000 30 yes

51,100 40 yes

68,000 55 no

74,000 46 no

23,000 47 yes

100,000 49 no

Data for Predictive Modeling

Target E

xa

mp

les Features

?

yes

yes

no

no

yes

no

Rules for Predictive ModelingTarget

Exam

ple

s

Features

Data should be:

Large enough

Independently Identically Distributed

Paradox of Big Data:“You never have the data you want”

Art of making due with second best

IBM: Sales Force Optimization

Wallet is NEVER observedWe observe

this in the

data

But we do not

observe this

IBM Sales to

this Company

Company Revenue (D&B)

Wallet/Opportunity

How can we make this a

predictive modeling problem?

Wallet

10

5

31

17

39

4

Data for Wallet Estimation?

Target

Exam

ple

s

Features

10

REALISTIC Wallets as quantiles Motivation Imagine 100 identical firms with identical IT needs

Consider the distribution of the IBM sales to these firms

Bottom firms should spend as much as the top

Define wallet as high percentile of spending conditional on the customer attributes

Fre

qu

en

cy

IBM Sales

Wallet Estimate

Revenue

10

5

31

17

39

4

Data for Wallet Estimation

Target

Exam

ple

s

Features

Quantile Regression optimizing weighted absolute loss

10 20 30 40 50 60 70 801

2

3

4

5

6

7

8

9

Firm Sales

IBM

Rev

enue

Company Sales

IBM

Rev

enue

Opportunity for C 2

Opportunity for C 1

C1

C2

10 20 30 40 50 60 70 801

2

3

4

5

6

7

8

9

Firm Sales

IBM

Rev

enue

Company Sales

IBM

Rev

enue

Opportunity for C 2

Opportunity for C1

C1

C2

10 20 30 40 50 60 70 801

2

3

4

5

6

7

8

9

10 20 30 40 50 60 70 801

2

3

4

5

6

7

8

9

Firm Sales

IBM

Rev

enue

Company Sales

IBM

Rev

enue

Opportunity for C 2

Opportunity for C 1

C1

C2

10 20 30 40 50 60 70 801

2

3

4

5

6

7

8

9

20 30 40 50 60 70 801

2

3

4

5

6

7

8

9

Firm Sales

IBM

Rev

enue

Company Sales

IBM

Rev

enue

Opportunity for C 2

Opportunity for C1

C1

C2

Medical Diagnosis: Brest Cancer

© IBM Corporation 2008Slide 14

Siemens: Computer-Aided Detection of Breast Cancer in Mammograms

1712 Patients 6816 Images

105,000 Candidates

[ x1 , x2 , … , x117 ]Image feature vector

Malignant

?

MLO CC MLO CC

Siemens Medical: fMRI breast cancer data

245 Patients:

36% Cancer

414 Patients:

1% Cancer

1027 Patients

0% Cancer

18 Patients:

85% Cancer

Mo

de

l

sc

ore

Log of Patient ID

Every point

is a candidate

In essence, the most predictive variable is the patient ID

Data for Diagnosis from Multiple Sources

Target

Exam

ple

s

Features

Cancer

yes

no

no

no

no

no

Modeling the Sources …

Target

Exam

ple

s

Features

Source Cancer

1 yes

2 no

1 no

1 no

4 no

3 no

Digital Advertising

Click?

yes

yes

no

no

yes

no

Optimizing Clicks in Advertising?

Click Optimization: Fumbling in the DarkTop 10 Apps by CTR

Digital Advertising

Post view purchase conversion

Data collection for post-view purchase conversion

TimeCohort of random

prospects

?

http://www.google.com/aclk?sa=l&ai=CK8Gz3vdiUOv1MMOUggee6YGACMTUvK4E7P_U5lfckt2TkQEIBhADKANQyp-siAFgyYbtiISk7A-gAZycyOcDyAEHqgQmT9DJHfLJqoq9rJvwx1MDEQVFAcklOXOglhcW3vX5m3zS3oWh_c7ABQWgBiY&sig=AOD64_3FsmHdYqdTmgI0aE71hKeI84wNag&ctype=5&ved=0CJsBEPMO&adurl=http://www.dickssportinggoods.com/entry.point?entry=12620613&source=CA_DF:12620613:DSP&camp=CSE:GooglePLA:12620613&003=4239056&010=SKU-12260509&rct=j&q=nike shoes

http://www.google.com/aclk?sa=l&ai=CK8Gz3vdiUOv1MMOUggee6YGACMTUvK4E7P_U5lfckt2TkQEIBhADKANQyp-siAFgyYbtiISk7A-gAZycyOcDyAEHqgQmT9DJHfLJqoq9rJvwx1MDEQVFAcklOXOglhcW3vX5m3zS3oWh_c7ABQWgBiY&sig=AOD64_3FsmHdYqdTmgI0aE71hKeI84wNag&ctype=5&ved=0CJsBEPMO&adurl=http://www.dickssportinggoods.com/entry.point?entry=12620613&source=CA_DF:12620613:DSP&camp=CSE:GooglePLA:12620613&003=4239056&010=SKU-12260509&rct=j&q=nike shoes

Data For Advertising

Target

Exam

ple

s

Features

PV Buy

no

no

no

no

yes

yes

Multi-Armed Bandit: Exploration vs. exploitation

Show some random ads to learn a good model

Tradeoff between learning and using

Ignore the ad altogether …

http://en.wikipedia.org/wiki/File:Las_Vegas_slot_machines.jpg

http://en.wikipedia.org/wiki/File:Las_Vegas_slot_machines.jpg

Sample people PRIOR to ad

Target

Exam

ple

s

Features

Buy

no

no

no

no

yes

yes

Very few Luxury cars are bough online

Maserati $128,0000

$128,0000

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0CAcQjRw&url=http://wot.motortrend.com/four-seat-maserati-gran-turismo-mc-stradale-debuting-in-geneva-334269.html/maserati-granturismo-mc-stradale-front-side-view/&ei=oMY_VLrNIojLggTz-YGoAw&bvm=bv.77648437,d.eXY&psig=AFQjCNEAPQKor2o1pGJTFwuXm1aLGrjlyA&ust=1413552127916541

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0CAcQjRw&url=http://wot.motortrend.com/four-seat-maserati-gran-turismo-mc-stradale-debuting-in-geneva-334269.html/maserati-granturismo-mc-stradale-front-side-view/&ei=oMY_VLrNIojLggTz-YGoAw&bvm=bv.77648437,d.eXY&psig=AFQjCNEAPQKor2o1pGJTFwuXm1aLGrjlyA&ust=1413552127916541

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0CAcQjRw&url=http://wot.motortrend.com/1403_maserati_alfieri_concept_looks_forward_at_geneva_2014.html/maserati-alfieri-concept-show-floor-front-side-view/&ei=r8Y_VLDdH4_DggSvzYKgBg&bvm=bv.77648437,d.eXY&psig=AFQjCNEAPQKor2o1pGJTFwuXm1aLGrjlyA&ust=1413552127916541

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0CAcQjRw&url=http://wot.motortrend.com/1403_maserati_alfieri_concept_looks_forward_at_geneva_2014.html/maserati-alfieri-concept-show-floor-front-side-view/&ei=r8Y_VLDdH4_DggSvzYKgBg&bvm=bv.77648437,d.eXY&psig=AFQjCNEAPQKor2o1pGJTFwuXm1aLGrjlyA&ust=1413552127916541

Reality of Online Purchases

Target

Exam

ple

s

Features

Buy

no

no

no

no

no

yes

Predict Other indicators: search or brand site visit/schedule test drive

Target E

xa

mp

les

Features

Site Visit

no

no

no

yes

yes

yes

Online Display Advertising

Who should you really advertise to???

Data for Effective Advertising

Target

Exam

ple

s

Features

Impact

1

0.3

0.5

0

0

0.1

Measuring impact and causal effect?

A/B Testing

You only get aggregate answer –not a label per

examples

Alternative Histories (Counterfactual)

Fundamentally Impossible!

Target

Exam

ple

s

Features

Impact

1

0.3

0.5

0

0

0.1

Build two separate models and calculate impact as the difference

Site Visit

yes

no

no

yes

no

no

Site Visit

yes

no

no

yes

no

no

Exa

mp

les 1

seen a

d

Exa

mp

les 2

not

se

en

ad

Expected Impact:p(SV|Ad)-p(SV|no ad)

Use predictive models to measure impact

Negative Test: wrong ad

Positive Test: A/B comparison

Advertising Fraud

Is there really a person on the other end wanting to see the site?

Data for Fraud Detection

Target

Exam

ple

s

Features

Human?

yes

no

no

yes

yes

no

Telling the difference between an algorithm and a human

Turing test KAPTCHA

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=M5nbl8r37z3MxM&tbnid=V71nJ309tUf54M:&ved=0CAUQjRw&url=http://www.alanturing.net/turing_archive/pages/reference articles/theturingtest.html&ei=lfqUUfzEE--_0QH3q4FY&bvm=bv.46471029,d.dmg&psig=AFQjCNH-5E24f9Fr1EfEqMwPAFJozeOjjg&ust=1368804278031609

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=M5nbl8r37z3MxM&tbnid=V71nJ309tUf54M:&ved=0CAUQjRw&url=http://www.alanturing.net/turing_archive/pages/reference articles/theturingtest.html&ei=lfqUUfzEE--_0QH3q4FY&bvm=bv.46471029,d.dmg&psig=AFQjCNH-5E24f9Fr1EfEqMwPAFJozeOjjg&ust=1368804278031609

Bot traffic networks

Audiences in Video Advertising

Pleasing the advertising oracle …

Audience reports from matched populations in Facebook

68% of the ads where shown to females

Makeup for 32% of adsThe Oracle

Data for Audience Optimization

Target E

xa

mple

s

Features

Gender

male

female

female

male

male

female

Weighted Logistic Regression on aggregated

Target E

xa

mple

s

Features

Weight Gender

0.32 male

0.68 female

0.32 male

0.68 female

0.73 male

0.27 female

HyperlocalTargeting?

Data for Location Reliability in Auction

Target E

xa

mple

s

Features

Reliable?

yes

no

no

yes

yes

no

30% smart phone users travel faster than speed of sound …

Catalan traditionspop up everywhere ….

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0CAcQjRw&url=http://blogs.howstuffworks.com/the-coolest-stuff-on-the-planet/catalonia-where-towers-are-made-of-humans.htm&ei=dME_VKLnIMGNyASc84C4Aw&bvm=bv.77648437,d.aWw&psig=AFQjCNG-DCxN0-mjRnFiBkL3LI2Dz0tfLw&ust=1413550772881175


Data for Location Reliability in Auction

Target

Exa

mp

les

Features

Reliable?

maybe

no

no

maybe

maybe

no



Paradox of Big Data:“You never have the data you want”

Art of making due with second best

All a matter how creative you are at cheating….

all the data and still not enough

Data & Analytics

data collection

data analytics

data fo

advertisingtarget examples

adtarget examples

fmri breast cancer data

sources target examples

predictive variable