all the data and still not enough
TRANSCRIPT
Estimating conditional probabilities
Income
Age
Not interestedBuy
50K
45
Logistic Regression
p(buy|37,78000) = 0.48
p(+|x)=
P(Buy|Age,Income)
Income Age Buy
123,000 30 yes
51,100 40 yes
68,000 55 no
74,000 46 no
23,000 47 yes
100,000 49 no
Data for Predictive Modeling
Target E
xa
mp
les Features
?
yes
yes
no
no
yes
no
Rules for Predictive ModelingTarget
Exam
ple
s
Features
Data should be:
Large enough
Independently Identically Distributed
Wallet is NEVER observedWe observe
this in the
data
But we do not
observe this
IBM Sales to
this Company
Company Revenue (D&B)
Wallet/Opportunity
How can we make this a
predictive modeling problem?
10
REALISTIC Wallets as quantiles Motivation Imagine 100 identical firms with identical IT needs
Consider the distribution of the IBM sales to these firms
Bottom firms should spend as much as the top
Define wallet as high percentile of spending conditional on the customer attributes
Fre
qu
en
cy
IBM Sales
Wallet Estimate
Quantile Regression optimizing weighted absolute loss
10 20 30 40 50 60 70 801
2
3
4
5
6
7
8
9
Firm Sales
IBM
Rev
enue
Company Sales
IBM
Rev
enue
Opportunity for C 2
Opportunity for C 1
C1
C2
10 20 30 40 50 60 70 801
2
3
4
5
6
7
8
9
Firm Sales
IBM
Rev
enue
Company Sales
IBM
Rev
enue
Opportunity for C 2
Opportunity for C1
C1
C2
10 20 30 40 50 60 70 801
2
3
4
5
6
7
8
9
10 20 30 40 50 60 70 801
2
3
4
5
6
7
8
9
Firm Sales
IBM
Rev
enue
Company Sales
IBM
Rev
enue
Opportunity for C 2
Opportunity for C 1
C1
C2
10 20 30 40 50 60 70 801
2
3
4
5
6
7
8
9
20 30 40 50 60 70 801
2
3
4
5
6
7
8
9
Firm Sales
IBM
Rev
enue
Company Sales
IBM
Rev
enue
Opportunity for C 2
Opportunity for C1
C1
C2
© IBM Corporation 2008Slide 14
Siemens: Computer-Aided Detection of Breast Cancer in Mammograms
1712 Patients 6816 Images
105,000 Candidates
[ x1 , x2 , … , x117 ]Image feature vector
Malignant
?
MLO CC MLO CC
Siemens Medical: fMRI breast cancer data
245 Patients:
36% Cancer
414 Patients:
1% Cancer
1027 Patients
0% Cancer
18 Patients:
85% Cancer
Mo
de
l
sc
ore
Log of Patient ID
Every point
is a candidate
In essence, the most predictive variable is the patient ID
Data collection for post-view purchase conversion
TimeCohort of random
prospects
?
Multi-Armed Bandit: Exploration vs. exploitation
Show some random ads to learn a good model
Tradeoff between learning and using
Ignore the ad altogether …
Very few Luxury cars are bough online
Maserati $128,0000
$128,0000
Predict Other indicators: search or brand site visit/schedule test drive
Target E
xa
mp
les
Features
Site Visit
no
no
no
yes
yes
yes
Measuring impact and causal effect?
A/B Testing
You only get aggregate answer –not a label per
examples
Build two separate models and calculate impact as the difference
Site Visit
yes
no
no
yes
no
no
Site Visit
yes
no
no
yes
no
no
Exa
mp
les 1
seen a
d
Exa
mp
les 2
not
se
en
ad
Expected Impact:p(SV|Ad)-p(SV|no ad)
Telling the difference between an algorithm and a human
Turing test KAPTCHA
Pleasing the advertising oracle …
Audience reports from matched populations in Facebook
68% of the ads where shown to females
Makeup for 32% of adsThe Oracle
Data for Audience Optimization
Target E
xa
mple
s
Features
Gender
male
female
female
male
male
female
Weighted Logistic Regression on aggregated
Target E
xa
mple
s
Features
Weight Gender
0.32 male
0.68 female
0.32 male
0.68 female
0.73 male
0.27 female
Catalan traditionspop up everywhere ….
Data for Location Reliability in Auction
Target
Exa
mp
les
Features
Reliable?
maybe
no
no
maybe
maybe
no