cheap learning-dunning-9-18-2015
TRANSCRIPT
© 2014 MapR Technologies 1© 2014 MapR Technologies
Cheap Learning Complements Deep Learning
Ted Dunning
© 2014 MapR Technologies 2
Me, Us• Ted Dunning, Chief Application Architect, MapR
– Committer PMC member Zookeeper, Drill– VP Incubator– Bought the beer at the first HUG
• MapR– Distributes more open source components for Hadoop– Adds major technology for performance, HA, industry standard API’s
• Info– Hash tag - #mapr #mlconfatl– See also - @ApacheDrill
@ted_dunning and @mapR
© 2014 MapR Technologies 3
Agenda• Rationale• Why cheap isn't the same as simple-minded• Some techniques• Examples
© 2014 MapR Technologies 4
Why is cheap better than deep (sometimes)Greenfield problems can be
– Easy (large number of these)– Impossible (large number of these)– Hard but possible (right on the boundary)
Mature problems can be– Easy (these are already done)– Impossible (still a large number of these)– Hard but possible (now the majority of the effort)
© 2014 MapR Technologies 5
Most data isn’t worth much in isolation
First data is valuable
Later data is dregs
© 2014 MapR Technologies 6
Suddenly worth processing
First data is valuable
Later data is dregs
But has high aggregate value
© 2014 MapR Technologies 7
If we can handle the scale
It’s really big
© 2014 MapR Technologies 8
With great scale comes great opportunity• Increasing scale by 1000x changes the game
• We essentially have green fields opening up all around
• Most of the opportunities don’t require advanced learning
© 2014 MapR Technologies 9
A simple example - security monitoring
• “Small” data– Capture IDS logs– Detect what you already know
• “Big” data– Capture switch, server, firewall logs as well– New patterns emerge immediately
© 2014 MapR Technologies 10
Another example – fraud detection
• “Small” data– Maintain card profiles– Segment models– Evaluate all transactions
• “Big” Data– Maintain card profiles, full 90 day transaction history– Per user hierarchical models– Evaluate all transactions
© 2014 MapR Technologies 11
Easy != Stupid• You still have to do things reasonably well
– Techniques that are not well founded are still problems
• Heuristic frequency ratios still fail – Coincidences still dominate the data– Accidental 100% correlations abound
• Related techniques still broken for coincidence– Pearson’s χ2
– Simple correlations
© 2014 MapR Technologies 12
Blast from the past
© 2014 MapR Technologies 13
Scale does not cure wrong
It just makes easy more common
© 2014 MapR Technologies 14
A core technique• Many of these easy problems reduce to finding interesting
coincidences
• This can be summarized as a 2 x 2 table
• Actually, many of these tables
A OtherB k11 k12
Other
k21 k22
© 2014 MapR Technologies 15
How do you do that?• This is well handled using G-test
– See wikipedia– See http://bit.ly/surprise-and-coincidence
• Original application in linguistics now cited > 2000 times
• Available in ElasticSearch, in Solr, in Mahout• Available in R, C, Java, Python
© 2014 MapR Technologies 16
Which one is the anomalous co-occurrence?
A not AB 13 1000
not B 1000 100,000
A not AB 1 0
not B 0 10,000
A not AB 10 0
not B 0 100,000
A not AB 1 0
not B 0 2
© 2014 MapR Technologies 17
Which one is the anomalous co-occurrence?
A not AB 13 1000
not B 1000 100,000
A not AB 1 0
not B 0 10,000
A not AB 10 0
not B 0 100,000
A not AB 1 0
not B 0 20.90 1.95
4.52 14.3
Dunning Ted, Accurate Methods for the Statistics of Surprise and Coincidence, Computational Linguistics vol 19 no. 1 (1993)
© 2014 MapR Technologies 18
So we can find interesting coincidence
and that gets us exactly what?
© 2014 MapR Technologies 19
Cooccurrence Analysis
© 2014 MapR Technologies 20
Real-life example• Query: “Paco de Lucia”• Conventional meta-data search results:
– “hombres de paco” times 400– not much else
• Recommendation based search:– Flamenco guitar and dancers– Spanish and classical guitar– Van Halen doing a classical/flamenco riff
© 2014 MapR Technologies 21
Real-life example
© 2014 MapR Technologies 22
Any other domains?
© 2014 MapR Technologies 23
Document classification
© 2014 MapR Technologies 24
Language identification
© 2014 MapR Technologies 25
OK … Works for language
Anything else?
© 2014 MapR Technologies 26
Species identification
© 2014 MapR Technologies 27
Anything useful?
Like, to do with money?
© 2014 MapR Technologies 28
Common Point of Compromise• Scenario:
– Merchant 0 is compromised, leaks account data during compromise– Fraud committed elsewhere during exploit– High background level of fraud– Limited detection rate for exploits
• Goal:– Find merchant 0
• Meta-goal:– Screen algorithms for this task without leaking sensitive data
© 2014 MapR Technologies 29
Simulation Setup
© 2014 MapR Technologies 30
Simulation Strategy• For each consumer
– Pick consumer parameters such as transaction rate, preferences– Generate transactions until end of sim-time
• If merchant 0 during compromise time, possibly mark as compromised• For all transactions, possible mark as fraud, probability depends on history• Merchants are selected using hierarchical Pittman-Yor
• Restate data– Flatten transaction streams– Sort by time
• Tunables– Compromise probability, background fraud, detection probability
© 2014 MapR Technologies 31
But that isn’t very realistic!• No details of the fraud• No details of the fraudsters• No details on the transactions• No details on the models
• How can this be any good at all?
© 2014 MapR Technologies 32
Secure Development is Hard
© 2014 MapR Technologies 33
Secure Development is Hard
Outside collaborators are outside the security perimeter
They can’t see the data and they can’t tune new algorithms to fit reality
© 2014 MapR Technologies 34
How To Make Realistic Data
© 2014 MapR Technologies 35
Parametric Simulation
Parametric matching of failure signatures allows emulation of complex data properties
Matching on KPI’s and failure modes guarantees practical fidelity
© 2014 MapR Technologies 36
Performance Indicators to Match• User and merchant population• Transaction count/consumer• Merchant propensity skew• Level of detected fraud• Spectrum of meta-model scores
© 2014 MapR Technologies 37
So how does it work in practice?
© 2014 MapR Technologies 38
© 2014 MapR Technologies 39
Really truly bad guys
© 2014 MapR Technologies 40
Summary• We live in a golden age of newly achieved scale
• That scale has lowered the tree– Hard problems are much easier– Lots of low-hanging fruit all around us
• Cheap learning has huge value
• Code available at http://github.com/tdunning
© 2014 MapR Technologies 41
Me, Us• Ted Dunning, Chief Application Architect, MapR
– Committer PMC member Zookeeper, Drill– VP Incubator– Bought the beer at the first HUG
• MapR– Distributes more open source components for Hadoop– Adds major technology for performance, HA, industry standard API’s
• Info– Hash tag - #mapr #mlconfatl– See also - @ted_dunning and @mapR