finding changes in real data

83
© 2017 MapR Technologies 1 Detecting Change

Upload: ted-dunning

Post on 22-Jan-2018

182 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Finding Changes in Real Data

© 2017 MapR Technologies 1

Detecting Change

Page 2: Finding Changes in Real Data

© 2017 MapR Technologies 2

Contact Information

Ted Dunning, PhD

Chief Application Architect, MapR Technologies

Board member, Apache Software Foundation

O’Reilly author

Email [email protected] [email protected]

Twitter @ted_dunning

Page 3: Finding Changes in Real Data

© 2017 MapR Technologies 3

Who We Are

• MapR Technologies

– We make a kick-ass platform for big data computing

– Support many workloads including Hadoop / Spark / HPC / Other

– Extended to allow streams and tables in basic platform

– Free for academic research / training

• Apache Software Foundation

– Culture hub for building open source communities

– Shared values around openness for contribution as well as use

– Many major projects are part of Apache

– Even more minor ones!

Page 4: Finding Changes in Real Data

© 2017 MapR Technologies 4

Basic Outline

• Goal Setting

• Basic Ideas

– LLR (finding changes in counts)

– Poisson rate change detection (finding changes in events timing)

– Distribution estimation / visualization

– Labeled events and adding labels

• Free Improvisation on Themes

Page 5: Finding Changes in Real Data

© 2017 MapR Technologies 5

Why Is This Practically Important

• The novice came to the master and says “something is broken”

Page 6: Finding Changes in Real Data

© 2017 MapR Technologies 6

Why Is This Practically Important

• The novice came to the master and says “something is broken”

• The master replied “What has changed?”

Page 7: Finding Changes in Real Data

© 2017 MapR Technologies 7

Why Is This Practically Important

• The novice came to the master and says “something is broken”

• The master replied “What has changed?”

• And the student was enlightened

Page 8: Finding Changes in Real Data

© 2017 MapR Technologies 8

The Second Student

• Another student said to the master, “I see something has

changed … something may have broken”

Page 9: Finding Changes in Real Data

© 2017 MapR Technologies 9

The Second Student

• Another student said to the master, “I see something has

changed … something may have broken”

• The master replied, “You have no question to ask. You have no

need of enlightenment”

Page 10: Finding Changes in Real Data

© 2017 MapR Technologies 10

The Second Student

• Another student said to the master, “I see something has

changed … something may have broken”

• The master replied, “You have no question to ask. You have no

need of enlightenment”

• And thus the student was enlightened

Page 11: Finding Changes in Real Data

© 2017 MapR Technologies 11

• There are some very powerful techniques available, some only

very recently, that can make the detection of change much

easier than you might think. I will describe the practical use of

several of these techniques including t-digest, non-linear

histograms, variable rate Poisson models and combinations of

these.

Page 12: Finding Changes in Real Data

© 2017 MapR Technologies 12

Comparing Counts

• Suppose we have two situations A and B, each with many

observations, nA and nB

• And some event x occurred n1A and n1B times in each situation

x other

A n1A nA - n1A

B n1B nB - n1B

Page 13: Finding Changes in Real Data

© 2017 MapR Technologies 13

Comparing Counts

• Have we seen a change in the frequency of x?

• Frequency ratios?

– Breaks with small counts

• - test?

– Breaks with small counts

Page 14: Finding Changes in Real Data

© 2017 MapR Technologies 14

Log-Likelihood Ratio Test (Root LLR)

• In R

entropy = function(k) {

-sum(k*log((k==0)+(k/sum(k))))

}

llr = function(k) {

(entropy(rowSums(k))+entropy(colSums(k))

-entropy(k))*2

}

• Like mutual information * 2 N

Page 15: Finding Changes in Real Data

© 2017 MapR Technologies 15

Spot the Anomaly

• Root LLR is roughly like standard deviations

A not A

B 13 1000

not B 1000 100,000

A not A

B 1 0

not B 0 2

A not A

B 1 0

not B 0 10,000

A not A

B 10 0

not B 0 100,000

0.89 1.95

4.51 14.29

Page 16: Finding Changes in Real Data

© 2017 MapR Technologies 16

How Does it Work

Empirical fit to asymptotic

distribution is very good

Page 17: Finding Changes in Real Data

© 2017 MapR Technologies 17

How Does it Work?

Page 18: Finding Changes in Real Data

© 2017 MapR Technologies 18

OKWe can detect changes in counts

Page 19: Finding Changes in Real Data

© 2017 MapR Technologies 19

Real-life Example

• Query: “Paco de Lucia”

• Conventional meta-data search results:

– “hombres de paco” times 400

– not much else

• Recommendation based search:

– Flamenco guitar and dancers

– Spanish and classical guitar

– Van Halen doing a classical/flamenco riff

Page 20: Finding Changes in Real Data

© 2017 MapR Technologies 20

Real-life Example

Page 21: Finding Changes in Real Data

© 2017 MapR Technologies 21

Example 2 - Common Point of Compromise

• Scenario:

– Merchant 0 is compromised, leaks account data during compromise

– Fraud committed elsewhere during exploit

– High background level of fraud

– Limited detection rate for exploits

• Goal:

– Find merchant 0

• Meta-goal:

– Screen algorithms for this task without leaking sensitive data

Page 22: Finding Changes in Real Data

© 2017 MapR Technologies 22

Example 2 - Common Point of Compromise

skim exploit

Merchant 0

Skimmed data

Merchant n

Card data is stolen

from Merchant 0

That data is used

in frauds at other

merchants

Page 23: Finding Changes in Real Data

© 2017 MapR Technologies 23

Simulation Setup

0 20 40 60 80 100

01

00

300

50

0

day

coun

tCompromise period

Exploit period

compromises

frauds

Page 24: Finding Changes in Real Data

© 2017 MapR Technologies 24

Detection Strategy

• Select histories that precede non-fraud

• And histories that precede fraud detection

• Analyze 2x2 cooccurrence of merchant n versus fraud

detection

Page 25: Finding Changes in Real Data

© 2017 MapR Technologies 25

Page 26: Finding Changes in Real Data

© 2017 MapR Technologies 26

What about the

real world?

Page 27: Finding Changes in Real Data

© 2017 MapR Technologies 27

●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●● ● ●● ●● ●● ● ●●●● ●●●● ●● ●●●● ●●●● ●●● ●● ●● ● ●● ● ●●●● ●● ● ●●●● ●●●●●● ●● ●● ●●● ●●● ●●●●● ● ●●● ●● ●●● ●●● ●● ●●●● ●●● ●●● ●●● ●●● ●●

●●

●●

02

04

06

08

0

LLR score for real data

Number of Merchants

Bre

ach S

core

(L

LR

)

Real truly bad guys

100

101

102

103

104

105

106

Really truly bad guys

Page 28: Finding Changes in Real Data

© 2017 MapR Technologies 28

What about time?

Page 29: Finding Changes in Real Data

© 2017 MapR Technologies 29

Finding Changes in Timing

• Suppose our input is events embedded in time

• Suppose we want to find changes in our input in real-time

• Waiting and counting is fine if we don’t have to react now

• We can do much better

Page 30: Finding Changes in Real Data

© 2017 MapR Technologies 30

Poisson Event Rate Change

• Detection of fallout

– Time since last is very sensitive for complete failure

• Detection of change relative to reference

– Time since n-th most recent

– LLR with time

• Have to trade detection speed versus false positive rate and

size of change

• Can run multiple detectors at once

Page 31: Finding Changes in Real Data

© 2017 MapR Technologies 31

Basic idea:Time interval is better than counts

Page 32: Finding Changes in Real Data

© 2017 MapR Technologies 32

Sporadic Events: Finding Normal and Anomalous Patterns

• Time between intervals is much more usable than absolute

times

• Counts don’t link as directly to probability models

• Time interval is log ρ

• This is a big deal

Page 33: Finding Changes in Real Data

© 2017 MapR Technologies 33

Event Stream (timing)

• Events of various types arrive at irregular intervals

– we can assume Poisson distribution

• The key question is whether frequency has changed relative to

expected values

– This shows up as a change in interval

• Want alert as soon as possible

Page 34: Finding Changes in Real Data

© 2017 MapR Technologies 34

Converting Event Times to Anomaly

99.9%-ile

99.99%-ile

Page 35: Finding Changes in Real Data

© 2017 MapR Technologies 35

In the real world, event rates often vary

Page 36: Finding Changes in Real Data

© 2017 MapR Technologies 36

Time Intervals Are Key to Modeling Sporadic Events

0 1 2 3 4

02

46

8

t (days)

dt (m

in)

Page 37: Finding Changes in Real Data

© 2017 MapR Technologies 37

Time Intervals Are Key to Modeling Sporadic Events

0 1 2 3 4

02

46

8

t (days)

dt (m

in)

Page 38: Finding Changes in Real Data

© 2017 MapR Technologies 38

Poisson Distribution

• Time between events is exponentially distributed

• This means that long delays are exponentially rare

• If we know λ we can select a good threshold

– or we can pick a threshold empirically

Dt ~ le-lt

P(Dt > T ) = e-lT

- logP(Dt > T ) = lT

Page 39: Finding Changes in Real Data

© 2017 MapR Technologies 39

After Rate Correction

0 1 2 3 4

02

46

810

t (days)

dt

/ ra

te

99.9%−ile

99.99%−ile

Page 40: Finding Changes in Real Data

© 2017 MapR Technologies 40

Detecting Anomalies in Sporadic Events

Incoming

events

99.97%-ile

Alarm

Δn

Rate predictor

Rate

history

t-digest

δ> t

t i δ λ(t i- t i - n)

λt

Page 41: Finding Changes in Real Data

© 2017 MapR Technologies 41

Detecting Anomalies in Sporadic Events

Incoming

events

99.97%-ile

Alarm

Δn

Rate predictor

Rate

history

t-digest

δ> t

t i δ λ(t i- t i - n)

λt

Page 42: Finding Changes in Real Data

© 2017 MapR Technologies 42

Seasonality Poses a Challenge

Nov 17 Nov 27 Dec 07 Dec 17 Dec 27

02

46

8

Christmas Traffic

Date

Hits /

10

00

Page 43: Finding Changes in Real Data

© 2017 MapR Technologies 43

Something more is needed …

Nov 17 Nov 27 Dec 07 Dec 17 Dec 27

02

46

8

Christmas Traffic

Date

Hits /

10

00

Page 44: Finding Changes in Real Data

© 2017 MapR Technologies 44

We need a better rate predictor…

Incoming

events

99.97%-ile

Alarm

Δn

Rate predictor

Rate

history

t-digest

δ> t

t i δ λ(t i- t i - n)

λt

Page 45: Finding Changes in Real Data

© 2017 MapR Technologies 45

Idea: Predict log(rate) from lagged log(rate)

• Predict log because

– Peak to valley ratio

– Traffic grew by 30 %

– All rates are positive

Page 46: Finding Changes in Real Data

© 2017 MapR Technologies 46

Idea: Predict log(rate) from lagged log(rate)

• Predict log because

– Peak to valley ratio

– Traffic grew by 30 %

– All rates are positive

– Just because I said so

Page 47: Finding Changes in Real Data

© 2017 MapR Technologies 47

Idea: Predict log(rate) from lagged log(rate)

• Predict log because

– Peak to valley ratio

– Traffic grew by 30 %

– All rates are positive

– Just because I said so

• Let model see many lagged values

• Use L1 regularized linear model to pick important historical

values

– We would have moved to something fancier if this hadn’t worked

Page 48: Finding Changes in Real Data

© 2017 MapR Technologies 48

A New Rate Predictor for Sporadic Events

Page 49: Finding Changes in Real Data

© 2017 MapR Technologies 49

Improved Prediction with Adaptive Modeling

Dec 17 Dec 19 Dec 21 Dec 23 Dec 25 Dec 27 Dec 29

02

46

8

Christmas Prediction

Date

Hits (

x 1

00

0)

Page 50: Finding Changes in Real Data

© 2017 MapR Technologies 50

Some days the magic worksSome days ...

We use slightly different magic

Page 51: Finding Changes in Real Data

© 2017 MapR Technologies 51

Detecting More Subtle Changes

• Time-since-last finds complete failures well

• Nth order time finds more subtle rate changes

• But that subtlety delays detection of complete failure

– First order delay has 99.9% confidence at 6.5 units

– 10th order delay has 99.9% confidence at 12.5 units

• But 10th order delay can find speedups, first order cannot

Page 52: Finding Changes in Real Data

© 2017 MapR Technologies 57

10th order difference of

Poisson distribution

Page 53: Finding Changes in Real Data

© 2017 MapR Technologies 58

Finding Changes in Time Series

• So far, we only have times

• What about when we have times and measurements together?

– These are called time-series!

• First step can be to discretize the measurement

– Quintiles or deciles are good candidates

– Multi-scale discretization is a fine thing to do

• That gives us arrival times for measurements in each bin

– And this is susceptible to the rate model on previous slides

Page 54: Finding Changes in Real Data

© 2017 MapR Technologies 59

Finding Changes in Time Series

• Comprehensive approaches also possible (for counts)

• Time aware variant of G-test is possible

vs

Ted Dunning. Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19, 1 (March

1993)

http://bit.ly/surprise-and-coincidence

Page 55: Finding Changes in Real Data

© 2017 MapR Technologies 60

Propagation Anomalies

• What happens when something shadows part of the coverage

field for mobile telecom?

– Can happen in urban areas with a construction crane

• Can solve heuristically

– Subtract from reference image composed by long term averages

– Doesn’t deal well with weak signal regions and low S/N

• Can solve probabilistically

– Compute anomaly for each measurement, use mean of log(p)

Page 56: Finding Changes in Real Data

© 2017 MapR Technologies 61

Page 57: Finding Changes in Real Data

© 2017 MapR Technologies 62

Page 58: Finding Changes in Real Data

© 2017 MapR Technologies 63

Variable Signal/Noise Makes Heuristic Tricky

Far from the transmitter,

received signal is dominated by

noise. This makes subtraction of

average value a bad algorithm.

Page 59: Finding Changes in Real Data

© 2017 MapR Technologies 64

Other Issues

• Finding changes in coverage area is similar tricky

• Coverage area is roughly where tower signal strength is higher

than neighbors

• Except for fuzziness due to hand-off delays

• Except for bias due to large-scale caller motions

– Rush hour

– Event mobs

Page 60: Finding Changes in Real Data

© 2017 MapR Technologies 65

Simple Answer for Propagation Anomalies

• Cluster signal strength reports

• Cluster locations using k-means, large k

• Model report rate anomaly using discrete event models

• Model signal strength anomaly using percentile model

• Trade larger k against higher report rates, faster detection

• Overall anomaly is sum of individual log(p) anomalies

Page 61: Finding Changes in Real Data

© 2017 MapR Technologies 66

Tower Coverage Areas

Page 62: Finding Changes in Real Data

© 2017 MapR Technologies 67

Just One Tower

Page 63: Finding Changes in Real Data

© 2017 MapR Technologies 68

Cluster Reports for That Tower

Page 64: Finding Changes in Real Data

© 2017 MapR Technologies 69

Cluster Reports for That Tower

1

2 3

4

5

6

7

8

9

Can also sub-divide each cluster

into signal strength ranges

Multiple scales of clustering

can also be used to trade off

geographic versus temporal

resolution

Page 65: Finding Changes in Real Data

© 2017 MapR Technologies 70

Example

0.0

0.5

1.0

1.5

dt

01

23

45

67

dt

0.0

0.2

0.4

0.6

dt

Each cluster gives us a

sequence of events.

Individual anomaly scores can

be scaled and added to get

composite anomaly score

Optimality of combined signal

derives from optimality of

components.

Page 66: Finding Changes in Real Data

© 2017 MapR Technologies 71

Characterizing Distributions

• What about sequences of values from arbitrary distributions

– Can we find changes in the distribution?

– For instance, what about latencies?

• Non-linear histogram - FloatHistogram

• Fully Adaptive histogram – t-digest

Page 67: Finding Changes in Real Data

© 2017 MapR Technologies 72

FloatHistogram

• Assume all measurements are in the range

• Divide this range into power of 2 sub-ranges

• Sub-divide each sub-range evenly with steps

• Relative error is bounded in measurement space

Page 68: Finding Changes in Real Data

© 2017 MapR Technologies 73

FloatHistogram

• Assume all measurements are in the range

• Divide this range into power of 2 sub-ranges

• Sub-divide each sub-range evenly with steps

• Relative error is bounded in measurement space

• Bin index can be computed using FP representation!

Page 69: Finding Changes in Real Data

© 2017 MapR Technologies 74

T-digest

• Or we can talk about small errors in q

• Accumulate samples, sort, merge

• Merge if k-size < 1

Page 70: Finding Changes in Real Data

© 2017 MapR Technologies 75

T-digest

• Or we can talk about small errors in q

• Accumulate samples, sort, merge

• Merge if k-size < 1

0.0 0.2 0.4 0.6 0.8 1.0q

02

46

81

0k

Page 71: Finding Changes in Real Data

© 2017 MapR Technologies 76

T-digest

• Or we can talk about small errors in q

• Accumulate samples, sort, merge

• Merge if k-size < 1

• Interpolate using centroids in x

• Very good near extremes, no dynamic allocation

0.0 0.2 0.4 0.6 0.8 1.0q

02

46

81

0k

Page 72: Finding Changes in Real Data

© 2017 MapR Technologies 77

Finding Change with Histograms

• With fixed bins, we can simply count and compare counts for

different bins

• Thus, histogram change reduces to count change

• Or to changes in event times

Page 73: Finding Changes in Real Data

© 2017 MapR Technologies 78

Visualizing Histograms

• We want to detect small changes

– Consider log-scale for Y

• Non-linear bin spacing is really good for increasing counts

– Reweight by bin-width

– Changing x axis changes y axis

Page 74: Finding Changes in Real Data

© 2017 MapR Technologies 79

Good Results

Page 75: Finding Changes in Real Data

© 2017 MapR Technologies 80

Bad Results

Page 76: Finding Changes in Real Data

© 2017 MapR Technologies 81

Bad Results

Page 77: Finding Changes in Real Data

© 2017 MapR Technologies 82

With Better Scaling

Page 78: Finding Changes in Real Data

© 2017 MapR Technologies 83

Bad Results

Page 79: Finding Changes in Real Data

© 2017 MapR Technologies 84

Page 80: Finding Changes in Real Data

© 2017 MapR Technologies 85

With FloatHistogram

Page 81: Finding Changes in Real Data

© 2017 MapR Technologies 86

Summary

• Counts – LLR

• Events – Poisson + nth-order diffs

• Decimate in space

• Decimate in measurement space

– t-digest, FloatHistogram

• Don’t forget visualization

Incoming

events

99.97%-ile

Alarm

Δn

Rate predictor

Rate

history

t-digest

δ> t

t i δ λ(t i- t i - n)

λt

0.0 0.2 0.4 0.6 0.8 1.0q

02

46

81

0k

Page 82: Finding Changes in Real Data

© 2017 MapR Technologies 87

Q & A

Page 83: Finding Changes in Real Data

© 2017 MapR Technologies 88

Contact Information

Ted Dunning, PhD

Chief Application Architect, MapR Technologies

Board member, Apache Software Foundation

O’Reilly author

Email [email protected] [email protected]

Twitter @ted_dunning