optimal bayesian networks

27
Optimal Bayesian Networks Advanced Topics in Computer Science Seminar Supervisor: Dr. Herman Maya Author: Kreimer Andrew 1

Upload: andrew-kreimer

Post on 24-Jul-2015

167 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Optimal Bayesian Networks

1

Optimal Bayesian Networks

Advanced Topics in Computer Science Seminar Supervisor: Dr. Herman Maya

Author: Kreimer Andrew

Page 2: Optimal Bayesian Networks

2

Data Mining

› Massive amounts of data: Petabyte, Terabyte

› Data evolution

› Multidisciplinary field

› Data Warehouse

› OLAP & OLTP

› Preprocessing

› KDD – Knowledge Discovery in Databases

› One truth

Page 3: Optimal Bayesian Networks

3

Data Mining Methods

› Clustering– Bank clients: private or business

› Association Rules– YouTube suggestions, Amazon checkout suggestions

› Classification and Prediction– SPAM mail classification, FX trend predictions– Payment power prediction

› Integration– Clustered client gets specific classification model

Page 4: Optimal Bayesian Networks

4

The Bayesian Approach

› Probability & Statistics– Instances – classical approach– A priori/A posteriori knowledge – the Bayesian approach

› Bayes’ Theorem– P(A|B) = P(B|A)P(A)/P(B)

› MAP– Maximum A Posteriori

Page 5: Optimal Bayesian Networks

5

Bayesian Classifier› Describe a client by age and income

› P(X) – probability that exists a client aged 25 with income of 5000

› P(H) – probability that client buys a guitar

› P(X|H) – probability that exists client X given that someone bought a guitar

› P(H|X) – probability of client X buying a guitar

› P(H|X) = P(X|H)P(H)/P(X)

› The Naïve approach– Independent variables

Page 6: Optimal Bayesian Networks

6

Naïve Bayes Classifier

› Optimal classifier is not practical

› The variables are independent

› Probability 0– Laplace – add one dummy record– m-estimate – there are m virtual records

Page 7: Optimal Bayesian Networks

7

Classification example using Naïve Bayes Classifier

News in EU News in US EU GDP US GDP EURUSD

bad bad Up Down Up

bad good Down Down Up

good bad Up Up Down

good good Up Up Up

bad bad Down Up Down

good bad Down Up Down

bad good Up Down Up

bad bad Up Down Down

good good Up Up Up

Bad good Down down Up

NewsEu, NewsUs ϵ {bad, good}

EuGDP, UsGDP, EURUSD (Class) ϵ {up, down}

Let’s try to classify trends in the FX market using several attributes: news in Europe, news in US, GDP in Europe and news in US. Each instance is monthly measurement. News attributes describe general market temperament. GDP attributes describe the change relative to last period.

Page 8: Optimal Bayesian Networks

8

Classification example using Naïve Bayes Classifier› Let’s classify new instance

– X=(NewsEU = good, NewsUS = bad, EuGDP = up, UsGDP = up)

› We start with :

– P(EURUSD = Up) = 6/10 = 0.6– P(EURUSD = Down) = 4/10 = 0.4

› Then we calculate the joints

– P(NewsEu = good | EURUSD = Up) = 2/6 = 0.33– P(NewsUs = bad | EURUSD = Up) = 1/6 = 0.16– etc.

Page 9: Optimal Bayesian Networks

9

Classification example using Naïve Bayes Classifier

› The classification:

› P(X|EURUSD = up) = P(NewsEu = good | EURUSD = up) * P(NewsUs = bad | EURUSD = up) * P(EuGDP = up | EURUSD = up) * P(UsGDP = up | EURUSD = up) = 0.33 * 0.16 * 0.66 * 0.33 = 0.01149984

› P(X|EURUSD = down) = P(NewsEu = good | EURUSD = down) * P(NewsUs = bad | EURUSD = down) * P(EuGDP = up | EURUSD = down) * P(UsGDP = up | EURUSD = down) = 0.5 * 1 * 0.5 * 0.75 = 0.1875

› Using MAP:

› = max{P(X|EURUSD=up)P(EURUSD=up), P(X|EURUSD=down)P(EURUSD=down)} = max{0.01149984, 0.1875} = 0.1875

› Conclusion: trend down, we should sell EURUSD.

Page 10: Optimal Bayesian Networks

10

Bayesian Network

› Graphical probabilistic model

› DAG

› CPT for each attribute

› d-separated, d-connected

› A -> D, D -> A

› P(C|A,B,D,E) = P(C|A,B,D)– E and C are d-separated

Page 11: Optimal Bayesian Networks

11

Probability Inference› Probability calculation:

› Given A, B, C, D & E, calculate P(A, B, C, D, E):

› Given A, B, D & E, calculate C by using MAP:

› Given A, C, D & E, calculate B by using Bayes Theorem:– P(A|B) = P(B|A) * P(A)/P(B) //Bayes Theorem– P(B) = P(B|A) + P(B|^A) = P(B|A) * P(A) + P(B|^A) * P(^A) =

Page 12: Optimal Bayesian Networks

12

Dynamic Bayesian Network

› Bayesian Network extension

› Time slice attributes relations

› Matrix of attributes and time slices

› Time series

› Cycle are allowedX1

X2

X3

X4

Attribute

p

… Attribute 2 Attribute 1  

… Time 1

… Time 2

… … … … …

… Time n

Page 13: Optimal Bayesian Networks

13

Bayesian Network Example

› Let’s try to predict trends in EURUSD

› Binary class variable: Up or Down

› Attributes: Open, High, Low, Close, MA100, MA200

› Class: ClassTrend

Page 14: Optimal Bayesian Networks

14

Bayesian Network Example

CPT: BN:

Page 15: Optimal Bayesian Networks

15

Bayesian Network Learning

› Structure is given by field expert (Wish You Were Here)

› Structure learning - computational barrier– structures– Heuristics– Metrics for evaluating structures: local, global, d-

separation

› Conditional Probability Tables calculation

Page 16: Optimal Bayesian Networks

16

Bayesian Network Learning

› Attributes ordering:– Set – is candidate parent of iff is before in order– Possible parents come before the node in order

› Structure– DAG

X1 X3 X2

X1

X3

X2

Order(left) and Structure(right)

Page 17: Optimal Bayesian Networks

17

Network Scoring

› Structures are evaluated by scoring (global/local)

› Bayesian Dirichlet – BD

› BDeu (equivalent uniform Bayesian Dirichlet).

› MDL given model M and dataset D:– Description cost: – Looking for minimum or maximum (start at )

Page 18: Optimal Bayesian Networks

18

Bayesian Network Learning Algorithms› Gradient Descent

– Structure is given, CPT to be calculated– Some of the a priori probabilities are missing– Infinitesimal approximation

› K2– Well known– Greedy Algorithm– Each node has a maximum number of parents– Add parents gradually (from 0)– Attributes ordering is given– Look for the structure having highest score– Stop when no better structure is found

Page 19: Optimal Bayesian Networks

19

Bayesian Network Learning Algorithms

› Hill-Climbing Search– Local Search, Global Search– Global: incremental solution construction– Local: start with random solution, optimize towards the

optimal              1                

                               

        4                      

8 .. .. ..         8 7 6 5        

                        4      

          3               3    

                            2  

            2                 1

Global (right) vs. Local (left)

Page 20: Optimal Bayesian Networks

20

Bayesian Network Learning Algorithms

› Taboo Search– List of forbidden solutions– Allow bad solutions to reveal good solutions– Avoid local max/min– Efficient data structures– Decisions made with 4 dimensions:

› Past occurrences› Frequencies› Quality› Impact

Possible Solutions

Solutions Evaluation

Find Optimal SolutionStop?

Update Taboo List

Initial Solution

Optimal Solution

Taboo Search scheme

Page 21: Optimal Bayesian Networks

21

Bayesian Network Learning Algorithms

› TAN – Tree Augmented Naïve Bayes– Tree based– Conditional Mutual Information – Edges from class to attributes– Chow-Liu (1968)

› גנטי (GA) אלגוריתם– Evolution– Mutation– Selection from several generations

Stop?

Selection

Solutions

CreationNew Solutions

Change

Solution Generatio

n

Genetic Algorithm, Source: P. Larranaga et al.

Optimal SolutionInitializatio

n

 

Page 22: Optimal Bayesian Networks

22

Bayesian Network Learning Algorithms

› Simulated Annealing– Thermodynamics principal– Possible local minimum/maximum

› Ordering-Based Search– Attributes ordering is given– Each node has max number of descendants– Cardinality of orderings is lower than cardinality of

structures– There is an ordering to structure map

Page 23: Optimal Bayesian Networks

23

Classifiers Comparison› WEKA 3.6, votes.arff, 435 records, 17 attributes, 10

foldsFN TN FP TP

Inaccurat

eAccurate

Calculation

TimeClassifier

14 154 29 238 9.89% 90.11% 0.01sec Naïve Bayes

8 160 8 259 3.68% 96.32% 0sec J48

10 158 23 244 7.59% 92.41% 0sec IB1

10 158 13 254 5.29% 94.71% 1.75sec MLP

14 154 29 238 9.89% 90.11% 0.04sec BN, K2, Local

14 154 29 238 9.89% 90.11% 0.01sec BN, K2, Global

14 154 28 239 9.66% 90.34% 0.02sec BN, Hill Climber, Local

12 156 12 255 5.52% 94.48% 2.87sec BN, Hill Climber, Global

10 158 12 255 5.06% 94.94% 1.34sec BN, Simulated Annealing, Local

13 155 13 254 5.98% 94.02% 52.04sec BN, Simulated Annealing, Global

14 154 28 239 9.66% 90.34% 0.02sec BN, Taboo Search, Local

15 153 12 255 6.21% 93.79% 1.92sec BN, Taboo Search, Global

9 159 13 254 5.06% 94.94% 0.04sec BN, TAN, Local

6 162 15 252 4.83% 95.17% 3.24sec BN, TAN, Global

Page 24: Optimal Bayesian Networks

24

Classifiers Comparison› WEKA 3.7, GBPAUD, 37 attributes, 10k records, 33%-66%

splitIncorrectly

Classified

Correctly

ClassifiedCalculatioT time Classifier

36.62% 63.38% 0.03sec Naïve Bayes

1.23% 98.77% 0.48sec J48

31.21% 68.79% 0.01sec IB1

? ? >5min MLP

35.73% 64.27% 0.11sec BN, K2, Local

35.73% 64.27% 3.62sec BN, K2, Global

37.2647% 62.7353% 143.19sec BN, Hill Climber, Local

? ? >5min BN, Simulated Annealing, Local

35.5294% 64.4706% 144.19min BN, Taboo Search, Local

? ? >5min BN, TAN, Local

Page 25: Optimal Bayesian Networks

25

Optimal Bayesian Network

› Combinatorial optimization

› Inference is difficult if we must visit the whole structure

› Curse of dimensionality

› Feature selection – critical phase

› Attributes ordering – usually must be calculated

› Search space pruning by heuristics

› A priori knowledge, field experts (Wish You Were Here)

Page 26: Optimal Bayesian Networks

26

Summary

› Graphical classification model– Judea Pearl (1988)– Chow-Liu (1968)

› Easily fitted

› Easily interpreted

› Computational limit (as always!)

› Polynomial algorithms?– Time– Memory

Page 27: Optimal Bayesian Networks

27

That's all folks!Kreimer Andrew

Algonell.com – Scientific FX [email protected]