bayesian data mining

Bayesian Data Mining

Bayesian Data MiningUniversity of BelgradeSchool of Electrical EngineeringDepartment of Computer Engineering and Information Theory

Marko Stupar 11/3370 [email protected]/40Data Mining problemToo many attributes in training set (colons of table)

Existing algorithms need too much time to find solution

We need to classify, estimate, predict in real time

Marko Stupar 11/3370 [email protected] 1Value 2 . . . Value1000002/402Problem importanceFind relation between: All Diseases,All Medications,All Symptoms,

Marko Stupar 11/3370 [email protected]/40Existing solutionsMarko Stupar 11/3370 [email protected], C4.5Too many iterationsContinuous arguments need binning

Rule inductionContinuous arguments need binning

Neural networksHigh computational time

K-nearest neighborOutput depends only on distance based close values4/40Classification, Estimation, PredictionUsed for large data setVery easy to constructNot using complicated iterative parameter estimationsOften does surprisingly wellMay not be the best possible classifierRobust, Fast, it can usually be relied on to

Marko Stupar 11/3370 [email protected] Bayes Algorithm5/40Marko Stupar 11/3370 [email protected] Bayes algorithm ReasoningNew information arrived

How to classify Target?TargetAttribute 1Attribute 2 Attribute n................Training SetTargetAttribute 1Attribute 2 Attribute na1a2anMissing6/40Target can be one of discrete values: t1, t2, , tn

Marko Stupar 11/3370 [email protected]

Nave Bayes algorithm ReasoningHow to calculate?

Did not helped very much?Nave Bayes assumes that A1An are independent, So:

7/40AgeIncomeStudentCreditTargetBuys Computer1YouthHighNoFairNo2YouthHighNoExcellentNo3MiddleHighNoFairYes4SeniorMediumNoFairYes5SeniorLowYesFairYes6SeniorLowYesExcellentNo7MiddleLowYesExcellentYes8YouthMediumNoFairNo9YouthLowYesFairYes10SeniorMediumYesFairYes11YouthMediumYesExcellentYes12MiddleMediumNoExcellentYes13MiddleHighYesFairYes14SeniorMediumNoExcellentNoAttributes = (Age=youth, Income=medium, Student=yes, Credit_rating=fair)

P(Attributes, Buys_Computer=Yes) =P(Age=youth|Buys_Computer=yes) * P(Income=medium|Buys_Computer=yes) *P(Student=yes|Buys_Computer=yes) * P(Credit_rating=fair|Buys_Computer=yes) * P(Buys_Computer=yes)=2/9 * 4/9 * 6/9 * 6/9 * 9/14 = 0.028Attributes = (Age=youth, Income=medium, Student=yes, Credit_rating=fair)

P(Attributes, Buys_Computer=No) = P(Age=youth|Buys_Computer=no) *P(Income=medium|Buys_Computer=no) * P(Student=yes|Buys_Computer=no) *P(Credit_rating=fair|Buys_Computer=no) * P(Buys_Computer=no)=3/5 * 2/5 * 1/5 * 2/5 * 5/14 = 0.007

Nave BayesDiscrete Target ExampleMarko Stupar 11/3370 [email protected]/408

Nave BayesDiscrete Target - ExampleAttributes = (Age=youth, Income=medium, Student=yes, Credit_rating=fair)Target = Buys_Computer = [Yes | No] ?

P(Attributes, Buys_Computer=Yes) =P(Age=youth|Buys_Computer=yes) * P(Income=medium|Buys_Computer=yes) *P(Student=yes|Buys_Computer=yes) * P(Credit_rating=fair|Buys_Computer=yes) * P(Buys_Computer=yes)=2/9 * 4/9 * 6/9 * 6/9 * 9/14 = 0.028

P(Attributes, Buys_Computer=No) = P(Age=youth|Buys_Computer=no) *P(Income=medium|Buys_Computer=no) * P(Student=yes|Buys_Computer=no) *P(Credit_rating=fair|Buys_Computer=no) * P(Buys_Computer=no)=3/5 * 2/5 * 1/5 * 2/5 * 5/14 = 0.007

P(Buys_Computer=Yes | Attributes) > P(Buys_Computer=No| Attributes)

Therefore, the nave Bayesian classifier predicts Buys_Computer = Yes for previously given Attributes

Marko Stupar 11/3370 [email protected]/409Nave BayesDiscrete Target Spam filterAttributes = Text Document = w1,w2,w3 Array of wordsTarget = Spam = [Yes | No] ?

- probability that the i-th word of a given document occurs in documents, in training set, that are classified as Spam - probability that all words of document occur in Spam documents in training set

Marko Stupar 11/3370 [email protected]/40Nave BayesDiscrete Target Spam filter - Bayes factor

Sample correction if there is a word in document that never occurred in training set the whole will be zero. Sample correction solution put some low value for that

Marko Stupar 11/3370 [email protected]/40Gaussian Nave BayesContinuous AttributesContinuous attributes do not need binning (like CART and C4.5)Choose adequate PDF for each Attribute in training setGaussian PDF is most likely to be used to estimate the attribute probability density function (PDF)Calculate PDF parameters by using Maximum Likelihood MethodNave Bayes assumption - each attribute is independent of other, so joint PDF of all attributes is result of multiplication of single attributes PDFs

Marko Stupar 11/3370 [email protected]/40Gaussian Nave BayesContinuous Attributes - ExampleTraining set

Validation setsexheight (feet)weight (lbs)foot size(inches)male618012male5.9219011male5.58 17012male5.92 16510female51006female5.5 1508female5.421307female5.751509

Target = maleTarget = femaleheight (feet)

5.8850.0271755.41750.07291875weight (lbs)

176.25126.5625132.5418.75foot size(inches)

11.250.68757.51.25

sexheight (feet)weight (lbs)foot size(inches)Target61308

Marko Stupar 11/3370 [email protected]/40Nave Bayes - ExtensionsEasy to extendGaussian Bayes sample of extensionEstimate Target If Target is real number, but in training set has only few acceptable discrete values t1tn, we can estimate Target by:

A large number of modifications have been introduced, by the statistical, data mining, machine learning, and pattern recognition communities, in an attempt to make it more flexible Modifications are necessarily complications, which detract from its basic simplicity

Marko Stupar 11/3370 [email protected]/40Nave Bayes - ExtensionsAre Attributes always really independent? A1 = Weight, A2 = Height, A3 = Shoe Size, Target = [male|female]?How can that influence our Nave Bayes data mining?

Assumption?Bayesian networkMarko Stupar 11/3370 [email protected]/40Marko Stupar 11/3370 [email protected] NetworkBayesian network is a directed acyclic graph (DAG) with a probability table for each node.Bayesian network contains: Nodes and Arcs between themNodes represent arguments from databaseArcs between nodes represent their probabilistic dependencies

TargetA2A1A3P(A2|Target)A6A4A5P(A6|A4A5)A7P(A3|Target,A4A6)16/40Bayesian NetworkWhat to doRead Network

Construct NetworkTraining Set

Continue with nave Bayes

Joint ProbabilityDistributionNave BayesMarko Stupar 11/3370 [email protected]/40Marko Stupar 11/3370 [email protected] NetworkRead NetworkChain rule of probability

Bayesian network - Uses Markov Assumption

A7A2A5Joint Probability DistributionA7 depends only on A2 and A518/40Bayesian NetworkRead Network - ExampleHow to get P(N|B), P(B|M,T)?Expert knowledgeFrom Data(relative frequency estimates)Or a combination of both

MedicationBlood ClothTraumaHeart AttackNothingStrokeP(M)P(!M)0.20.8P(T)P(!T)0.050.95MTP(B)P(!B)TT0.950.05TF0.30.7FT0.60.4FF0.90.1BP(H)P(!H)T0.40.6F0.150.85BP(N)P(!N)T0.250.75F0.750.25BP(S)P(!S)T0.350.65F0.10.9

Marko Stupar 11/3370 [email protected]/40ManuallyFrom Database AutomaticallyHeuristic algorithms1. heuristic search method to construct a model2.evaluates model using a scoring methodBayesian scoring methodentropy based methodminimum description length method3. go to 1 if score of new model is not significantly better

Algorithms that analyze dependency among nodesMeasure dependency by conditional independence (CI) test

Bayesian NetworkConstruct NetworkMarko Stupar 11/3370 [email protected]/40Bayesian NetworkConstruct NetworkHeuristic algorithmsAdvantagesless time complexity in worst caseDisadvantageMay not find the best solution due to heuristic nature

Algorithms that analyze dependency among nodesAdvantagesusually asymptotically correct DisadvantageCI tests with large condition-sets may be unreliable unless the volume of data is enormous.

Marko Stupar 11/3370 [email protected]/4021Bayesian NetworkConstruct Network - Example1. Choose an ordering of variables X1, ,Xn2. For i = 1 to nadd Xi to the networkselect parents from X1, ,Xi-1 such that P (Xi | Parents(Xi)) = P (Xi | X1, ... Xi-1)

Marko Stupar 11/3370 [email protected] CallsJohn CallsAlarmBurglaryEarthquakeP(J | M) = P(J)? NoP(A | J, M) = P(A | J)? P(A | J, M) = P(A)? NoP(B | A, J, M) = P(B | A)? YesP(B | A, J, M) = P(B)? NoP(E | B, A ,J, M) = P(E | A)? NoP(E | B, A, J, M) = P(E | A, B)? Yes

22/40Create Network from databased(Directional)-Separationd-Separation is graphical criterion for deciding, from a given causal graph(DAG), whether a disjoint sets of nodes X-set, Y-set are independent, when we know realization of third Z-set

Z-set - is instantiated(values of its nodes are known) before we try to determine d-Separation(independence) between X-set and Y-set

X-set and Y-set are d-Separated by given Z-set if all paths between them are blocked

Example of Path : N1 N3 -> N4 -> N5

bayesian data mining

Documents