bayesian data mining
TRANSCRIPT
Bayesian Data Mining
Bayesian Data MiningUniversity of BelgradeSchool of Electrical EngineeringDepartment of Computer Engineering and Information Theory
Marko Stupar 11/3370 [email protected]/40Data Mining problemToo many attributes in training set (colons of table)
Existing algorithms need too much time to find solution
We need to classify, estimate, predict in real time
Marko Stupar 11/3370 [email protected] 1Value 2 . . . Value1000002/402Problem importanceFind relation between: All Diseases,All Medications,All Symptoms,
Marko Stupar 11/3370 [email protected]/40Existing solutionsMarko Stupar 11/3370 [email protected], C4.5Too many iterationsContinuous arguments need binning
Rule inductionContinuous arguments need binning
Neural networksHigh computational time
K-nearest neighborOutput depends only on distance based close values4/40Classification, Estimation, PredictionUsed for large data setVery easy to constructNot using complicated iterative parameter estimationsOften does surprisingly wellMay not be the best possible classifierRobust, Fast, it can usually be relied on to
Marko Stupar 11/3370 [email protected] Bayes Algorithm5/40Marko Stupar 11/3370 [email protected] Bayes algorithm ReasoningNew information arrived
How to classify Target?TargetAttribute 1Attribute 2 Attribute n................Training SetTargetAttribute 1Attribute 2 Attribute na1a2anMissing6/40Target can be one of discrete values: t1, t2, , tn
Marko Stupar 11/3370 [email protected]
Nave Bayes algorithm ReasoningHow to calculate?
Did not helped very much?Nave Bayes assumes that A1An are independent, So:
7/40AgeIncomeStudentCreditTargetBuys Computer1YouthHighNoFairNo2YouthHighNoExcellentNo3MiddleHighNoFairYes4SeniorMediumNoFairYes5SeniorLowYesFairYes6SeniorLowYesExcellentNo7MiddleLowYesExcellentYes8YouthMediumNoFairNo9YouthLowYesFairYes10SeniorMediumYesFairYes11YouthMediumYesExcellentYes12MiddleMediumNoExcellentYes13MiddleHighYesFairYes14SeniorMediumNoExcellentNoAttributes = (Age=youth, Income=medium, Student=yes, Credit_rating=fair)
P(Attributes, Buys_Computer=Yes) =P(Age=youth|Buys_Computer=yes) * P(Income=medium|Buys_Computer=yes) *P(Student=yes|Buys_Computer=yes) * P(Credit_rating=fair|Buys_Computer=yes) * P(Buys_Computer=yes)=2/9 * 4/9 * 6/9 * 6/9 * 9/14 = 0.028Attributes = (Age=youth, Income=medium, Student=yes, Credit_rating=fair)
P(Attributes, Buys_Computer=No) = P(Age=youth|Buys_Computer=no) *P(Income=medium|Buys_Computer=no) * P(Student=yes|Buys_Computer=no) *P(Credit_rating=fair|Buys_Computer=no) * P(Buys_Computer=no)=3/5 * 2/5 * 1/5 * 2/5 * 5/14 = 0.007
Nave BayesDiscrete Target ExampleMarko Stupar 11/3370 [email protected]/408
Nave BayesDiscrete Target - ExampleAttributes = (Age=youth, Income=medium, Student=yes, Credit_rating=fair)Target = Buys_Computer = [Yes | No] ?
P(Attributes, Buys_Computer=Yes) =P(Age=youth|Buys_Computer=yes) * P(Income=medium|Buys_Computer=yes) *P(Student=yes|Buys_Computer=yes) * P(Credit_rating=fair|Buys_Computer=yes) * P(Buys_Computer=yes)=2/9 * 4/9 * 6/9 * 6/9 * 9/14 = 0.028
P(Attributes, Buys_Computer=No) = P(Age=youth|Buys_Computer=no) *P(Income=medium|Buys_Computer=no) * P(Student=yes|Buys_Computer=no) *P(Credit_rating=fair|Buys_Computer=no) * P(Buys_Computer=no)=3/5 * 2/5 * 1/5 * 2/5 * 5/14 = 0.007
P(Buys_Computer=Yes | Attributes) > P(Buys_Computer=No| Attributes)
Therefore, the nave Bayesian classifier predicts Buys_Computer = Yes for previously given Attributes
Marko Stupar 11/3370 [email protected]/409Nave BayesDiscrete Target Spam filterAttributes = Text Document = w1,w2,w3 Array of wordsTarget = Spam = [Yes | No] ?
- probability that the i-th word of a given document occurs in documents, in training set, that are classified as Spam - probability that all words of document occur in Spam documents in training set
Marko Stupar 11/3370 [email protected]/40Nave BayesDiscrete Target Spam filter - Bayes factor
Sample correction if there is a word in document that never occurred in training set the whole will be zero. Sample correction solution put some low value for that
Marko Stupar 11/3370 [email protected]/40Gaussian Nave BayesContinuous AttributesContinuous attributes do not need binning (like CART and C4.5)Choose adequate PDF for each Attribute in training setGaussian PDF is most likely to be used to estimate the attribute probability density function (PDF)Calculate PDF parameters by using Maximum Likelihood MethodNave Bayes assumption - each attribute is independent of other, so joint PDF of all attributes is result of multiplication of single attributes PDFs
Marko Stupar 11/3370 [email protected]/40Gaussian Nave BayesContinuous Attributes - ExampleTraining set
Validation setsexheight (feet)weight (lbs)foot size(inches)male618012male5.9219011male5.58 17012male5.92 16510female51006female5.5 1508female5.421307female5.751509
Target = maleTarget = femaleheight (feet)
5.8850.0271755.41750.07291875weight (lbs)
176.25126.5625132.5418.75foot size(inches)
11.250.68757.51.25
sexheight (feet)weight (lbs)foot size(inches)Target61308
Marko Stupar 11/3370 [email protected]/40Nave Bayes - ExtensionsEasy to extendGaussian Bayes sample of extensionEstimate Target If Target is real number, but in training set has only few acceptable discrete values t1tn, we can estimate Target by:
A large number of modifications have been introduced, by the statistical, data mining, machine learning, and pattern recognition communities, in an attempt to make it more flexible Modifications are necessarily complications, which detract from its basic simplicity
Marko Stupar 11/3370 [email protected]/40Nave Bayes - ExtensionsAre Attributes always really independent? A1 = Weight, A2 = Height, A3 = Shoe Size, Target = [male|female]?How can that influence our Nave Bayes data mining?
Assumption?Bayesian networkMarko Stupar 11/3370 [email protected]/40Marko Stupar 11/3370 [email protected] NetworkBayesian network is a directed acyclic graph (DAG) with a probability table for each node.Bayesian network contains: Nodes and Arcs between themNodes represent arguments from databaseArcs between nodes represent their probabilistic dependencies
TargetA2A1A3P(A2|Target)A6A4A5P(A6|A4A5)A7P(A3|Target,A4A6)16/40Bayesian NetworkWhat to doRead Network
Construct NetworkTraining Set
Continue with nave Bayes
Joint ProbabilityDistributionNave BayesMarko Stupar 11/3370 [email protected]/40Marko Stupar 11/3370 [email protected] NetworkRead NetworkChain rule of probability
Bayesian network - Uses Markov Assumption
A7A2A5Joint Probability DistributionA7 depends only on A2 and A518/40Bayesian NetworkRead Network - ExampleHow to get P(N|B), P(B|M,T)?Expert knowledgeFrom Data(relative frequency estimates)Or a combination of both
MedicationBlood ClothTraumaHeart AttackNothingStrokeP(M)P(!M)0.20.8P(T)P(!T)0.050.95MTP(B)P(!B)TT0.950.05TF0.30.7FT0.60.4FF0.90.1BP(H)P(!H)T0.40.6F0.150.85BP(N)P(!N)T0.250.75F0.750.25BP(S)P(!S)T0.350.65F0.10.9
Marko Stupar 11/3370 [email protected]/40ManuallyFrom Database AutomaticallyHeuristic algorithms1. heuristic search method to construct a model2.evaluates model using a scoring methodBayesian scoring methodentropy based methodminimum description length method3. go to 1 if score of new model is not significantly better
Algorithms that analyze dependency among nodesMeasure dependency by conditional independence (CI) test
Bayesian NetworkConstruct NetworkMarko Stupar 11/3370 [email protected]/40Bayesian NetworkConstruct NetworkHeuristic algorithmsAdvantagesless time complexity in worst caseDisadvantageMay not find the best solution due to heuristic nature
Algorithms that analyze dependency among nodesAdvantagesusually asymptotically correct DisadvantageCI tests with large condition-sets may be unreliable unless the volume of data is enormous.
Marko Stupar 11/3370 [email protected]/4021Bayesian NetworkConstruct Network - Example1. Choose an ordering of variables X1, ,Xn2. For i = 1 to nadd Xi to the networkselect parents from X1, ,Xi-1 such that P (Xi | Parents(Xi)) = P (Xi | X1, ... Xi-1)
Marko Stupar 11/3370 [email protected] CallsJohn CallsAlarmBurglaryEarthquakeP(J | M) = P(J)? NoP(A | J, M) = P(A | J)? P(A | J, M) = P(A)? NoP(B | A, J, M) = P(B | A)? YesP(B | A, J, M) = P(B)? NoP(E | B, A ,J, M) = P(E | A)? NoP(E | B, A, J, M) = P(E | A, B)? Yes
22/40Create Network from databased(Directional)-Separationd-Separation is graphical criterion for deciding, from a given causal graph(DAG), whether a disjoint sets of nodes X-set, Y-set are independent, when we know realization of third Z-set
Z-set - is instantiated(values of its nodes are known) before we try to determine d-Separation(independence) between X-set and Y-set
X-set and Y-set are d-Separated by given Z-set if all paths between them are blocked
Example of Path : N1 N3 -> N4 -> N5