probability and statistics for data mining

25
Probability and Statistics for Data Mining COMP5318

Upload: etenia

Post on 12-Jan-2016

32 views

Category:

Documents


4 download

DESCRIPTION

Probability and Statistics for Data Mining. COMP5318. Question 1. Question: Suppose you randomly select a credit card holder and the person has defaulted on their credit card. What is the probability that the person selected is a ‘Female’?. Probability. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Probability and Statistics for Data Mining

Probability and Statistics for Data Mining

COMP5318

Page 2: Probability and Statistics for Data Mining

Question 1

• Question: Suppose you randomly select a credit card holder and the person has defaulted on their credit card. What is the probability that the person selected is a ‘Female’?

Gender % of credit card holders

% of gender who default

Male 60 55

Female 40 35

Page 3: Probability and Statistics for Data Mining

Probability

• Probability is the mathematical language to understand uncertainty.

• We need to make decisions in the presence of uncertainty which is ever present.

• Example: The Earth is warming- a phenomenon that is known as Global Warming (GW). Is modern human activity the cause of GW.– Physics driven approach– Data driven approach

Page 4: Probability and Statistics for Data Mining

Experiments and Observation

• When an experiment is carried out we observe the outcome – which is often uncertain.– If not uncertain then why carry out the experiment?

• We look into a random shopping basket. Does it contain a a packet of “Tofu”?

• We toss a coin, does it land on “Heads”?• We ask a question: “Is it raining in Broom, WA,

right now”?

Page 5: Probability and Statistics for Data Mining

Building Blocks of Probability

• The space of all possible outcomes is called the sample space.– Non-trivial to decide.

• Single Coin Toss. The space is {H,T}.

• Shopping Basket. The space of all possible combinations of all items sold in the store.

• Shopping Basket: {Tofu, Not-Tofu}.

Page 6: Probability and Statistics for Data Mining

Events

• Events are subsets of the sample space. Events are often defined in familiar terms.

• In the shopping basket scenario– A vegetarian shopping basket is an event.– all possible vegetarian item combinations.

• Throw of a dice. The event we are looking for could be: Even Number = {2,4,6}, where the sample space = {1,2,3,4,5,6}

Page 7: Probability and Statistics for Data Mining

Events

• Let G be the set of all galaxies. Characterize each galaxy by three number – d: distance from earth– a: major axis– b: minor axis

• Elliptic Galaxies (EG)– EG ={(a,b,d) | a/b > 1.5}

• Distant Spiral Galaxies (DSG)– DSG ={(a,b,d) | a/b <= 1.5 and d > 10}

Page 8: Probability and Statistics for Data Mining

Events

• Let G be the set of all genes. Each gene can be “on” or “off”. Let E correspond to the event: all genes which are “on” when the skin cells are “starved”.

Page 9: Probability and Statistics for Data Mining

Events are Sets

• At the most basic level events are sets. Therefore we can carry out set union, difference and intersection on events.

• For example:– E1: shopping baskets which contain Tofu– E2: shopping baskets which contain Milk– E1 U E2: shopping baskets which contain

either Tofu or Milk

Page 10: Probability and Statistics for Data Mining

Probability

• Let S be the space of all possible elementary outcomes. Let = Power(S) be the power set of S. Then the probability P is function: P : [0,1]

that satisfy the following properties (axioms):

Page 11: Probability and Statistics for Data Mining

Interpretation of Probability

• Physical or Ontological: Long term frequency– 50% chance that a coin will land on heads.– 20% of all Woolworth shopping baskets are

vegetarian.– 22% of all Woolworth shopping baskets in

Northbridge plaza are vegetarian.• Epistemological : Degree of Belief

– 20% chance that my neighbours are watering their lawn on “dry” days.

– 99% chance that the green immovable object outside my house is a Tree.

– 90% chance that Australia will win the cricket world cup.

Page 12: Probability and Statistics for Data Mining

Consequences of Axioms

Page 13: Probability and Statistics for Data Mining

Example

• Two coin tosses. Let H1 be the event that a heads occurs on toss 1 and H2 a heads on toss 2. All events are equally likely.

• Sample space = {HH, HT, TH, TT}– H1 = {HH, HT}– H2 = {HH,TH}– P(H1 U H2) = ½ + ½ - ¼ = 3/4

Page 14: Probability and Statistics for Data Mining

Example

• Two events A and B are independent if – P(A ∩ B) = P(A)P(B)

• P(A∩B) is also written as P(AB) and P(A,B).• If A and B are disjoint event then A and B such

that P(A) > 0 and P(B) > 0 then A and B cannot be independent– P(A ∩ B) = 0. Yet P(A)P(B) > 0

• Except for this case you cannot determine independence by looking at a Venn diagram

Page 15: Probability and Statistics for Data Mining

Question

• A shopping basket can either be kosher or not. The probability that it will be kosher is 3/4. Examine 10 baskets at a check out counter. What is the probability that there will be at least one kosher basket.

Page 16: Probability and Statistics for Data Mining

Answer

• Let E be the event “At least one kosher basket.” Let NKi be the event that the i-th basket is non-kosher.

Independence

Page 17: Probability and Statistics for Data Mining

Example

• For an Online Book Seller (OBS) the conversion rate is 1/100, i.e., every 100th visitors ends up making a purchase. What is the probability that at least one purchase will be made in 10 consecutive visits (by distinct customers).

Page 18: Probability and Statistics for Data Mining

Example

• Two people take turns to sink a basketball. P1 succeeds with probability 1/3 and P2 with ¼. What is the probability that P1 succeeds before P2.

• Requires clever setting up of the events.– Let E be the event that P1 succeeds before P2.

– Let Ai be the event that P1 succeeds before P2 on the ith trial.

– Ai ∩Aj = Ø and E = [i=11Ai

Page 19: Probability and Statistics for Data Mining

Conditional Probability

• Very Important Concept• P(A|B) is “fraction of occurrences of B in

which A also occurs”– P(A|B) = P(A ∩ B)/P(B); P(B) > 0

• For a fixed B, P(.|B) is a probability– Therefore if A1 and A2 are disjoint then– P(A1 U A2 |B) = P(A1|B) + P(A2|B)

• Note, P(A|B U C) =/= P(A|B) + P(A|C)• Also P(A|B) =/= P(B|A)

Page 20: Probability and Statistics for Data Mining

Standard Example

D Dc

+ 0.009 0.099

- 0.001 0.891

9.0001.0009.0

009.0

)(

)()|(

DP

DPDP

9.0099.0891.0

891.0

)(

)()|(

c

cc

DP

DPDP

Suppose a test is positive. What isthe probability of disease?

08.0099.0009.0

009.0)|(

DP

D is disease+/-; Test positive or negative

Page 21: Probability and Statistics for Data Mining

Standard Data Mining ExampleTID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Suppose the data above closely resembles the behaviour of the populationat large.

What is the chance that those who buy a Diaper will also buy Beer.

= P(Diaper ∩ Beer)/P(Diaper) = 0.6/0.8 = 0.75

Is Diaper an Event?

Page 22: Probability and Statistics for Data Mining

Conditional Independence

• If A and B are independent then P(A|B)=P(A)

• P(AB) = P(A|B)P(B)• Law of Total Probability.

Page 23: Probability and Statistics for Data Mining

Bayes Theorem

Page 24: Probability and Statistics for Data Mining

Question 1

• Question: Suppose you randomly select a credit card holder and the person has defaulted on their credit card. What is the probability that the person selected is a ‘Female’?

Gender % of credit card holders

% of gender who default

Male 60 55

Female 40 35

Page 25: Probability and Statistics for Data Mining

Answer to Question 1

30.060.055.040.035.0

40.035.0)|()|(

)()|()|(

MGYDPFGYDP

FGPFGYDPYDFGP

But what does G=F and D=Y mean? We have not even formally defined them.