1 isqs 3358 business intelligence probability and statistics (review) zhangxi lin isqs 3358 texas...

24
1 ISQS 3358 Business Intelligence Probability and Statistics (review) Zhangxi Lin ISQS 3358 Texas Tech University

Upload: clyde-williamson

Post on 11-Jan-2016

237 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 ISQS 3358 Business Intelligence Probability and Statistics (review) Zhangxi Lin ISQS 3358 Texas Tech University

1

ISQS 3358 Business Intelligence

Probability and Statistics (review)

Zhangxi Lin

ISQS 3358

Texas Tech University

Page 2: 1 ISQS 3358 Business Intelligence Probability and Statistics (review) Zhangxi Lin ISQS 3358 Texas Tech University

2

Agenda

Introductory probability and statistics Data analysis with MS Excel SAS Analytics Cases

Page 3: 1 ISQS 3358 Business Intelligence Probability and Statistics (review) Zhangxi Lin ISQS 3358 Texas Tech University

3

Probability and Uncertainty

Probability measures the amount of uncertainty of an event: a fact whose occurrence is uncertain.

Consider, as an example, the event R “Tomorrow, April 18th, it will rain in Lubbock”. The occurrence of R is difficult to predict — we have all been victims of wrong forecasts made by the “weather channel” — and we quantify this uncertainty with a number p(R), called the probability of R.

It is common to assume that this number is non-negative and it cannot exceed 1. The two extremes are interpreted as the probability of the impossible event: p(R) = 0, and the probability of the sure event: p(R) = 1. Thus, p(R) = 0 asserts that the event R will not occur while, on the other hand, p(R) = 1 asserts that R will occur with certainty.

Page 4: 1 ISQS 3358 Business Intelligence Probability and Statistics (review) Zhangxi Lin ISQS 3358 Texas Tech University

4

Probability and Uncertainty (Cont’d) Suppose now that you are asked to quote the probability of

R, and your answer is p(R) = 0.7. There are two main interpretations of this number. 0.7 represent the odds in favor of R. This is the subjective

probability that measures your personal belief in R. Objective probability is the interpretation of p(R) = 0.7 as a

relative frequency. Suppose, for instance, that in the last ten years, it rained 7

times on the day 16th January. Then 0.7 = 7/10 is the relative frequency of occurrences of R, also given by the ratio between the favorable cases (7) and all possible cases (10).

There are other interpretations of p(R) = 0.7 arising, for instance, from logic or psychology. Here, we will simply focus attention to rules for computations with probability.

Page 5: 1 ISQS 3358 Business Intelligence Probability and Statistics (review) Zhangxi Lin ISQS 3358 Texas Tech University

5

Sample Space

Definition 1 (Sample Space) The set of all possible events is called the sample space and is denoted by S.

If we denote events by capital letters A, B, … , we write S = {A, B, …,}. The identification of the sample space depends on the problem at hand. For instance, in the exercise of forecasting tomorrow weather, the sample space consists of all meteorological situations: rain (R), sun (S), cloud (C), typhoon (T) etc.

The sample space is a set, on which we define some algebraic operations between events.

Page 6: 1 ISQS 3358 Business Intelligence Probability and Statistics (review) Zhangxi Lin ISQS 3358 Texas Tech University

6

Algebraic Operations

Definition 2 (Algebraic Operations) Let A and B be two events of the sample space S. We will denote “A does not occur” by A¯ or ~A “either A or B occur” by A B; “both A and B occur” by A B; “A occurs and B does not” by A\ B.

The events A and B are exhaustive if A B = S, in other words we are sure that either A or B will occur. Thus, in particular A ~A = S. The events A and B are exclusive if A B = ;, where is the impossible event, that is the event whose occurrence is known to be impossible. In this case, we are sure that if A occurs then B cannot. Clearly, we have A B = .

Page 7: 1 ISQS 3358 Business Intelligence Probability and Statistics (review) Zhangxi Lin ISQS 3358 Texas Tech University

7

Conditional Probability

P(A|B) = P(A and B) / P(B) Example, there are 40 female students in a class of 100. 10

of them are from some foreign countries. 20 male students are also foreign students. Even A: student from a foreign country Even B: a female student

If randomly picking up one of students to give a talk in the class.

The probability the student is a female: P(B) = 0.4 The probability the student is from a foreign country: P(A) =

(10 + 20) / 100 = 0.3 The student is female and from a foreign country: P(A and

B) = 10 / 100 = 0.1 If randomly choosing a female student to present in the

class, the probability she is a foreign student: P(A|B) = 10 / 40 = 0.25, or P(A|B) = P (A and B) / P (B) = 0.1 / 0.4 = 0.25

Page 8: 1 ISQS 3358 Business Intelligence Probability and Statistics (review) Zhangxi Lin ISQS 3358 Texas Tech University

8

Venn Diagrams

FemaleForeignstudent

Female foreign student

Page 9: 1 ISQS 3358 Business Intelligence Probability and Statistics (review) Zhangxi Lin ISQS 3358 Texas Tech University

9

Questions

What is the probability of female students who are not foreign students regarding the whole class?

What is the probability of male students who are foreign students regarding the whole class?

What is the probability of male students who are not foreign students regarding the whole class?

Page 10: 1 ISQS 3358 Business Intelligence Probability and Statistics (review) Zhangxi Lin ISQS 3358 Texas Tech University

10

Association Analysis Example Use W, L and C to represent items “Watch Promo”, “Life Ins

Promo”, and “Credit Card Ins”. List all itemsets based on {W, L, C}. If (W, C) is not a frequent itemset, which itemsets will be

eliminated? Based on the following table (a) identify a rule that has the

highest support, (b) draw the contingence matrix, and (c) calculate its support, confidence and lift.

ID Watch Promo (W) Life Ins Promo (L) Credit Card Ins. (C)

1 Yes Yes No

2 Yes Yes Yes

3 Yes Yes No

4 No Yes Yes

5 Yes No Yes

Page 11: 1 ISQS 3358 Business Intelligence Probability and Statistics (review) Zhangxi Lin ISQS 3358 Texas Tech University

11

Confusion Matrix

Model M1 PREDICTED CLASS

ACTUALCLASS

Yes No

Yes 100 100

No 50 200

200

250

150 300

Accuracy rate = 100 / 150 = 0.667. It is a conditional probability: p(Actual Yes|Predicted Yes)

Coverage rate = 100 / 200 = 0.5. It is also a conditional probability: p(Predicted Yes|Actual Yes)

Page 12: 1 ISQS 3358 Business Intelligence Probability and Statistics (review) Zhangxi Lin ISQS 3358 Texas Tech University

12

Contingency TableChecking Account

500 3,500

1,000 5,000

No

Yes

No Yes

SavingAccount

4,000

6,000

10,000Support(SVG CK) = 50%

Confidence(SVG CK) = 83%

Lift(SVG CK) = 0.83/0.85 < 1

Page 13: 1 ISQS 3358 Business Intelligence Probability and Statistics (review) Zhangxi Lin ISQS 3358 Texas Tech University

13

Arrest Rates

Description: The data set is a record of the arrests per 100,000 people in each

age group in the United States from 1970 through 1999. The variables in the data set are: year (YEAR), arrests per 100

thousand in population (RATE), and the age group (AGEGROUP). The age groups are defined as (1) 14 and under, (2) 15-17, (3) 18-20, (4) 21-24, and (5) 25 or over.

This data set is a subset of the data in the data set totarrests. The data set could be used to generate descriptive statistics by age group or to do a time series analysis to predict the arrest rates by age group. These predictions might be used in assessing the need for judicial system infrastructure changes. Finally, the data could be used to compare age groups with an ANOVA.

150 rows, 3 columns. Statistics: Descriptive Statistics, Graphic Analysis, Time Series

Page 14: 1 ISQS 3358 Business Intelligence Probability and Statistics (review) Zhangxi Lin ISQS 3358 Texas Tech University

14

Candy

Description: Since 1994, the United States Food and Drug Administration (FDA) has

required uniform, easy-to-read nutrition labeling for nearly all foods. The purpose of the new label is to reduce confusion and help consumers choose more healthful diets.

The United States Department of Agriculture (USDA) and the Department of Health and Human Service (HHS) have teamed up to produce the Food Guide Pyramid, which recommends eating a variety of foods, an appropriate number of calories, and a modest amount of fat-specifically, 30% or fewer of your total number of calories per day should be calories from fat, and only a third of those should be calories from saturated fat. For adults consuming 2000 calories per day, which works out to no more than 65 grams of fat, no more than 20 grams of which are saturated fat.

We want to know how many candy bars can fit into this daily diet. We found nutritional facts about every candy bar we could find. We also included some non-bar candies like M&Ms, Reese’s Pieces, Skittles, and S

75 rows, 17 columns. Statistics: Descriptive Statistics, Graphic Analysis, Confidence Intervals

Page 15: 1 ISQS 3358 Business Intelligence Probability and Statistics (review) Zhangxi Lin ISQS 3358 Texas Tech University

15

Cars

Description: The data set contains information on cars

such as weight, gas tank size, turning radius, horsepower and engine displacement for 116 cars from different countries.

116 rows, 8 columns. Statistics: Descriptive Statistics, ANOVA,

Graphic Analysis

Page 16: 1 ISQS 3358 Business Intelligence Probability and Statistics (review) Zhangxi Lin ISQS 3358 Texas Tech University

16

Corn

Description: The data was collected to examine the effect of weather

related phenomena on corn yield. The data set includes information on the total precipitation

(in inches) for the year prior to the start of the growing season, the average daily temperature (in degrees Fahrenheit) for each of the months of May through August, the total rain (in inches) during each of the months June through August, and the corn yield (in bushels per acre).

This information was collected for each of the years 1930 through 1962. The year is also included in the data set. You are interested in determining the relationship between the corn yield and the other variables.

33 rows, 10 columns. Statistics: Regression, Correlation, Graphic Analysis

Page 17: 1 ISQS 3358 Business Intelligence Probability and Statistics (review) Zhangxi Lin ISQS 3358 Texas Tech University

17

Auction

Description: This is data from 19 livestock auction markets. The columns include: the number of head of different livestock sold (in thousands) including CATTLE, CALVES, HOGS, and SHEEP, the cost of operation of the auction market (in thousands of dollars) (COST), and the market identifier (MARKETID).

The object is to use multiple linear regression to describe the relationship between the cost of operations to the number of livestock sold in the various classes. COST will be the dependent variable and CATTLE, CALVES, HOGS, and SHEEP the independent variables.

An additional variable, VOLUME, is the total of all major livestock sold in each market. It is the sum of the variables CATTLE, CALVES, HOGS, and SHEEP, and can be used to demonstrate an exact linear dependency between independent variables. rows, columns.

Regression, Graphic Analysis

Page 18: 1 ISQS 3358 Business Intelligence Probability and Statistics (review) Zhangxi Lin ISQS 3358 Texas Tech University

18

Control Chart for a Process within Statistical Control

Page 19: 1 ISQS 3358 Business Intelligence Probability and Statistics (review) Zhangxi Lin ISQS 3358 Texas Tech University

19

Pareto Chart

Page 20: 1 ISQS 3358 Business Intelligence Probability and Statistics (review) Zhangxi Lin ISQS 3358 Texas Tech University

20

Tube Angle

Description: You are in charge of the quality control effort at a bicycle manufacturer that specializes in limited production frames. The most popular model your company produces is a day touring model called the Arribe!, which is a racing-style frame for weekend warriors. The seat tube angle of a bicycle frame can dramatically affect the finished bicycle's handling characteristics. This is the angle formed by the intersection of the tube that holds the seat post with the top horizontal frame tube. A small seat tube angle endows the frame with forgiving handling characteristics. Weekend warriors want frames that are responsive and quick; they prefer frames with steep seat tube angles. The Arribe! is manufactured with these specifications in mind.

The purpose of this analysis is to determine if the manufacturing process is in control. The target angle is 74 degrees, with specification limits of 73.7 and 74.3 degrees.

100 rows, 2 columns. Quality control

Page 21: 1 ISQS 3358 Business Intelligence Probability and Statistics (review) Zhangxi Lin ISQS 3358 Texas Tech University

21

Tube Defects

Description: Often it is more cost-effective to simply evaluate whether an item is defective or not. This data is recorded from frame tubes prior to assembly. Frame tubes need to be meticulously filed, mitered and sanded before they are joined into a complete frame. The tube ends are then inspected to assure that they fit together properly. Rather than base your analyses on each of the measures that affect whether tubes fit together, you will analyze a single characteristic, specifically whether each individual tube is defective or not.

960 rows, 2 columns Graphic Analysis, Quality Control

Page 22: 1 ISQS 3358 Business Intelligence Probability and Statistics (review) Zhangxi Lin ISQS 3358 Texas Tech University

22

Beer Sales

Description: Beer sales records monthly sales of beer in hectoliters, along with the average high and low temperatures in the region, over a period of five years.

The object is to see how beer sales change over time. You can also consider the relationships between beer sales and temperatures.

4 rows, 60 columns. Regression, Time Series

Page 23: 1 ISQS 3358 Business Intelligence Probability and Statistics (review) Zhangxi Lin ISQS 3358 Texas Tech University

23

College

Description: The data is a collection of information on colleges and universities collected in the early 1990’s.

The primary interest is in predicting graduation rates, the percent of students who graduate from the institution in four years. Potential predictor variables are tuition, type of college (public or private), and region of the country.

200 rows, 6 columns. Regression, ANOVA, ANCOVA, Graphic Analysis

Page 24: 1 ISQS 3358 Business Intelligence Probability and Statistics (review) Zhangxi Lin ISQS 3358 Texas Tech University

24

Oranges

Description: The data are from a study of the relationship between the

price of oranges and sales per customer. The hypothesis is that sales vary as a function of price

differences for different stores (STORE) and days of the week (DAY).

The price is varied daily for two varieties of oranges. The variables P1 and P2 denote the prices for the two varieties, respectively.

The variables Q1 and Q2 are the sales per customer of the corresponding varieties. Q1 and Q2 are used as the dependent variables, with STORE, DAY, P1, and P2 as the independent variables.

36 rows, 3 columns. Usage: ANCOVA, Graphic analysis