statistics with r

67

Upload: ruruchowdhury

Post on 07-May-2015

471 views

Category:

Education


0 download

DESCRIPTION

Praxis Weekend Business Analytics Program Sessions 1 to 3

TRANSCRIPT

Page 1: Statistics with R
Page 2: Statistics with R

> x=11

> print(x)

[1] 11

> x

[1] 11

> X

Error: object 'X' not found

> y<-7

> y

[1] 7

> y<-9

> y

[1] 9

> ls()

[1] "x" "y"

> rm(y)

> y

Error: object 'y' not found

> y<-9

> x.1<-14

> x.1

[1] 14

> 1x<-22Error: unexpected symbol in "1x"

Page 3: Statistics with R
Page 4: Statistics with R

Entering data with c

• c function for small datasets – combines or concatenates terms together

Example: we have a count of the number of typing mistakes of a word document:0 2 1 3 2 0 1 1To enter this into an R session we go like this:> typo=c(0,2,1,3,2,0,1,1)> typo[1] 0 2 1 3 2 0 1 1

Page 5: Statistics with R

Learning Objectives

• What is statistics?• Become aware of the varied applications of statistics in

business.• Differentiate between descriptive and inferential statistics.• Identify types of variables.

Page 6: Statistics with R

Statistics in Business

• Accounting — auditing and cost estimation• Economics — local, regional, national, and international

economic performance • Finance — investments and portfolio management• Management — human resources, compensation, and quality

management• Management Information Systems — performance of systems

which gather, summarize, and disseminate information to various managerial levels

• Marketing — market analysis and consumer research• International Business — market and demographic analysis

Page 7: Statistics with R

What is Statistics?

• Science dealing with collection, analysis, interpretation and presentation of data (with a view to making inferences)

• Branches of statistics:– Descriptive – graphical or numerical summaries of

data– Inferential – making a decision based on data

Page 8: Statistics with R

What is Statistics?

Statistics in business is the study of VARIATIONS

Page 9: Statistics with R

Population Versus Sample

• Population — the whole– a collection of all persons, objects, or items under

study• Census — gathering data from the entire population• Sample — gathering data on a subset of the population

– Use information about the sample to infer about the population

Page 10: Statistics with R

Population Versus Sample

Page 11: Statistics with R

Population and Census Data

Identifier Color MPG

RD1 Red 12

RD2 Red 10

RD3 Red 13

RD4 Red 10

RD5 Red 13

BL1 Blue 27

BL2 Blue 24

GR1 Green 35

GR2 Green 35

GY1 Gray 15

GY2 Gray 18

GY3 Gray 17

Page 12: Statistics with R

Sample and Sample Data

Identifier Color MPG

RD2 Red 10

RD5 Red 13

GR1 Green 35

GY2 Gray 18

Page 13: Statistics with R

Population Versus Sample

Population

(parameter)

Sample

(statistic)

Select arandom sample

Calculate to estimate

Page 14: Statistics with R

Parameter vs. Statistic

• Parameter — descriptive measure of the population– Usually represented by Greek letters

• Statistic — descriptive measure of a sample– Usually represented by Roman letters

parameter population denotes variancepopulation denotes 2

denotes population standard deviation

mean sample denotes x

variancesample denotes s2

deviation standard sample denotes s

Page 15: Statistics with R

Statistics in Business

• Inferences about parameters made under conditions of uncertainty (which are always present in statistics)– Uncertainty can be caused by

• Randomness in selection of a sample• Lack of knowledge about the source of the

inferences • Change in conditions not accounted for

Page 16: Statistics with R

Variables and Data

Variable : a characteristic of any entity being studied – is capable of taking on different values that can be used for analysise.g. stock price, ROI, market share, age of worker, income of a family, total sales, advertising cost etc

Measurement : is done when a standard process is used to assign numbers to particular characteristics of a variable – may be obvious or definede.g. age is obvious but ROI or Labour productivity is defined

The source of each measurement is called a Sampling unitData : recorded measurements

Page 17: Statistics with R

Levels of Data Measurement

What are 40 and 80? may represent Weights of two objects being shippedRatings received in a consumer test by two

different productsFootball jersey numbers of a fullback and centre-

forward

Appropriateness of data analysis depends on the level of measurement of the data gathered

Page 18: Statistics with R

Levels of Data Measurement

• Nominal — Qualitative data, typically numbers are used only to classify or categorize the attribute, however it is useful to retain original verbal descriptions of categories– 1 for “male” and 2 for “female” – Employee identification number – Religion, Geographic location, PIN code, Place of

birth – Demographic questions in survey etc

Page 19: Statistics with R

Levels of Data Measurement

• Ordinal - A variable is ordinal measurable if ranking or ordering is possible for values of the variable.– For example, a gold medal reflects superior

performance to a silver or bronze medal in the Olympics. But can you say a gold and a bronze medal average out to a silver medal?

– Preference scales are typically ordinal – how much do you like this cereal? Like it a lot, somewhat like it, neutral, somewhat dislike it, dislike it a lot.

Page 20: Statistics with R

Levels of Data Measurement

• Interval - In interval measurement the distance between attributes does have meaning.– Numerical data typically fall into this category– For example, when measuring temperature (in

Fahrenheit), the distance from 30-40 is same as the distance from 70-80. The interval between values is interpretable.

Page 21: Statistics with R

Levels of Data Measurement

• Ratio — in ratio measurement there is always a reference point that is meaningful (either 0 for rates or 1 for ratios)– This means that you can construct a meaningful

fraction(or ratio) with a ratio variable.

– In applied social research most "count" variables are ratio, for example, the number of clients in past six months.

Page 22: Statistics with R

Visualizing the data

• Construct a frequency distribution– For both grouped and ungrouped data

• Construct graphical summaries of qualitative data

• Construct graphical summaries of quantitative data

• Construct graphical summaries of two variables

Page 23: Statistics with R

Ungrouped vs.Grouped Data

• Ungrouped data– have not been summarized in any way– are also called raw data

• Grouped data– logical groupings of data exists

• i.e. age ranges (20-29, 30-39, etc.)

– have been organized into a frequency distribution

Page 24: Statistics with R

Example of Ungrouped Data

42

30

53

50

52

30

55

49

61

74

26

58

40

40

28

36

30

33

31

37

32

37

30

32

23

32

58

43

30

29

34

50

47

31

35

26

64

46

40

43

57

30

49

40

25

50

52

32

60

54

Ages of a sample of Managers from Urban Child Care Centres in US

Page 25: Statistics with R

Frequency Distribution

• Frequency Distribution – summary of data presented in the form of class intervals and frequencies– Vary in shape and design– Constructed according to the individual

researcher's preferences

Page 26: Statistics with R

Frequency Distribution

• Steps in Frequency Distribution– Step 1 – Determine range of frequency distribution

• Range is the difference between the high and the lowest numbers

– Step 2 – Determine the number of classes• Do not use too many, or two few classes

– Step 3 – Determine the width of the class interval• Approx. class width can be calculated by dividing the

rangeby the number of classes

• Values fit into only one class

Page 27: Statistics with R

Frequency Distribution of ChildCare Manager’s Ages

Class Interval Frequency20-under 30 6

30-under 40 1840-under 50 1150-under 60 1160-under 70 370-under 80 1

Page 28: Statistics with R

Relative Frequency

RelativeClass Interval Frequency Frequency20-under 30 6 .1230-under 40 18 .3640-under 50 11 .2250-under 60 11 .2260-under 70 3 .06

70-under 80 1 .02 Total 50 1.00

Relative frequency is the proportion of the total frequency that is in any given class interval in a frequency distributionrtion of the total frequencythat is any given class interval in a frequency distribution.

6

50

18

50

Page 29: Statistics with R

Cumulative Frequency

CumulativeClass Interval Frequency Frequency20-under 30 6 6

30-under 40 18 2440-under 50 11 3550-under 60 11 4660-under 70 3 4970-under 80 1 50

Total 50

Cumulative frequency is a running total of frequencies through the classes of a frequency distributionen class interval in a frequency distribution.

18 + 611 + 24

Page 30: Statistics with R

Cumulative Relative Frequencies

Cumulative Relative Cumulative RelativeClass Interval Frequency Frequency FrequencyFrequency20-under 30 6 .12 6 .1230-under 40 18 .36 24 .4840-under 50 11 .22 35 .7050-under 60 11 .22 46 .9260-under 70 3 .06 49 .9870-under 80 1 .02 50 1.00 Total 50 1.00

Cumulative relative frequency is a running total of the relative frequencies through the classes of a frequency distributione total frequency

Page 31: Statistics with R

Common Statistical Graphs – Quantitative Data

• Histogram -- vertical bar chart of frequencies• Frequency Polygon -- line graph of frequencies• Ogive -- line graph of cumulative frequencies• Dot Plots – each data value is plotted• Stem and Leaf Plot -- Like a histogram, but

shows individual data values. Useful for small data sets.

Page 32: Statistics with R

Histogram

• A histogram is a graphical summary of a frequency distribution

• Labeling x-axis with class endpoints and y-axis with frequencies, drawing a horizontal line between two class endpoints at each frequency value

• The number and location of rectangles (bars) should be determined based on the sample size and the range of the data

Page 33: Statistics with R

42

30

53

50

52

30

55

49

61

74

26

58

40

40

28

36

30

33

31

37

32

37

30

32

23

32

58

43

30

29

34

50

47

31

35

26

64

46

40

43

57

30

49

40

25

50

52

32

60

54

Range = Largest - Smallest

= 74 - 23

= 51

Smallest

Largest

Data Range

Page 34: Statistics with R

Number of Classes and Class Width

• The number of classes should be between 5 and 15.– Fewer than 5 classes cause excessive summarization.– More than 15 classes leave too much detail.

• Class Width– Divide the range by the number of classes for an

approximate class width– Round up to a convenient number

Page 35: Statistics with R

Class midpoint or Class mark

The midpoint of each class interval is called theclass midpoint or the class mark.

Page 36: Statistics with R

Midpoints for Age Classes

Relative CumulativeClass Interval Frequency Midpoint FrequencyFrequency20-under 30 6 25 .12 630-under 40 18 35 .36 2440-under 50 11 45 .22 3550-under 60 11 55 .22 4660-under 70 3 65 .06 4970-under 80 1 75 .02 50 Total 50 1.00

Page 37: Statistics with R

Midpoints for Age Classes

Relative CumulativeClass Interval Frequency Midpoint FrequencyFrequency20-under 30 6 25 .12 630-under 40 18 35 .36 2440-under 50 11 45 .22 3550-under 60 11 55 .22 4660-under 70 3 65 .06 4970-under 80 1 75 .02 50 Total 50 1.00

Page 38: Statistics with R

Histogram

Class Interval Frequency20-under 30 630-under 40 1840-under 50 1150-under 60 1160-under 70 370-under 80 1

Page 39: Statistics with R

Frequency Polygon

Class Interval Frequency20-under 30 630-under 40 1840-under 50 1150-under 60 1160-under 70 370-under 80 1

A graphical display of class frequencies 0

1020

0 10 20 30 40 50 60 70 80

Years

Fre

qu

ency

Page 40: Statistics with R

Relative Frequency Ogive

Cumulative

RelativeClass Interval Frequency20-under 30 .1230-under 40 .4840-under 50 .7050-under 60 .9260-under 70 .9870-under 80 1.00

Page 41: Statistics with R

Stem and Leaf plot: Safety Examination Scores for Plant Trainees

Raw Data86

76

23

77

81

79

68

77

92

59

68

75

83

49

91

47

72

82

74

70

56

60

88

75

97

39

78

94

55

67

83

89

67

91

81

Stem Leaf

2

3

4

5

6

7

8

9

3

9

7 9

5 6 9

0 7 7 8 8

0 2 4 5 5 6 7 7 8 9

1 1 2 3 3 6 8 9

1 1 2 4 7

Page 42: Statistics with R

Stem and Leaf plot: Construction

Raw Data86

76

23

77

81

79

68

77

92

59

68

75

83

49

91

47

72

82

74

70

56

60

88

75

97

39

78

94

55

67

83

89

67

91

81

Stem Leaf

2

3

4

5

6

7

8

9

3

9

7 9

5 6 9

0 7 7 8 8

0 2 4 5 5 6 7 7 8 9

1 1 2 3 3 6 8 9

1 1 2 4 7

Stem

Leaf

Stem

Leaf

Page 43: Statistics with R

Histogram vs. Stem and Leaf?

• So, which one should you use?• A Stem and Leaf plot is useful for small data

sets. It shows the values of the datapoints.• A histogram foregoes seeing the individual

values of the data for the bigger picture of the distribution of the data

• The purpose of these graphs is to summarize a set of data. As long as that need is met, either one is okay to use.

Page 44: Statistics with R

Common Statistical Graphs – Qualitative Data

• Pie Chart -- proportional representation for categories of a whole

• Bar Chart – frequency or relative frequency of one more categorical variables

Page 45: Statistics with R

Complaints by Amtrak Passengers

COMPLAINT NUMBER PROPORTION DEGREES

Stations, etc. 28,000 .40 144.0

Train 14,700 .21 75.6PerformanceEquipment

Personnel

Schedules,etc.

10,500

9,800

7,000

50.4

COMPLAINT NUMBER PROPORTION DEGREES

Stations, etc. 28,000 .40 144.0

TrainPerformance

14,700 .21 75.6

Equipment 10,500 .15 54.0

Personnel 9,800 .14

Schedules,etc.

7,000 .10 36.0

Total 70,000 1.00 360.0

Page 46: Statistics with R

Complaints by Amtrak Passengers

Page 47: Statistics with R

Second Quarter U.S. Truck Production

Second Quarter Truck Production in the U.S. (Hypothetical values)

2d QuarterTruck

ProductionCompany

A

B

C

D

ETotals

357,411

354,936

160,997

34,099

12,747920,190

Page 48: Statistics with R

Second Quarter U.S. Truck Production

Page 49: Statistics with R

Pie Chart Calculations for Company A

Company

A

B

C

D

ETotals

2d QuarterTruck

Production

357,411

354,936

160,997

34,099

12,747920,190

Proportion Degrees

.388

.386

.175

.037

.0141.000

140

139

63

13

5360

Page 50: Statistics with R

Vertical Bar Graphs or Column Charts

2010 2011 2012 20130

1

2

3

4

5

6

KolkataMumbaiChennai

Page 51: Statistics with R

Horizontal Bar Chart

2010

2011

2012

2013

0 1 2 3 4 5 6

ChennaiMumbaiKolkata

Page 52: Statistics with R

Pareto Chart A pareto chart is a bar chart, sorted from the most frequent to the

least frequent, overlaid with a cumulative line graph (like an ogive). These data present the most common types of defects.

0102030405060708090

100

PoorWiring

Short inCoil

DefectivePlug

Other

Freq

uenc

y

0%10%20%30%40%50%60%70%80%90%100%

Page 53: Statistics with R

Scatter Plot

Registered Vehicles (1000's)

Gasoline Sales (1000's of Gallons)

5 60

15 120

9 90

15 140

7 60

Page 54: Statistics with R

Common Statistical Graphs –Comparing Two Variables

• Scatter Plot -- type of display using Cartesian coordinates to display values for two variables for a set of data.– The data is displayed as a collection of points, each

having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.

– A scatter plot is also called a scatter chart, scatter diagram and scatter graph.

Page 55: Statistics with R

Measures of Central Tendency & Dispersion:

Learning Objectives

• Distinguish between measures of central tendency, measures of variability, measures of shape, and measures of association.

• Understand the meanings of mean, median, mode, quartile, percentile, and range.

• Compute mean, median, mode, percentile, quartile, range, variance, standard deviation, and mean absolute deviation on ungrouped data.

• Differentiate between sample and population variance and standard deviation.

Page 56: Statistics with R

Measures of Central Tendency & Dispersion:

Learning Objectives - continued

• Understand the meaning of standard deviation as it is applied by using the empirical rule and Chebyshev’s theorem.

• Compute the mean, median, standard deviation, and variance on grouped data.

• Understand box and whisker plots, skewness, and kurtosis.

• Compute a coefficient of correlation and interpret it.

Page 57: Statistics with R

Measures of Central Tendency:Ungrouped Data

• Measures of central tendency yield information about “the centre, or middle part, of a group of numbers.”

• Measures of central tendency do not focus on the span of the data set or how far values are from the middle numbers

• Common Measures of Location– Mode– Median– Mean– Percentiles– Quartiles

Page 58: Statistics with R

Mode

• Mode - the most frequently occurring value in a data set– Applicable to all levels of data measurement (nominal,

ordinal, interval, and ratio)– Can be used to determine what categories occur most

frequently– Sometimes, no mode exists (no duplicates)

• Bimodal – In a tie for the most frequently occurring value, two modes are listed

• Multimodal -- Data sets that contain more than two modes

Page 59: Statistics with R

Median

• Median - middle value in an ordered array of numbers.– Half the data are above it, half the data are below it– Mathematically, it is the (n+1)/2 th ordered

observation• For an array with an odd number of terms, the median is

the middle number– n=11 => (n+1)/2 th = 12/2 th = 6th ordered observation

• For an array with an even number of terms the median is the average of the middle two numbers

– n=10 => (n+1)/2 th = 11/2 th = 5.5th = average of 5th and 6th ordered observation

Page 60: Statistics with R

Arithmetic Mean

• Mean is the average of a group of numbers• Applicable for interval and ratio data• Not applicable for nominal or ordinal data• Affected by each value in the data set,

including extreme values• Computed by summing all values in the data

set and dividing the sum by the number of values in the data set

Page 61: Statistics with R

Demonstration Problem

The number of U.S. cars in service by top car rental companies in a recent year according to Auto Rental News follows.

Company / Number of Cars in ServiceEnterprise 643,000; Hertz 327,000; National/Alamo 233,000; Avis 204,000; Dollar/Thrifty 167,000; Budget 144,000; Advantage 20,000; U-Save 12,000; Payless 10,000; ACE 9,000; Fox 9,000; Rent-A-Wreck 7,000; Triangle 6,000

Compute the mode, the median, and the mean.

Page 62: Statistics with R

Demonstration Problem• Solutions

Solution

Mode: 9,000 (two companies with 9,000 cars in service)

Median: With 13 different companies in this group, N = 13. The median is located at the (13 +1)/2 = 7th position. Because the data are already ordered, median is the 7th term, which is 20,000.

Mean: μ = ∑x/N = (1,791,000/13) = 137,769.23

Page 63: Statistics with R

Percentile

• Percentile - measures of central tendency that divide a group of data into 100 parts

• At least n% of the data lie at or below the nth

percentile, and at most (100 - n)% of the data lie above the nth percentile

• Example: 90th percentile indicates that at 90% of the data are equal to or less than it, and 10% of the data lie above it

Page 64: Statistics with R

Calculating Percentiles

• To calculate the pth percentile,– Order the data– Calculate i = N (p/100)– Determine the percentile

• If i is a whole number, then use the average of the ith and (i+1)th ordered observation

• Otherwise, round i up to the next highest whole number

Page 65: Statistics with R

Quartiles

• Quartile - measures of central tendency that divide a group of data into four subgroups

• Q1: 25% of the data set is below the first quartile• Q2: 50% of the data set is below the second quartile• Q3: 75% of the data set is below the third quartile

25% 25% 25% 25%

Q 3Q 2Q 1

Page 66: Statistics with R

Quartiles for Demonstration Problem

For the cars in service data, n=13, so

Q1: i = 13 (25/100) = 3.25, so use the 4th ordered observationQ1 = 9,000

Q3: i = 13 (75/100) = 9.75, so use the 10th ordered observationQ3 = 204,000

Page 67: Statistics with R

Which Measure Do I Use?

• Which measure of central tendency is most appropriate?– In general, the mean is preferred, since it has nice

mathematical properties, we shall discuss later– The median and quartiles, are resistant to outliers

• Consider the following three datasets– 1, 2, 3 (median=2, mean=2)– 1, 2, 6 (median=2, mean=3)– 1, 2, 30 (median=2, mean=11)– All have median=2, but the mean is sensitive to the outliers

• In general, if there are outliers, the median is preferred to the mean

……….. To continue