Download - Statistics with R
> x=11
> print(x)
[1] 11
> x
[1] 11
> X
Error: object 'X' not found
> y<-7
> y
[1] 7
> y<-9
> y
[1] 9
> ls()
[1] "x" "y"
> rm(y)
> y
Error: object 'y' not found
> y<-9
> x.1<-14
> x.1
[1] 14
> 1x<-22Error: unexpected symbol in "1x"
Entering data with c
• c function for small datasets – combines or concatenates terms together
Example: we have a count of the number of typing mistakes of a word document:0 2 1 3 2 0 1 1To enter this into an R session we go like this:> typo=c(0,2,1,3,2,0,1,1)> typo[1] 0 2 1 3 2 0 1 1
Learning Objectives
• What is statistics?• Become aware of the varied applications of statistics in
business.• Differentiate between descriptive and inferential statistics.• Identify types of variables.
Statistics in Business
• Accounting — auditing and cost estimation• Economics — local, regional, national, and international
economic performance • Finance — investments and portfolio management• Management — human resources, compensation, and quality
management• Management Information Systems — performance of systems
which gather, summarize, and disseminate information to various managerial levels
• Marketing — market analysis and consumer research• International Business — market and demographic analysis
What is Statistics?
• Science dealing with collection, analysis, interpretation and presentation of data (with a view to making inferences)
• Branches of statistics:– Descriptive – graphical or numerical summaries of
data– Inferential – making a decision based on data
What is Statistics?
Statistics in business is the study of VARIATIONS
Population Versus Sample
• Population — the whole– a collection of all persons, objects, or items under
study• Census — gathering data from the entire population• Sample — gathering data on a subset of the population
– Use information about the sample to infer about the population
Population Versus Sample
Population and Census Data
Identifier Color MPG
RD1 Red 12
RD2 Red 10
RD3 Red 13
RD4 Red 10
RD5 Red 13
BL1 Blue 27
BL2 Blue 24
GR1 Green 35
GR2 Green 35
GY1 Gray 15
GY2 Gray 18
GY3 Gray 17
Sample and Sample Data
Identifier Color MPG
RD2 Red 10
RD5 Red 13
GR1 Green 35
GY2 Gray 18
Population Versus Sample
Population
(parameter)
Sample
(statistic)
Select arandom sample
Calculate to estimate
Parameter vs. Statistic
• Parameter — descriptive measure of the population– Usually represented by Greek letters
• Statistic — descriptive measure of a sample– Usually represented by Roman letters
parameter population denotes variancepopulation denotes 2
denotes population standard deviation
mean sample denotes x
variancesample denotes s2
deviation standard sample denotes s
Statistics in Business
• Inferences about parameters made under conditions of uncertainty (which are always present in statistics)– Uncertainty can be caused by
• Randomness in selection of a sample• Lack of knowledge about the source of the
inferences • Change in conditions not accounted for
Variables and Data
Variable : a characteristic of any entity being studied – is capable of taking on different values that can be used for analysise.g. stock price, ROI, market share, age of worker, income of a family, total sales, advertising cost etc
Measurement : is done when a standard process is used to assign numbers to particular characteristics of a variable – may be obvious or definede.g. age is obvious but ROI or Labour productivity is defined
The source of each measurement is called a Sampling unitData : recorded measurements
Levels of Data Measurement
What are 40 and 80? may represent Weights of two objects being shippedRatings received in a consumer test by two
different productsFootball jersey numbers of a fullback and centre-
forward
Appropriateness of data analysis depends on the level of measurement of the data gathered
Levels of Data Measurement
• Nominal — Qualitative data, typically numbers are used only to classify or categorize the attribute, however it is useful to retain original verbal descriptions of categories– 1 for “male” and 2 for “female” – Employee identification number – Religion, Geographic location, PIN code, Place of
birth – Demographic questions in survey etc
Levels of Data Measurement
• Ordinal - A variable is ordinal measurable if ranking or ordering is possible for values of the variable.– For example, a gold medal reflects superior
performance to a silver or bronze medal in the Olympics. But can you say a gold and a bronze medal average out to a silver medal?
– Preference scales are typically ordinal – how much do you like this cereal? Like it a lot, somewhat like it, neutral, somewhat dislike it, dislike it a lot.
Levels of Data Measurement
• Interval - In interval measurement the distance between attributes does have meaning.– Numerical data typically fall into this category– For example, when measuring temperature (in
Fahrenheit), the distance from 30-40 is same as the distance from 70-80. The interval between values is interpretable.
Levels of Data Measurement
• Ratio — in ratio measurement there is always a reference point that is meaningful (either 0 for rates or 1 for ratios)– This means that you can construct a meaningful
fraction(or ratio) with a ratio variable.
– In applied social research most "count" variables are ratio, for example, the number of clients in past six months.
Visualizing the data
• Construct a frequency distribution– For both grouped and ungrouped data
• Construct graphical summaries of qualitative data
• Construct graphical summaries of quantitative data
• Construct graphical summaries of two variables
Ungrouped vs.Grouped Data
• Ungrouped data– have not been summarized in any way– are also called raw data
• Grouped data– logical groupings of data exists
• i.e. age ranges (20-29, 30-39, etc.)
– have been organized into a frequency distribution
Example of Ungrouped Data
42
30
53
50
52
30
55
49
61
74
26
58
40
40
28
36
30
33
31
37
32
37
30
32
23
32
58
43
30
29
34
50
47
31
35
26
64
46
40
43
57
30
49
40
25
50
52
32
60
54
Ages of a sample of Managers from Urban Child Care Centres in US
Frequency Distribution
• Frequency Distribution – summary of data presented in the form of class intervals and frequencies– Vary in shape and design– Constructed according to the individual
researcher's preferences
Frequency Distribution
• Steps in Frequency Distribution– Step 1 – Determine range of frequency distribution
• Range is the difference between the high and the lowest numbers
– Step 2 – Determine the number of classes• Do not use too many, or two few classes
– Step 3 – Determine the width of the class interval• Approx. class width can be calculated by dividing the
rangeby the number of classes
• Values fit into only one class
Frequency Distribution of ChildCare Manager’s Ages
Class Interval Frequency20-under 30 6
30-under 40 1840-under 50 1150-under 60 1160-under 70 370-under 80 1
Relative Frequency
RelativeClass Interval Frequency Frequency20-under 30 6 .1230-under 40 18 .3640-under 50 11 .2250-under 60 11 .2260-under 70 3 .06
70-under 80 1 .02 Total 50 1.00
Relative frequency is the proportion of the total frequency that is in any given class interval in a frequency distributionrtion of the total frequencythat is any given class interval in a frequency distribution.
6
50
18
50
Cumulative Frequency
CumulativeClass Interval Frequency Frequency20-under 30 6 6
30-under 40 18 2440-under 50 11 3550-under 60 11 4660-under 70 3 4970-under 80 1 50
Total 50
Cumulative frequency is a running total of frequencies through the classes of a frequency distributionen class interval in a frequency distribution.
18 + 611 + 24
Cumulative Relative Frequencies
Cumulative Relative Cumulative RelativeClass Interval Frequency Frequency FrequencyFrequency20-under 30 6 .12 6 .1230-under 40 18 .36 24 .4840-under 50 11 .22 35 .7050-under 60 11 .22 46 .9260-under 70 3 .06 49 .9870-under 80 1 .02 50 1.00 Total 50 1.00
Cumulative relative frequency is a running total of the relative frequencies through the classes of a frequency distributione total frequency
Common Statistical Graphs – Quantitative Data
• Histogram -- vertical bar chart of frequencies• Frequency Polygon -- line graph of frequencies• Ogive -- line graph of cumulative frequencies• Dot Plots – each data value is plotted• Stem and Leaf Plot -- Like a histogram, but
shows individual data values. Useful for small data sets.
Histogram
• A histogram is a graphical summary of a frequency distribution
• Labeling x-axis with class endpoints and y-axis with frequencies, drawing a horizontal line between two class endpoints at each frequency value
• The number and location of rectangles (bars) should be determined based on the sample size and the range of the data
42
30
53
50
52
30
55
49
61
74
26
58
40
40
28
36
30
33
31
37
32
37
30
32
23
32
58
43
30
29
34
50
47
31
35
26
64
46
40
43
57
30
49
40
25
50
52
32
60
54
Range = Largest - Smallest
= 74 - 23
= 51
Smallest
Largest
Data Range
Number of Classes and Class Width
• The number of classes should be between 5 and 15.– Fewer than 5 classes cause excessive summarization.– More than 15 classes leave too much detail.
• Class Width– Divide the range by the number of classes for an
approximate class width– Round up to a convenient number
Class midpoint or Class mark
The midpoint of each class interval is called theclass midpoint or the class mark.
Midpoints for Age Classes
Relative CumulativeClass Interval Frequency Midpoint FrequencyFrequency20-under 30 6 25 .12 630-under 40 18 35 .36 2440-under 50 11 45 .22 3550-under 60 11 55 .22 4660-under 70 3 65 .06 4970-under 80 1 75 .02 50 Total 50 1.00
Midpoints for Age Classes
Relative CumulativeClass Interval Frequency Midpoint FrequencyFrequency20-under 30 6 25 .12 630-under 40 18 35 .36 2440-under 50 11 45 .22 3550-under 60 11 55 .22 4660-under 70 3 65 .06 4970-under 80 1 75 .02 50 Total 50 1.00
Histogram
Class Interval Frequency20-under 30 630-under 40 1840-under 50 1150-under 60 1160-under 70 370-under 80 1
Frequency Polygon
Class Interval Frequency20-under 30 630-under 40 1840-under 50 1150-under 60 1160-under 70 370-under 80 1
A graphical display of class frequencies 0
1020
0 10 20 30 40 50 60 70 80
Years
Fre
qu
ency
Relative Frequency Ogive
Cumulative
RelativeClass Interval Frequency20-under 30 .1230-under 40 .4840-under 50 .7050-under 60 .9260-under 70 .9870-under 80 1.00
Stem and Leaf plot: Safety Examination Scores for Plant Trainees
Raw Data86
76
23
77
81
79
68
77
92
59
68
75
83
49
91
47
72
82
74
70
56
60
88
75
97
39
78
94
55
67
83
89
67
91
81
Stem Leaf
2
3
4
5
6
7
8
9
3
9
7 9
5 6 9
0 7 7 8 8
0 2 4 5 5 6 7 7 8 9
1 1 2 3 3 6 8 9
1 1 2 4 7
Stem and Leaf plot: Construction
Raw Data86
76
23
77
81
79
68
77
92
59
68
75
83
49
91
47
72
82
74
70
56
60
88
75
97
39
78
94
55
67
83
89
67
91
81
Stem Leaf
2
3
4
5
6
7
8
9
3
9
7 9
5 6 9
0 7 7 8 8
0 2 4 5 5 6 7 7 8 9
1 1 2 3 3 6 8 9
1 1 2 4 7
Stem
Leaf
Stem
Leaf
Histogram vs. Stem and Leaf?
• So, which one should you use?• A Stem and Leaf plot is useful for small data
sets. It shows the values of the datapoints.• A histogram foregoes seeing the individual
values of the data for the bigger picture of the distribution of the data
• The purpose of these graphs is to summarize a set of data. As long as that need is met, either one is okay to use.
Common Statistical Graphs – Qualitative Data
• Pie Chart -- proportional representation for categories of a whole
• Bar Chart – frequency or relative frequency of one more categorical variables
Complaints by Amtrak Passengers
COMPLAINT NUMBER PROPORTION DEGREES
Stations, etc. 28,000 .40 144.0
Train 14,700 .21 75.6PerformanceEquipment
Personnel
Schedules,etc.
10,500
9,800
7,000
50.4
COMPLAINT NUMBER PROPORTION DEGREES
Stations, etc. 28,000 .40 144.0
TrainPerformance
14,700 .21 75.6
Equipment 10,500 .15 54.0
Personnel 9,800 .14
Schedules,etc.
7,000 .10 36.0
Total 70,000 1.00 360.0
Complaints by Amtrak Passengers
Second Quarter U.S. Truck Production
Second Quarter Truck Production in the U.S. (Hypothetical values)
2d QuarterTruck
ProductionCompany
A
B
C
D
ETotals
357,411
354,936
160,997
34,099
12,747920,190
Second Quarter U.S. Truck Production
Pie Chart Calculations for Company A
Company
A
B
C
D
ETotals
2d QuarterTruck
Production
357,411
354,936
160,997
34,099
12,747920,190
Proportion Degrees
.388
.386
.175
.037
.0141.000
140
139
63
13
5360
Vertical Bar Graphs or Column Charts
2010 2011 2012 20130
1
2
3
4
5
6
KolkataMumbaiChennai
Horizontal Bar Chart
2010
2011
2012
2013
0 1 2 3 4 5 6
ChennaiMumbaiKolkata
Pareto Chart A pareto chart is a bar chart, sorted from the most frequent to the
least frequent, overlaid with a cumulative line graph (like an ogive). These data present the most common types of defects.
0102030405060708090
100
PoorWiring
Short inCoil
DefectivePlug
Other
Freq
uenc
y
0%10%20%30%40%50%60%70%80%90%100%
Scatter Plot
Registered Vehicles (1000's)
Gasoline Sales (1000's of Gallons)
5 60
15 120
9 90
15 140
7 60
Common Statistical Graphs –Comparing Two Variables
• Scatter Plot -- type of display using Cartesian coordinates to display values for two variables for a set of data.– The data is displayed as a collection of points, each
having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.
– A scatter plot is also called a scatter chart, scatter diagram and scatter graph.
Measures of Central Tendency & Dispersion:
Learning Objectives
• Distinguish between measures of central tendency, measures of variability, measures of shape, and measures of association.
• Understand the meanings of mean, median, mode, quartile, percentile, and range.
• Compute mean, median, mode, percentile, quartile, range, variance, standard deviation, and mean absolute deviation on ungrouped data.
• Differentiate between sample and population variance and standard deviation.
Measures of Central Tendency & Dispersion:
Learning Objectives - continued
• Understand the meaning of standard deviation as it is applied by using the empirical rule and Chebyshev’s theorem.
• Compute the mean, median, standard deviation, and variance on grouped data.
• Understand box and whisker plots, skewness, and kurtosis.
• Compute a coefficient of correlation and interpret it.
Measures of Central Tendency:Ungrouped Data
• Measures of central tendency yield information about “the centre, or middle part, of a group of numbers.”
• Measures of central tendency do not focus on the span of the data set or how far values are from the middle numbers
• Common Measures of Location– Mode– Median– Mean– Percentiles– Quartiles
Mode
• Mode - the most frequently occurring value in a data set– Applicable to all levels of data measurement (nominal,
ordinal, interval, and ratio)– Can be used to determine what categories occur most
frequently– Sometimes, no mode exists (no duplicates)
• Bimodal – In a tie for the most frequently occurring value, two modes are listed
• Multimodal -- Data sets that contain more than two modes
Median
• Median - middle value in an ordered array of numbers.– Half the data are above it, half the data are below it– Mathematically, it is the (n+1)/2 th ordered
observation• For an array with an odd number of terms, the median is
the middle number– n=11 => (n+1)/2 th = 12/2 th = 6th ordered observation
• For an array with an even number of terms the median is the average of the middle two numbers
– n=10 => (n+1)/2 th = 11/2 th = 5.5th = average of 5th and 6th ordered observation
Arithmetic Mean
• Mean is the average of a group of numbers• Applicable for interval and ratio data• Not applicable for nominal or ordinal data• Affected by each value in the data set,
including extreme values• Computed by summing all values in the data
set and dividing the sum by the number of values in the data set
Demonstration Problem
The number of U.S. cars in service by top car rental companies in a recent year according to Auto Rental News follows.
Company / Number of Cars in ServiceEnterprise 643,000; Hertz 327,000; National/Alamo 233,000; Avis 204,000; Dollar/Thrifty 167,000; Budget 144,000; Advantage 20,000; U-Save 12,000; Payless 10,000; ACE 9,000; Fox 9,000; Rent-A-Wreck 7,000; Triangle 6,000
Compute the mode, the median, and the mean.
Demonstration Problem• Solutions
Solution
Mode: 9,000 (two companies with 9,000 cars in service)
Median: With 13 different companies in this group, N = 13. The median is located at the (13 +1)/2 = 7th position. Because the data are already ordered, median is the 7th term, which is 20,000.
Mean: μ = ∑x/N = (1,791,000/13) = 137,769.23
Percentile
• Percentile - measures of central tendency that divide a group of data into 100 parts
• At least n% of the data lie at or below the nth
percentile, and at most (100 - n)% of the data lie above the nth percentile
• Example: 90th percentile indicates that at 90% of the data are equal to or less than it, and 10% of the data lie above it
Calculating Percentiles
• To calculate the pth percentile,– Order the data– Calculate i = N (p/100)– Determine the percentile
• If i is a whole number, then use the average of the ith and (i+1)th ordered observation
• Otherwise, round i up to the next highest whole number
Quartiles
• Quartile - measures of central tendency that divide a group of data into four subgroups
• Q1: 25% of the data set is below the first quartile• Q2: 50% of the data set is below the second quartile• Q3: 75% of the data set is below the third quartile
25% 25% 25% 25%
Q 3Q 2Q 1
Quartiles for Demonstration Problem
For the cars in service data, n=13, so
Q1: i = 13 (25/100) = 3.25, so use the 4th ordered observationQ1 = 9,000
Q3: i = 13 (75/100) = 9.75, so use the 10th ordered observationQ3 = 204,000
Which Measure Do I Use?
• Which measure of central tendency is most appropriate?– In general, the mean is preferred, since it has nice
mathematical properties, we shall discuss later– The median and quartiles, are resistant to outliers
• Consider the following three datasets– 1, 2, 3 (median=2, mean=2)– 1, 2, 6 (median=2, mean=3)– 1, 2, 30 (median=2, mean=11)– All have median=2, but the mean is sensitive to the outliers
• In general, if there are outliers, the median is preferred to the mean
……….. To continue