1/19
StatTools Guide
Table of Contents
1. Data 2
a) Data Set Manager 2 b) Data Utilities 3
2. Analyses 4
a) Summary Statistics 4 • One-Variable Summary • Correlation and Covariance
b) Summary Graphs 7 • Histogram • Scatterplot • Box-Whisker Plot
c) Statistical Inference 10 • Confidence Interval • Hypothesis Test • Sample Size Selection • One-Way ANOVA • Two-Way ANOVA • Chi-Square Independence Test
d) Normality Tests 13 • Chi-Square Test • Q-Q Normal Plot
e) Time Series and Forecasting 14 • Time Series Graphs • Runs Test for Randomness • Forecast
f) Regression and Classification 15 • Regression • Logistic Regression
g) Quality Control 16 • Pareto Chart • X/R Charts • P, C and U Charts
h) Nonparametric Tests 18 • Sign Test • Wilcoxon Signed-Ranked Test • Mann-Whitney Test
3. Tools 19
a) User Manual 19 b) Example Spreadsheets 19
2/19
The StatTools toolbar is divided into 3 parts:
1. Data Let you define the dataset (i.e. “tell the computer where the data are and what they represent”) a) Data Set Manager
This is the first step you need to take before being able to apply any analysis to the data.
1 3 2
Create a new dataset
Name the dataset or keep the original name
Where are the data?
Optional. If checked your table will look a little nicer.
Are observations given in columns or rows?
Does your table contain a name (title) for each row/column?
3/19
b) Data Utilities
This is an optional step. It is used only if you want to perform any added treatment to your data (i.e. calculate differences, draw random samples, etc.).
Add calculated columns/rows to your table
Select a random sample out of your population
Group values by category, for example MPG by type of car. You will always be able to stack/unstack your data when asking for analysis so you don’t need to decide right now.
4/19
2. Analyses Once you defined your dataset you can apply a number of analyses on it. a) Summary Statistics
Basic statistics for numerical variables.
• One-variable summary
Statistics (mean, variance, etc.) for each numerical variable.
Which summary statistics do you
Which variable(s) do you want to analyze? Stacked by which
If you data are unstacked you will get statistics across all observations (i.e. mean MPG for all cars).
Stacking your data allows you to obtain statistics by category (i.e. mean MPG by type of car)
5/19
Example: One-variable summary for stacked and unstacked variables
Stacked: MPG statistics for all cars Unstacked: MPG statictics for Midsizes and SUVs
6/19
• Correlation and Covariance
Statistics between each pair of numerical variables.
Example: Correlation and Covariance analyses between 3 numerical variables
StatTools (Core Analysis Pack) Analysis: Correlation and Covariance
Performed By: Claire Date: Friday, November 20, 2009
Updating: Live
Age Income Education
Correlation Table Data Set #1 Data Set #1 Data Set #1
Age 1.000 0.084 -0.130 Income 0.084 1.000 -0.111 Education -0.130 -0.111 1.000
Age Income Education
Covariance Table Data Set #1 Data Set #1 Data Set #1
Age 54.822 4.324 -1.921 Income 4.324 48.516 -1.545 Education -1.921 -1.545 4.000
For which pair of variables do you want correlation or covariance?
Correlation or Covariance?
Note: entries above are equal to entries below diagonal (i.e. Cov(A,B)=Cov(B,A))
Note: entries on the diagonal are equal to the variance of each variable (i.e. Cov(A,A)=Var(A))
7/19
b) Summary Graphs
• Histogram
Example: Histogram
Unstacked: one histogram for all data (variable has to be numerical)
Number of bars
Value variable has to be numerical
8/19
• Scatterplot
Example: Scatterplot
0
10
20
30
40
50
60
1975 1980 1985 1990 1995 2000 2005 2010 2015
MPG
/ D
ata
Set #
1
Year / Data Set #1
Scatterplot of MPG vs Year of Data Set #1
Both variables must be numerical
Always unstacked
9/19
• Box-Whisker Plot
Description of Plot Elements
Selected variable must be numerical
Includes the following description
10/19
c) Statistical Inference
• Confidence Interval
Confidence interval for mean, standard deviation or proportion
Type of data: one-sample, two-sample or paired-sample
Confidence level
11/19
• Hypothesis Test
• Sample Size Selection
Defines the sample size needed for the confidence level requested.
Confidence interval for mean, standard deviation or proportion
H 0
H a
12/19
• One-Way ANOVA Small p-value means that means are different across populations (H 0 : means are equal) If confidence intervals do not contain 0, means are not equal.
• Two-Way ANOVA Variables must be stacked. Two categories and one numerical variable must be selected
13/19
• Chi-Square Independence Test Tests if attribute in rows and attribute in columns are independent from each other. For example if rows are “gender” and columns are “favorite soft drink”, the Chi-square independence test verifies that favorite soft drink is independent from gender. If the p-value is small the attributes are not independent.
d) Normality Tests
• Chi-square Test Compares histogram of data with the histogram had the data been normally distributed. The smaller the p-value is the closer the data are to a normal distribution.
• Q-Q Normal Plot Compares data quantiles with the quantiles had the data been normally distributed. A straight line at a 45 degrees angle means that the data are normal.
14/19
e) Time Series and Forecasting
• Time Series Graph Displays value of a numerical variable across observations.
• Runs Test for Randomness Variables must be unstacked. Indicates if values of a variable seem random. If number of runs above/below mean is significantly different than E(R) (expected number of runs under randomness) observations are not random.
15/19
• Forecast Allows to forecast future observations using different forecasting methods.
f) Regression and Classification
16/19
• Regression
• Logistic Regression Used when response variables is either 0 or 1 (success or failure). Independent variables can be continuous.
g) Quality Control Indicates whether a process is in statistical control.
Multiple considers all independent variables at once.
Other regression types allow each variable to be considered only for a subset of the observations.
17/19
• Pareto Chart This is a frequency histogram. Example: Pareto Chart
• X/R Charts
Give mean (X) and range (R) for each subset of observations. Two numerical variables need to be selected. • P, C and U Charts
Is the process contained between the upper and lower limits?
Example: C-Chart
18/19
h) Nonparametric Tests Used for hypothesis testing about the median when normal distribution is not assumed.
Both Sign and Wilcoxon Signed-Rank tests can be performed for the median of a single variable (one-sample analysis) or for the median of the differences between pairs of variables (paired-sample analysis).
• Sign Test Is the median positive or negative?
• Wilcoxon Signed-Rank Test Assumes that distribution is symmetric (but not necessarily normal).
• Mann-Whitney Test Can only be performed on two samples. Can be used to test whether two samples are issued from the same probability distribution.