the green lab - [07-a] data analysis
TRANSCRIPT
1 Het begint met een idee
Data AnalysisDescriptive Statistics and EDA
Giuseppe Procaccianti
Vrije Universiteit Amsterdam
2 Giuseppe Procaccianti / S2 group / The Green Lab
Quick Recap
Experiment scoping
Experiment planning
Idea
Experiment operation
Analysis & interpretation
Presentation & package
Vrije Universiteit Amsterdam
3 Giuseppe Procaccianti / S2 group / The Green Lab
Analysis and Interpretation
● Understanding the data
○ descriptive statistics
○ exploratory data analysis (EDA, e.g. boxplots, scatter plots)
● (Optional) data reduction
● Hypothesis testing
● Results interpretation
Vrije Universiteit Amsterdam
4 Giuseppe Procaccianti / S2 group / The Green Lab
Descriptive Statistics
● Goal: get a ‘feeling’ about how data is distributed
● Properties:
○ Central Tendency (e.g. Mean, Median)
○ Dispersion (e.g. Frequency, Standard Deviation)
○ Dependency (e.g. Correlation)
Vrije Universiteit Amsterdam
5 Giuseppe Procaccianti / S2 group / The Green Lab
Parameter vs. statistic
● Parameter: feature of the population
○ μ: mean
○ σ: standard deviation
● Statistic: feature of the sample
○ : mean
○ s: standard deviation
● Statistics are an estimation of parameters
Vrije Universiteit Amsterdam
6 Giuseppe Procaccianti / S2 group / The Green Lab
Central Tendency
● Arithmetic mean:
● Geometric Mean:
Vrije Universiteit Amsterdam
7 Giuseppe Procaccianti / S2 group / The Green Lab
Central Tendency: example
● Average of scores:
6 - 7 - 8 - 9 - 10
● Arithmetic mean: 8
● Geometric mean: ~7.87
Vrije Universiteit Amsterdam
8 Giuseppe Procaccianti / S2 group / The Green Lab
Central Tendency: example
● Average of returns of investments:
90% ; 10% ; 20% ; 30% ; -90%
● Arithmetic mean:
(90+10+20+30-90)/5= 12%
● Geometric mean:
[(1.9 x 1.1 x 1.2 x 1.3 x 0.1) ^ 1/5] - 1 =0.2008= -20.08%
Vrije Universiteit Amsterdam
9 Giuseppe Procaccianti / S2 group / The Green Lab
Central Tendency
● Median (or 50% percentile): middle value separating the
greater and lesser halves of a data set
X = [13, 18, 13, 14, 13, 16, 14, 21, 13]
Xsort
= [13, 13, 13, 13, 14, 14, 16, 18, 21]
Vrije Universiteit Amsterdam
10 Giuseppe Procaccianti / S2 group / The Green Lab
Central Tendency
● Mode: most frequent value in data set
X = [13, 18, 13, 14, 13, 16, 14, 21, 13]
Mox = 13
Vrije Universiteit Amsterdam
11 Giuseppe Procaccianti / S2 group / The Green Lab
Central Tendency - Skewness
Vrije Universiteit Amsterdam
12 Giuseppe Procaccianti / S2 group / The Green Lab
Dispersion
● Sample variance:
● Standard Deviation:
● Standard Deviation is dimensionally equivalent to the data
Vrije Universiteit Amsterdam
13 Giuseppe Procaccianti / S2 group / The Green Lab
Dispersion - three-sigma-rule
"Empirical Rule" by Dan Kernler - Own work. Licensed under CC BY-SA 4.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Empirical_Rule.PNG#/media/File:Empirical_Rule.PNG
Vrije Universiteit Amsterdam
14 Giuseppe Procaccianti / S2 group / The Green Lab
Dispersion - three-sigma-rule
● Range:
● Coefficient of variation:(in percentage of mean)
● Coefficient of variation only has meaning if all values are positive (ratio scale, not interval scale e.g. temperatures)
Vrije Universiteit Amsterdam
15 Giuseppe Procaccianti / S2 group / The Green Lab
Dispersion - example
● Dataset: [100, 100, 100]Mean: 100
● Variance: 0
● Standard Deviation: 0
● Coeff. Variation: 0
● Range: 0
Vrije Universiteit Amsterdam
16 Giuseppe Procaccianti / S2 group / The Green Lab
Dispersion - example
● Dataset: [90, 100, 110]Mean: 100
● Sample Variance: 100
● Standard Deviation: 10
● Coeff. Variation: 10%
● Range: 20
Vrije Universiteit Amsterdam
17 Giuseppe Procaccianti / S2 group / The Green Lab
Dispersion - example
● Dataset: [1, 5, 6, 8, 10, 40, 65, 88]Mean: 27.875
● Sample Variance: 1082.69
● Standard Deviation: 32.9
● Coeff. Variation: 1.18%
● Range: 87
Vrije Universiteit Amsterdam
18 Giuseppe Procaccianti / S2 group / The Green Lab
Basic visualizations
Box Plot
Median
3rd quartile
1st quartile
Vrije Universiteit Amsterdam
19 Giuseppe Procaccianti / S2 group / The Green Lab
Basic visualizations
Box Plot
Vrije Universiteit Amsterdam
20 Giuseppe Procaccianti / S2 group / The Green Lab
Basic visualizations
Box Plot
By Gbdivers (Own work) [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
outliers positive skewness
Vrije Universiteit Amsterdam
21 Giuseppe Procaccianti / S2 group / The Green Lab
Dependency: correlation
● Sample correlation coefficient (Pearson):
● Meaningful when comparing paired values/datasets
Vrije Universiteit Amsterdam
22 Giuseppe Procaccianti / S2 group / The Green Lab
Dependency: correlation
● Spearman’s rank correlation coefficient:
● Kendall’s rank correlation coefficient: ○ smaller values○ more accurate on small samples
● Pearson correlation coefficient assumes normally distributed data
Vrije Universiteit Amsterdam
23 Giuseppe Procaccianti / S2 group / The Green Lab
Dependency: example
Age vs. body fat %
● Pearson: r = 0.7921
● Spearman: = 0.7539
● Kendall: = 0.5762
Vrije Universiteit Amsterdam
24 Giuseppe Procaccianti / S2 group / The Green Lab
Basic Visualizations
Scatter Plot
Vrije Universiteit Amsterdam
25 Giuseppe Procaccianti / S2 group / The Green Lab
Basic Visualizations
Image Source: http://www.cqeacademy.com/cqe-body-of-knowledge/continuous-improvement/quality-control-tools/the-scatter-plot-linear-regression/
Scatter plots per different values of r
Vrije Universiteit Amsterdam
26 Giuseppe Procaccianti / S2 group / The Green Lab
Correlation does NOT imply causation!
● Spurious Correlations: http://tylervigen.com/
Vrije Universiteit Amsterdam
Thank you!
[email protected] 27 Giuseppe Procaccianti / S2 group / The Green Lab