the green lab - [07-a] data analysis

27
1 Het begint met een idee Data Analysis Descriptive Statistics and EDA Giuseppe Procaccianti

Upload: giuseppe-procaccianti

Post on 14-Feb-2017

480 views

Category:

Education


5 download

TRANSCRIPT

Page 1: The Green Lab - [07-A] Data Analysis

1 Het begint met een idee

Data AnalysisDescriptive Statistics and EDA

Giuseppe Procaccianti

Page 2: The Green Lab - [07-A] Data Analysis

Vrije Universiteit Amsterdam

2 Giuseppe Procaccianti / S2 group / The Green Lab

Quick Recap

Experiment scoping

Experiment planning

Idea

Experiment operation

Analysis & interpretation

Presentation & package

Page 3: The Green Lab - [07-A] Data Analysis

Vrije Universiteit Amsterdam

3 Giuseppe Procaccianti / S2 group / The Green Lab

Analysis and Interpretation

● Understanding the data

○ descriptive statistics

○ exploratory data analysis (EDA, e.g. boxplots, scatter plots)

● (Optional) data reduction

● Hypothesis testing

● Results interpretation

Page 4: The Green Lab - [07-A] Data Analysis

Vrije Universiteit Amsterdam

4 Giuseppe Procaccianti / S2 group / The Green Lab

Descriptive Statistics

● Goal: get a ‘feeling’ about how data is distributed

● Properties:

○ Central Tendency (e.g. Mean, Median)

○ Dispersion (e.g. Frequency, Standard Deviation)

○ Dependency (e.g. Correlation)

Page 5: The Green Lab - [07-A] Data Analysis

Vrije Universiteit Amsterdam

5 Giuseppe Procaccianti / S2 group / The Green Lab

Parameter vs. statistic

● Parameter: feature of the population

○ μ: mean

○ σ: standard deviation

● Statistic: feature of the sample

○ : mean

○ s: standard deviation

● Statistics are an estimation of parameters

Page 6: The Green Lab - [07-A] Data Analysis

Vrije Universiteit Amsterdam

6 Giuseppe Procaccianti / S2 group / The Green Lab

Central Tendency

● Arithmetic mean:

● Geometric Mean:

Page 7: The Green Lab - [07-A] Data Analysis

Vrije Universiteit Amsterdam

7 Giuseppe Procaccianti / S2 group / The Green Lab

Central Tendency: example

● Average of scores:

6 - 7 - 8 - 9 - 10

● Arithmetic mean: 8

● Geometric mean: ~7.87

Page 8: The Green Lab - [07-A] Data Analysis

Vrije Universiteit Amsterdam

8 Giuseppe Procaccianti / S2 group / The Green Lab

Central Tendency: example

● Average of returns of investments:

90% ; 10% ; 20% ; 30% ; -90%

● Arithmetic mean:

(90+10+20+30-90)/5= 12%

● Geometric mean:

[(1.9 x 1.1 x 1.2 x 1.3 x 0.1) ^ 1/5] - 1 =0.2008= -20.08%

Page 9: The Green Lab - [07-A] Data Analysis

Vrije Universiteit Amsterdam

9 Giuseppe Procaccianti / S2 group / The Green Lab

Central Tendency

● Median (or 50% percentile): middle value separating the

greater and lesser halves of a data set

X = [13, 18, 13, 14, 13, 16, 14, 21, 13]

Xsort

= [13, 13, 13, 13, 14, 14, 16, 18, 21]

Page 10: The Green Lab - [07-A] Data Analysis

Vrije Universiteit Amsterdam

10 Giuseppe Procaccianti / S2 group / The Green Lab

Central Tendency

● Mode: most frequent value in data set

X = [13, 18, 13, 14, 13, 16, 14, 21, 13]

Mox = 13

Page 11: The Green Lab - [07-A] Data Analysis

Vrije Universiteit Amsterdam

11 Giuseppe Procaccianti / S2 group / The Green Lab

Central Tendency - Skewness

Page 12: The Green Lab - [07-A] Data Analysis

Vrije Universiteit Amsterdam

12 Giuseppe Procaccianti / S2 group / The Green Lab

Dispersion

● Sample variance:

● Standard Deviation:

● Standard Deviation is dimensionally equivalent to the data

Page 13: The Green Lab - [07-A] Data Analysis

Vrije Universiteit Amsterdam

13 Giuseppe Procaccianti / S2 group / The Green Lab

Dispersion - three-sigma-rule

"Empirical Rule" by Dan Kernler - Own work. Licensed under CC BY-SA 4.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Empirical_Rule.PNG#/media/File:Empirical_Rule.PNG

Page 14: The Green Lab - [07-A] Data Analysis

Vrije Universiteit Amsterdam

14 Giuseppe Procaccianti / S2 group / The Green Lab

Dispersion - three-sigma-rule

● Range:

● Coefficient of variation:(in percentage of mean)

● Coefficient of variation only has meaning if all values are positive (ratio scale, not interval scale e.g. temperatures)

Page 15: The Green Lab - [07-A] Data Analysis

Vrije Universiteit Amsterdam

15 Giuseppe Procaccianti / S2 group / The Green Lab

Dispersion - example

● Dataset: [100, 100, 100]Mean: 100

● Variance: 0

● Standard Deviation: 0

● Coeff. Variation: 0

● Range: 0

Page 16: The Green Lab - [07-A] Data Analysis

Vrije Universiteit Amsterdam

16 Giuseppe Procaccianti / S2 group / The Green Lab

Dispersion - example

● Dataset: [90, 100, 110]Mean: 100

● Sample Variance: 100

● Standard Deviation: 10

● Coeff. Variation: 10%

● Range: 20

Page 17: The Green Lab - [07-A] Data Analysis

Vrije Universiteit Amsterdam

17 Giuseppe Procaccianti / S2 group / The Green Lab

Dispersion - example

● Dataset: [1, 5, 6, 8, 10, 40, 65, 88]Mean: 27.875

● Sample Variance: 1082.69

● Standard Deviation: 32.9

● Coeff. Variation: 1.18%

● Range: 87

Page 18: The Green Lab - [07-A] Data Analysis

Vrije Universiteit Amsterdam

18 Giuseppe Procaccianti / S2 group / The Green Lab

Basic visualizations

Box Plot

Median

3rd quartile

1st quartile

Page 19: The Green Lab - [07-A] Data Analysis

Vrije Universiteit Amsterdam

19 Giuseppe Procaccianti / S2 group / The Green Lab

Basic visualizations

Box Plot

Page 20: The Green Lab - [07-A] Data Analysis

Vrije Universiteit Amsterdam

20 Giuseppe Procaccianti / S2 group / The Green Lab

Basic visualizations

Box Plot

By Gbdivers (Own work) [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

outliers positive skewness

Page 21: The Green Lab - [07-A] Data Analysis

Vrije Universiteit Amsterdam

21 Giuseppe Procaccianti / S2 group / The Green Lab

Dependency: correlation

● Sample correlation coefficient (Pearson):

● Meaningful when comparing paired values/datasets

Page 22: The Green Lab - [07-A] Data Analysis

Vrije Universiteit Amsterdam

22 Giuseppe Procaccianti / S2 group / The Green Lab

Dependency: correlation

● Spearman’s rank correlation coefficient:

● Kendall’s rank correlation coefficient: ○ smaller values○ more accurate on small samples

● Pearson correlation coefficient assumes normally distributed data

Page 23: The Green Lab - [07-A] Data Analysis

Vrije Universiteit Amsterdam

23 Giuseppe Procaccianti / S2 group / The Green Lab

Dependency: example

Age vs. body fat %

● Pearson: r = 0.7921

● Spearman: = 0.7539

● Kendall: = 0.5762

Page 24: The Green Lab - [07-A] Data Analysis

Vrije Universiteit Amsterdam

24 Giuseppe Procaccianti / S2 group / The Green Lab

Basic Visualizations

Scatter Plot

Page 26: The Green Lab - [07-A] Data Analysis

Vrije Universiteit Amsterdam

26 Giuseppe Procaccianti / S2 group / The Green Lab

Correlation does NOT imply causation!

● Spurious Correlations: http://tylervigen.com/

Page 27: The Green Lab - [07-A] Data Analysis

Vrije Universiteit Amsterdam

Thank you!

[email protected]

[email protected] 27 Giuseppe Procaccianti / S2 group / The Green Lab