data wrangling and statistical analysis...data wrangling and statistical analysis practicals ana...

23
Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection Extraction Statistical Analysis Interpret Visualize Model Source: Trevor Bihl, 2017

Upload: others

Post on 10-Jul-2020

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection

Data Wrangling and Statistical Analysis

PracticalsAna Maria Heilman, Ph.D

Breeding Pipeline DB Mgr

2018

Collect Clean

Data Wrangling Selection

Extraction

Statistical Analysis

InterpretVisualize

Model

Source: Trevor Bihl, 2017

Page 2: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection

OBJECTIVES

• Identify different descriptive statistics and visualizations

used to explore and interpret your data

• Identify the different steps of the data management cycle

• Understand the importance of QAQC in the cleaning of

messy data

Page 3: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection

Data Wrangling: Data Collection

– "Data wrangling involves:

• Taking raw data

• Extracting it

• Cleaning it

• Developing data features for analysis”

Collect Clean

Data Wrangling Selection

Extraction

Statistical Analysis

InterpretVisualize

Model

Source: Trevor Bihl, 2017

Page 4: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection

Data Wrangling: Data Collection

Raw Data• Real-world data is rarely orderly

and clean (Bihl, 2017)

• Must establish standard definitions

and protocols

Data collection

Data entry

4

Page 5: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection

QA vs QC

QA QC Process oriented to

eliminate errors

Product oriented

Proactive process Reactive Process

Define SOP (methods,

standards, audits, checklist)

Follow SOP steps to correct

errors

QA – Process oriented

QC – Product oriented

QA = “Set of processes or steps that

ensure protocols developed are

followed to minimize errors in the

data” (Campbell et al. 2013)

QC = “protective process to identify

suspect data after it has been

generated” (Campbell et al. 2013)

Quality Assurance/ Quality Control

https://passel.unl.edu/communities/index.php?idinformationmodule=1130447290&topicorder=4&maxto=6&minto=1&idcollectionmodule=1130274258

Page 6: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection

Plan

Collect

Assure

Describe

Preserve

Discover

Integrate

Analyze

Data Life Cycle• Plan:

– Description of the data

– Management steps (SOPs)

– Accessibility

• Collect:

– Field books

– Tablets/iPads

– Sensors

• Assure:

– QA through automatic check ups

Page 7: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection

Descriptive Statistics

• Both descriptive statistics and visualizations should be the

first methods to use to understand your data

• Descriptive Statistics:

– Compute basic quantitative information

• Means, variances

• Histograms and Pareto Charts (distribution of data)

Page 8: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection

Descriptive Statistics

Practicals using JMP and Excel

Page 9: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection

Descriptive Statistics: Histograms

• Can instantly give us a subjective assessment about a

data set

• A histogram reflects:

– Distribution of the data based on counting # of obs. within

range bins

Page 10: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection

Descriptive Statistics: Histograms

• Open Big Class.jmp

• Select Analyze > Distribution

• Select weight and age for Y

columns (this indicates JMP

which column of data to

analyze)

• Click OK

Big Class.jmp

Page 11: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection

Descriptive Statistics: Histograms

• Below the histogram is the

quantile info, the min, max,

median values for this data

set

– Quantiles give us an idea of

how the data is distributed

Big Class.jmp

Page 12: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection

Descriptive Statistics: Histograms

• Histogram Bin:

– Click on the red triangle next to

Weight

– Select Histogram Options

– Set Bin Width

– Type the new bin width to be 10

Big Class.jmp

Page 13: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection

Descriptive Statistics: Histograms

• Below the quantiles we have

the Summary Statistics

– Mean: Avg of a set of numbers

– Std Dev: Standard deviations

– N: Number of observations

Big Class.jmp

Page 14: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection

Descriptive Statistics: Histograms

• Distribution of categorical

data:

– Example using Grocery

Purchases.jmp

– The frequency tables do not

correspond with quntiles

• Show how often each category

appears

Gro

cery

Pu

rch

ases.jm

p

Page 15: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection

Descriptive Statistics: Histograms• Box plots :

– Presented immediately next to the

histogram is a box plot

• Box Plots displays a representation of

the distribution whereby:

– Box = location of the 1st and 3rd

quartiles

– Line inside = median

– Whiskers(above/below) = Extent of

data 1.5X length of interquartile

range

– Diamond = location of the upper

and lower 95% confidence interval

about the mean

– Bracket (red) = densest 50% of the

data

Big Class.jmp

Page 16: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection

Descriptive Statistics: Histograms• Quantile Box Plot:

– The outlier box plot relates

information regarding the

distribution of data BUT not the

quantiles

– To add a quantile box plot:

• Click on the red triangle next to

weight

• Select Quantile Box Plot

– Q1 = 91.25 (first quartile)

– Q2 = 105 (median )

– Q3 =115.75 (quartile)Big Class.jmp

Page 17: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection

Descriptive Statistics: Histograms• Stem and Leaf Plots:

– Another approach to visualize

the distribution of the data

– Uses the same frequency bins

– Retains the quantifiable

information

– To add a stem and leaf plot:

• Click on the red triangle next to

weight > select Stem and Leaf

• The value 6 | 4 indicates the first

stem is a value of 60, with a leaf

being 4 = 64

Big Class.jmp

Page 18: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection

Descriptive Statistics: Histograms

• Pareto Charts:

– “Problem solving tools that show

causes and quantiles in an ordered

manner” (Burr, 1990 cited by Bihl,

2017)

– Data is organized by group and

quantity from the largest to the smallest

– It includes a cumulative count that

represents the overall contribution to

the total number of occurrences (Bihl,

2017)

Failure2.jmp

Page 19: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection

Descriptive Statistics: Histograms

• Pareto Charts:

– Select the file called

Failure2.jmp

– Click Analyze > Quality and

Process > Pareto Plot

– Select failure and click Y, Cause.

– Select clean and click X,

Grouping.

– Select N and click Freq.

– Click OK.

Failure2.jmp

Page 20: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection

• Rearrange the order of the

plots by clicking the title (after)

in the first tile and dragging it

to the title of the next tile

(before)

– Order of the causes changes to

reflect the order based on the

first cell

• A reduction in the oxide

defects is clear after cleaning

Descriptive Statistics: Histograms

Failure2.jmp

Page 21: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection

Data Visualization Tools

Practicals using JMP and Excel

Page 22: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection

Data Visualization Tools

Type of Visualizations

• Scatter Plots

• Charts

• Multidimensional plots

– Parallel Plots

– Cell Plots

• Multivariate and Correlations Tool

– Correlations Table

– Heat maps

– Simple Statistics

• Graph Builder and Custom Figure

Page 23: Data Wrangling and Statistical Analysis...Data Wrangling and Statistical Analysis Practicals Ana Maria Heilman, Ph.D Breeding Pipeline DB Mgr 2018 Collect Clean Data Wrangling Selection

Questions?