data wrangling and statistical analysis...data wrangling and statistical analysis practicals ana...
TRANSCRIPT
Data Wrangling and Statistical Analysis
PracticalsAna Maria Heilman, Ph.D
Breeding Pipeline DB Mgr
2018
Collect Clean
Data Wrangling Selection
Extraction
Statistical Analysis
InterpretVisualize
Model
Source: Trevor Bihl, 2017
OBJECTIVES
• Identify different descriptive statistics and visualizations
used to explore and interpret your data
• Identify the different steps of the data management cycle
• Understand the importance of QAQC in the cleaning of
messy data
Data Wrangling: Data Collection
– "Data wrangling involves:
• Taking raw data
• Extracting it
• Cleaning it
• Developing data features for analysis”
Collect Clean
Data Wrangling Selection
Extraction
Statistical Analysis
InterpretVisualize
Model
Source: Trevor Bihl, 2017
Data Wrangling: Data Collection
Raw Data• Real-world data is rarely orderly
and clean (Bihl, 2017)
• Must establish standard definitions
and protocols
Data collection
Data entry
4
QA vs QC
QA QC Process oriented to
eliminate errors
Product oriented
Proactive process Reactive Process
Define SOP (methods,
standards, audits, checklist)
Follow SOP steps to correct
errors
QA – Process oriented
QC – Product oriented
QA = “Set of processes or steps that
ensure protocols developed are
followed to minimize errors in the
data” (Campbell et al. 2013)
QC = “protective process to identify
suspect data after it has been
generated” (Campbell et al. 2013)
Quality Assurance/ Quality Control
https://passel.unl.edu/communities/index.php?idinformationmodule=1130447290&topicorder=4&maxto=6&minto=1&idcollectionmodule=1130274258
Plan
Collect
Assure
Describe
Preserve
Discover
Integrate
Analyze
Data Life Cycle• Plan:
– Description of the data
– Management steps (SOPs)
– Accessibility
• Collect:
– Field books
– Tablets/iPads
– Sensors
• Assure:
– QA through automatic check ups
Descriptive Statistics
• Both descriptive statistics and visualizations should be the
first methods to use to understand your data
• Descriptive Statistics:
– Compute basic quantitative information
• Means, variances
• Histograms and Pareto Charts (distribution of data)
Descriptive Statistics
Practicals using JMP and Excel
Descriptive Statistics: Histograms
• Can instantly give us a subjective assessment about a
data set
• A histogram reflects:
– Distribution of the data based on counting # of obs. within
range bins
Descriptive Statistics: Histograms
• Open Big Class.jmp
• Select Analyze > Distribution
• Select weight and age for Y
columns (this indicates JMP
which column of data to
analyze)
• Click OK
Big Class.jmp
Descriptive Statistics: Histograms
• Below the histogram is the
quantile info, the min, max,
median values for this data
set
– Quantiles give us an idea of
how the data is distributed
Big Class.jmp
Descriptive Statistics: Histograms
• Histogram Bin:
– Click on the red triangle next to
Weight
– Select Histogram Options
– Set Bin Width
– Type the new bin width to be 10
Big Class.jmp
Descriptive Statistics: Histograms
• Below the quantiles we have
the Summary Statistics
– Mean: Avg of a set of numbers
– Std Dev: Standard deviations
– N: Number of observations
Big Class.jmp
Descriptive Statistics: Histograms
• Distribution of categorical
data:
– Example using Grocery
Purchases.jmp
– The frequency tables do not
correspond with quntiles
• Show how often each category
appears
Gro
cery
Pu
rch
ases.jm
p
Descriptive Statistics: Histograms• Box plots :
– Presented immediately next to the
histogram is a box plot
• Box Plots displays a representation of
the distribution whereby:
– Box = location of the 1st and 3rd
quartiles
– Line inside = median
– Whiskers(above/below) = Extent of
data 1.5X length of interquartile
range
– Diamond = location of the upper
and lower 95% confidence interval
about the mean
– Bracket (red) = densest 50% of the
data
Big Class.jmp
Descriptive Statistics: Histograms• Quantile Box Plot:
– The outlier box plot relates
information regarding the
distribution of data BUT not the
quantiles
– To add a quantile box plot:
• Click on the red triangle next to
weight
• Select Quantile Box Plot
– Q1 = 91.25 (first quartile)
– Q2 = 105 (median )
– Q3 =115.75 (quartile)Big Class.jmp
Descriptive Statistics: Histograms• Stem and Leaf Plots:
– Another approach to visualize
the distribution of the data
– Uses the same frequency bins
– Retains the quantifiable
information
– To add a stem and leaf plot:
• Click on the red triangle next to
weight > select Stem and Leaf
• The value 6 | 4 indicates the first
stem is a value of 60, with a leaf
being 4 = 64
Big Class.jmp
Descriptive Statistics: Histograms
• Pareto Charts:
– “Problem solving tools that show
causes and quantiles in an ordered
manner” (Burr, 1990 cited by Bihl,
2017)
– Data is organized by group and
quantity from the largest to the smallest
– It includes a cumulative count that
represents the overall contribution to
the total number of occurrences (Bihl,
2017)
Failure2.jmp
Descriptive Statistics: Histograms
• Pareto Charts:
– Select the file called
Failure2.jmp
– Click Analyze > Quality and
Process > Pareto Plot
– Select failure and click Y, Cause.
– Select clean and click X,
Grouping.
– Select N and click Freq.
– Click OK.
Failure2.jmp
• Rearrange the order of the
plots by clicking the title (after)
in the first tile and dragging it
to the title of the next tile
(before)
– Order of the causes changes to
reflect the order based on the
first cell
• A reduction in the oxide
defects is clear after cleaning
Descriptive Statistics: Histograms
Failure2.jmp
Data Visualization Tools
Practicals using JMP and Excel
Data Visualization Tools
Type of Visualizations
• Scatter Plots
• Charts
• Multidimensional plots
– Parallel Plots
– Cell Plots
• Multivariate and Correlations Tool
– Correlations Table
– Heat maps
– Simple Statistics
• Graph Builder and Custom Figure
Questions?