visualizing and exploring datasrihari/cse626/slide/ch3-part1 1.pdfvisualizing and exploring data....

27
Visualizing and Exploring Data

Upload: others

Post on 19-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis

Visualizing and Exploring Data

Page 2: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis

Visual Methods for finding structures in data

• Power of human eye/brain to detect structures– Product of eons of evolution

• Display data in ways that capitalize on human pattern processing abilities

• Can find unexpected relationships– Limitation: very large data sets

Page 3: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis

Exploratory Data Analysis

• Explore the data without any clear ideas of what we are looking for

• EDA techniques are– Interactive– Visual

• Many graphical methods for low-dimensional data• For higher dimensions -- Principal Components

Analysis

Page 4: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis

Topics in Visualization

1. Summarizing DataMean, Variance, Standard Deviation, Skewness

2. Tools for Single Variables (histogram)3. Tools for Pairs of Variables (scatterplot)4. Tools for Multiple Variables5. Principal Components Analysis

– Reduced number of dimensions

Page 5: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis

1. Summarizing the data

• Centrality– Minimizes the sum of squared errors to all samples– If there are n data values, mean is the value such that the sum of n

copies of the mean equals the sum of data values• Measures of Location

– Mean is a measure of location– Median (value that has equal no of points above and

below)– Quartile (value greater than a quarter of the data points)

∑=

=n

iix

nMean

1)(1 , µ

Page 6: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis

Measures of Dispersion, or Variability

2

1

2 ])([1

1 , µσ −−

= ∑=

n

iix

nVariance

Average squared errorin mean representing data

2

1

2 ])([1

1 ,Deviation Standard µσ −−

= ∑=

n

i

ixn

2/32

3

))ˆ)(((

)ˆ)((Skewness

∑∑

−=

µ

µ

ix

ix Measures how much the datais one-sided (single long tail)

Page 7: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis

2. Tools for Displaying Single Variables

• Basic display for univariate data is thehistogram– No of values of the variable that lie in

consecutive intervals

Page 8: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis

Histogram of supermarket credit card usageManydid not use it at all

These used itevery weekexcept holidays

weeks

Page 9: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis

Histogram of Diastolic blood pressure of individuals (UCI ML archive)

Zero BP meansdata missing

Page 10: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis

Smoothing estimates

• Kernel Function K• Estimated density at point x is

∑=

−=

n

i hixxK

nxf

1))((1)(ˆ

• Gaussian Kernel with std dev h2)(

21

),( ht

CehtK−

= )( where ixxt −=

Page 11: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis

Kernel Estimateswith different values of h:Small values lead to spiky estimates

Data is right skewedwith hint of multimodality

2)(21

),( ht

CehtK−

=

Higher smoothing

Page 12: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis

3. Tools for Displaying Relationship between two variables• Box Plots• Scatter Plots• Contour Plots• Time as one of the two variables

Page 13: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis

Box Plot

Median

UpperQuartile

LowerQuartile

1.5 times inter-quartile range

Page 14: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis

HealthyDiabetic

MultipleVariables

Page 15: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis

Scatterplot

Credit card repayment data

Highly correlated dataSignificant number depart from pattern: worth investigating

Page 16: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis

Scatterplot Disadvantages1. With large no of data points reveals little structure

2. Can conceal overprinting which can be significant for multimodal data

Page 17: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis

Contourplot1. Overcomes some scatterplot problems

Unimodalitycan be seen:Not apparentin scatterplot

2. Requires a 2-D density estimate to be constructed with a 2-D kernel

Page 18: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis

Display when one of the variables is time

AnnualFees introduced

Jan 1963 Dec 1970

Peaks in early and late summer and around new year

Page 19: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis
Page 20: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis

Tools for Displaying More than Two Variables

• Scatter plots for all pairs of variables• Trellis Plot• Parallel Coordinates Plot

Page 21: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis

More than two variables

• Sheets of Paper and Computer screens are fine for two variables

• Need projections from higher-dimensional data to 2-D plane

• Methods– Examine all pairs of variables

• Scatterplot matrix• Trellis plot• Icons

Page 22: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis

IndependentCPU performanceScatter Plot Matrix

209 CPU data:Cycle TimeMinimum MemoryMaximum MemoryCache Size (Kb)Minimum ChannelsMaximum ChannelsRelative PerformanceEstimated rel perf (wrt IBM)

Correlated

Page 23: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis

Disadvantage of Scatter Plot Matrices

• Scatter Plot Matrices are multiple bivariatesolutions

• Not a multivariate solution• Such projections sacrifice

information3 variables8 cubes: alternately empty and fullEach 1-D and 2-D projection isuniformly distributed!

2-dprojection

Page 24: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis

Trellis Plot

• Rather than displaying scatter plot for each pair of variables

• Fix a particular pair of variables and produce a series of scatter plots, histograms, time series plots, contour plots etc

Page 25: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis

Male Female

Younger

Older

EpilepticSeizures in 2 weekperiod

EpilepticSeizures in later 2 weekperiod

Best fit line

Trellis Plot(with scatterplots)

Page 26: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis

Icon Plot

Star Plot: Each direction correspondsto a variable. Length correspondsto a value

53 samples of minerals12 chemical properties

Page 27: Visualizing and Exploring Datasrihari/CSE626/Slide/Ch3-part1 1.pdfVisualizing and Exploring Data. Visual Methods for finding ... • For higher dimensions -- Principal Components Analysis

ParallelCoordinatesPlot

Each path representsan individual

Each countRepresents 2-weekperiod