data visualization at codetalks 2016

Post on 21-Feb-2017

118 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

VisualizingHigh-dimensionalData

Dr. Stefan KühnLead Data Scientist

code.talks - 30.09.2016

Overview

• Data Visualization as you might know it• Main Properties of Graphics (and Humans)• A short story about Charts• Pair Plots and Correlations

• Data Visualization as you might not know it• Fundamental Problems• SVD, t-SNE and other approximations• Principal Components and Principal Curves • Parallel Coordinates• The Grand Tour

2

Takeaways - hopefully ;-)

• Data Visualization is complicated• There is always an approximation• There is always a bias• There is always a misinterpretation

• Data Visualization is simple• Lots of packages available• Lots of studies + literature• Lots of examples to learn from

3

Data Visualization Basics

4

The Modes of Perception

5

X

X

X

X

X

XX

O

O

O O

O

O

O

OX

X

XX

O

O

O

OX

X

X

X

X

X

XX

O

O

O O

O

O

O

O

X

X X

X

XO

OO

Fast Slow

Find the outlier

The Modes of Perception

6

• Pre-attentive• fast• parallel processing• effortless

• Pattern recognition• semi-fast• governed by laws of per by

• Attentive• slow• sequential• high effort (attention a very limited resource)

Main Properties of Graphics

7

Category ExamplePosition

Shape

Size

Color

Orientation (Line)

Length (Line)

Type and Size (Line)

Brightness

Main Properties of Graphics Humans

8

Category Amount of pre-attentive informationPosition very high

Shape ———

Size approx. 4

Color approx. 8

Orientation (Line) approx. 4

Length (Line) ———

Type and Size (Line) ———

Brightness approx. 8

Pre-attentive perception

9

• Position• fast• effective• high number of different positions

• Color• use with care

• Shape• Orientation

Pre-attentive perception is effortless.Exploit this as much as you can.

Pattern detection

10

„It is interesting to note that our brain […] subconsciously always prefers meaningful situations and objects.“

• Emergence• Reiification• Multi-stability• Invariance

Pattern detection can be trained.Exploit this for frequent visualizations.

What is this?

11

What is this?

12

What is this?

13

Laws of Gestalt

14

„It is interesting to note that our brain, inaccordance with the laws of Gestalt, subconsciously always prefers meaningful situations and objects.“

Accuracy of Graphics

15

Square Pie vs Stacked Bar vs Pie vs Donut

What do you think?

https://eagereyes.org/blog/2016/a-reanalysis-of-a-study-about-square-pie-charts-from-2009

Demo

16

Beyond the Basics

17

Fundamental Problems

• No accurate method in higher dimensions• Approximations methods• „Simulated“ dimensions (color, size, shape)• Animations?

• No notion of quality or accuracy for Visualizations• Information Theory?• „Stability“?

All Visualizations are wrong, but some are useful.

18

Approximation methods

• Pair Plots• Axis-aligned projections • Interpretable in terms of original variables

• Singular Value Decomposition• Optimal with respect to 2-norm (Euclidean norm)

and supremum norm• Comes with an error estimate

• Other methods• Stochastic Neighborhood Embedding (t-SNE)• „Manifold Learning“

19

Demo

20

Manifold Learning Methods

• Locally Linear Embedding• Neighborhood-preserving embedding

• Isomap• quasi-isometric

• Multi-dimensional scaling• quasi-isometric

• Spectral Embedding• Spectral clustering based on similarity

• Stochastic Neighborhood Embedding (SNE, t-SNE)• preserves probabilities

• Local Tangent Space Alignement (LTSA)

21

Local Tangent Space Alignement

22

Principal Components and Curves

• Principal Component Analysis• orthogonal decomposition based on SVD• linear in all variables• tries to preserve variance

• Principal Curves• minimize the Sum of Squared Errors with respect

to all variables (as PCA, preserve variance)• nonlinear• smooth

23

Principal Components and Curves

24

Parallel Coordinates

• Parallel Coordinates• especially useful for high-dimensional data• depends on ordering and scaling

25

Demo

26

The Grand Tour

• Animated sequence of 2-D projections• https://en.wikipedia.org/wiki/

Grand_Tour_(data_visualisation)• Asimov (1985): The grand tour: a tool for viewing

multidimensional data.• Underlying idea• Randomly generate 2-D projections (random

walk)• Over time generate a dense subset of all

possible 2-D projections• Optional: Follow a given path / guided tour

27

The Grand Tour

28

The Grand Tour

29

Demo

30

31

Thanks a lot!

www.codecentric.deblog.codecentric.de

stefan.kuehn@codecentric.dedatascience@codecentric.de

top related