data exploration and visualization - big data | big data · visualization visualization is the...

83
Data Exploration and Visualization Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi dahilinde gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma Bakanlığı’nın görüşlerini yansıtmamaktadır.

Upload: others

Post on 28-May-2020

34 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Data Exploration and Visualization

Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi dahilinde gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma Bakanlığı’nın görüşlerini yansıtmamaktadır.

Page 2: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Visualization

Visualization is the conversion of data into a visual or tabular format.

Visualization helps understand the characteristics of the data and the relationships among data items or attributes can be analyzed or reported.

Visualization of data is one of the most powerful and appealing techniques for data exploration.

– Humans have a well developed ability to analyze large amounts of information that is presented visually

– Can detect general patterns and trends

– Can detect outliers and unusual patterns

Page 3: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Examples of Business Questions

• Simple (descriptive) Stats • “Who are the most profitable customers?”

• Hypothesis Testing • “Is there a difference in value to the company of these

customers?”

• Segmentation/Classification • What are the common characteristics of these

customers?

• Prediction • Will this new customer become a profitable

customer? If so, how profitable?

adapted from Provost and Fawcett, “Data Science for Business”

Page 4: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Ask an interesting

question.

Get the data.

Explore the data.

Model the data.

Communicate and

visualize the results.

What is the scientific goal?

What would you do if you had all the data?

What do you want to predict or estimate?

How were the data sampled?

Which data are relevant?

Are there privacy issues?

Plot the data.

Are there anomalies?

Are there patterns?

Build a model.

Fit the model.

Validate the model.

What did we learn?

Do the results make sense?

Can we tell a story?

Page 5: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Data Exploration

Not always sure what we are looking for (until we find it)

Page 6: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

“The greatest value of a

picture is when it forces us

to notice what we never

expected to see.”

John Tukey

Exploratory Data Analysis

Page 7: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

What is Data?

● Collection of data objects and their attributes

● An attribute is a property or characteristic of an object – Examples: eye color of a

person, temperature, etc.

– Attribute is also known as

variable, field, characteristic,

or feature

● A collection of

attributes describe an

object – Object is also known as

record, point, case, sample,

entity, or instance

Tid Refund Marital Status

Taxable Income

Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Attributes

Objects

Page 8: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Type of variables (attributes)

● Descriptive (categorical) variables

– Nominal variables (no order between values):

gender, eye color, race group, ...

– Ordinal variables (inherent order among

values): response to treatment: none, slow,

moderate, fast

● Measurement variables

– Continuous measurement variable: height,

weight, blood pressure ...

– Discrete measurement variable (values are

integers): number of siblings, the number of times

a person has been admitted to a hospital ...

Page 9: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Properties of Attribute Values

● The type of an attribute depends on which of the following properties it possesses:

– Distinctness:

– Order:

– Addition:

– Multiplication:

=

< >

+ -

* /

– Nominal attribute: distinctness

– Ordinal attribute: distinctness & order

– Interval attribute: distinctness, order & addition

– Ratio attribute: all 4 properties

Page 10: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

The Trouble with Summary Stats

Page 11: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Looking at Data

Page 12: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Visualization in R

plot() is the main graphing function

Automatically produces simple plots for vectors,

functions or data frames

Many useful customization options…

Page 13: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

13

Chart types

• Single variable

• Dot plot

• Box-and-whisker plot

• Histogram

• Jitter plot

• Error bar plot

• Cumulative distribution function

Page 14: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

14

Chart types

• Two variables • Bar chart

• Scatter plot

• Line plot

• Log-log plot

• More than two variables • Stacked plots

• Parallel coordinate plot

Page 15: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Sample Data

Page 16: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Importing Data

How do we get data into R?

First make sure your data is in an easy to read

format such as CSV (Comma Separated Values).

Use code: > dataset<-

read.table("C:/bodyfat.csv",sep=",",header=TRUE)

> summary(dataset)

Page 17: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Plotting a Vector

plot(v) will print the elements of the vector v

according to their index

# Plot height for each observation

> plot(dataset$Height)

# Plot values against their ranks

> plot(sort(dataset$Height))

Page 18: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Parameters for plot()

Specifying labels:

– main – provides a title

– xlab – label for the x axis

– ylab – label for the y axis

Specifying range limits

– ylim – 2-element vector

gives range for x axis

– xlim – 2-element vector gives range for y axis

Page 19: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Plotting a Vector

Page 20: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Visualization Techniques: Histograms

Histogram

– Usually shows the distribution of values of a single variable

– Divide the values into bins and show a bar plot of the number of

objects in each bin.

– The height of each bar indicates the number of objects

– Shape of histogram depends on the number of bins

Example: Petal Width (10 and 20 bins, respectively)

Page 21: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Histogram

Page 22: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the
Page 23: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Histogram

Page 24: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Cause of death Frequency

Accidents 6,688

Homicide 2,093

Suicide 1,615

Malignant tumor 745

Heart disease 463

Congenital abnormalities 222

Chronic respiratory disease 107

Influenza and pneumonia 73

Cerebrovascular diseases 67

Other tumor 52

All other causes 1,653

Frequency table showing the ten most common causes

of death in Americans between 15 and 19 years of age

in 1999. The total number of deaths is n = 13,778.

Bar graph

Bar graph

Page 25: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Bar graphs and frequencies

No Yes

Frequency Bar Graph of Type F

req

uen

cy

0

20

40

60

80

100

120

Type

> type.freq <- table(Pima.tr$type)

> barplot(type.freq, xlab = "Type", ylab = "Frequency", main = "Frequency Bar Graph of Type")

Page 26: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Adding a Label Inside a Plot

> hist(dataset$Weight, xlab = "Weight", main = "Who will develop obesity?",

col = "blue")

> rect(90, 0, 120, 1000, border = "red", lwd = 4)

> text(105, 1100, "At Risk", col = "red", cex = 1.25)

Page 27: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Pie chart

2

3

• We can use a pie chart to visualize the relative frequencies

of different categories for a categorical variable.

• In a pie chart, the area of a circle is divided into sectors, each

representing one of the possible categories of the variable.

• The area of each sector c is proportional to its frequency.

Pie Chart of Races

1

slices <- c(11, 4, 6)

lbls <- c("1", "2", "3",)

pie(slices, labels = lbls, main="Pie Chart of Races")

Page 28: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Box Plots

Q3

maximum

IQR

minimum

median

0.0

33.3

66.7

100.0

AGE

Variables

Years

Q1

Page 29: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Example of Box Plots

Box plots can be used to compare attributes

Page 30: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Plotting Two Vectors

Page 31: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Tattersall et al. (2004) Journal of Experimental Biology 207:579-585

Scatter plots

plot(dataset$temp, dataset$mealsize,

xlab=‘Meal size (% of snake’ biomass)',ylab=‘Change in snake temperature')

Page 32: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Line Plots

plot(t1,D2$DELL,type="l",main='Dell Closing Stock Price',

xlab='Time',ylab='Price $'))

Page 33: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Adding a Legend

legend(60,45,c('Intel','Dell'),lty=c(1,2))

Page 34: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Association between reproductive

effort and avian malaria

Table 2.3A. Contingency table showing incidence of malaria in female great tits

subjected to experimental egg removal.

control group egg removal group row total

malaria

no malaria

7 15

28 15

22

43

column total 35 30 65

Mosaic plot

Page 35: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Mosaic plot

>library(vcd)

>mosaic(HairEyeColor, shade=TRUE, legend=FALSE)

Page 36: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Plotting Contents of a Dataset as Matrix

>plot(dataset[c(5,6,7)])

Page 37: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Effective Visualizations

1. Have graphical integrity

2. Keep it simple

3. Use the right display

4. Use color strategically

5. Tell a story with data

Page 38: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Graphical Integrity

Page 39: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Graphical Integrity

Flowing Data

Page 40: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Scale Distortions

Flowing Data

Page 41: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the
Page 42: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Scale Distortions

Page 43: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Keep It Simple

Page 44: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Edward Tufte

Page 45: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Maximize Data-Ink Ratio

0-

$24,99

9

$25,0

00+

0-

$24,99

9

$25,0

00+

Data ink

Total ink used in graphic Data-Ink Ratio =

Page 46: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Maximize Data-Ink Ratio

0

175

350

525

700

Males Females

0-

$24,99

9

$25,0

00+

0-

$24,99

9

$25,0

00+

Data ink

Total ink used in graphic Data-Ink Ratio =

Page 47: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Why 3D pie charts are bad

Kevin Fox

Page 48: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

ongoing,Tim Brey

Avoid Chartjunk Extraneous visual elements that distract from the

message

Page 49: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Avoid Chartjunk

ongoing,Tim Brey

Page 50: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Avoid Chartjunk

ongoing,Tim Brey

Page 51: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Don’t!

matplotlib

gallery

Excel Charts

Blog

Page 52: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Use The Right Display

Page 53: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

http://extremepresentation.typepad.com/blog/files/choosing_a_good_chart.pdf

Page 54: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Comparisons

Page 55: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Bar Chart How Much Does Beer Consumption Vary by Country?

Bottles per

person per week

Page 56: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Bars vs. Lines

Zacks

1999

Page 57: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Trends

Page 58: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Yahoo! Finance

Page 59: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Proportions

Page 60: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Pie Charts

Page 61: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

eagerpies.com

Page 62: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Stacked Bar Chart

S. Few

Page 63: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Don’t!

Page 64: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Correlations

Page 65: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Scatterplots

http://xkcd.com/388/

Page 66: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Don’t!

matplot3d tutorial

Page 67: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Perceptual Effectiveness

Page 68: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

How much longer?

A

B

Page 69: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

How much steeper slope?

B A

Page 70: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

How much larger area?

A B

Page 71: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

How much darker?

A B

Page 72: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

How much bigger value?

A

2 16

B

Page 73: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Most Efficient

Least Efficient

Quantitative

Ordered

Categories

C. Mulbrandon

VisualizingEconomics.com

Page 74: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Most Effective

VisualizingEconomics.com

Page 75: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Less Effective

VisualizingEconomics.com

Page 76: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Pie vs. Bar Charts

Page 77: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Cliff Mass

Least Effective

Page 78: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Use Color Strategically

Page 79: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Color Discriminability

Sinha 2007

Page 80: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Colors for Categories

Do not use more than 5-8 colors at once

Ware,“Information Visualization”

Page 81: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Colors for Ordinal Data

Zeilis et al, 2009,“Escaping RGBland: Selecting

Colors for Statistical Graphics”

Vary luminance and saturation

Page 82: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Avoid Rainbow Colors!

matplotlib

gallery Perceptually nonlinear

Page 83: Data Exploration and Visualization - Big Data | Big Data · Visualization Visualization is the conversion of data into a visual or tabular format. Visualization helps understand the

Color Blindness

Deuteranope Protanope

Based on slide from Stone

Red / green deficiencies

Tritanope

Blue / Yellow deficiency