principles of data visualization - rodrigo de luna lara · 2019. 2. 10. · principles of data...

29
Principles of Data Visualization Rodrigo De Luna Lara Contents The Importance of Data Visualization 2 Planar and Retinal Variables 6 Retinal Variables for Qualitative Data .................................. 6 Retinal Variables for Quantitative Data ................................. 9 The Importance of Color 11 The Color Brewer Schemes ........................................ 11 Sequential Brewer Schemes ..................................... 12 Diverging Brewer Schemes ...................................... 13 Qualitative Brewer Schemes ..................................... 14 Best Practices in Visualization 15 Avoid 3D Visualizations .......................................... 15 Avoid Pie Charts .............................................. 16 Beware of Misleading Aspect Ratios ................................... 17 Beware of Spurious Correlations ..................................... 18 Avoid dual-scaled axes ........................................... 19 Declutter your visualizations ....................................... 20 Emphasize what is important ....................................... 23 Choosing the correct visualization 27 Storytelling with Data 28 Understand the Context .......................................... 28 Tell a Story ................................................. 28 Case Studies 28 Bibliography 29 1

Upload: others

Post on 18-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

Principles of Data VisualizationRodrigo De Luna Lara

ContentsThe Importance of Data Visualization 2

Planar and Retinal Variables 6Retinal Variables for Qualitative Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Retinal Variables for Quantitative Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

The Importance of Color 11The Color Brewer Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Sequential Brewer Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Diverging Brewer Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Qualitative Brewer Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Best Practices in Visualization 15Avoid 3D Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Avoid Pie Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Beware of Misleading Aspect Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Beware of Spurious Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Avoid dual-scaled axes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Declutter your visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Emphasize what is important . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Choosing the correct visualization 27

Storytelling with Data 28Understand the Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Tell a Story . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Case Studies 28

Bibliography 29

1

Page 2: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

The Importance of Data Visualization

Suppose we have 13 different datasets (Matejka & Fitzmaurice, 2017) and we compute some statisticalmeasures for them, as shown in Table 1 (The R2 corresponds to a Pearson correlation coefficient). If youwere asked to describe these datasets, you could conclude that they are barely different, perhaps you couldconsider they are measurements of the same process with different ammounts of noise.

Table 1: Statistical Measures for 13 Distinct Datasets.

Dataset x y σx σy R2

1 54.26327 47.83225 16.76514 26.93540 -0.06447192 54.26610 47.83472 16.76983 26.93974 -0.06412843 54.26873 47.83082 16.76924 26.93573 -0.06858644 54.26732 47.83772 16.76001 26.93004 -0.06834345 54.26030 47.83983 16.76774 26.93019 -0.06034146 54.26144 47.83025 16.76590 26.93988 -0.06171487 54.26881 47.83545 16.76670 26.94000 -0.06850428 54.26785 47.83590 16.76676 26.93610 -0.06897979 54.26588 47.83150 16.76885 26.93861 -0.068609210 54.26734 47.83955 16.76896 26.93027 -0.062961111 54.26993 47.83699 16.76996 26.93768 -0.069445612 54.26692 47.83160 16.77000 26.93790 -0.066575213 54.26015 47.83972 16.76996 26.93000 -0.0655833

Once having reached your conclusion from the table, you decide out of curiosity to plot the first dataset, andyou come up with Figure 1.

0

25

50

75

100

0 25 50 75 100

x

y

Figure 1: Plot for Dataset 1

2

Page 3: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

From the raw data itself, or even the statistical measures in Table 1 you wouldn’t have likely thought thedataset described a dinosaur. Now, considering the different statistical measures for the rest of the datasets,you could still conclude their plots are fairly similar. However, just to confirm you are correct you decide toplot them as well, coming up with Figure 2.

10 11 12 13

6 7 8 9

2 3 4 5

0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100

0

25

50

75

100

0

25

50

75

100

0

25

50

75

100

x

y

Figure 2: Plots for Datasets 2-13

It is evident that all of them are extremely different datasets, yet each have practically the same statisticalmeasures. This example is a more modern approach to a demonstration constructed by statistician FrancisAnscombe (1973) to demonstrate the importance of data visualization in analysis. Table 2 shows the samedescriptive statistics as the previous example for the Anscombe datasets.

Table 2: Statistical Measures for Anscombe Datasets.

Dataset x y σx σy R2

1 9 7.500909 3.316625 2.031568 0.81642052 9 7.500909 3.316625 2.031657 0.81623653 9 7.500000 3.316625 2.030424 0.81628674 9 7.500909 3.316625 2.030578 0.8165214

3

Page 4: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

We could also fit a simple linear model of the form y = β1x + β0 to the 4 datasets to see if they can bemodeled by a common function. The resulting coefficients can be seen in Table 3.

Table 3: Linear Regression Coefficients for Anscombe Datasets.

Dataset β1 β0

1 0.5000909 3.0000912 0.5000000 3.0009093 0.4997273 3.0024544 0.4999091 3.001727

Like in the previous example, at this point we could conclude the datasets are extremely similar and wecan use the function y = 0.50x+ 3.00 for each of them without any issue, but we already know using thisapproach can be misleading, which can be confirmed by looking at the plots in Figure 3.

3 4

1 2

0 5 10 15 20 0 5 10 15 20

0

5

10

15

20

0

5

10

15

20

x

y

Figure 3: Plots for Anscombe Datasets

Despite the fact that these 4 datasets share a common best linear model, each of the datasets evidently hasvery distinct characteristics when visualized. This demonstration also displays the large effect outliers canhave in model predictions and statistical properties, justifying even more the need to visualize the data beforecommitting to any conclusions or insights. In the article which originated this demonstration, Anscombe(1973) explained that:

Most kinds of statistical calculation rest on assumptions about the behavior of the data. Thoseassumptions may be false, and then the calculations are misleading. We ought always to try tocheck whether the assumptions are reasonably correct; and if they are wrong we ought to be able toperceive in what ways they are wrong. Graphs are very valuable for these purposes.

4

Page 5: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

Producing visualizations for the purpose of exploring the data in order to gain a better understanding ofthe interactions between variables is one of two main purposes of data visualization, which we will refer toas exploratory analysis. The second main purpose of data visualization is to communicate a message orconvey an insight to a specific audience, which we will refer to as explanatory analysis. Figure 4 shows theprogression from exploratory to explanatory analysis.

Figure 4: Exploratory vs Explanatory Analysis

Exploratory analysis is made to understand the data, with the purpose of detecting underlying patterns andrelationships. Usually, it seeks to answer a specific question or hypothesis from the data. During this stage itis critical to understand the origin and characteristics of the data, to help us understand how to process it inorder to obtain clear, specific insights. On the other hand, the purpose of the explanatory analysis is to takethese insights and find the most effective way of presenting them to a specific audience, in the most concise,simple and clear way that is possible.

Having this in mind, data visualization is most critical for the explanatory analysis. Nonetheless, followinggood practices for data visualization during the exploratory analysis can help us obtain insights more easily,and is very important to avoid reaching misleading interpretations. As the examples in this section showed,not visualizing the data correctly can lead to erroneous or biased conclusions about the data.

5

Page 6: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

Planar and Retinal Variables

To make effective visualizations we must be aware of the different types of visual encoding variables thatexist, choosing the correct visual encoding depending on the data is an essential step in data visualization.There are two types of visual encoding variables: planar and retinal. Planar variables represent points in acoordinate system (usually cartesian), and they allow the use of a single variable. A scatter plot of a singlevariable has only planar visual encoding.

2

3

4

5

4 5 6 7 8

Sepal length (cm)

Sep

al W

idth

(cm

)

*Only versicolor species plotted

Figure 5: Simple Scatter Plot.

Retinal variables are visual properties we use to express the data, such as size, color, shape or texture. Theneed for the inclusion of these visual properties arises mainly from the necessity of presenting more thanone variable in a single visualization. The appropriate use of these variables depends on several factors,mainly the type of data (qualitative or quantitative), the number of variables/factors we want to plot andthe medium of the presentation (printed or digital).

Table 4: Recommended Retinal Variables by Data Feature.

Color Shape Size TextureQuantitative data X × X ×Qualitative data X X × XData with many levels X × × ×Data with few levels X X × XPrinted media × X X XDigital media X X X ×

Retinal Variables for Qualitative Data

Let’s start by looking at these variables for qualitative data. Figure 6 shows the sepal width versus the sepallength for all 3 species in the iris dataset. It is possible to differentiate between the 3 species by looking atthe markers, but it is not easy to differentiate some of the points. Using varying shapes is recommendedwhen we have few levels or factors in our data, as it becomes harder to differentiate between the markers asmore of them are added.

Figure 7 shows the available markers in ggplot. Only markers 21-24 can have different fill color, the rest ofthe markers act as symbols rather than geometries.

6

Page 7: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

2

3

4

5

4 5 6 7 8

Sepal length (cm)

Sep

al W

idth

(cm

)

Species

setosa

versicolor

virginica

Figure 6: Visual Encoding by Shape

20 21 22 23 24

15 16 17 18 19

10 11 12 13 14

5 6 7 8 9

0 1 2 3 4

Figure 7: ggplot Markers

Next we’ll look at texture, in this case the geometry for the points tends to remain constant, while thetexture of it’s fill changes to reflect a different class. Using textures is not generally recommended, mosttypes of visualizations make it hard to differentiate between them. However, they can be useful for printedvisualizations, as they are appropriate for greyscale color schemes. Figure 8 shows the same data as inFigure 6, but using different textures instead of different shapes.

Note: textures aren’t implemented in ggplot2 by default (and generally on the most popular plotting libraries).However, some different markers can be used to give the same effect.

7

Page 8: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

2

3

4

5

4 5 6 7 8

Sepal length (cm)

Sep

al W

idth

(cm

)

Species

setosa

versicolor

virginica

Figure 8: Visual Encoding by Texture

Finally, Figure 9 shows the same plot but with the same marker in different colors. So far, the differentiationbetween classes in this plot is the best. The human eye can distinguish about 10 million different colors(conversely, think about how many different shapes you can differentiate as markers in a plot, or how manyeasily distinguishable textures can be generated), so color tends to be the natural choice for differentiatingclasses in visualization.

2

3

4

5

4 5 6 7 8

Sepal.Length

Sep

al.W

idth Species

setosa

versicolor

virginica

Figure 9: Visual Encoding by Color

Nonetheless, there are still some drawbacks with using color for visualizations. First and foremost, we mustconsider that there are people with color vision deficiency, and that some people may be more adept atdistinguishing between subtle variations in color. More considerations for the proper use of color will becovered in a following chapter.

The retinal variables can be combined for better effect, to allow for an even better visualization. Figure 10shows the result of combining shape and color. It is even easier to distinguish the differences between species.This is the most common combination of retinal variables, as shape/texture and texture/color are less thanideal combinations.

8

Page 9: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

2

3

4

5

4 5 6 7 8

Sepal.Length

Sep

al.W

idth Species

setosa

versicolor

virginica

Figure 10: Visual Encoding by Shape & Color

Retinal Variables for Quantitative Data

For quantitative data we’ll look at the mtcars dataset, which is comprised of fuel consumption and severalaspects of automobile design and performance for 32 models (1973-1974 models). As was discussed previously,a simple scatter plot with planar encoding is enough to represent 2-variable relationships. A simple scatterplot of the horse power of the engine vs its displacement can be seen in Figure 11.

0

100

200

300

400

500

0 100 200 300 400 500

Displacement (cu. in.)

Gro

ss h

orse

pow

er

Figure 11: Scatter Plot for mtcars Dataset

The need for using retinal variables with quantitative usually arises from including a third variable on a 2Dvisualization. Let’s consider that we want to look at how the gas mileage varies by engine displacement andhorsepower. One way would be to make a 3D scatter plot, which isn’t the best option (the reasons why arediscussed in the Best practices section).

A better option is to use retinal variables. Figure 12 shows the encoding of this variable in the size of thepoints. We can also use color to encode the variable, as seen in Figure 13, the choice of color is essential inthis case, important considerations are covered in the Importance of Color section. Finally, we can combineboth size and color to emphasize further the effect, as seen in Figure 14.

9

Page 10: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

0

100

200

300

400

500

0 100 200 300 400 500

Displacement (cu. in.)

Gro

ss h

orse

pow

ermpg

10

15

20

25

30

35

Figure 12: Visual Encoding by Size

0

100

200

300

400

500

0 100 200 300 400 500

Displacement (cu. in.)

Gro

ss h

orse

pow

er

10

15

20

25

30

35mpg

Figure 13: Visual Encoding by Color

0

100

200

300

400

500

0 100 200 300 400 500

Displacement (cu. in.)

Gro

ss h

orse

pow

er

mpg

10

15

20

25

30

35

Figure 14: Visual Encoding by Size & Color

10

Page 11: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

The Importance of Color

In data visualization, the choice of color is not merely aesthetic, color has a function,and improperlyselected colors can distort relationships between values. In general, color should follow these guidelines onvisualizations:

• Color is meant to convey meaning: it must be used sparingly and with a specific purpose in mind.• Color affects how we perceive objects: the relative position/size/shape of an object is affected by

color, perceptive bias should be minimized by using appropriate color schemes.• Color must direct the attention of the audience: color should be used to emphasize what we’re

trying to tell about the data.

The excessive use of color can lead to unpleasant and ineffective visualizations. Consider the plot in Figure 15,the color scheme is eye-straining and makes it difficult to distinguish between values.

0

100

200

300

400

500

0 100 200 300 400 500

Displacement (cu. in.)

Gro

ss h

orse

pow

er

10

15

20

25

30

35mpg

Figure 15: Example of Poor Choice of Colors

There are some considerations to be had when choosing a color scheme for a visualization, some of whichinclude:

• Image background: most color schemes are designed to be displayed on white backgrounds. Theonly situation where dark backgrounds could be used is when the image will be viewed in darkness.

• Supporting elements: elements such as grid lines, text on axes, labels or legends should be color-neutral (greyscale).

• Legibility: everything in the visualization must be clearly legible at first glance.• Color blindness: some members of the audience can have color vision deficiencies, making it harder

for them to distinguish between certain colors.• Consistent colors: when using several plots, the color schemes between them should be consistent.

The Color Brewer Schemes

There are standardized color schemes that are widely used in data visualization. Some of the most popularschemes are the Color Brewer schemes (Brewer, 2017). These schemes were hand-picked and crafted forcartography, although they are widely used in graphics in general. There are 3 types of Color Brewer schemes,depending on the nature of the data:

1. Sequential schemes: suited for data that is ascending in nature, light colors represent low values anddark colors represent high values.

11

Page 12: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

2. Diverging schemes: these schemes put equal emphasis at the extremes of the data range with darkcolors, and in the class break in the middle with the lightest color. The class break can represent acritical value in the data such as the mean or median. Different colors mark divergence from the classbreak in opposite directions.

3. Qualitative schemes: these schemes are designed for classes that don’t imply different magnitudes,best used for nominal or categorical data.

In the Color Brewer website, these schemes can be further filtered by colorblind safe, print friendly andphotocopy safe. Figure 16, Figure 17 and Figure 18 show the Color Brewer palettes available in ggplot. Theycan be used for both discrete and continuous color scales.

Sequential Brewer Schemes

1 2 3 4 5 6 7 8 9

Blues

1 2 3 4 5 6 7 8 9

BuGn

1 2 3 4 5 6 7 8 9

BuPu

1 2 3 4 5 6 7 8 9

GnBu

1 2 3 4 5 6 7 8 9

Greens

1 2 3 4 5 6 7 8 9

Greys

1 2 3 4 5 6 7 8 9

Oranges

1 2 3 4 5 6 7 8 9

OrRd

1 2 3 4 5 6 7 8 9

PuBu

1 2 3 4 5 6 7 8 9

PuBuGn

1 2 3 4 5 6 7 8 9

PuRd

1 2 3 4 5 6 7 8 9

RdPu

1 2 3 4 5 6 7 8 9

Reds

1 2 3 4 5 6 7 8 9

YlGn

1 2 3 4 5 6 7 8 9

YlGnBu

1 2 3 4 5 6 7 8 9

YlOrBr

1 2 3 4 5 6 7 8 9

YlOrRd

Figure 16: Sequential Brewer Palettes

12

Page 13: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

Diverging Brewer Schemes

1 2 3 4 5 6 7 8 9 10

BrBG

1 2 3 4 5 6 7 8 9 10

PiYG

1 2 3 4 5 6 7 8 9 10

PRGn

1 2 3 4 5 6 7 8 9 10

PuOr

1 2 3 4 5 6 7 8 9 10

RdBu

1 2 3 4 5 6 7 8 9 10

RdGy

1 2 3 4 5 6 7 8 9 10

RdYlBu

1 2 3 4 5 6 7 8 9 10

RdYlGn

1 2 3 4 5 6 7 8 9 10

Spectral

Figure 17: Diverging Brewer Palettes

13

Page 14: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

Qualitative Brewer Schemes

1 2 3 4 5 6 7 8

Accent

1 2 3 4 5 6 7 8

Dark2

1 2 3 4 5 6 7 8

Set2

1 2 3 4 5 6 7 8

Pastel2

1 2 3 4 5 6 7 8 9

Pastel1

1 2 3 4 5 6 7 8 9

Set1

1 2 3 4 5 6 7 8 9 10 11 12

Set3

1 2 3 4 5 6 7 8 9 10 11 12

Paired

Figure 18: Qualitative Brewer Palettes

14

Page 15: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

Best Practices in Visualization

In this section some good and bad practices in data visualization will be presented. Mainly as reference ofwhat to avoid when creating visualizations.

Avoid 3D Visualizations

Consider the plot in Figure 19, in comparison with Figure 14, it is harder to visualize the data, some pointsare lost in the perspective and the colorbar is not very effective. 3D plots have the issue of perspective, it iseasier for people to visualize things in 2D than in 3D. Most of the time there are ways of circumventing 3Dplots by encoding with retinal variables.

0 100 200 300 400 500

1015

2025

3035

50100

150200

250300

350

Displacement (cu. in.)

Gro

ss h

orse

pow

er

mpg

Figure 19: Example 3D Plot With mtcars Dataset

Even the simplest 3D plots (3D scatter plots) have issues with perception, and are not optimal for staticvisualizations. 3D plots are more useful in exploratory analysis with interactive visualizations. Additionally,in most plotting libraries, producing a high quality 3D plot requires extensive customization and tinkering.

15

Page 16: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

Avoid Pie Charts

In general, there is always a visualization that is more effective than a pie chart. Pie charts have an extremelyunfavorable reputation in the world of data visualization. Most of it is due to the fact that it is not easyto interpret and compare data in pie charts. Take a look at Figure 20, can you compare the magnitude ofdeaths in Males and Females in the month of April? Can you tell which month had the most deaths acrossboth genders?

Deaths from Lung Diseases in the UK by Month (1974)month

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov

Dec

Males

MonthJanFebMarAprMayJunJulAugSepOctNovDec

Females

Figure 20: Pie Charts for mdeaths and fdeaths Datasets

Data in pie charts is noticeably hard to compare if there are more than 2-3 points. To aid in the visualizationthe value or corresponding percentage of each “slice” could be added to the plot. But if you need to labeleach individual point then the visualization is inappropriate and ineffective. Walter Hickey (2013), a reporterfor the Business Insider states that “pie charts are the Aquaman of data visualization” in his article “TheWorst Chart in the World”

Consider Figure 21 as an example of a better visualization for the same data.

0

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Month

Cou

nt

Gender

Male

Female

Figure 21: Alternative Visualization

16

Page 17: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

Beware of Misleading Aspect Ratios

Axes can be (unprofessionally) manipulated to change the story data is telling. Let’s see at a real case ofmanipulation by tampering with the axes. In 2015 National Review tweeted a plot similar to the one inFigure 22.

0102030405060708090

100110

1880 1891 1902 1913 1924 1935 1946 1957 1968 1979 1990 2001 2012

Year

Tem

pera

ture

°F)

Global Average Temperature by Year

Figure 22: Plot with Misleading Axes

It is a fact that a 1° increase in global temperature can have a huge impact on the global climate. A moreappropriate visualization can be seen in Figure 23

56.0

56.5

57.0

57.5

58.0

58.5

59.0

1880 1891 1902 1913 1924 1935 1946 1957 1968 1979 1990 2001 2012

Year

Tem

pera

ture

°F)

Global Average Temperature by Year

Figure 23: Plot with More Appropriate Axes

Manipulating axes to try to change the interpretation of the audience about the data is not only inappropriate,but also very unprofessional. Any effect can be maximized or minimized by disproportionaly zooming in orout, and if the audience is not familiar with the data, their interpretation can be biased by doing this.

17

Page 18: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

Beware of Spurious Correlations

Remember that correlation does not imply causation, the use of spurious correlations can range fromdangerous to absurd. In engineering applications, assuming that a correlation implies causation can result inincorrect models, predictions or recommendations to customers. In more daily-life situations, undoubtedlyunrelated phenomena can show high correlation to comical effect.

Consider the plots in Figure 24, one shows a declining trend in US aviation accidents (U.S. General Ser-vices Administration, 2015), the second shows a rising trend in US consumption of ice cream (U.S. General Ser-vices Administration, 2017). Undoubtedly, both are completely unrelated, but if we plot their correlation(Figure 25) we can see that it is significant. One could (erroneously) conclude that more consumption in icecream is the cause of less aviation accidents in the US.

4000450050005500600065007000

1200135015001650180019502100

1992 1995 1998 2001 2004 2007 2010 2013

Year

Tho

usan

d S

hort

Ton

s o

f Ice

Cre

amA

viat

ion

Acc

iden

ts

Figure 24: Aviation Deaths and Ice Cream Consumption in the US by year.

1200

1350

1500

1650

1800

1950

2100

4000 4500 5000 5500 6000 6500 7000

Thousand Short Tons of Ice Cream

Avi

atio

n A

ccid

ents

Figure 25: Corrleation Between Aviation Deaths and Ice Cream Consumption in the US.

See http://tylervigen.com/spurious-correlations for more examples of ridiculous correlations (Vigen, n.d.).

18

Page 19: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

Avoid dual-scaled axes

Stephen Few (2008), a data visualization specialist, presents the following guidelines for using dual-scaledaxes:

• Graphs should only include a dual-scaled axis when needed to compare datasets with different units ofmeasure (and even then it is not encouraged).

• Magnitude comparisons between values with different units of measure and scales are not appropriate,for this reason nothing but lines should be used in graphs with dual-scaled axes.

• Given that only the slopes of the lines are meaningful in dual-scaled axes, it is inappropriate to use adual-scaled axis in a graph that doesn’t display values along an interval scale (time).

• Using dual-scaled axes to show more than one quantitative scale encourage people to compare themagnitude of the values between them, which is meaningless.

Consider Figure 26 as an example of the common issues with dual axes on graphs. Attention is drawntowards the intersections between both plots, which have no real significance. While it is possible to infersome relationship from the plot, a more suitable visualization would look like Figure 24.

1100

1300

1500

1700

1900

2100

0.066

0.078

0.090

0.102

0.115

0.127

1969 1972 1975 1978 1981 1984

Year

Driv

ers

Kill

edP

etrol Price

Road Casualties in Great Britain

Figure 26: Example of Misleading Dual Axis Plot.

The consensus is that perhaps the only acceptable use of dual scale axis is to display a rescaling of a singlevariable, like shown in Figure 27.

36.2

36.4

36.6

36.8

37.0

37.2

37.4

37.6

97.16

97.52

97.88

98.24

98.60

98.96

99.32

99.68

14:0012/12/1990

18:0012/12/1990

22:0012/12/1990

02:0013/12/1990

06:0013/12/1990

10:0013/12/1990

Datetime

Tem

pera

ture

°C) Tem

perature (°F

)

Castor canadensis Body Temperature

Figure 27: Example of Acceptable Use of Dual Axes.

19

Page 20: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

Declutter your visualizations

In visualizations, less is more. The more elements in a visualization, the harder it is to direct attention of theaudience to what we want to emphasize in the visualization. Cluttered graphs are harder to interpret. Let’sstart by looking at the price of diamonds by varying carat, color and cut Figure 28 shows this plot. It is noteasy to present an insight out of this plot, given how many elements it has.

0

5000

10000

15000

20000

1 2 3 4 5

Carat

Pric

e (U

SD

)

Cut

Fair

Good

Very Good

Premium

Ideal

Color

D

E

F

G

H

I

J

Figure 28: Diamond Price by Carat, Cut and Color

To remove clutter we can remove levels in the color and cut, focusing only on the extremes and class break.The resulting plot can be seen in Figure 29.

0

5000

10000

15000

20000

1 2 3 4 5

Carat

Pric

e (U

SD

)

Color

D

G

J

Cut

Fair

Very Good

Ideal

Figure 29: Diamond Price by Carat, Selected Color and Cut

Figure 30, Figure 31 and Figure 32 show sequential decluttering to focus on what is important from the datain a simple visualization.

20

Page 21: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

VVS1 IF

VS1 VVS2

SI1 VS2

I1 SI2

0.5 1.0 1.5 0.5 1.0 1.5 2.0

0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0

0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5

1 2 3 4 5 1 2 30

5000

10000

15000

20000

0

5000

10000

15000

20000

0

5000

10000

15000

20000

0

5000

10000

15000

20000

0

5000

10000

15000

20000

0

5000

10000

15000

20000

0

5000

10000

15000

20000

0

5000

10000

15000

20000

Carat

Pric

e (U

SD

)

Color

D

G

J

Cut

Fair

Very Good

Ideal

Figure 30: Diamond Price by Carat, Clarity, Selected Color and Cut

21

Page 22: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

Fair Very Good Ideal

I1S

I2S

I1V

S2

VS

1V

VS

2V

VS

1IF

0 1 2 3 4 5 1 2 1 2 3

0

5000

10000

15000

20000

0

5000

10000

15000

20000

0

5000

10000

15000

20000

0

5000

10000

15000

20000

0

5000

10000

15000

20000

0

5000

10000

15000

20000

0

5000

10000

15000

20000

0

5000

10000

15000

20000

Carat

Pric

e (U

SD

) Color

D

G

J

Figure 31: Diamond Price by Carat, Clarity, Selected Color and Cut, with Regression Lines

22

Page 23: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

Fair Very Good Ideal

I1V

S2

IF

1 2 3 4 5 1 2 0.5 1.0 1.5 2.0 2.5

0

5000

10000

15000

20000

0

5000

10000

15000

20000

0

5000

10000

15000

20000

Carat

Pric

e (U

SD

) Color

D

G

J

Figure 32: Diamond Price by Carat, Selected Color, Cut and Clarity, with Regression Lines

This simpler visualization allows us to quickly deduce the following insights:

• Larger diamonds (by carat) tend to have lower quality cut and color. The diamonds with ideal cuttend to have a lower range of carats.

• The diamonds’ color changes the rate at which they become more expensive with increasing carat.• For some clarities and cuts, the price difference between premium color diamonds (D) and good color

dimamonds (G) is not very significant.• There are practically no poor cut diamonds with premium clarity and viceversa.

Emphasize what is important

Emphasizing what is important is more suited in explanatory analysis, when we want to convey specificinformation in a visualization. So far, all of the plots seen as examples have been exploratory. In thissection, we will look at an example of explanatory analysis while focusing on the importance of emphasis invisualization.

Suppose you’re given exchange data versus US Dollars for some currencies (shown in Table 5) for the year2016 (Myfxbook Ltd, 2017), and are given the general task of analyzing interesting effects on the Mexicanpeso. The dataset contains how the exchange rate of each currecy versus the US Dollar changes after eachclosing of the markets (in percentage).

Table 5: Currencies in the Dataset

Acronym CurrencyBRL Brazilian RealCAD Canadian DollarCNY Chinese YuanEUR EuroGBP British PoundMXN Mexican Peso

23

Page 24: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

You begin by plotting the data for the whole year (Figure 33), and notice unusual behavior around November,there is a very large increase in the exchange rate, in both the Mexican Peso and the Brazilian Real. Youthen decide to zoom-in on that month, as shown in Figure 34. The Latinamerican currencies in the dataset(MXN and BRL), both show a sharp increase between November 8 and November 10.

−3

−2

−1

0

1

2

3

4

5

6

7

8

01/0

6/16

01/0

7/16

01/0

8/16

01/0

9/16

01/1

0/16

01/1

1/16

01/1

2/16

01/0

1/17

Date

Pct

. cha

nge

in e

xcha

nge

rate

v

s U

SD

wrt

pre

viou

s cl

ose

Currency

BRL

CAD

CNY

EUR

GBP

MXN

Figure 33: Change in Exchange Rates for 2016.

−3

−2

−1

0

1

2

3

4

5

6

7

8

01/1

1/16

02/1

1/16

03/1

1/16

04/1

1/16

05/1

1/16

06/1

1/16

07/1

1/16

08/1

1/16

09/1

1/16

10/1

1/16

11/1

1/16

12/1

1/16

13/1

1/16

14/1

1/16

15/1

1/16

Date

Pct

. cha

nge

in e

xcha

nge

rate

v

s U

SD

wrt

pre

viou

s cl

ose

Currency

BRL

CAD

CNY

EUR

GBP

MXN

Figure 34: Change in Exchange Rates for November 2016

The US General Elections were between November 8 and 10, and some markets, specially in Latinamericareacted negatively to the preliminary results of the election and when Trump became president-elect. Torestrict the analysis, the Chinese Yuan and British Pound are dropped from the analysis. The CanadianDollar is useful to see the reaction in the rest of North America to the elections, the Brazilian Real reflects thereaction in South America and the Euro in Europe. The analysis is thus restricted to the trend in Westerncountries. The plot of the resulting dataset can be seen in Figure 35.

24

Page 25: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

−3

−2

−1

0

1

2

3

4

5

6

7

8

01/1

1/16

02/1

1/16

03/1

1/16

04/1

1/16

05/1

1/16

06/1

1/16

07/1

1/16

08/1

1/16

09/1

1/16

10/1

1/16

11/1

1/16

12/1

1/16

13/1

1/16

14/1

1/16

15/1

1/16

Date

Pct

. cha

nge

in e

xcha

nge

rate

v

s U

SD

wrt

pre

viou

s cl

ose

Currency

BRL

CAD

EUR

MXN

Figure 35:

So far we’ve been focusing on some information only to restrict the analysis, and we have the data we needto show how Trump’s election had an immediate effect on Latinamerican countries from the moment thepolls favored him as president-elect. We now need to modify the plot to emphasize this conclusion and focusit on the Mexican peso.

There are several ways of emphasizing content:

Remove unnecessary elements from the plot:

Try to remove as much clutter and unnecessary elements as is possible from the visualization. For thisexample, these were the actions taken:

• The x and y grid lines were set to blank.• The top and right borders of the plot were set to blank.• The frequenecy of the tick marks on the x-axis was reduced.• The formatting of the tick labels on the x-axis was changed to avoid having them at an angle.• The year in the tick labels was dropped as it gives unnecessary information.• The label on the y-axis was simplified with the inclusion of a title and subtitle.

Focus on increasing the whitespace in the plot:

The color white is your friend when trying to create impactful visualizations. The removal of the unnecesaryelements from the previous items resulted in more whitespace in the plot. More whitespace helps the audiencefocus their attention on what we want them to see.

Use color to focus attention on what you want the audience to see first

The Color Brewer schemes are excellent in exploratory analysis, in explanatory analysis you should restrictthe colors to 1-2 different colors and use greyscale for non-principal parts of the visualization. In this examplethese actions were taken:

• The dates corresponding to the elections were highlighted with a grey background• The line plot corresponding to the Mexican Peso was highlighted with a blue color• The series for the rest of the currencies were set to greyscale colors.

25

Page 26: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

Use annotations to reinforce the point of the visualization.

The inclusion of brief, concise texts, in the form of takeaways help you reinforce your point and allow thevisualization to be more independent. In this example only a couple of annotations were added:

• The legend was completely replaced by labeling each individual series.• The take-away of the visualization is also included in the top right corner of the plot with the same

font color as the highlights.

The resulting visualization can be seen in Figure 36, this plot takes into consideration all of the best practicesfor data visualization seen so far.

US Elections

BRL

CADEUR

MXN

LATAM markets suffered a sharp drop as theelection favored Trump, reflecting the market'suncertainty over his presidency.

−3

−2

−1

0

1

2

3

4

5

6

7

8

Nov 01 Nov 03 Nov 05 Nov 07 Nov 09 Nov 11 Nov 13 Nov 15

Per

cent

Cha

nge

With respect to previous close.

Change in exchange rate vs USD (2016)

Figure 36: Correct Use of Emphasis for Visualization

26

Page 27: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

Choosing the correct visualization

Choosing the best visualization for a given dataset can be complex, information can be presented in diverseways, and there is not a definitive guideline on how to choose the best one. One must become familiar withthe capabilities of a plotting library to better decide what type of visualization to use. One good resource forselecting visualizations is The Dataviz Catalogue (Ribecca, 2017), which allows selecting an appropriatevisualization with a simple to use Wizard (Figure 37)

Figure 37: Dataviz Catalogue Search Interface

27

Page 28: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

Storytelling with Data

Understand the Context

Tell a Story

Case Studies

28

Page 29: Principles of Data Visualization - Rodrigo De Luna Lara · 2019. 2. 10. · Principles of Data Visualization Rodrigo De Luna Lara Contents TheImportanceofDataVisualization 2 PlanarandRetinalVariables

Bibliography

Anscombe, F. (1973). Graphs in statistical analysis. The American Statistician, 27 (1), 17–21.

Brewer, C. (2017). Color brewer. Retrieved August 24, 2017, from http://colorbrewer2.org/

Few, S. (2008). Dual-scaled axes in graphs, are they ever the best solution. Retrieved August 25, 2017, fromhttp://www.perceptualedge.com/articles/visual_business_intelligence/dual-scaled_axes.pdf

Hickey, W. (2013). The worst chart in the world. Retrieved August 24, 2017, from http://www.businessinsider.com/pie-charts-are-the-worst-2013-6

Matejka, J., & Fitzmaurice, G. (2017). Datasaurus dozen. Retrieved from https://www.autodeskresearch.com/sites/default/files/The%20Datasaurus%20Dozen.zip

Myfxbook Ltd. (2017). Forex currencies. Retrieved August 26, 2017, from http://www.myfxbook.com/forex-market/currencies

National Review. (2015). The only #climatechange chart you need to see. Retrieved August 24, 2017, fromhttps://twitter.com/nro/status/676516015078039556

Reynolds, P. (1994). Case studies in biometry. John Wiley & Sons.

Ribecca, S. (2017). The data visualization catalogue. Retrieved August 26, 2017, from http://www.datavizcatalogue.com/search.html

U.S. General Services Administration. (2015). Accidents, fatalities, and rates, 1995 through2014, u.S. general aviation. Retrieved August 25, 2017, from https://catalog.data.gov/dataset/accidents-fatalities-and-rates-1995-through-2014-u-s-general-aviation

U.S. General Services Administration. (2017). Sweetener market data historical deliveries by use - ice cream.Retrieved August 25, 2017, from https://catalog.data.gov/dataset/sweetener-market-data-historical-deliveries-by-use-ice-cream/

Vigen, T. (n.d.). Spurious correlations. Retrieved August 25, 2017, from http://tylervigen.com/spurious-correlations

29