overheads visualizing data 2012

Upload: sdsdsd7562

Post on 04-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Overheads Visualizing Data 2012

    1/52

    Statistics for Engineering

    Section 1: Visualizing data

    Kevin Dunn

    Copyright, and all rights reserved, Kevin Dunn, 2012

    http://stats4eng.connectmv.com

    2012

    1

  • 7/31/2019 Overheads Visualizing Data 2012

    2/52

    Plot your data

    2

  • 7/31/2019 Overheads Visualizing Data 2012

    3/52

    Usage examples

    Co-worker: Here are the yields from a batch system for thelast 3 years (1256 data points), can you help me: understand more about the time-trends in the past 3 year? efficiently summarize the yield from all batches run in 2010?

    3

  • 7/31/2019 Overheads Visualizing Data 2012

    4/52

    Usage examples

    Co-worker: Here are the yields from a batch system for thelast 3 years (1256 data points), can you help me: understand more about the time-trends in the past 3 year? efficiently summarize the yield from all batches run in 2010?

    Manager: effectively summarize the (a) number and (b) typesof defects on 17 aluminum grades for the past 12 months

    4

  • 7/31/2019 Overheads Visualizing Data 2012

    5/52

    Usage examples

    Co-worker: Here are the yields from a batch system for thelast 3 years (1256 data points), can you help me: understand more about the time-trends in the past 3 year? efficiently summarize the yield from all batches run in 2010?

    Manager: effectively summarize the (a) number and (b) typesof defects on 17 aluminum grades for the past 12 months

    Yourself: 24 different measurements vs time (5 readings perminute, over 300 minutes) for each batch we produce; how

    can we visualize these 36,000 data points?

    5

  • 7/31/2019 Overheads Visualizing Data 2012

    6/52

    References

    1. Edward Tufte, Envisioning Information, Graphics Press, 1990.

    (10th printing in 2005)

    2. Edward Tufte, The Visual Display of Quantitative Information,Graphics Press, 2001.

    3. Edward Tufte, Visual Explanations: Images and Quantities,

    Evidence and Narrative, 2nd edition, Graphics Press, 1997.4. William Cleveland, Visualizing Data, and The Elements of

    Graphing Data, Hobart Press; 2nd edition, 1994.

    5. Stephen Few, Show Me the Numbers, and Now You See It,Analytics Press.

    6. Su, Its easy to produce chartjunk using Microsoft Excel 2007but hard to make good graphs, Computational Statistics andData Analysis, 52 (10), 4594-4601, 2008,http://dx.doi.org/10.1016/j.csda.2008.03.007

    6

  • 7/31/2019 Overheads Visualizing Data 2012

    7/52

    Background

    This class might seem too easy, too obvious. It is!

    The human eye and brain are excellent at pattern recognition,sorting through signal and noise.

    7

  • 7/31/2019 Overheads Visualizing Data 2012

    8/52

    Background

    This class might seem too easy, too obvious. It is!

    The human eye and brain are excellent at pattern recognition,sorting through signal and noise.

    We can easily cope with bad plots; but good plots save timeand show a clearer, more honest picture.

    Cliches: Let the data speak for themselves, Plot the data

    We will look at: how

    8

  • 7/31/2019 Overheads Visualizing Data 2012

    9/52

    Time-series plots

    It is a 2-dimensional plot: (usually) horizontal x-axis: time or sequence order other axis: the data values

    Univariate plot Our eyes can deal with high data density:

    sinusoids spikes outliers separate noise from signal

    9

  • 7/31/2019 Overheads Visualizing Data 2012

    10/52

    Time-series plots

    Good, automated labelling is important.Heres an example of bad labelling

    (and bad axis scaling and colour choices)

    10

  • 7/31/2019 Overheads Visualizing Data 2012

    11/52

    Time-series plots

    Multiple lines (trajectories): should not cross and jumble

    Colours and markers help only slightly

    11

  • 7/31/2019 Overheads Visualizing Data 2012

    12/52

    Time-series plots

    Use separate, parallel axes rather; and minimal ink

    These non-default settings can take a long time to set (10 minutes

    for this example)

    12

  • 7/31/2019 Overheads Visualizing Data 2012

    13/52

    Time-series plots

    Sparklines

    Read the website link (in the notes)

    Used for financial trends (example)

    Built into Excel 2010

    Good for iPods, cell phones, tablet computers: high density, small size.

    13

  • 7/31/2019 Overheads Visualizing Data 2012

    14/52

    Time-series plots

    Example of sparklines in everyday use:

    Figure from Wikipedia

    14

  • 7/31/2019 Overheads Visualizing Data 2012

    15/52

    Time-series plots

    Further tips

    Keep the x-axis spacing constant: helps interpretation dont reposition the time-axis labels

    dont use magnifying glass concept.

    Adjust for inflation when plotting money values against time sales of polymer to DuPont over the past 10 years example of car sales:

    http://www.duke.edu/ rnau/411infla.htm

    15

  • 7/31/2019 Overheads Visualizing Data 2012

    16/52

    Time-series plots

    Show reasonable amount of data for context

    16

  • 7/31/2019 Overheads Visualizing Data 2012

    17/52

    Bar plots

    A univariate plot on a two dimensional axis.

    Has a category axis and value axis

    Use a bar plot when:

    many categories

    interpretation does not change if category axis is reordered

    17

  • 7/31/2019 Overheads Visualizing Data 2012

    18/52

    Bar plots

    Rather use a time-series plot if the data have a sequence:

    You can see the trends more clearly.

    18

  • 7/31/2019 Overheads Visualizing Data 2012

    19/52

    Bar plotsBar plots can be wasteful as each data point is repeated severaltimes:

    1. left edge (line) ofeach bar

    2. right edge (line) ofeach bar

    3. the height of thecolour in the bar

    4. the numbersposition (up anddown along the

    y-axis)

    5. the top edge ofeach bar, just belowthe number

    6. the number itself 19

  • 7/31/2019 Overheads Visualizing Data 2012

    20/52

    Bar plots

    Maximize data ink ratio within reason

    Maximize data ink ratio =total ink for data

    total ink for graphics= 1 proportion of ink that can be erased

    without loss of data information

    Rather use a table for a handful of data points:

    20

  • 7/31/2019 Overheads Visualizing Data 2012

    21/52

    Bar plots

    Dont use cross-hatching, textures, or unusual shading in the

    plots: it creates visual vibrations

    21

  • 7/31/2019 Overheads Visualizing Data 2012

    22/52

    Bar plots

    Use horizontal bars if: there is a some ordering to the categories the labels do not fit side-by-side

    22

  • 7/31/2019 Overheads Visualizing Data 2012

    23/52

    Bar plots

    Use horizontal bars if: there is a some ordering to the categories the labels do not fit side-by-side

    You can place the labels inside the bars

    23

  • 7/31/2019 Overheads Visualizing Data 2012

    24/52

    Bar plots

    Use horizontal bars if: there is a some ordering to the categories the labels do not fit side-by-side

    You can place the labels inside the bars

    You should usually start the non-category axis at zero

    24

  • 7/31/2019 Overheads Visualizing Data 2012

    25/52

    Box plots

    A graphical display of the 5-number summary for 1 variable

    minimum sample value

    25th percentile (1st quartile)

    50th percentile (median)

    75th percentile (3rd quartile)

    maximum sample value

    Notes:

    1. 25th percentile is the value below which 25 percent of theobservations in the sample are found

    2. distance from 3rd to 1st quartile = interquartile range (IQR)

    Box plots are effective for comparing similar variables (same unitsof measurement)

    25

  • 7/31/2019 Overheads Visualizing Data 2012

    26/52

    Box plots

    P os 1 Po s2 Pos 3 P os4 Po s5 Pos 6

    1 1761 1 739 1 758 1 677 1 684 1 6922 1801 1 688 1 753 1 741 1 692 1 6753 1697 1 682 1 663 1 671 1 685 1 6514 1679 1 712 1 672 1 703 1 683 1 6745 1699 1 688 1 699 1 678 1 688 1 705

    . . . .

    96 1717 1708 1645 1690 1568 168897 1661 1660 1668 1691 1678 1692

    98 1706 1665 1696 1671 1631 164099 1689 1678 1677 1788 1720 1735100 1751 1736 1752 1692 1670 1671

    Video of data source

    26

  • 7/31/2019 Overheads Visualizing Data 2012

    27/52

    Box plots

    > summary ( boards [ 1 : 1 0 0 , 2 : 7 ] )Pos1 Pos2 Pos3 Pos4 Pos5 Pos6

    Min . : 1524 1603 1594 1452 1568 1503

    1 st Qu . : 1671 1657 1654 1667 1662 1652Median : 1680 1674 1672 1678 1673 1671Mean : 1687 1677 1677 1679 1674 16723 rd Qu . : 1705 1688 1696 1693 1685 1695Max . : 1822 1762 1763 1788 1741 1765

    27

  • 7/31/2019 Overheads Visualizing Data 2012

    28/52

    Box plots

    28

  • 7/31/2019 Overheads Visualizing Data 2012

    29/52

    Box plots

    Some variations:

    use the mean instead of the median

    outliers shown as dots, where an outlier is most commonly

    defined as any point 1.5 IQR distance units above and belowthe median.

    use the 2nd percentile (instead of median 1.5IQR)

    use the 98th percentile (instead of median + 1.5IQR)

    add the density histogram onto the box plot: violin plot

    29

  • 7/31/2019 Overheads Visualizing Data 2012

    30/52

    Box plot variation: violin plot

    30

  • 7/31/2019 Overheads Visualizing Data 2012

    31/52

    Scatter plots

    Used to help understand the relationship between twovariables: a bivariate plot

    Collection of points in the 2 axes

    Each point is the intersection of the values on each axis

    Intention of a scatter plot

    Asks the viewer to draw a causal relationship between the twovariables

    31

  • 7/31/2019 Overheads Visualizing Data 2012

    32/52

    Scatter plots

    32

  • 7/31/2019 Overheads Visualizing Data 2012

    33/52

    Scatter plots

    However, not all scatter plots show causal phenomenon.

    33

  • 7/31/2019 Overheads Visualizing Data 2012

    34/52

    Scatter plots

    Strive for graphical excellence by:

    making each axis as tight as possible

    avoid heavy grid lines use the least amount of ink

    do not distort the axes

    34

  • 7/31/2019 Overheads Visualizing Data 2012

    35/52

    Scatter plots

    There is an unfounded fear that others wont understand your 2Dscatter plot.

    Tufte study (VDQI): no scatter plots in a sample (1974 to1980) of Western dailies

    12 year olds can interpret such plots.

    Japanese newspapers frequently use scatterplots

    Plant control room: seldom see scatter plots.

    Key point

    The producers of charts must assume their audience is capable ofinterpreting them. Rather, assume that if you can understand theplot, so will your audience.

    35

  • 7/31/2019 Overheads Visualizing Data 2012

    36/52

    Scatter plots

    Add box plots or histograms to aide interpretation:

    36

  • 7/31/2019 Overheads Visualizing Data 2012

    37/52

    Scatter plots

    Add a 3rd variable: different marker sizes

    Add a 4th variable: use colour or grayscale shading

    The GapMinder website allows you to play the graph overtime (the 5th variable)

    37

  • 7/31/2019 Overheads Visualizing Data 2012

    38/52

    Scatter plots

    Web-based demo from http://gapminder.org

    Demo by Hans Rosling (requires internet access)

    38

  • 7/31/2019 Overheads Visualizing Data 2012

    39/52

    Tables

    Tables are for comparative data analysis on categorical objects.

    Note the rows are in default alphabetical order.

    We can make the table tell a story if we reorder the rows by

    some other variable. e.g. monthly insurance payment

    39

  • 7/31/2019 Overheads Visualizing Data 2012

    40/52

    Tables

    Compare defect types (number of defects) for differentproduct grades (categories):

    Which defects cost us the most money?

    40

  • 7/31/2019 Overheads Visualizing Data 2012

    41/52

    Tables

    Defect frequency If 1850 lots of grade A4636 (first row): defect A rate = 1/50 If 250 lots of grade A2610 (last row): defect A rate = 1/50 Redraw table on production rate basis

    If comparing defects over different grades: go down the table(show fraction within the column)

    If comparing defects within grade: go across table (showfraction with the row) Could weight each column by cost of defect

    41

  • 7/31/2019 Overheads Visualizing Data 2012

    42/52

    Tables

    Three common pitfalls:1. using pie charts when tables will do

    42

  • 7/31/2019 Overheads Visualizing Data 2012

    43/52

    Tables

    2. arbitrarily ordering of the rows

    43

  • 7/31/2019 Overheads Visualizing Data 2012

    44/52

    Tables

    3. using excessive grid lines

    44

  • 7/31/2019 Overheads Visualizing Data 2012

    45/52

    Tables

    Interesting example: comparing two treatments

    45

  • 7/31/2019 Overheads Visualizing Data 2012

    46/52

    Tables

    46

  • 7/31/2019 Overheads Visualizing Data 2012

    47/52

    Data frames

    Frames are the basic containers that surround the data and givecontext to our numbers. Here are some tips:

    1. Use round numbers2. Tighten the axes as much as possible, except ...

    3. when showing comparison plots: all axes must have the sameminima and maxima

    47

  • 7/31/2019 Overheads Visualizing Data 2012

    48/52

    Aesthetics and style

    I highly recommend reading Tuftes 4 books: contain remarkable

    examples of how to bring data to life.

    48

  • 7/31/2019 Overheads Visualizing Data 2012

    49/52

    Colour

    Colour is effective, but: readers could be colour-blind, document read from a gray-scale print out

    There is no standard colour progression (blues, greens,yellows, orange, red).

    Safest colour progression is gray-scale axis: from black towhite satisfies colour-blind readers looks good in printed form

    49

  • 7/31/2019 Overheads Visualizing Data 2012

    50/52

    General summary

    No general advice that applies in every instance. Useful tipsnevertheless:

    To understand causality, you must show causality: usebivariate scatter plots (sometimes line plots also work well)

    Plots and text go together: a plot = paragraph of text add labels to plots for outliers and interesting points add equations add small summary tables

    Avoid codes: A = grade TK133, B = grade RT231

    50

  • 7/31/2019 Overheads Visualizing Data 2012

    51/52

    General summary

    Avoid unnecessary extras to enliven the plot

    If the statistics are boring, then youve got the wrongnumbers.

    51

  • 7/31/2019 Overheads Visualizing Data 2012

    52/52

    General summary

    Adjust for inflation if plot involves money and time

    Maximize the data-ink ratio = (ink for data) / (total ink forgraphics).

    1. eliminate non-data ink2. erase redundant data-ink.

    Maximize data density: 250 data points per linear inch, and625 data points per square inch.

    52