computer science and data analysis serieslibrary02.embl.de/inmagicgenie/documentfolder/... ·...
TRANSCRIPT
Chapman & Haii/CRC
Computer Science and Data Analysis Series
The interface between the computer and statistical sciences is increasing, as each discipline seeks to harness the power and resources of the other. This series aims to foster the integration between the computer sciences and statistical , numerical, and probabilistic methods by publishing a broad range of reference works, textbooks, and handbooks.
SERIES EDITORS David Blei, Princeton University David Madigan, Rutgers University Marina Meila, University of Washington Fionn Murtagh, Royal Holloway, University of London
Proposals for the series should be sent directly to one of the series editors above, or submitted to:
Chapman & Haii/CRC 4th Floor, Albert House 1-4 Singer Street London EC2A 480 UK
Published Titles
Bayesian Artificial Intelligence Kevin B. Korb and Ann E. Nicholson
Computational Statistics Handbook with MATLAB®, Second Edition
Wendy L. Martinez and Angel R. Martinez
Clustering for Data Mining: A Data Recovery Approach Boris Mirkin
Correspondence Analysis and Data Coding with Java and R Fionn Murtagh
Design and Modeling for Computer Experiments
Kai-Tai Fang, Runze Li, and Agus Sudjianto
Exploratory Data Analysis with MATLAB® Wendy L. Martinez and Angel R. Martinez
Interactive Graphics for Data Analysis: Principles and Examples Martin Theus and Simon Urbanek
Introduction to Machine Learning and Bioinformatics Sushmita Mitra, Sujay Datta, Theodore Perkins, and George Michailidis
Pattern Recognition Algorithms for Data Mining Sankar K. Pal and Pabitra Mitra
R Graphics Paul Murrell
R Programming for Bioinformatics Robert Gentleman
Semisupervised Learning for Computational Linguistics Steven Abney
Statistical Computing with R Maria L. Rizzo ..
(
t3-l Computer Science and Data Analysis Series
Interactive Graphics for Data Analysis Principles and Examples
Martin Theus Simon Urbanek
0 ~y~~F~~~~~0"P Boca Raton London New York
CRC Press is an imprint of the Taylor & Francis Group, an informa business
A CHAPMAN & HALL BOOK
Apple is n registered t.rndemork of Apple Computc•·,l.nc. AT&T is n registered tradClll'lark of AT&T Coq>. OataDesk is n registered trademark of Dnta Description, Inc. SAS, SAS-Insight , SAS-)MP, and SAS- tat-Studio re registered rndemarks of SAS Institute In~ . and/or its affiliates. PostScript is a registered tmdcmark of Adobe Systems Incorporated. SPLUS is ~ registered trademark of the Insightful Corporation. SI.'SS is a registered trademark of SPSS Inc. UN IX is a registered trademark of The Open Group. Windows is a registered trademark of Microsoft Corporntion. Other thh:d-party trndemarks belong to their respc tive owners.
T.hc figures on pages .157 and 213 arc reproduced undcr the GNU free Documentation License (GFDL) and can be found In the WlkimcdJa Commons. The figure on page 173 i~ copyright 2003 by Fenton. licensee BioMed Central Ltd .. and Is taken from the Open Access article "A new growth chart for pretcrm babies: Babson and Benda's chart updated with recent d~tu and a new (onnnt."
Chapman & Haii/CRC Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742
© 2009 by Taylor & Francis Group, LLC Chapman & Haii/CRC is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works Printed in the United States of America on acid-free paper 10 9 8 7 6 54 3 2 1
International Standard Book Number-13: 978-1-58488-594-8 (Hardcover)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and i nformatlon. butth author and publisher cannot assume responsibility for the validity of all mat~rials or the consequences of their use. The authors and publishers have attempted to trncc the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical. or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe.
Library of Congress Cataloging-in-Publication Data
Theus, Martin. Interactive graphics for data analysis :principles and examples I Martin Theus, Simon
Urbanek. p. em. -- (Computer science and data analysis series)
Includes bibliographical references and index. ISBN 978-1-58488-594-8 (hardcover: alk. paper) l. Graphical modeling (Statistics)--Data processing. 2. Statistics--Graphic methods. 3.
Computer graphics. I. Urbanek, Simon. II. Title.
QA276.3.T54 2008 006.6--dc22
Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com
and the CRC Press Web site at http://www.crcpress.com
.•
2008038130
\ ' (
To John W. Tukey
..
Contents
Introduction
I Principles
1 Interactivity 1.1 Queries . . . . . . . . . . . . . . . . 1.2 Selection and Linked Highlighting 1.3 Linking Analyses ..... 1.4 Interacting with Graphics . . . . .
2 Examining a Single Variable 2.1 Categorical Data .. 2.2 Continuous Data . . 2.3 Transforming Data 2.4 Weighted Plots . . .
3 Interactions between Two Variables 3.1 Two Categorical Variables ................ . 3.2 One Categorical Variable and One Continuous Variable 3.3 Two Continuous Variables ...... . ......... .
4 Multidimensional Plots 4.1 Mosaic Plots 4.2 Parallel Coordinate Plots 4.3 Trellis Displays
5 Plot Ensembles and Statistical Models 5.1 Response Models . 5.2 ANOVA ..... . 5.3 Log-linear Models
0 Geographical Data
'? More Interactivity 7.1 Sorting and Ordering 7.2 Zooming ....... .
1
9
11 11 13 25 26
29 29 31 40 43
49 49 53 57
69 69 75 81
89 89 93 98
107
119 . 119 . 122
xi
xii
7.3 Multiple Views ............... . 7.4 Interactive Graphics=/=- Dynamic Graphics
8 Missing Values
9 Large Data 9.1 Unaffected, Summary-Based Plots 9.2 Glyph-Based Plots ......... .
10 On the Examples
II Examples
A How to Pass an Exam
B Washing- What Makes the Difference
C The Influence of Smoking on Birth weight
D The Titanic Disaster Revisited
E Housing Rent Prices in Munich
F What Makes a Tour de France Winner
G How to Survive the Thirty Years' War
H Classification of Italian Olive Oils
I E-Voting in the 2004 Florida Election
i Mondrian Reference i.1 Quick Start Guide i.2 Plots ..... . i.3 Conventions i.4 Reference Card
References
Author Index
Subject Index ·'
Contents
. 126
. 127
133
143 . 143 . 145
151
155
157
165
173
183
193
203
213
221
231
241 . 241 . 251 . 257 . 266
269
275
277
~ttroduction
Vlhat is this book about?
This book talks about exploratory data analysis (EDA) and how interactive graphical methods can help gain further insights in a dataset, generate new questions and hypotheses. John W. Tukey often referred to EDA DS experimental work. Tukey and Wilk (1965) summarize data analysis saying it" ... must be considered as an open-ended, highly interactive, ite1oative process, whose actual steps are segments of a stubbily branching, tree-like pattern of possible actions." Visualization of the data is probably oue of the most powerful tools in this exploration process, as the role of the researcher in EDA is to explore the data in as many different ways as possible until a plausible "story" of the data emerges. Typical data analyses comprise the following eight steps:
1. Plan the study A well-thought-out study design that respects the study goals should be the initial step of any data analysis. For very clear-cut questions like optimizing the yield in a plant, optimal designs can be chosen - see Pukelsheim (2006), for instance. Unfortunately, statisticians are often consulted after the data were collected and thus cannot influence the study design. EDA methods can cope with the "here is the data" situation much more easily because they do not rely on apriori hypotheses and distributions.
2. Understand the background and collect questions Analyzing data without a further understanding of the background is almost impossible. This is often neglected in classical teaching of mathematical statistics. It is only if procedures and techniques relate to actual data that we might find interpretable results and be able to give proper recommendations. Thus, a study of the background and the data sources is extremely important for conducting a successful data analysis.
3. Check the data for errors From textbook examples, we are used to looking at what we regard as "clean" datasets. There are no obvious errors and the data seem to be consistent. But even for those datasets, the origin and the background of the data sometimes remains somewhat unclear and
1