computer science and data analysis serieslibrary02.embl.de/inmagicgenie/documentfolder/... ·...

4
Chapman & Haii/CRC Computer Science and Data Analysis Series The interface between the computer and statistical sciences is increasing, as each discipline seeks to harness the power and resources of the other. This series aims to foster the integration between the computer sciences and statistical, numerical, and probabilistic methods by publishing a broad range of reference works, textbooks, and handbooks. SERIES EDITORS David Blei, Princeton University David Madigan, Rutgers University Marina Meila, University of Washington Fionn Murtagh, Royal Holloway, University of London Proposals for the series should be sent directly to one of the series editors above, or submitted to: Chapman & Haii/CRC 4th Floor, Albert House 1-4 Singer Street London EC2A 480 UK Published Titles Bayesian Artificial Intelligence Kevin B. Korb and Ann E. Nicholson Computational Statistics Handbook with MATLAB®, Second Edition Wendy L. Martinez and Angel R. Martinez Clustering for Data Mining: A Data Recovery Approach Boris Mirkin Correspondence Analysis and Data Coding with Java and R Fionn Murtagh Design and Modeling for Computer Experiments Kai- Tai Fang, Runze Li, and Agus Sudjianto Exploratory Data Analysis with MATLAB® Wendy L. Martinez and Angel R. Martinez Interactive Graphics for Data Analysis: Principles and Examples Martin Theus and Simon Urbanek Introduction to Machine Learning and Bioinformatics Sushmita Mitra, Sujay Datta, Theodore Perkins, and George Michailidis Pattern Recognition Algorithms for Data Mining Sankar K. Pal and Pabitra Mitra R Graphics Paul Murrell R Programming for Bioinformatics Robert Gentleman Semisupervised Learning for Computational Linguistics Steven Abney Statistical Computing with R Maria L. Rizzo .. ( t3-l Computer Science and Data Analysis Series Interactive Graphics for Data Analysis Principles and Examples Martin Theus Simon Urbanek 0 Boca Raton London New York CRC Press is an imprint of the Taylor & Franci s Group, an informa business A CHAPMAN & HALL BOOK

Upload: duongcong

Post on 16-Mar-2018

224 views

Category:

Documents


3 download

TRANSCRIPT

Chapman & Haii/CRC

Computer Science and Data Analysis Series

The interface between the computer and statistical sciences is increasing, as each discipline seeks to harness the power and resources of the other. This series aims to foster the integration between the computer sciences and statistical , numerical, and probabilistic methods by publishing a broad range of reference works, textbooks, and handbooks.

SERIES EDITORS David Blei, Princeton University David Madigan, Rutgers University Marina Meila, University of Washington Fionn Murtagh, Royal Holloway, University of London

Proposals for the series should be sent directly to one of the series editors above, or submitted to:

Chapman & Haii/CRC 4th Floor, Albert House 1-4 Singer Street London EC2A 480 UK

Published Titles

Bayesian Artificial Intelligence Kevin B. Korb and Ann E. Nicholson

Computational Statistics Handbook with MATLAB®, Second Edition

Wendy L. Martinez and Angel R. Martinez

Clustering for Data Mining: A Data Recovery Approach Boris Mirkin

Correspondence Analysis and Data Coding with Java and R Fionn Murtagh

Design and Modeling for Computer Experiments

Kai-Tai Fang, Runze Li, and Agus Sudjianto

Exploratory Data Analysis with MATLAB® Wendy L. Martinez and Angel R. Martinez

Interactive Graphics for Data Analysis: Principles and Examples Martin Theus and Simon Urbanek

Introduction to Machine Learning and Bioinformatics Sushmita Mitra, Sujay Datta, Theodore Perkins, and George Michailidis

Pattern Recognition Algorithms for Data Mining Sankar K. Pal and Pabitra Mitra

R Graphics Paul Murrell

R Programming for Bioinformatics Robert Gentleman

Semisupervised Learning for Computational Linguistics Steven Abney

Statistical Computing with R Maria L. Rizzo ..

(

t3-l Computer Science and Data Analysis Series

Interactive Graphics for Data Analysis Principles and Examples

Martin Theus Simon Urbanek

0 ~y~~F~~~~~0"P Boca Raton London New York

CRC Press is an imprint of the Taylor & Francis Group, an informa business

A CHAPMAN & HALL BOOK

Apple is n registered t.rndemork of Apple Computc•·,l.nc. AT&T is n registered tradClll'lark of AT&T Coq>. OataDesk is n registered trademark of Dnta Description, Inc. SAS, SAS-Insight , SAS-)MP, and SAS- tat-Studio re regis­tered rndemarks of SAS Institute In~ . and/or its affiliates. PostScript is a registered tmdcmark of Adobe Systems Incorporated. SPLUS is ~ registered trademark of the Insightful Corporation. SI.'SS is a registered trademark of SPSS Inc. UN IX is a registered trademark of The Open Group. Windows is a registered trademark of Microsoft Corporntion. Other thh:d-party trndemarks belong to their respc tive owners.

T.hc figures on pages .157 and 213 arc reproduced undcr the GNU free Documentation License (GFDL) and can be found In the WlkimcdJa Commons. The figure on page 173 i~ copyright 2003 by Fenton. licensee BioMed Central Ltd .. and Is taken from the Open Access article "A new growth chart for pretcrm babies: Babson and Benda's chart updated with recent d~tu and a new (onnnt."

Chapman & Haii/CRC Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742

© 2009 by Taylor & Francis Group, LLC Chapman & Haii/CRC is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works Printed in the United States of America on acid-free paper 10 9 8 7 6 54 3 2 1

International Standard Book Number-13: 978-1-58488-594-8 (Hardcover)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and i nformatlon. butth author and publisher cannot assume responsibility for the validity of all mat~rials or the consequences of their use. The authors and publishers have attempted to trncc the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical. or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Dan­vers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe.

Library of Congress Cataloging-in-Publication Data

Theus, Martin. Interactive graphics for data analysis :principles and examples I Martin Theus, Simon

Urbanek. p. em. -- (Computer science and data analysis series)

Includes bibliographical references and index. ISBN 978-1-58488-594-8 (hardcover: alk. paper) l. Graphical modeling (Statistics)--Data processing. 2. Statistics--Graphic methods. 3.

Computer graphics. I. Urbanek, Simon. II. Title.

QA276.3.T54 2008 006.6--dc22

Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com

and the CRC Press Web site at http://www.crcpress.com

.•

2008038130

\ ' (

To John W. Tukey

..

Contents

Introduction

I Principles

1 Interactivity 1.1 Queries . . . . . . . . . . . . . . . . 1.2 Selection and Linked Highlighting 1.3 Linking Analyses ..... 1.4 Interacting with Graphics . . . . .

2 Examining a Single Variable 2.1 Categorical Data .. 2.2 Continuous Data . . 2.3 Transforming Data 2.4 Weighted Plots . . .

3 Interactions between Two Variables 3.1 Two Categorical Variables ................ . 3.2 One Categorical Variable and One Continuous Variable 3.3 Two Continuous Variables ...... . ......... .

4 Multidimensional Plots 4.1 Mosaic Plots 4.2 Parallel Coordinate Plots 4.3 Trellis Displays

5 Plot Ensembles and Statistical Models 5.1 Response Models . 5.2 ANOVA ..... . 5.3 Log-linear Models

0 Geographical Data

'? More Interactivity 7.1 Sorting and Ordering 7.2 Zooming ....... .

1

9

11 11 13 25 26

29 29 31 40 43

49 49 53 57

69 69 75 81

89 89 93 98

107

119 . 119 . 122

xi

xii

7.3 Multiple Views ............... . 7.4 Interactive Graphics=/=- Dynamic Graphics

8 Missing Values

9 Large Data 9.1 Unaffected, Summary-Based Plots 9.2 Glyph-Based Plots ......... .

10 On the Examples

II Examples

A How to Pass an Exam

B Washing- What Makes the Difference

C The Influence of Smoking on Birth weight

D The Titanic Disaster Revisited

E Housing Rent Prices in Munich

F What Makes a Tour de France Winner

G How to Survive the Thirty Years' War

H Classification of Italian Olive Oils

I E-Voting in the 2004 Florida Election

i Mondrian Reference i.1 Quick Start Guide i.2 Plots ..... . i.3 Conventions i.4 Reference Card

References

Author Index

Subject Index ·'

Contents

. 126

. 127

133

143 . 143 . 145

151

155

157

165

173

183

193

203

213

221

231

241 . 241 . 251 . 257 . 266

269

275

277

~ttroduction

Vlhat is this book about?

This book talks about exploratory data analysis (EDA) and how interac­tive graphical methods can help gain further insights in a dataset, gener­ate new questions and hypotheses. John W. Tukey often referred to EDA DS experimental work. Tukey and Wilk (1965) summarize data analysis saying it" ... must be considered as an open-ended, highly interactive, it­e1oative process, whose actual steps are segments of a stubbily branching, tree-like pattern of possible actions." Visualization of the data is probably oue of the most powerful tools in this exploration process, as the role of the researcher in EDA is to explore the data in as many different ways as possible until a plausible "story" of the data emerges. Typical data analyses comprise the following eight steps:

1. Plan the study A well-thought-out study design that respects the study goals should be the initial step of any data analysis. For very clear-cut questions like optimizing the yield in a plant, optimal designs can be chosen - see Pukelsheim (2006), for instance. Unfortunately, statisticians are often consulted after the data were collected and thus cannot influence the study design. EDA meth­ods can cope with the "here is the data" situation much more easily because they do not rely on apriori hypotheses and distributions.

2. Understand the background and collect questions Analyzing data without a further understanding of the background is almost impossible. This is often neglected in classical teaching of mathematical statistics. It is only if procedures and techniques relate to actual data that we might find interpretable results and be able to give proper recommendations. Thus, a study of the back­ground and the data sources is extremely important for conducting a successful data analysis.

3. Check the data for errors From textbook examples, we are used to looking at what we regard as "clean" datasets. There are no obvious errors and the data seem to be consistent. But even for those datasets, the origin and the background of the data sometimes remains somewhat unclear and

1