a word from our 2009 section chairsstat-computing.org/newsletter/issues/scgn-20-2.pdfvolume 20,...

Volume 20, December 2009

A joint newsletter of the StatisticalComputing & Statistical GraphicsSections of the American StatisticalAssociation

A Word from our 2009 Section ChairsJOSÉ PINHEIROCOMPUTING

ANTONY UNWINGRAPHICS

The cross-disciplinary nature of StatisticalComputing is one of its defining features, butmay also be a challenge for people attempting todevelop a career in the field. Even though statisticalcomputing methods are increasingly drivinginnovation in key core areas of Statistics andComputer Sciences, researchers and practitionersspecializing in Statistical Computing are at timesnot properly recognized for their contributions andexpertise. Continued on page 2 . . .

"Description is Revelation"(Wallace Stevens, later also Seamus Heaney)

Graphic displays should reveal information andit is always a pleasure to come across graphicsthat do just that. They describe the informationin a graphic way that is often more readilyunderstandable than a set of statistics or the resultsof a model. Continued on page 2 . . .

Contents of this volume:

A Word from our 2009 Section Chairs . . . . . 1Editor’s Note . . . . . . . . . . . . . . . . . . . 3Highlights from the Joint Statistical Meetings 4High-Flying Graphics at the 2009 Data Expo 5

Hadoop for Statistical Analysis and Explo-ration . . . . . . . . . . . . . . . . . . . . . . 12

Statistical Graphics! — who needs VisualAnalytics? . . . . . . . . . . . . . . . . . . . 17

Reader Response . . . . . . . . . . . . . . . . 24Announcements . . . . . . . . . . . . . . . . . 25

Section Officers . . . . . 26

Computing ChairContinued from page 1.

Establishing a professional identity as an expertin Statistical Computing may be challenging insome companies and academic departments, oftenleading those individuals to develop additionalareas of expertise/activity that are more in linewith that of their peers.

This problem is of course not new and haslong been recognized by our section. Two of theawards presented by the Statistical Computingsection, the Student Paper Competition (jointlysponsored with the Statistical Graphics section)and the John M. Chambers Award, are intendedto attract and recognize young talent in the field.Now a new award, the Statistical Computing andGraphics Award, is being created to recognizeindividuals who, over their careers, have madeinnovative and impactful contributions in theareas of statistical computing, graphics, and/orsoftware. Initially suggested by Deborah Nolan,the new award is jointly organized and sponsoredby the Statistical Computing and StatisticalGraphics sections, having been enthusiasticallyendorsed by the executive committees of bothsections. It is intended to be presentedbi-annually at the Joint Statistical Meetings(JSM), starting next year. Further details can befound in the award announcement, available athttp://stat-computing.org/awards/comp-graphics.

On a different topic, the statistical computingprogram for next year’s JSM is shaping upnicely, under the leadership of Thomas Lumley,the 2010 program-chair. In recent years, therehave been fewer roundtables and ContinuingEducation (CE) courses and computer technologyworkshops sponsored by our section. Thoselearning opportunities are highly valued by JSMattendees, provide great visibility and increase theunderstanding of statistical computing, and are asource revenue for our section. Even though itis too late to submit CE proposals for next year’sJSM, I strongly encourage our section members toconsider submitting proposals for future meetings.

As this is my final column as section chair, Iwould like to take the opportunity to thank myfellow officers in the Stat Computing and StatGraphics sections for making this such an enjoyableexperience. Luke Tierney will be taking over as

chair in January, ensuring that the future of thesection will be in excellent hands.

Finally, on behalf of the section I would liketo thank Elizabeth Slate, who is rotating out ofthe Secretary/Treasurer position at the year-end,for her outstanding contributions to the sectionover the past two years. A warm welcome toUsha Govindarajulu, who will be taking over thefinancial helm of the section.

José PinheiroJohnson & Johnson

Graphics ChairContinued from page 1.

But graphics is still more of an art than a scienceand it is surprising how much tastes vary. I havebeen visiting Heike Hofmann and Di Cook at IowaState and Di commented on how much progresscomputer graphics had made in the last twentyyears compared to statistical graphics. She is right,and I think it is because we can all generally agreeon which software renders a teapot best (to nameone classic example), while few agree on whichgraphics are best for the iris dataset (to nameanother classic example of a different type). It isdifficult to make progress when the goals are noteasy to recognise.

If statistical graphics is to develop, therehave to be more students working in thefield and the Student Paper Competition(stat-computing.org/awards/student/index.html)is one way the section encourages this. Thecompetition is run jointly with the StatisticalComputing section and usually there are far morecomputing entries than graphics entries. Doesanyone have any ideas for changing this stateof affairs? My own opinion, doubtless bothcontroversial and contentious, is that it’s harderto get a graphics paper accepted. A non-graphicspaper will probably involve some new technicalidea and be mathematically or computationallysophisticated. It looks impressive, even if it maynot achieve a lot. A graphics paper is more likely toinvolve some insight or simplification that makesthe reviewer think it is obvious and thereforenot worth publishing, even it the idea has directapplication. Whatever the reason, we have fartoo few graphics entries and I exhort you all topersuade students to think of submitting a paper.

2

Last year’s Data Expo was a great success and Ihope any of you who attended the JSM took a lookat the submissions (you can still see some of themat http://stat-computing.org/dataexpo/2009/posters/). Hadley Wickham provided an excellentand unusually large dataset on flight delays. Thecontrast in size of this dataset (about 120 millioncases) and one from a previous competition waspointed out by Jürgen Symanzik, who found theseinstructions for the 1983 Data Expo: "Becauseof the Committee’s limited (zero) budget for theExposition, we are forced to provide the data inhardcopy form only (enclosed). (Sorry!) There are406 observations on the following 8 variables...".Note that each participating group had to enter thedata by hand before doing any analysis! For thecoming JSM in Vancouver Hadley has promisedanother fascinating dataset and I recommend youkeep a look out for that. Bear in mind that the main

aim of the Expo is not to give prizes (though prizesfor the three most interesting entries are awarded),but to gather together different ways of visualizingthe same data. Innovative views of a small aspectof the dataset can make an important contribution,there is no need for every entry to include a fullanalysis of the whole dataset.

And one final important note, our two sectionshave agreed to establish a new award, the theStatistical Computing and Graphics Award torecognize an individual or team for innovationin computing, software, or graphics that has hada great impact on statistical practice or research(stat-computing.org/awards/comp-graphics/).Whom would/will you nominate?

Antony UnwinUniversity of Augsburg

Editor’s NoteNicholas Lewin-Koh (Computing)Andreas Krause (Graphics)

This issue of Volume 20 of our newsletter containsthree articles:

— High-Flying Graphics at the 2009 Data Expoby Rick Wicklin

— Hadoop for Statistical Analysis and Explo-ration by Byron Ellis

— Statistical Graphics! – Who needs Visual An-alytics? by Martin Theus

Our section heads provide interesting views intheir column “A word from our section chairs”, welook back at the highlights of the JSM 2009, and fea-ture a reader response from Naomi Robbins. Thesearticles all provide interesting reading and we hopeyou enjoy them.

We are always looking for interesting contribu-tions. Please contact us if you have short articles,

software, graphs, or anything that you think mightbe of interest to the ASA Sections for StatisticalComputing and Graphics.

We are seeking a new editor for this newslet-ter to coordinate the graphics section. Serving asnewsletter editor is a great opportunity to get toknow members of the statistical computing andgraphics communities, building up, enhancing,and revitalizing your network. It is also a valuableprofessional service function to both, the ASA andthe Statistical Computing and Graphics Sections.

Nicholas and Andreas will work closelywith the new editor to manage the transitionsmoothly. Interested people should contactNicholas ([email protected]) or An-dreas ([email protected]).

Last but not least: Happy New Year to all ofyou!

Nicholas and Andreas

3

http://stat-computing.org/dataexpo/2009/posters/


Highlights from the Joint StatisticalMeetingsHighlights from Stat Computingprogram

The record number of attendees at this year’s JSMin Washington D.C. had an impact on the meet-ing program (including the Stat. Computing pro-gram). The larger than usual number of Con-tributed sessions, eleven Paper and one Poster ses-sions, gave clear indication of the high degree ofinterest in statistical computing topics. The six in-vited sessions focused on computational topics inspatial and temporal data analysis, Bayesian infer-ence, and collaborative research and applications instatistical computing. Together, these 18 sessionscovered a broad range of relevant and interestingtopics involving statistical computing and attracteda substantial number of JSM attendees. Kudos tothe program-chair, Robert McCulloch, the speak-ers, discussants, session organizers and chairs.

A couple of areas in which the program couldbe strengthened for future JSM’s are Topic Con-tributed sessions and Continuing Education (CE)events. Besides the traditional, and always excel-lent, session featuring the winners of the StudentPaper Competition, there was a single additionalTopic Contributed session, focusing on computingenvironments and large datasets, sponsored by oursection at this year’s meeting. The submission ofTopic Contributed session proposals for the 2010JSM is still open, though the end of January dead-line is quickly approaching. All are encouraged tosubmit proposals.

No CE courses or tutorials were sponsored bythe Stat Computing Section in this year’s JSM, dittofor roundtables. This is a missed opportunity forpromoting and creating greater awareness aboutStatistical Computing, as well as for generatingrevenue for the Section. The submission deadlinefor CE courses has passed, but there is still timefor submitting proposals for Computer TechnologyWorkshops and roundtables.

Perhaps influenced by the record JSM atten-dance, this year’s Student Paper Competition hada record number of entries. The large number ofhigh quality submissions made it difficult for thejudges to select just four winners, as traditionally

done. Upon their request, the Stat Computing andStat Graphics executive committees agreed to havefive winners for this year’s competition.

As usual, the Monday night Stat Computingand Stat Graphics mixer was one of the highlightsof the meeting, at least for the crowd who attends itevery year. A great opportunity to meet old friendsand make new ones, while enjoying good food, get-ting the latest news from both sections, and, witha little bit of luck, winning one of the many door-prizes that are given out. On the latter, a big thanksto the vendors, too many to list here, who, year af-ter year, generously donate books, software, gad-gets, etc, for the prizes.

José PinheiroJohnson & Johnson

Looking Back — the Graphics pro-gram at JSM 2009

As always there was so much going on at the JSM,so many talks, so many discussions, so many con-tacts, that it’s hard to remember what actually tookplace. One thing I can remember was the annualmixer. For once in my long years of JSM experienceI was not looking forward to it. Having part of theresponsibility for organising an event is always asobering influence. Fortunately everyone seemedto enjoy themselves, there were a lot of raffle good-ies to distribute, and the main event, honouring ourprizewinners, went off very well.

In the conference itself the section cospon-sored seven invited sessions, two contributed ses-sions, a continuing education course (Martin Theusand Simon Urbanek on their super book "Interac-tive Graphics for Data Analysis"), two roundtablelunches and, most importantly, the Data Expo onflight delays. The Data Expo had a prominent posi-tion in the main conference building and was seenby a lot of people. (By the way, if you want to re-fresh your memory on what all these events werein detail, you can still access all the informationfrom the JSM 2009 program on the web. Just en-ter under Search by sponsor "Section on Statistical

4

Graphics".) I wish I could say that I went to all ofthese sessions and could tell you all about them.In fact, thanks to the regular scheduling rule thatthere are always some sessions that overlap thatshouldn’t, I was giving an Introductory OverviewLecture on Visualising high-dimensional data justas one of our invited sessions "Data Display forLarge Complex Data Sets" was taking place. Thesessions I did manage to get to were very goodand we should offer our congratulations to our Pro-gram Chair, Steve MacEachern, for his coordinat-ing work and to all the organisers and speakers fortheir contributions. It would be nice to see morecontributed paper sessions (our sister section Sta-tistical Computing has far more of those), but Sta-tistical Graphics is a tricky subject for research and

publication, and the short time available for eachpaper in contributed sessions is not ideal. Do thinkabout it in the future, though.

After the JSM in 2007 Hadley Wickham put upa lot of information about the meeting on the sec-tion’s website (stat-graphics.org/graphics/). Weshould rekindle this shortstanding tradition andput stuff up about JSM 2009. We encour-age everyone to send their talks, photos, re-lated materials or URL links to Hadley Wickham<[email protected]> for inclusion in the website.

Antony UnwinUniversity of Augsburg

High-Flying Graphics at the 2009 DataExpoRick Wicklin

Introduction

The Data Expo is a biannual poster session usuallysponsored jointly by the ASA Sections on StatisticalGraphics and Statistical Computing. The purposeof the poster session is to distribute an interestingdata set to many researchers and to challenge themto use statistical graphics to describe and visualizethe data concisely on a single poster. The sessionis always well-attended, and it helps the Sectionshighlight the importance of statistical graphics indata analysis.

This year’s Data Expo was organized by HadleyWickham, who assembled a truly massive setof data from the Research and Innovative Tech-nology Administration (RITA) which coordinatesthe U.S. Department of Transportation (DOT) re-search programs. The data (available from http://stat-computing.org/dataexpo/2009/) consist of123 million records of U.S. domestic commercialflights between 1987 and 2008. For each flight thereis information about 29 variables, including the fol-lowing:

• Dates: day of week, date, month, and year

• Arrival and departure times: actual andscheduled

• Origin and destination: airport code, latitude,and longitude

• Carrier: American, Aloha Air, . . ., US Air

It is fascinating to see the varied approachesand graphical displays presented for these data.You are invited to browse the electronic versionof the posters at http://stat-computing.org/dataexpo/2009/posters/.

This article describes several graphs in myposter, which was joint work with Robert Allison(6). The poster graphically presents ways in whichflight delays and cancellations vary in time, amongairports, and among airline carriers. The articlealso describes graphical methods and features ofthe data that elicited the most comments from visi-tors to the Data Expo. For me, the discussions withcolleagues during and after the poster session werethe most gratifying part of creating the poster.

Data Reduction

When presenting data graphically, it is important tobe able to quickly convey the main features in thedata. But how can you summarize 21 years worth

5

http://stat-computing.org/dataexpo/2009/

http://stat-computing.org/dataexpo/2009/



of data on the delays or cancellations of 123 millionflights?

The DOT defines a departing flight as “de-layed” if it departs more than 15 minutes after itsscheduled departure time. The 15 minute windowis also used for defining when an arriving flight isdelayed.

For these data, I found it useful to reduce thesize of the data by computing a descriptive statis-tic such as the mean length of a delay or the per-centage of flights delayed. These statistics are cal-culated over an appropriate aggregation unit suchas a date, a year, an airport, or a carrier. This of-ten proved to be an important first step in under-standing and conveying gross trends in the data.All but two graphs in my poster displayed descrip-tive statistics, rather than the raw data.

For example, one quantity displayed in severalgraphs is the “percentage of flights delayed” (PFD)for a given day. (This is a standard quantity mea-sured and reported by the DOT; see http://www.bts.gov/.) If a certain day has 30,000 flights and6,000 of those are more than 15 minutes late, thenthe PFD for that day is 20%. Plotting the PFD foreach day instead of the original data results in a sig-nificant reduction in the volume of data, but at theusual risk of losing details that might be important.

Temporal Effects: A 21-Year Sum-mary

The main graphical technique we used to presenthow the PFD varies over 21 years is a calendar plot,shown in Figure 1 for five years of the data. Foreach year, there are seven rows that represent thedays of the week and usually 53 columns that rep-resent each week in the year. The color of eachday represents the value of the variable for thatday. The earliest reference I have seen for a calen-dar plot is Mintz, Fitz-Simons, and Wayland (3). ASAS macro for computing the plot was presented inZdeb and Allison (8).

The calendar plot is a choropleth map: eachyear is a “country,” each month is a “state,” andeach day is an “county.” By thinking of the cal-endar plot as a choropleth map, we can use ideasfrom the cartographic literature to guide the designof the plot. For example, we can follow Pickle et al.(5) and choose a color for each day as determinedby the quantile of the PFD for each day.

In Figure 1, the colors are determined by quin-tiles of the PFD for all days in the five-year time pe-riod. The colors are chosen to suggest a traffic lightcolor scheme: the first quintile of the data is coloredgreen for “go”; the second and third quintiles areyellow and pale orange for “caution”; a darker or-ange and red are used for the upper quintiles to sig-nify extreme delays. The actual colors are based ona diverging color scheme from ColorBrewer.org(1). This web site is a valuable resource for creat-ing color schemes because Brewer’s schemes havethe important property that each color is equally-perceived: no one color catches the eye and drawsthe viewer’s attention. The colors are also designedto be distinguishable to the color blind.

Figure 1: Calendar Plot for Percentage of FlightsDelayed (2004–2008)

A few aspects of the data are apparent by look-ing at the color patterns of columns: the PFD is rel-atively low in the spring and fall, whereas summerand the last two weeks of December have manydays with a high PFD. The PFD in January–Marchcan be quite variable, presumably due to winterstorms. If you average the values of the calen-dar plot for each week (that is, take the averageof each column), you obtain the scatter plot in Fig-ure 2 which plots the average weekly PFD versusthe week of the year. A loess smoother (adjustedfor periodicity in the data) is overlaid on the plotso that it is easier to see that, on average, the PFDis high during the summer and Christmas seasonsbut low during the spring and autumn. In a pan-eled layout, this figure can be placed underneath

6

http://www.bts.gov/

http://www.bts.gov/

ColorBrewer.org

the calendar plot, similar to the arrangement usedby Peng (4).

Figure 2: Average Weekly Percentage of Flights De-layed (2004–2008)

A calendar plot for the percentage of flights can-celed (PFC) can be similarly constructed and showssimilar features. However, the most striking fea-ture of the calendar plot for canceled flights is re-lated to the terrorist attacks on September 11, 2001,as shown in Figure 3. In the two weeks prior to9/11, an average of 2.3% of flights were canceled.By the middle of October, 2001, the average PFCwas down to 1% of flights, due in part to a 17% re-duction in the number of scheduled flights per day.The daily PFC remained low until 2004.

Figure 3: Calendar Plot for Percentage of FlightsCanceled (1999–2003)

As suggested by Pickle et al. (5), it is a good ideato plot the distribution of the variable that you areusing to color the map. Figure 4 shows the distri-bution of the PFC variable for the five years shownin Figure 3. (The distribution over the full 21 years

is similar, but was not included in my poster due tospace considerations.) This distribution exhibits along tail.

Following Pickle et al. (5), there is a color barbeneath the density plot and the horizontal axis istruncated at the 99th percentile of the data. (The ac-tual maximum is indicated in the graph.) The colorbar shows the values that correspond to the colorsused in the calendar plot; this gives the viewer a vi-sual impression of the varying lengths of the quan-tiles. An investigation of the days with the mostcancellations reveals that they are primarily relatedto the 9/11 attacks or to extreme weather events.

Figure 4: Distribution of the Percentage of FlightsCanceled (1999–2003)

In summary, you can use the calendar plot toreveal features and seasonal trends for 21 years ofairline delays and cancellations.

Carrier Effects: Multivariate Visu-alization of Time Series

The calendar plot is useful for showing how a sin-gle quantity varies during long periods of time.However, it is not so useful for comparing multiplequantities that vary in time. For example, supposeyou are interested in comparing the mean length ofdelays for flights among the different carriers. Arethere some carriers that are almost always on time,while others are habitually late? To investigate thisquestion, we constructed a multivariate time seriesplot as described by Peng (4).

The plot is displayed in Figure 5 for 20 majorairline carriers during 2007. That year is chosen be-cause there was a major winter storm (the “Valen-tine’s Day Blizzard”) that affected the entire eastern

7

Figure 5: Multivariate Time Series of Mean Delay for Each Carrier (2007)

half of North America and paralyzed transporta-tion in all forms (7). The effect of the storm is vis-ible as a dark brown vertical line in the middle ofFebruary.

Following Peng and a suggestion from John Sall(personal communication), the carriers are sortedaccording to the mean delay for all flights duringthe year. This makes it easy to see that Aloha Air-lines (AQ) and Hawaiian Airlines (HA) have supe-rior on-time performance, presumably because ofshort flight times and fair weather! The next bestperforming airlines in 2007 according to this met-ric were Southwest (WN), Frontier (F9), Delta (DL),AirTran (FL), Express (9E), and JetBlue (B6).

Again, it is useful to use ideas from choroplethmaps to choose colors for these data. For this plotthe colors are based on seven quantiles and the col-ors are based on a blue-brown color ramp devel-oped Cindy Brewer and used successfully in Pickleet al. (5). (This double-ended color scheme doesa good job of simultaneously depicting both highand low values.) The blue shades correspond todays for which the mean daily delay for a particularcarrier was less than five minutes. The brown col-

ors correspond to days and carriers for which themean delay was between 13 and 21 minutes (lightbrown) or more than 21 minutes (dark brown). Thedistribution of mean daily delays is shown in Fig-ure 6. You can see that the median value of thedistribution corresponds to a mean delay of sevenminutes, but that the distribution has a long tail.

Figure 6: Distribution of the Mean Daily Delay forEach Carrier (2007)

A box plot (not shown) of the mean delays for

8

each day of the year could be placed on the rightside of Figure 5 as suggested by Peng (4) and usedin a cartographic context by Carr, Wallin, and Carr(2). The box plot enables you to see the within-carrier variation of the daily means throughout theyear and conclude, for example, that even thoughthe delays for Frontier airlines are usually low,there is a large amount of variation in the dailymeans.

Two features of this multivariate time series plotare striking. The first is the “outlying” behavior ofAloha and Hawaiian Airlines. The second is thewell-defined vertical bands of brown (long delays)for most carriers in February, in the summer, and inthe latter half of December. This is in marked con-trast to the faint blue bands in April and May, andthe heavier blue bands in September through earlyNovember. These qualitative trends can be visual-ized further by smoothing the time series, similar toFigure 2. The graph (not shown) suggests that themean delays in the spring and fall are shorter thanthe mean delays during the summer and during thelatter half of December, and that this tends to betrue for most carriers (although Atlantic SoutheastAirlines (EV) seems an exception to the rule).

In summary, the multivariate time series plotenables you to see gross similarities and differencesbetween the mean daily delays of the 20 carriers.

Airport Effects: Interactive Charts

You can create a graph similar to Figure 5 for air-ports. In fact, you can create two such graphs: onefor flights that originate at the airport, and anotherfor flights for which the airport is the destination.However, in my poster I chose to present the rela-tionship between airports and delays by using in-teractive and dynamically linked graphics.

The interactive graphic takes the form of a mapof the continental U.S. with a single airport se-lected, as shown in Figure 7. I nicknamed thisgraph the splay plot because it reminds me ofsplayed fingers. All flights that originate from theselected airport are grouped according to their des-tination and are colored according to the PFD foreach destination. Note that some routes are con-sistently on time, whereas others are frequently de-layed.

In the interactive version of Figure 7, you canclick on the splay plot to select a different airport.The plot uses SAS/Intrnet R© software to detect the

click and to run a SAS program to create a splayplot for the new airport. (See my poster for an ex-ample.) So that the colors on the splay plot do notchange from one airport to another, the colors arehard-coded into five categories: less than 10% offlights delayed (green), between 10% and 15% offlights delayed (purple), and so on, up to more than25% of flights delayed (red).

The splay plot could be improved in severalways. It could include a density estimate similar toFigure 4 that shows the distribution of the coloringvariable for each route. At the Data Expo, SimonUrbanek suggested using semitransparent lines inFigure 7.

Figure 7: Percentage of Flights Delayed (Denver,2008)

A second approach to presenting the relation-ship between airports and delays is to use dy-namically linked graphs. Figure 8 shows this ap-proach in the SAS/IML R© Studio application (for-merly named SAS R© Stat Studio). This figure showsthe PFD for each major airport and for each year.The linked graphics enable you to select observa-tions according to certain criteria. For example, youcan select airports with a large PFD by selectingmarkers in the scatter plot in the upper-left plot, asshown in the figure. Or you can select certain air-ports or certain years or both and examine the PFDfor those selected observations.

Interactively exploring Figure 8 revealed a dif-ference between “hub” and “non-hub” airports.Hub airports appear to give preferential treatmentto inbound flights. An example of this appearsin the scatter plot shown in Figure 9 where thepercentage of inbound flights that are delayed

9

(PIFD) is plotted against the percentage of out-bound flights that are delayed (POFD) for all majorairports and all years. The large blue markers corre-spond to flights delayed at Chicago O’Hare during1997–2008. Note that for each year the PIFD is lessthan the POFD. (The diagonal line is the identityline.)

Figure 8: Relationships between Delays, Airports,and Years (1987–2008)

Philip Easterling, a former analyst with a majorUS airline, informed me after the Data Expo thatthis is deliberate: hub carriers tend to hold out-bound flights when an inbound connection is late,so that a single late inbound flight can result indozens of outbound delays. For this same reason,air traffic controllers give preferential treatment tolanding inbound flights at hubs.

Figure 9: Differences between Hub and Non-HubAirports (1987–2008)

In contrast, the non-hub airports tend to try toget outbound flights off the ground on time so thatthey can arrive at the hubs on time. For the spe-cific example of Seattle, shown in Figure 9, East-erling explained that arriving flights are often de-layed due to clouds and fog, but that the groundcrews work hard to “turn around” these planes sothat they depart on time.

Conclusions

The data set used for the 2009 Data Expo is exceed-ingly rich. It would be easy to fill up two posterswith graphs of these data! This article briefly de-scribes a few graphs that elicited the most com-ments from viewers of my poster. The two mainapproaches used in this article are (1) to reducethe dimensions of the data by plotting descriptivestatistics aggregated over dates, years, airports, orcarriers; and (2) to use design principles from car-tographic research to aid the visualization of mul-tivariate time series. These techniques enable usto create graphs that exhibit relationships among adozen variables in a data set with 123 million ob-servations.

The ASA promotes poster sessions as a way toincrease the interaction between presenters at JSMand their colleagues. The 2009 Data Expo suc-ceeded admirably in that task. If you enjoy makinggraphs and analyzing real-world data, I hope to seeyou at the next Data Expo.

Acknowledgements

The graphs in this paper were created by usingSAS/IML Studio 3.2, except for Figure 7 which wascreated by using SAS/GRAPH R© software. I thankRobert Allison for help in preparing the originalposter and for introducing me to calendar plots.Bob Rodriguez provided helpful comments on thegraphics. Peter Borysov helped to write programsin SAS/IML Studio for graphs in the poster thatwere originally created with SAS/GRAPH soft-ware. Linda Pickle reviewed a draft of this articleand made helpful suggestions. Finally, I thank themanagement at SAS, particularly Joe Hutchinsonand Radhika Kulkarni, for supporting my partici-pation in the Data Expo.

10

Bibliography[1] Brewer, Cynthia A. 2006. ColorBrewer. URL http://www.

ColorBrewer.org.

[2] Carr, Daniel B., Wallin, John F., and Carr, D. Andrew. 2000.Two new tools for epidemiological applications: Linked mi-cromap plots and conditioned choropleth maps. Statistics inMedicine, 19:2521–2538.

[3] Mintz, D., Fitz-Simons, T., and Wayland, M. 1997. Trackingair quality trends with SAS/GRAPH. In Proceedings of theTwenty-Second Annual SAS, Users Group International Confer-ence, Cary, NC: SAS Institute Inc.

[4] Peng, Roger. 2008. A method for visualizing multivariatetime series data. Journal of Statistical Software, 25(1):1–17.URL http://www.jstatsoft.org/v25/c01.

[5] Pickle, L. W., Mungiole, M., Jones, G. K., and White, A. A.1996. Atlas of United States Mortality. National Center forHealth Statistics, Hyattsville, MD. URL http://www.cdc.

gov/nchs/products/other/atlas/atlas.htm.

[6] Wicklin, Rick, and Allison, Robert. 2009. Congestion inthe sky: Visualizing domestic airline traffic with SAS, soft-

ware. Poster presentation, Joint Statistical Meetings. URLhttp://stat-computing.org/dataexpo/2009/posters/.

[7] Wikipedia. 2009. February 2007 North America winterstorm. URL http://en.wikipedia.org/wiki/February_

2007_North_America_Winter_Storm.

[8] Zdeb, Mike, and Allison, Robert. 2005. Stretching thebounds of SAS/GRAPH software. In Proceedings of the Thir-tieth Annual SAS, Users Group International Conference, Cary,NC: SAS Institute Inc.

SAS R© and all other SAS Institute Inc. prod-uct or service names are registered trademarks ortrademarks of SAS Institute Inc. in the USA andother countries. R© indicates USA registration.

Other brand and product names are trademarksof their respective companies.

Rick WicklinSAS Institute [email protected]

11

http://www.ColorBrewer.org

http://www.ColorBrewer.org

http://www.jstatsoft.org/v25/c01

http://www.cdc.gov/nchs/products/other/atlas/atlas.htm

http://www.cdc.gov/nchs/products/other/atlas/atlas.htm


http://en.wikipedia.org/wiki/February_2007_North_America_Winter_Storm

http://en.wikipedia.org/wiki/February_2007_North_America_Winter_Storm

mailto:[email protected]

Hadoop for Statistical Analysis andExplorationByron Ellis

The explosive growth of data available from a va-riety of sources, particularly social behavior datafrom the movie watching habits from the NetflixPrize to the millions of posts available from Twit-ter every day. In more traditional areas of research,such as biology, the introduction of high through-put biology has also increased the amount of avail-able data significantly.

While large , relatively inflexible, data ware-housing solutions have long been available the lastfive years has seen the development of new inex-pensive data analysis and storage platforms specif-ically designed to handle these enormous data sets.One such system is Hadoop, originally developedat Yahoo! and now maintained as an Apacheproject.

Originally part of the Nutch web indexer,Hadoop provides a distributed storage system andmap-reduce processing model based on descrip-tions of the Google File System (GFS) and MapRe-duce processing platform published by Google ina series of papers. Using only commodity hard-ware and networking, it allows organizations toconstruct distributed, multi-terabyte storage facil-ities coupled to a processing platform structured totake advantage of the storage system’s characteris-tics.

The processing model is fairly strict in theHadoop environment and heavily informed by thefeatures of storage model, so we begin with a briefdiscussion of the Hadoop architecture. From therewe will present examples of several use cases. Thefirst case is the most common, data filtering andsummarization for downstream analysis and pre-sentation. In the second example, we begin to takethe analysis of very large datasets into the Hadoopcluster itself. Finally, we also consider how wemight use Hadoop to visualize very large datasetswith a simple plotting example.

Architecture and ProcessingModel

Hadoop is deployed on commodity grid comput-ing environments: a number of individual com-pute/data nodes controlled by a “master” nodewith standard (gigabit) networking. Scaling thecompute environment is easy, compute/data nodesmay enter and leave the system at will and noneof the compute hardware is considered to be reli-able (at the time of writing the master node is con-sidered to be reliable, but efforts are being madeto provide redundancy for master nodes as well)and clusters upwards of 4,000 compute nodes havebeen documented.

Unlike most grid computing environments,Hadoop does not make use of a standard net-work filesystem such as NFS. Instead it providesits own implementation of a distributed and repli-cated filesystem. The storage component of thefilesystem is hosted on a set of data nodes, whichare usually the same set of nodes that are used forcomputation. Filesystem metadata, directory and“inode” management, is managed by a single “na-menode” that usually runs on one of the grid com-puting environment master nodes. The primarytask of the “namenode” is the manage the mappingof files to blocks and the ordering of blocks withinga file as well as the mapping of blocks to the datanodes.

This storage architecture has several implica-tions. First, operations on the directory structuresare not parallel and will have performance some-what worse than a local filesystem. Second, thisfilesystem will prefer a small number of large filesover a large number of small files. This is reinforcedby the typical block sizes—64MB. The advantage isthat once a file has been located, reads can be donein a highly parallel fashion.

The Hadoop processing model should be famil-iar to any R programmer who has ever used tap-ply. Like the distributed file system, each com-pute node runs a local compute manager that is de-signed to be robust to failure—if one piece of a com-pute job dies it is rescheduled on a different node—controlled by a management daemon on the master

12

node, equivalent to the data node daemon and thename node daemon. In fact, these will generally bethe same set of machines (except perhaps the jobcontrol node and the name node).

The processing model consists of two distinctphases, a “map” phase and a “reduce” phase. Thefirst map phase takes advantage of the performancecharacteristics of the distributed file system by firsttransporting the map executable to each computenode and then executing parallel reads against eachof the blocks of the input data files (i.e. the codeis moved to the data rather than the other wayaround). When the data nodes and the computenodes are shared hardware the map phases are ar-ranged such that data is kept to the local disk asmuch as possible. As a result,a given map task maybe given several different blocks out of order thathappen to reside on the same machine. This lackof ordering of input records is an important featureof the parellelism—it is assumed in the model thata map operation could be performed in parallel foreach input record simultaneously with a completelack of synchronization.

The map phase itself is usually quite simple.Given input tuples, often simply lines of text, themap phase produces a key value pair (k, v) whereboth the key and the value can be arbitrarily com-plex data structures. The only restriction is that thekeys must be comparable and should be sortable.This key is used to distribute key-value pairs to aspecific “reducer” task. Once all of the key-valuepairs have been generated each reducer will theirassigned keys and then “reduce” the values foreach key in this sorted order (note that it is the keysnot the values which are sorted). It should be notedthat the partition, sort and grouping functions mayall be overridden to control this behavior as we willsee in the examples.

The reduce itself operates on an iterator that al-lows the task to pass over each value. Prior tothe latest versions of Hadoop there was a restric-tion that the reducer was only allowed to pass overeach value in the iterator exactly once. With version0.20.0 it seems that this restriction has been liftedso that iterators now support mark/reset function-ality. There are no restrictions on what is emittedby a reducer and it also has general access to thefilesystem so it can be used for its side effects as wewill see in the final example.

In terms of tapply, the map phase corresponds

to the INDEX column while the reduce phase corre-sponds to the FUN parameter. To put it another way,for those familiar with the SQL language, the mapphase somewhat like the GROUP BY clause, whilethe reduce phase is like the SELECT clause. Usingsome more advanced techniques it is even possibleto implement various kinds of JOIN clauses, thoughthat is beyond the scope of this article.

Example: Data Wrangling

The most common use for Map-Reduce platformsis so-called “data wrangling.” These tasks areessentially reorganization and summarization oflarge raw datasets for further analysis or presenta-tion. This use case is very similar to using a rela-tional database system to manage the raw data andextracting components of interest for further analy-sis in a statistical software package.

To demonstrate this, we will consider a streamof event data where an event type and a timestampare written out to a file. We make no assumptionsthat the file is sorted.

SECONDS EVENTCLASSSECONDS EVENTCLASS

Given this input data, we would like to do sev-eral things. First, we would simply like to get acount of the number of events-per-hour for eachclass of events so that we can do some simple anal-ysis in R. While Hadoop jobs can be written inany language that can read from stdin and writeto stdout via the Streaming interface the best per-formance is to be had in its native language Java,which is what we will use for the examples. Forthis task the map fragment1 is quite simple:

protected LongWritable one =new LongWritable ( 1 ) ;

public void map( Object key , Text value ,Context contex t ) {

S t r i n g [ ] f i e l d s = value . t o S t r i n g . s p l i t ("\ t " ) ;

Text key = new Text ( f i e l d s [ 1 ]+"\ t "+Long . parseLong ( f i e l d s [ 0 ] ) /3600) ) ;

contex t . wri te ( key , one ) ;}

Our reducer is also quite simple: WithHadoop’s default configuration the end result is a

1Full source code is available on the website

13

set of tab-delimited text files (one for each reducenode) containing the data. To read the data into,say, R, one might call:

out = read . delim ( pipe ( " hadoop dfs −c a tcount _ r e s u l t s / part−r−∗ " ) , header=F )

Alternatively, there is an R package Rhipe thatinteracts directly with Hadoop that should give su-perior performance to the pipe method as well assupport for implementing Hadoop jobs in R di-rectly. Upon reading the file, readers will note thatevent types are not organized within the outputfiles. This is because the default partition functionuses a hash of the entire output key, which also con-tains the hour. To ensure that all data for a givenevent appears in the reducer (and the same outputfile as a result) we must override the partition func-tion:

public i n t g e t P a r t i t i o n ( Text key ,LongWritable value , i n t numReducers ) {

return key . t o S t r i n g . s p l i t ( "\ t " ) [ 0 ] .hashCode ( ) % numReducers ;

}

This ensures that all keys containing with thesame event class are sent to the same reducer wherethey will all be sorted together. While this is con-venient for reading, if the keys are not well dis-tributed by the partitioner performance can sufferas some reducer nodes will be under utilized or re-quire more data transfer.

Next, rather than grouping events by hour weare interested in calculating statistics about thewaiting time between the events. To see how thisworks, first we consider our desired reduce step:

public void reduce ( EventTimeWritble key ,I t e r a t o r <LongWritable > times ,

Context contex t ) {double sum_x = 0 . 0 , sum_x2 = 0 . 0 ;double n = −1.0;long l a s t = 0 ;for ( LongWritable l : t imes ) {

i f ( n < 0) {l a s t = l . get ( ) ;

} e lse {long d i f f = l . get ( ) − l a s t ;sum_x += d i f f ;sum_x2 += d i f f ∗ d i f f ;

}n += 1 . 0 ;

}double mu = sum_x / n ;double sd = Math . s q r t (sum_x2 / n − mu∗

mu) ;

contex t . wri te (new Text ( key . getEvent ( )) ,new Text (mu+"\ t "+sd ) ) ;

}

From our reducer we can see that we need tomodify our mapper to emit the event type andtimestamp as before, but we should also emit thetime instead of a counter. Recall from the previ-ous section that the keys are sorted before the re-duce begins but that the values within each key arenot necessarily sorted. By including the value inthe overall key we effect a sorting of the values andwhen combined with a grouping function that onlyconsiders the event type we iterate over the times-tamps for each event type in sorted order.

This pattern of employing a partition of em-ploying a matched partition and group functionalong with including a value to be sorted in thekey is a common pattern in Hadoop jobs where awell-defined iteration order leads to a calculationthat can be performed with only a small amount ofstate.

Example: Expectation-Maximizationin Hadoop

While Hadoop is often used to obtain sufficientstatistics for further analysis, sometimes we wantto work with the entire dataset. To demonstratethis we will consider the classic mixture-model EMproblem. In this case we have observations ~x1 =(x1, y1), . . . ,~xn = (xn, yn) generated from k bivari-ate Normal distributions. Each observation ~xi =(xi , yi) has a latent variable zi identifying the mix-ture,

Xi |Zi = j ∼ N2(µ j, Σ j

)P(Zi = j) = τ j and

k

∑j=1

τ j = 1.

Each iteration t takes the form of a Hadoopjob that takes in the current parameters (τ1, . . . , τk,µ1, . . . , µk, Σ1, . . . , Σk)(t) as configuration and cal-culates parameters for t + 1.

As in the previous section, we begin with thereduce phase, which must produce new values forall parameters at iteration t + 1, to determine ourneeds for the map phase as well as any modifica-tions that might be required to the grouping, sort-ing and combination. Recalling the usual M-step,

14

we have the following equations:

τ(t+1)j =

∑ni=1 z̄i j

n

µ(t+1)j =

∑ni=1 z̄i j~xi

∑ni=1 z̄i j

Σ(t+1)j =

∑ni=1 z̄i j

(~xi −µ

(t+1)j

) (~xi −µ

(t+1)j

)′∑

ni=1 z̄i j

Calculation of τ(t+1)j and µ

(t+1)j is straight-

forward, only requiring the z̄i j, the likelihood that~xi was drawn from mixture j given the current esti-mates and the observation ~xi for each mixture. Thissuggests that mapper should emit each observationk times with 1, . . . , k as keys to allow each mixtureto be calculated in parallel. We can also take advan-tage of the fact that the calculation of z̄i j requires nodata other than ~xi to move that calculation to themapper, increasing the parallel calculation of thatvalue from k to the total number of map tasks.

Since we can calculate tau(t+1)j and (µ j, Σ j) sep-

arately we can emit a key for each with valuesz̄i j,~xi ,~xi to calculate each of the parameters for thenext iteration.

Example: Data Visualization

As a final, lighter, example we consider the prob-lem of plotting the results of the mixture distribu-tion in the last example. If the data set is too largeto efficiently perform the EM in the last section, itis probably not feasible to draw, say, a scatterplot ofthe data.

To do this we will use the map-reduce frame-work for its side effects rather than the primary out-put. The basic idea is to partition the dataset intofixed size tiles. If our tile size is, say, 1000 pointsby 1000 points we can implement the mapper veryeasily:

public void map( Object key , Text value ,Context contex t ) {

S t r i n g [ ] pos = value . t o S t r i n g . s p l i t ( "\ t" ) ;

double x = Double . parseDouble ( pos [ 0 ] ) ;double y = Double . parseDouble ( pos [ 1 ] ) ;

Text key = new Text ( Math . f l o o r ( x /1 0 0 0 . 0 )

+"\ t "+Math . f l o o r ( y / 1 0 0 0 . 0 )

+"\ t "+Math . c e i l ( x / 1 0 0 0 . 0 )+"\ t "+Math . c e i l ( x / 1 0 0 0 . 0 ) ) ;

contex t . wri te ( key , value ) ;}

This maps each point into a specific boundingbox that can easily be rendered by a reducer andwritten to an output file on the Hadoop cluster:

public void reduce ( Text key , I t e r a t o r <Text> values , Context contex t ) {

Graphics g ;/ / Set up Graphics s u r f a c efor ( Text value : values ) {

S t r i n g [ ] pos = value . t o S t r i n g . s p l i t ( "\ t " ) ;

double x = Double . parseDouble ( pos [ 0 ] );

double y = Double . parseDouble ( pos [ 1 ] );

g . drawOval ( x−p o i n tS i z e / 2 , y−p o i n t S i z e /2 ,

x+p o i n t S i z e / 2 , y+p o i n t S i z e / 2) ;}/ / Write graphics to HDFS as a separa te

f i l e/ / Record the f i l e name .contex t . wri te ( key , new Text ( ou tput F i l e ) )

;}

Since the number of tiles is small relatively tothe original data a single Java program can simplyplace the tiles onto a large canvas using the positionof the bounding box to determine where to painteach tile. In this case, each tile is non-overlappingbut a graphics format that supports transparencywould allow non-overlapping regions to be easilycomposed into a single enormous canvas.

While the scatterplot is quite simple, it is notmuch more difficult to draw a graph (assumingstraight line edges between vertices) used the samemethods. Coupled with a map-reduce implementa-tion of a graph layout algorithm (it seems like stressmajorization techniques might be good candidates)it should be a powerful way to visualize very largegraph structures.

Conclusions and Final Notes

With a large segment of the Hadoop literature pri-marily focused on the management of Hadoopclusters, we hope that we have given a few ex-amples of the hows and whys of large data anal-

15

ysis. Furthermore, we hope that we given read-ers a few reasons to add Hadoop, or other map-reduce platforms, to their large-scale data analy-sis toolbox. To help develop this toolbox a littlebit more quickly, code for all the examples in thisarticle is available at insert URL here. Alterna-tively, a different set of algorithms (primarily re-lated to non-hierarchical clustering at the moment)are available from the Mahout (http://lucene.apache.org/mahout) project.

It would be interesting to see the developmentof a general set of implementations for large scaleanalysis techniques, for example, it should be pos-sible to implement backpropagation algorithms forneural networks using a technique similar to theone used for the EM example. In general, model fit-ting approaches that can be implemented in termsof gradient descent seems to be amenable to map-reduce approaches and there already exist meth-ods for finding the first eigenvector of an adjacency

matrix courtesy of map-reduce implementations ofGoogle’s PageRank algorithm.

Finally, while Hadoop can be easily integratedinto an existing grid computing environment, itcan also be easily deployed to cloud computing en-vironments such as Amazon’s EC2 for as-neededanalysis where the expense of a full grid computingenvironment does not make sense. There is evena company called Cloudera that provides imagesfor Hadoop clusters on EC2 that makes the job ofgetting started with Hadoop EC2 very straightfor-ward. This may not be cost effective for groups do-ing a lot of analysis on an on-going basis, but fora specific data set or occasional usage it could bequite convenient.

Byron EllisAdd [email protected]

16

http://lucene.apache.org/mahout

http://lucene.apache.org/mahout


Statistical Graphics! — who needs VisualAnalytics?Martin Theus

Introduction

Talking to a broader audience at the useR!2009 con-ference on visualization does probably not call fora presentation showing the bleeding edge researchwork. Such a presentation should rather give someorientational overview on why and how statisti-cians may use graphics successfully, and how theway statisticians use graphics may contrast to otherdisciplines.

A first question should be, what we mean with"visualization" as this term as well as the field isused within many disciplines. There is a broadrange of terms connected with visualization reach-ing from "Visual Communication" - which does"only" communicate qualitative mostly relationalinformation - to "Visual Analytics", which is the lat-est buzz-word. From a statistician’s point of view,visual analytics is exploratory statistical data analy-sis utilizing the most modern visualization and in-teraction techniques. As statisticians, we are con-cerned with visualizations which can generate in-sights based on data and their distributional prop-erties - an aspect which is only slowly gaining afoothold in the field of InfoVis.

Statistical Graphics and VisualAnalytics

Martin Theus www.theusRus.de

Statistical Graphics! Who needs Visual Analytics? useR! 2009, Rennes, FR

There are also Difference along other Dimensions …

5

A Tree in InfovisA Tree in R

|S tart>=8.5

S tart>=14.5Age< 55

Age>=111absent

29 /0 absent

12 /0

absent

12 /2

present

3 /4

present

8 /11

5

Figure 1: The standard representation of a tree in R

It is not surprising that the statistician’s effortsin the area of statistical graphics gets less attentionthan the filed of Visual Analytics when we look ata simple example in Figure 1 and 2. Figure 1 showsthe standard plot of a tree in R and Figure 2 the socalled beam trees Admittedly, this is a very extremeexample and it is easy to see that more flashy doesnot necessarily mean "better". Nonetheless, when itcomes to "selling" your work, the R-tree will prob-ably fall behind.

(9).



There are also Difference along other Dimensions …

5

A Tree in InfovisA Tree in R

|S tart>=8.5

S tart>=14.5Age< 55

Age>=111absent

29 /0 absent

12 /0

absent

12 /2

present

3 /4

present

8 /11

5

Figure 2: A so called “beam tree”, as defined by (9)

For statistical graphics we usually focuson visualizing (mostly statistical) properties,rather than "only" the data itself. Figure 3shows the Detergent data (7) in a parallelset plot (5), and Figure 4 a mosaic plot (4).



Where Statistical Graphics trials Visual Analytics I

• If a graphical display “only” shows the data it is much harder to go after certain properties we may expect to find in the data

6

1. Water Softness(soft, medium, hard)

2. Temperature(low, high)

3. M-user (person used brand M before study)(yes, no)

4. Preference (brand person prefers after the test)(X, M)

The major interest of the study is to find out whether preference for a de-tergent is influenced by the brand someone uses or not. Looking at the in-teraction of M-user and Preference will tell us that there is an interaction, butunrelated to the other two variables. Looking at the variables Water Softnessand Temperature we will find something that is to be expected: harder waterneeds warmer temperatures for the same washing result and a fixed amountof detergent. Mosaic plots allow the inspection of the interaction of M-userand Preference conditioned for each combination of Water Softness and Tem-perature, resulting in a plot, which includes the variables in the order in whichthey are listed above. Figure 1 shows the increasing interaction of M-user and

Hard Medium Soft

High

Low

hard medium soft

warm

cold

Water Softness

Te

mp

era

ture

M X

Yes

No

Preference

M-U

ser

M XM X

Yes

No

Fig. 1. The interaction of M-user and Preference increases for harder water and highertemperatures.

3

6

Figure 3: The detergent data set in a parallel setsdisplay



Where Statistical Graphics trials Visual Analytics I

• If a graphical display “only” shows the data it is much harder to go after certain properties we may expect to find in the data

6

1. Water Softness(soft, medium, hard)

2. Temperature(low, high)

3. M-user (person used brand M before study)(yes, no)

4. Preference (brand person prefers after the test)(X, M)

The major interest of the study is to find out whether preference for a de-tergent is influenced by the brand someone uses or not. Looking at the in-teraction of M-user and Preference will tell us that there is an interaction, butunrelated to the other two variables. Looking at the variables Water Softnessand Temperature we will find something that is to be expected: harder waterneeds warmer temperatures for the same washing result and a fixed amountof detergent. Mosaic plots allow the inspection of the interaction of M-userand Preference conditioned for each combination of Water Softness and Tem-perature, resulting in a plot, which includes the variables in the order in whichthey are listed above. Figure 1 shows the increasing interaction of M-user and

Hard Medium Soft

High

Low

hard medium soft

warm

cold

Water Softness

Te

mp

era

ture

M X

Yes

No

Preference

M-U

ser

M XM X

Yes

No

Fig. 1. The interaction of M-user and Preference increases for harder water and highertemperatures.

3

6

Figure 4: The detergent data set in a mosaic plot

The parallel set hierarchically visualizes themultivariate structure of the data with no particu-lar model in mind. In order to find outliers or spe-cial structures this might work well. From a sta-tistical point of view, we want to analyze associa-tions, which might be conditioned upon variablesor even combinations of variables. For a datasetwith four categorical variables the mosaic plot isclearly the best choice. The plot immediately re-veals that the more critical the washing conditions

are, i.e., harder water and higher temperatures, themore people tend to stick to their well know deter-gent ‘M’ and will avoid experiments with the newbrand ‘X’.

As trained statisticians we are used to lookat graphics and by doing so “virtually” per-forming statistical tests and estimations graphi-cally without actually calculating the exact num-bers. There are graphics or additions to graph-ics which can support this process directly.E.g., a boxplot is a summary statistics itself,and adding notches will even facilitate elemen-tary inference within the graphics, cf. Figure 5.



Where Statistical Graphics trials Visual Analytics II

• As a trained statistician we can look at graphics with distributions in mind ! sometimes we add explizit decision support

7

1 2 3 4

050

100

150

200

250

300

Notched Boxplot

Model-based Clustering

–1 0 1 2 3

–2

–1

01

2

eicosenoic

oleic

7

Figure 5: A notched boxplot combines graphicaldisplay and basic statistical test.

How do we use Graphics?

To better understand how and why we use graph-ics, we may classify types of graphics according totheir intended use. There are

• Exploration GraphicsExploration graphics aim at gaining insights.They are mainly used by a single researcher.As we look at very many graphics in shortsuccession during the exploration process, re-fined scales and legends are not of interest,but a highly interactive set-up which allows a

18

speedy modification of the graphics gets im-portant.

• Presentation GraphicsAs presentations graphics intends to show in-terpreted results at the end of an explorationprocess, this kind of graphics makes exten-sive use of scales and legends. They are usu-ally presented to a broader audience and oftervery much information must be transportedin one single graphics — a common problemknown by cartographers.

• Interactive Information GraphicsSomewhat in between exploration and pre-sentation graphics (which also covers con-ventional static info graphics known fromnewspapers) we find interactive info graph-ics on the web. There are some degrees offreedom which allow to change groups likestates etc. and further information may bequeried from the graphics, the basic visual-ization though, is predefined and can not bechanged.

• DiagnosticsFor most statistical models and statistical pro-cedure, diagnostic plots are offered to assessthe quality of such a procedure. A drawbackof such diagnostic plots is that these plots areoften not very efficient as they are based onstandard plot types and thus do not visual-ize the properties of the model directly. Fur-thermore, these plots often make it hard todirectly relate back to the raw data or mod-el/procedure properties which would allowfor an efficient improvement of the model.

Statistics has its greatest impact on Explo-ration Graphics and Diagnostic Graphics, althoughgraphics for presentation can benefit from statisti-cal thinking as well.

Whereas the research for improved diagnosticplots may or may not lead to better graphical toolswhich can improve the building of efficient models,an interactive implementation of diagnostic plotsmay directly improve the ability to understand theinfluence of certain cases and/or variables of themodel’s properties.



The Use of Graphics: Part I

• Exploration– Aims at gaining insights– Mainly personal use– Few scales and legends– Highly interactive and little persistent

• Presentation– Presents interpreted results– Tailored to fit a broad audience– Extensive scales, grids and legends– Static print without interactions

(there are interactive Infographics by now)

22

6 Introduction

1 10

Presentation Exploration

0.04

0.71

Thursday

Friday

Saturday

Sunday

3 51

110

Thursday

Friday

Saturday

Sunday

…

…

FIGURE 0.1The relation between the number of observers and the number of graphicsin use is inverse for presentation graphics and graphics for exploration.

Facing this distinction, it becomes quite clear that there is hardly asatisfying compromise† between the two aspects of the use of graphics.This also explains why most statistical software will not cover the twoaspects equally well at the same time.

What is in the book?

Inspired by the book Applied Statistics — Principles and Examples byCox and Snell (1981), this book is divided into two parts.

A persistent problem with data analysis is a lack of formalization. Thishas not changed ever since its inauguration. Tukey and Wilk (1965) seea clear need for formalization “... the technology of data analysis is stillunsystematized ...” but also warn “Data analysis can gain much fromformal statistics, but only if the connection is kept adequately loose.”

Trying to keep the right balance between these two demands, the firstpart of this book summarizes principles and methodology, showing howthe different graphical representations of variables of a dataset are ef-fectively used in an interactive setting. Therefore, the most importantplots are introduced along with their interactive controls. Different typesof data or relations between variables need specific plots or plot ensem-

†Regarding compromises, professor Peter J. Huber once explained in a lecture at AugsburgUniversity: “When a muckbucket is in your way, you can either pass it on the right, or onthe left — who would ever consider a compromise... .”

6 Introduction

1 10

Presentation Exploration

0.04

0.71

Thursday

Friday

Saturday

Sunday

3 51

110

Thursday

Friday

Saturday

Sunday

…

…

FIGURE 0.1The relation between the number of observers and the number of graphicsin use is inverse for presentation graphics and graphics for exploration.

Facing this distinction, it becomes quite clear that there is hardly asatisfying compromise† between the two aspects of the use of graphics.This also explains why most statistical software will not cover the twoaspects equally well at the same time.

What is in the book?

Inspired by the book Applied Statistics — Principles and Examples byCox and Snell (1981), this book is divided into two parts.

A persistent problem with data analysis is a lack of formalization. Thishas not changed ever since its inauguration. Tukey and Wilk (1965) seea clear need for formalization “... the technology of data analysis is stillunsystematized ...” but also warn “Data analysis can gain much fromformal statistics, but only if the connection is kept adequately loose.”

Trying to keep the right balance between these two demands, the firstpart of this book summarizes principles and methodology, showing howthe different graphical representations of variables of a dataset are ef-fectively used in an interactive setting. Therefore, the most importantplots are introduced along with their interactive controls. Different typesof data or relations between variables need specific plots or plot ensem-

†Regarding compromises, professor Peter J. Huber once explained in a lecture at AugsburgUniversity: “When a muckbucket is in your way, you can either pass it on the right, or onthe left — who would ever consider a compromise... .”

22

Figure 6: Exploration graphics and presentationgraphics can be best distinguished by the numberof graphics used and by the size of their audiences

Figure 6 depicts the opposite relation betweenthe number of graphics and and the size of the au-dience for exploration graphics and presentationgraphics — a depiction which borrow from AntonyUnwin.

19

Construction Principles for Statis-tical Graphics

Although there is actually not much to invent re-garding standard statistical graphics, it is some-times desirable to create a visualization which isspecialized for a particular dataset. The generalproblem which must be solved by the design of anyvisualization is depicted in Figure 7.

Starting with a dataset, we code — nowa-days usually with the help of a computer — thisdata into a graph. Even if this coding is com-pletely lossless, it is still depending on the de-coding process whether or not the visualizationcan successfully depict the phenomenon of inter-est. In many cases, the exact decoding of particu-lar values is less desirable than the overall qualita-tive message which might be hidden in a dataset.



Processing Pipeline

• Perceptual aspects are central for the correct design and inter-pretation of graphics

• As perception is subjective, we need to get it as unambiguous as possible

• Reusing common building blocks eases the decoding of graphics

11

Code

-4.1 1.0 90 23.16 23.45-3.0 1.6 87.7 23.14 23.71-3.0 2.9 85.8 23.39 24.29-3.4 2.0 87.8 23.53 24.08-3.2 3.1 87.2 23.71 24.25-4.2 3.5 87.1 23.82 24.19-4.2 1.3 86.2 23.85 24.19-3.2 2.6 85.9 23.80 24.14-3.5 2.8 87.2 23.65 23.90-4.3 2.2 88.4 23.58 23.88-3.9 0.7 88.6 23.47 23.96-3.5 3.1 89.1 23.77 24.01-4.3 2.1 89.4 23.59 23.89-4.1 0.6 87.8 23.65 24.00…

Decode ?GraphData

*

****

***

11

Figure 7: The decoding of the graphics content isthe most crucial step in the pipeline.

Cleveland’s (3) fundamental work on percep-tual issues gives good advice which graphical ele-ments and what use of scales may be most success-ful. We can not cover these principles here in depth,but even with these construction principles at handa successful graph design might still be hard. Oneof the famous graphic designer, Milton Glaser, onceput it this in a BBC documentary on the well knownLondon Tube map (1); he said:

. . . All design basically is a strange combination of the in-telligence and the intuition, where the intelligence onlytakes you so far and than your intuition has to reconcilesome of the logic in some peculiar way. . . .

[h]



Consistent use of Graphical Primitives is a Must

• Bertin’s (1967) proposal of displaying multidimensional discrete data.

• The data here:Titanic Passengers– Class– Age– Gender– Survived

• Can you tell “the Story”?(from the graph, NOT from the movie)

20Visualisation of Categorical Data

Martin Theus

Bertin (1967)

Riedwyl und Schuepbach (1983)

Women

Men

0 500 1000 Died

Legend

20Figure 8: Bertin’s proposal for visualizing multi-variate categorical data

A good support for our intuition is to stick tocertain graphical building blocks.

1. Pointsput along a (common) scale are usually thebest way to depict continuous data.

2. Areas (mostly rectangles)should refer to counts of data which aregrouped into certain categories.

3. Lines (like in profiles)shall connect data values which all belong toone entity.

Figure 8 is a good example of what happenswhen different building blocks are mixed into onegraphics, although the underlying data was allmeasured on the same scale. The example is takenfrom (2). The original graphics was showing acci-dent victims, the data shown in Figure 8 shows theTitanic survivors data — try to tell the story; onlyfrom the graph!

How does R support StatisticalGraphics?

There is no doubt that R was never meant to bea graphics package, or a statistical software toolwhich has its particular strength in creating graph-ics. Nonetheless, R’s structured language and

20

package system allow for an easy extension of ex-isting graphics as well as the creation of a com-pletely new graphics system. E.g., to add a den-sity estimator to an existing histogram, it only takeone line of code — something which is usually farharder to achieve in most other statistical packages.

> hist(faithful$eruptions, freq=F)> lines(density(faithful$eruptions))

There are three categories where R gets involvedin the creation of graphics. There are the gen-eral purpose packages graphics, lattice based ongrid, ggplot2 (10) and iplots (6) (with its succes-sor iplots eXtreme (8). Other packages deal withspecific topics like the party package which offersadvanced plotting methods for trees, or the vcdpackage which has many graphic types for categor-ical data. The third category where R is support-ing graphics can be found with graphics packageswhich connect to R. ggobi has a connection to R,Klimt offers advanced interactive graphics for treemodels which are generated in R, and Mondrian en-hances graphics with statistics which are calculatedwithin R.

Whereas the flexibility to enhance or even createnew plots has great potential, the growing numberof graphics packages seems to lead to a certain lackof clarity under what circumstance it is advisable touse which plot and/or graphics package.



How do I draw a Histogram in R?

27

hist(faithful$eruptions)

base• You use package:

27

Figure 9: The standard histogram as definedwithin the base package

The great variety of graphics packages in Rhas a drawback though. The simple question:“How to draw a histogram?” can (to my knowl-

edge) answered in five different ways in R rightnow. Figure 9 shows the standard histogram withinthe base package for the Old Faithful data, cre-ated by the command hist(faithful$eruptions).




27

truehist(faithful$eruptions)

base , MASS• You use package:

27

Figure 10: The “truehist” function within theMASS package

Figure 10 shows the same data in a his-togram generated by the function “truehist”(truehist(faithful$eruptions)) which canbe found in the MASS package. Interest-ingly the bar heights differ, although thebreaks used in the two histograms in Fig-ure 9 and 10 seem to be exactly the same.




27

histogram(faithful$eruptions)

base , MASS , lattice• You use package:

27

Figure 11: The histogram function from the latticepackage

The function histogram within the lattice pack-age generates the histogram in Figure 11 by typ-ing histogram(faithful$eruptions). Apart fromthe less appealing cyan — which was also use

21

by the truehist function — the extensive scalingsaround all four sides of the histogram effectivelyreduces the area used to plot the actual data.




27

qplot(eruptions, data = faithful, geom=”histogram”)

base , MASS , lattice , ggplot2• You use package:

27

Figure 12: The default histogram within the gg-plot2 package

The histogram in Figure 12 was generated withthe ggplot2 package. Contrary to the functions inthe previous figures ggplot uses a generic plot func-tion, as can be seen from the command:

> qplot(eruptions, data = faithful,

+ geom=’’histogram’’)




27

base , MASS , lattice , ggplot2 , iplots , …

ihist(faithful$eruptions)

• You use package:

27

Figure 13: The interactive histogram from theiplots package

Finally in Figure 13 we find the interac-tive version of a histogram created withihist(faithful$eruptions).

This “parade” of histograms is not meant to crit-icize one of the particular implementations, or R’sflexibility in adding new functionality. It is moremeant to raise the question whether or not we needsome kind of consolidation of functionality withinR. Depending on when you got to know R, and whointroduced you to it, your answer to “How to drawa histogram?” may be very different. This inconsis-tency can be an obstacle not only to those who startto getting to know R.

Obviously this problem is not restricted tographics only but can also be found in the contextof standard model fitting and other statistical pro-cedures.

Conclusion

At least in the closing of this paper I need to getback to the question in the title.

Statistical graphics — especially interactivegraphics for exploratory data analysis — are welldeveloped and the wealth of graphic types guaran-tees that almost any dataset can be explored withthese graphics. The particular power of statisti-cal graphics lies in their design which is usuallytargeted towards visualizing statistical properties,rather than “just” showing the data itself. I.e., thevisualization needs to focus on the potential gainof insight and inference. In this respect, statisticalgraphics can not learn very much from what is go-ing on within the recent development in visual ana-lytics. What definitely needs to be learned from thiscommunity are the technical skills which enable thecreation of interactive visualizations which attractthe potential audience.

iplots eXtreme can be the perfect toolbox toachieve this within a well known and easy to han-dle statistical environment, namely R.

Bibliography[1] BBC. London underground tube map video documen-

tary. http://infosthetics.com/archives/2008/11/

the_london_underground_map_tv_documentary.html

[2] J. Bertin. Semiology of Graphics. University of WisconsinPress, Madison, 2nd edition, 1983.

[3] W. S. Cleveland. The Elements of Graphing Data. Wadsworth,Monetrey, CA, 1985.

[4] H. Hofmann. Mosaic plots and their variants. In C.-h.Chen, W. Härdle, and A. Unwin, editors, Handbook of Data

22

http://infosthetics.com/archives/2008/11/the_london_underground_map_tv_documentary.html

http://infosthetics.com/archives/2008/11/the_london_underground_map_tv_documentary.html

Visualization (Springer Handbooks of Computational Statis-tics), chapter III.13, pages 617–642. Springer-Verlag TELOS,Santa Clara, CA, USA, 2008.

[5] R. Kosara, F. Bendix, and H. Hauser. Parallel sets: Inter-active exploration and visual analysis of categorical data.IEEE Trans. Vis. Comput. Graph., 12(4):558–568, 2006.

[6] M. Theus and S. Urbanek. iplots: Interactive Graphics forR. Statistical Computing and Graphics Newsletter, 15(1), 2004.

[7] M. Theus and S. Urbanek. Interactive Graphics for DataAnalysis: Principles and Examples (Computer Science and DataAnalysis). Chapman & Hall/CRC, 2008.

[8] S. Urbanek. iplots eXtreme — Next-generationinteractive graphics for analysis of large data.http://www2.agrocampus-ouest.fr/math/useR-2009/

slides/Urbanek.pdf

[9] F. van Ham and J. J. van Wijk. Beamtrees: Compact visual-ization of large hierarchies. In INFOVIS ’02: Proceedings ofthe IEEE Symposium on Information Visualization (InfoVis’02),page 93, Washington, DC, USA, 2002. IEEE Computer So-ciety.

[10] H. Wickham. ggplot2: Elegant Graphics for Data Analysis.useR! Springer, New York, 2009.

Martin TheusTelephónica O2 GermanyUniversity of Augsburg, Department of Computerori-ented Statistics and Data [email protected]

23

http://www2.agrocampus-ouest.fr/math/useR-2009/slides/Urbanek.pdf

http://www2.agrocampus-ouest.fr/math/useR-2009/slides/Urbanek.pdf

Reader ResponseComments on "Taking it to higherdimensions"

(3) provides a valuable service by reminding read-ers how to read pseudo-three-dimensional barcharts since so many people misread these chartsand are misled by them. He points out that thetop of the bars in Figure 1 are below the hori-zontal lines that represent their values. Therefore,the reader needs to project the bars onto the backwall of the chart that contains the grid lines asshown in his Figure 2. This note provides addi-tional observations on these charts. Although thegap shown in Figure 1 is common, not all pseudo-three-dimensional bar charts include this gap (5).Some software such as the default in PowerPointdraw the figure so that the gap is zero and the backof the bar touches the horizontal line as in Figure 2.Although Excel and PowerPoint come in the samesuite of programs from the same company, they usedifferent defaults for the gap. With other softwareit is the top of the front panel of the bar that touchesthe line. I find it unacceptable that the way to reada graph depends on the software used to produceit; this makes a strong argument in addition to thegap argument of Krause for not adding an unnec-essary pseudo-third dimension. In Excel, the gap isan option which can be changed. (4) points this outin his blog. Click on the series, go to format dataseries, then options and you will see gap depth.Figure 2 modifies Figure 1 so that the gap depthis zero. These steps apply to both Excel 2003 andExcel 2007.

Figure 1: Three-dimensional bar chart modeled af-ter Figure 1 of [2]

Krause further points out that the problem ofthe gap is magnified when perspective is added tothe plot and that this was noted by (1). Haemer dis-cusses perspective in the 1947 paper but his exam-ples use two-dimensional bar charts. He discussespseudo-three-dimensional figures in a 1951 paper(2).

Figure 2: Three-dimensional bar graph with gap ofzero.

In summary, not all spreadsheet three-dimensional graphs include the gap and in Excelthe gap is an option that can be changed.

Bibliography[1] Haemer, Kenneth W. (1951) The perils of perspective. The

American Statistician 1(3):19

[2] Haemer, Kenneth W. (1951) The pseudo third dimension.The American Statistician 5(4):28

[3] Krause, Andreas (2009) Taking it to higher dimensions Sta-tistical Computing and Graphics Newsletter 20(1), June 2009

[4] Peltier, Jon (2009) PTS blog, http://peltiertech.com/

WordPress/does-excel-suck/#comments

[5] Robbins, Naomi B. (2005) Creating More Effective Graphs.Hoboken: Wiley.

Naomi B. RobbinsNBRhttp: // www. nbr-graphs. com

[email protected]

24

http://peltiertech.com/WordPress/does-excel-suck/#comments

http://peltiertech.com/WordPress/does-excel-suck/#comments

http://www.nbr-graphs.com


AnnouncementsThe Statistical Computing andGraphics Award

The ASA Sections of Statistical Computing andStatistical Graphics have established the StatisticalComputing and Graphics Award to recognize anindividual or team for innovation in computing,software, or graphics that has had a great impacton statistical practice or research. Typically, awardsare granted bi-annually.

The prize carries with it a cash award of5, 000plusanallowanceo f upto1,000 for travel to theannual Joint Statistical Meetings (JSM) where theaward will be presented.

Qualifications The prize-winning contributionwill have had significant and lasting impact on sta-tistical computing, software or graphics.

The Awards Committee depends on the Amer-ican Statistical Association membership to submitnominations. Committee members will review thenominations and make the final determination ofwho, if any, should receive the award. The awardmay not be given to a sitting member of the AwardsCommittee or a sitting member of the ExecutiveCommittee of the Section of Statistical Computingor the Section of Statistical Graphics.

Nomination and Award Dates Nominations aredue by December 15 of the award year. The awardis presented at the Joint Statistical Meetings in Au-gust of the same year. The first award will be givenin 2010, and subsequent awards are to be made atmost bi-annually according to the discretion of theAwards Committee.

Nominations should be submitted as a com-plete packet, consisting of:

* nomination letter, no longer than four pages,addressing points in the selection criteria * nom-inee’s curriculum vita(s) * maximum of four sup-porting letters, each no longer than two pages

Selection Process The Awards Committee willconsist of the Chairs and Past Chairs of the Sectionson Statistical Computing and Statistical Graphics.

The selection process will be handled by theAwards Chair of the Statistical Computing Section.Nominations and questions are to be sent to the e-mail address below.

Fei ChenAvaya Labs233 Mt Airy RdBasking Ridge, NJ [email protected]

25


Section OfficersStatistical ComputingSection Officers 2009

Jose C. Pinheiro, [email protected](862) 778-8879

Deborah A. Nolan, [email protected]()643-7097

Luke Tierney, [email protected](319) 335-3386

Elizabeth Slate, Secretary/ [email protected](843) 876-1133

Montserrat Fuentes, COMP/COS [email protected](919) 515-1921

Robert E. McCulloch, Program [email protected](512) 471-9672

Thomas Lumley, Program [email protected](206) 543-1044

Barbara A Bailey, Publications [email protected](619) 594-4170

Jane Lea Harvill, ComputingSection [email protected](254) 710-1517

Donna F. Stroup (see right)Monica D. Clark (see right)

Nicholas Lewin-Koh, Newsletter [email protected]

Statistical GraphicsSection Officers 2009

Antony Unwin, [email protected]+49-821-598-2218

Daniel J. Rope, [email protected](703) 740-2462

Simon Urbanek, [email protected](973) 360-7056

Rick Wicklin, Secretary/ [email protected](919) 531-6629

Peter Craigmile, COS Rep [email protected]|(614) 688-3634

Linda W. Pickle, COMP/COS Representative [email protected](301) 402-9344

Steven MacEachern, Program [email protected](614) 292-5843

Heike Hofmann, Program [email protected](515) 294-8948

Donna F. Stroup, Council of [email protected](404) 218-0841

Monica D. Clark, ASA Staff [email protected](703) 684-1221

Andreas Krause, Newsletter [email protected]

26






















The Statistical Computing & Statistical GraphicsNewsletter is a publication of the Statistical Comput-ing and Statistical Graphics Sections of the ASA. Allcommunications regarding the publication should beaddressed to:

Nicholas Lewin-KohEditor Statistical Computing [email protected]

Andreas KrauseEditor Statistical Graphics [email protected]

All communications regarding ASA membership andthe Statistical Computing and Statistical GraphicsSection, including change of address, should be sentto

American Statistical Association1429 Duke StreetAlexandria, VA 22314-3402 USATEL (703) 684-1221FAX (703) [email protected]

27




a word from our 2009 section chairsstat-computing.org/newsletter/issues/scgn-20-2.pdfvolume 20,...

Documents