dsc 201: data analysis & visualization

DSC 201: Data Analysis & Visualization

Aggregation

Dr. David Koop

D. Koop, DSC 201, Fall 2016

Selection & Highlighting• Selection: a user action (mouse, keyboard) on items, links, etc. • Selection types: single vs. multiple, contiguous vs. non-contiguous • Feedback is important! • How? Change selected item's visual encoding

- Change color: want to achieve visual popout - Add outline mark: allows original color to be preserved - Change size (line width) - Add motion: marching ants

2D. Koop, DSC 201, Fall 2016

Yearly Performance (radius: fluctuation/index ratio, color: gain/loss)

1986198719881989

1990

1991

199219931994

199519961997

1998 1999

20002001

2002

20032004200520062007

2008

2009

2010201120121985

-2,000 -1,500 -1,000 -500 0 500 1,000 1,500 2,000Index Gain

-150%

-100%

-50%

0%

50%

100%

150%

Inde

x G

ain

%

Days by Gain/Loss

Gain(53%)Loss(46%)

Quarters

Q1

Q2Q3

Q4

Day of Week

0 500 1,000

Mon

Tue

Wed

Thu

Fri

Days by Fluctuation(%)

-25% -20% -15% -10% -5% 0% 5% 10% 15% 20% 25%0

500

1,000

1,500

2,000

Monthly Index Abs Move & Volume/500,000 Chart

1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 20120

1,000

2,000

3,000

4,000

5,000

6,000

7,000Monthly Index AverageMonthly Index Move

1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012

select a time range to zoom in

All records selected. Please click on the graph to apply filters.

dc.js - Dimensional Charting Javascript Librarydc.js is a javascript charting library with native crossfilter support and allowing highly efficient exploration on large multi-dimensional dataset (inspired by crossfilter's demo). It leverages d3 engine to render charts in css friendly svg format. Chartsrendered using dc.js are naturally data driven and reactive therefore providing instant feedback on user's interaction. Themain objective of this project is to provide an easy yet powerful javascript library which can be utilized to perform datavisualization and analysis in browser as well as on mobile device.

Version StatusThis page is running version: v2.0.0-beta.32. The latest stable version is v1.7.5.

Getting StartedTake a look at the annotated source to the Nasdaq Example that is shown below.

For more information and assistanceRelease NotesWiki - Additional examplesRoadmap - Future PlansStable API - DocumentationMaster API - Bleeding EdgeMailing List - The users group and best place to post support questions.Issues - Please post any issues you have found or enhancements you would like to see. Usage questions should be directed to the mailing list

Fork me @ https://github.com/dc-js/dc.js and also feel free to report any issue or request a new type of chart to be included in the next release.

ExamplesThe following charts provide a live example of dc.js used against Nasdaq 100 index for the last 27 years. (You can run this example completely off-line). Although it is just an example, using it you can already ask some quite interesting questions. If I am going to gamble whether Nasdaq 100 willgain or lose tomorrow what is my chance? Is Friday or Monday the most unlucky day for investors? Is spring better than winter to invest? Can you findthe outliers? When did the outliers occur? Public data source: PiTrading.com.

Try it out or check out these other examples.

US Venture Capital Landscape 2011 (choropleth chart, bubble chart)Major Canadian City Crime Stats 1998-2011 (bubble overlay, bar chart, line chart)List of Community Contributed ExamplesSimple, Specific Chart Examples (source)

Nasdaq 100 Index 1985/11/01-2012/06/29

Date Open Close Change Volume

1985/11

11/01/1985 115.48 116.28 0.80 900900

11/04/1985 116.28 116.04 -0.24 753400

11/05/1985 116.04 116.44 0.40 876800

11/06/1985 116.44 117.38 0.94 935000

11/07/1985 117.38 117.62 0.24 886400

11/08/1985 117.62 119.26 1.64 867600

11/11/1985 119.26 120.4 1.14 803900

11/12/1985 120.39 121.82 1.43 1083000

11/13/1985 121.82 121.17 -0.65 932100

11/14/1985 121.17 121.83 0.66 980400

Juxtaposition and Coordinated Views


[http://dc-js.github.io/dc.js/]

http://dc-js.github.io/dc.js/

All Subset

Same

Multiform

Multiform, Overview/

Detail

None

Redundant

No Linkage

Small Multiples

Overview/Detail

Multiple Views


[Munzner (ill. Maguire), 2014]

October November December 2012 February March April May June July August September20

30

40

50

60

70

80

Tem

pera

ture

(ºF)

New York

San Francisco

Austin

Superimposition


[M. Bostock, http://bl.ocks.org/mbostock/3884955]

http://bl.ocks.org/mbostock/3884955

Flattening the Sphere?


[USGS Map Projections]

http://www.apple.com

Choropleth Map: What are Marks and Channels?


[M. Ericson, New York Times]

What are the Marks and Channels?


[M. Ericson, New York Times]

House Races: More Geographic Data?


[New York Times, 2010]

House Races: Maps Aren't Always Best


[NYTimes]

http://elections.nytimes.com/2010/results/house/big-board

Assignment 4• http://www.cis.umassd.edu/~dkoop/dsc201/assignment4.html • Visualization using Tableau • Year as a Dimension


http://www.cis.umassd.edu/~dkoop/dsc201/assignment4.html

20

15

10

5

0

Weight Class (lbs)

Aggregation: Histograms• Very similar to bar charts • Often shown without space

between (continuity) • Choice of number of bins

- Important! - Viewers may infer different trends

based on the layout


[Munzner (ill. Maguire), 2014]

Boxplots• Show distribution • Single value (e.g. mean, max, min,

quartiles) doesn't convey everything • Created by John Tukey who grew

up in New Bedford! • Show spread and skew of data • Best for unimodal data • Variations like vase plot for

multimodal data • Aggregation here involves many

different marks


[Flowing Data]

http://flowingdata.com/2008/02/15/how-to-read-and-use-a-box-and-whisker-plot/

Data Analysis Scenarios• Often want to analyze data by some grouping:

- Dogs vs. cats - Millennials vs. Gen-X vs. Baby Boomers - Physics vs. Chemistry

• Compute statistics based on those groupings - max, min, median

• Perform your own type of transformation (top-k, spread, etc.) • Create visualizations for each group


Split-Apply-Combine• Coined by H. Wickham, 2011 • Similar to Map (split+apply) Reduce (combine) paradigm • The Pattern:

1. Split the data by some grouping variable 2. Apply some function to each group independently 3. Combine the data into some output dataset

• The apply step is usually one of : - Aggregate - Transform - Filter


[T. Brandt]

http://nbviewer.jupyter.org/format/slides/github/snth/split-apply-combine/blob/master/The%20Split-Apply-Combine%20Pattern%20in%20Data%20Science%20and%20Python.ipynb#/

Aggregation of time series data, a special use case of groupby, is referredto as resampling in this book and will receive separate treatment inChapter 10.

GroupBy MechanicsHadley Wickham, an author of many popular packages for the R programming lan-guage, coined the term split-apply-combine for talking about group operations, and Ithink that’s a good description of the process. In the first stage of the process, datacontained in a pandas object, whether a Series, DataFrame, or otherwise, is split intogroups based on one or more keys that you provide. The splitting is performed on aparticular axis of an object. For example, a DataFrame can be grouped on its rows(axis=0) or its columns (axis=1). Once this is done, a function is applied to each group,producing a new value. Finally, the results of all those function applications are com-bined into a result object. The form of the resulting object will usually depend on what’sbeing done to the data. See Figure 9-1 for a mockup of a simple group aggregation.

Figure 9-1. Illustration of a group aggregation

Each grouping key can take many forms, and the keys do not have to be all of the sametype:

• A list or array of values that is the same length as the axis being grouped

• A value indicating a column name in a DataFrame

250 | Chapter 9: Data Aggregation and Group Operations

Split-Apply-Combine


[W. McKinney, Python for Data Analysis]

Journal of Statistical Software 9

name age sex

John 13 Male

Peter 13 Male

Roger 14 Male

John 13 Male

Mary 15 Female

Alice 14 Female

Peter 13 Male

Roger 14 Male

Phyllis 13 Female

name age sex

Mary 15 Female

Alice 14 Female

Phyllis 13 Female

name age sex

John 13 Male

Peter 13 Male

Phyllis 13 Female

name age sex

Mary 15 Female

name age sex

Alice 14 Female

Roger 14 Male

name age sex

.(sex) .(age)

Figure 4: Two examples of splitting up a data frame by variables. If the data frame was splitup by both sex and age, there would only be one subset with more than one row: 13-year-oldmales.

Output Processing function restrictions Null output

*aply atomic array, or list vector()

*dply frame data frame, or atomic vector data.frame()

*lply none list()

*_ply none —

Table 3: Summary of processing function restrictions and null output values for all outputtypes. Explained in more detail in each output section.

3.2. Output

The output type defines how the pieces will be joined back together and how they will belabelled. The labels are particularly important as they allow matching up of input and output.

The input and output types are the same, except there is an additional output data type, _,which discards the output. This is useful for functions like plot() and write.table() thatare called only for their side e↵ects, not their return value.

The output type also places some restrictions on what type of results the processing functionshould return. Generally, the processing function should return the same type of data as theeventual output, (i.e., vectors, matrices and arrays for *aply and data frames for *dply) butsome other formats are accepted for convenience and are described in Table 3. These areexplained in more detail in the individual output type sections.

Output: Array (*aply)

With array output the shape of the output array is determined by the input splits and thedimensionality of each individual result. Figures 5 and 6 illustrate this pictorially for simple

Splitting by Variables


[H. Wickham, 2011]

12 The Split-Apply-Combine Strategy for Data Analysis

sex

Male

Female

value

3

3

age

13

14

value

3

2

15 1

age

13

14

value

2

1

sex

Male

Male

14 1

15 1

Female

Female

Female 13 1

.(sex) .(age) .(sex, age)

Figure 7: Illustrating the output from using ddply() on the example from Figure 4 withnrow(). Splitting variables shown above each example. Note how the extra labeling columnsare added so that you can identify to which subset the results apply.

to further process the list the labels will appear as if you had used aaply, adply, daply orddply directly. llply is convenient for calculating complex objects once (e.g., models), fromwhich you later extract pieces of interest into arrays and data frames.

There are no restrictions on the output of the processing function. If there are no results,*lply will return a list of length 0.

Output: Discarded (*_ply)

Sometimes it is convenient to operate on a list purely for the side e↵ects, e.g., plots, caching,and output to screen/file. In this case *_ply is a little more e�cient than abandoning theoutput of *lply because it does not store the intermediate results.

The *_ply functions have one additional argument, .print, which controls whether or noteach result should be printed. This is useful when working with lattice (Sarkar 2008) orggplot2 (Wickham 2010) graphics.

4. Helpers

The plyr package also provides a number of helper function which take a function (or func-tions) as input and return a new function as output.

splat() converts a function that takes multiple arguments to one that takes a list as itssingle argument. This is useful when you want a function to operate on a data frame,without manually pulling it apart. In this case, the column names of the data framewill match the argument names of the function. For example, compare the followingtwo ddply calls, one with, and one without spat:

R> hp_per_cyl <- function(hp, cyl, ...) hp / cyl

R> splat(hp_per_cyl)(mtcars[1,])

R> splat(hp_per_cyl)(mtcars)

R> ddply(mtcars, .(round(wt)),

+ function(df) mean_hp_per_cyl(df$hp, df$cyl))

R> ddply(mtcars, .(round(wt)), splat(mean_hp_per_cyl))

Apply+Combine: Counting


[H. Wickham, 2011]

In Pandas• groupby method creates a GroupBy object • groupby doesn't actually compute anything until there is an

aggregation or we wish to examine the groups • Choose keys (columns) to group by


Aggregation• Operations:

- size() - mean() - sum()

• May also wish to aggregate only certain subsets - Use square brackets with column names

• Can also write your own functions for aggregation and pass then to agg function - def peak_to_peak(arr): return arr.max() - arr.min() grouped.agg(peak_to_peak)


You’ll notice that some methods like describe also work, even though they are notaggregations, strictly speaking:

In [206]: grouped.describe()Out[206]: data1 data2key1 a count 3.000000 3.000000 mean 0.746672 0.910916 std 1.109736 0.712217 min -0.204708 0.092908 25% 0.137118 0.669671... ... ...b min -0.555730 0.281746 25% -0.546657 0.403565 50% -0.537585 0.525384 75% -0.528512 0.647203 max -0.519439 0.769023[16 rows x 2 columns]

I will explain in more detail what has happened here in the next major section on group-wise operations and transformations.

You may notice that custom aggregation functions are much slower thanthe optimized functions found in Table 9-1. This is because there issignificant overhead (function calls, data rearrangement) in construct-ing the intermediate group data chunks.

Table 9-1. Optimized groupby methods

Function name Description

count Number of non-NA values in the group

sum Sum of non-NA values

mean Mean of non-NA values

median Arithmetic median of non-NA values

std, var Unbiased (n - 1 denominator) standard deviation and variance

min, max Minimum and maximum of non-NA values

prod Product of non-NA values

first, last First and last non-NA values

To illustrate some more advanced aggregation features, I’ll use a less trivial dataset, adataset on restaurant tipping. I obtained it from the R reshape2 package; it was origi-nally found in Bryant & Smith’s 1995 text on business statistics (and found in the book’sGitHub repository). After loading it with read_csv, I add a tipping percentage columntip_pct.

Data Aggregation | 259

Optimized groupby methods


[W. McKinney, Python for Data Analysis]

Iterating over groups• for name, group in df.groupby('key1'): print(name) print(group)

• Can also .describe() groups


Example: Tipping Data• http://www.cis.umassd.edu/~dkoop/dsc201-2016fa/notebooks/

tipping.ipynb


http://www.cis.umassd.edu/~dkoop/dsc201-2016fa/notebooks/tipping.ipynb

dsc 201: data analysis & visualization

Documents