does pivot tables and morefiles.meetup.com/1736007/cinday rug.pdf · pivot tables are a nice way to...
TRANSCRIPT
2 Copyright @ 2009 Jim Holtman
Agenda
R Does Pivot Tables
Sparklines (Edward Tufte)
Misc. Graphics
Questions
3 Copyright @ 2009 Jim Holtman
Pivot Tables & More
John Van Wagenen’s CMG2008 paper “Pivot Tables/Charts – Magic Beans Without Living in a Fairly Tale”.
Pivot tables are a nice way to slice/dice/aggregate data.
I had been doing similar things in R, so it motivated me to write a paper on another way to get the same information.
I have used his data to illustrate how to do these techniques in R.
Now walk through some examples.
4 Copyright @ 2009 Jim Holtman
Excel Spreadsheet
CSV File Exported from above (10,696 data lines)
5 Copyright @ 2009 Jim Holtman
Excel Pivot Table Generated from the Data
Read John’s paper for the procedure for generating the pivot table in Excel
6 Copyright @ 2009 Jim Holtman
7 Copyright @ 2009 Jim Holtman
This is what the data objects in R look like.
8 Copyright @ 2009 Jim Holtman
“Casting” New Data
From the same ‘melt’ data, I can create a daily summary and add an indicator for PRIME time:
9 Copyright @ 2009 Jim Holtman
Excel Spreadsheet (24,560 data points)
Pivot Table
Chart
10 Copyright @ 2009 Jim Holtman
R Script
HOLIDAY
PERIOD2
PERIOD3
PRIME
WEEKEND
Breakdown by Shifts
0.6 seconds to read in 24,560 lines of data, summarize by shift and create the pie chart.
11 Copyright @ 2009 Jim Holtman
“batch” Data Object in R
12 Copyright @ 2009 Jim Holtman
EDA on the “batch” Data
Histogram of batch$cpu.hrs
batch$cpu.hrs
Fre
quency
0 5 10 15 20
05000
10000
15000
20000
25000
Histogram of batch$cpu.hrs[batch$cpu.hrs < 0.03]
batch$cpu.hrs[batch$cpu.hrs < 0.03]
Fre
quency
0.000 0.005 0.010 0.015 0.020 0.025 0.030
05000
10000
15000
Histogram of batch$cpu.hrs[batch$cpu.hrs < 0.005]
batch$cpu.hrs[batch$cpu.hrs < 0.005]
Fre
quency
0.000 0.001 0.002 0.003 0.004 0.005
02000
4000
6000
8000
13 Copyright @ 2009 Jim Holtman
Summarize by Prod & Dev (3rd character)
Excel Spreadsheet Pivot Table
0
500000
1000000
1500000
2000000
2500000
5/1
/200
7
6/1
/200
7
7/1
/200
7
8/1
/200
7
9/1
/200
7
10/1
/20
07
11/1
/20
07
12/1
/20
07
1/1
/200
8
2/1
/200
8
3/1
/200
8
4/1
/200
8
5/1
/200
8
6/1
/200
8
cp
u s
ec
on
ds
DEV
PROD
Chart From Pivot Table
14 Copyright @ 2009 Jim Holtman
Summarize by Prod & Dev Using R
15 Copyright @ 2009 Jim Holtman
Chart from R
20
07
-05
-01
20
07
-06
-01
20
07
-07
-01
20
07
-08
-01
20
07
-09
-01
20
07
-10
-01
20
07
-11
-01
20
07
-12
-01
20
08
-01
-01
20
08
-02
-01
20
08
-03
-01
20
08
-04
-01
20
08
-05
-01
20
08
-06
-01
DEV
PROD
0
500000
1000000
1500000
2000000
To
tal C
PU
Se
co
nd
s
16 Copyright @ 2009 Jim Holtman
Pivot Table Summary
R & Excel (and other products) can produce summaries that are equivalent to “pivot tables”
In R it is easy to automate the scripts and run through a set of files and quickly produce output in various formats: PDF, PNG for web pages, WMF for inclusion in WORD/PowerPoint documents, …
The interactive nature of R makes it easy to do EDA (exploratory data analysis) on your data.
17 Copyright @ 2009 Jim Holtman
Sparklines
Invented by Edward Tufte, well known expert on data visualization – www.edwardtufte.com for more examples
Inspired by Ron Kaminski’s CMG2008 paper
18 Copyright @ 2009 Jim Holtman
Sparklines from ‘vmstat’ data
Script on production systems log the ‘vmstat’ data to a file every 30 seconds. This is used to create the daily and monthly utilization charts for a system.
Data used to create “sparklines” of 19 variables in the log file below
19 Copyright @ 2009 Jim Holtman
20 Copyright @ 2009 Jim Holtman
Monthly Data
Have used “levelplot” to show 3D data – day of the month on the y-axis, time of day on the x-axis and color to represent the value of the z-axis, which would be the CPU utilization.
Sparklines for the month’s performance of the system were plotted next to the levelplot for comparison.
Both presentation methods allow you to look for patterns. Which do you find the easiest to see patterns in?
Sparklines would make an interesting presentation of yearly data. The example just duplicates the monthly data to provide an idea of what it might look like.
‘levelplot’ and sparklines of the same monthly utilization data.
22 Copyright @ 2009 Jim Holtman
23 Copyright @ 2009 Jim Holtman
Transaction Data
Consolidated ~79K transactions into 10 transaction groups and 10 user pools to make the reports easier to see.
Data has the user, transaction name, start and end time.
Response was calculated.
Look at this data with some stacked barcharts and mosaic plots.
Pivot Table of User/Transaction Counts
24 Copyright @ 2009 Jim Holtman
User.01 User.02 User.03 User.04 User.05 User.06 User.07 User.08 User.09 User.10
Stacked Bar Chart of Transaction Count by User
Tota
l T
ransactions
05000
10000
15000
Tran.01
Tran.02
Tran.03
Tran.04
Tran.05
Tran.06
Tran.07
Tran.08
Tran.09
Tran.10
25 Copyright @ 2009 Jim Holtman
Stacked Bar Chart/Mosaic Chart
Lets you see who the busy users are in terms of number of transactions.
A mosaic chart shows the same data, but the “area” of the boxes is proportional to the counts. y-axis range is the same for all data elements.
Sometimes easier to the ratios (mix) between the use of transactions for a user; may denote a different role for that user. – User.06 (lowest count) has on Trans.06, Trans.09 and Trans.10 a higher ratio than
User.08 (highest count)
26 Copyright @ 2009 Jim Holtman
Mosaic Plot of the Number of Transactions by User - Area Proportional to Count
User
Tra
n
User.
01
User.
02
User.
03
User.
04
User.
05
User.
06
User.
07
User.
08
User.
09
User.
10
Trans.01
Trans.02
Trans.03
Trans.04
Trans.05
Trans.06
Trans.07
Trans.08
Trans.09
Trans.10
27 Copyright @ 2009 Jim Holtman
Summary
Short introduction to R that will hopefully whet you appetite to look at what R might be able to do for you.
Shown how R can be used to generate summaries equivalent to pivot tables in Excel.
Examples of sparklines and mosaic plots that help to visualized data in some different ways.
e-mail me some of your data, and an idea of what you would like summarized, and I will try to show how R can do some basic processing on it.
28 Copyright @ 2009 Jim Holtman
[1] J. Van Wagenen, “Pivot Tables/Charts – Magic Beans Without Living in a Fairy Tale”, CMG 2008
[2] Ron Kaminski, “Automating Process Pathology Detection – Rule Engine Design Hints”, CMG 2008
[3] R Development Core Team, “R: A Language and Environment for Statistical Computing”, {ISBN} 3-900051-07-0, http://www.R-project.org
[4] J. Holtman, “Using R for System Performance Analysis”, CMG 2004
[5] J. Holtman, “Visualization Techniques for Analyzing Patterns in System Performance Data”, CMG 2005
[6] N. J. Gunther, “Guerrilla Capacity Planning”, Springer-Verlag, Heidelberg, Germany, 2007
[7] H. Wickham, “Reshaping data with the reshape package”, Journal of Statistical Software, 21(12), 2007
[8] Venables, W. N. and Ripley, B. D. “Modern Applied Statistics with S. Fourth Edition”. Springer, 2002, ISBN 0-387-95458-0
[9] Tufte, Edward “Beautiful Evidence” Graphic Press 2006
[10] Spector, Phil “Data Manipulation with R (Use R)” Springer, 2009. ISBN 978-0387747309
References