04 reports
DESCRIPTION
TRANSCRIPT
If you’re using a laptop, start installing latex, from the instructions on the website
Thursday, 2 September 2010
Hadley Wickham
Stat405Statistical reports
Thursday, 2 September 2010
1. More subsetting.
2. Missing values.
3. Statistical reports: data, code, graphics & written report
Thursday, 2 September 2010
Office hoursMe: before class, DH 2056Garrett: Wednesday, 3pm, DH 1041
Lab access: you should now have it
Thursday, 2 September 2010
Saving results
# Prints to screen
diamonds[diamonds$x > 10, ]
# Saves to new data frame
big <- diamonds[diamonds$x > 10, ]
# Overwrites existing data frame. Dangerous!
diamonds <- diamonds[diamonds$x < 10,]
Thursday, 2 September 2010
diamonds <- diamonds[1, 1]diamonds
# Uh oh!
rm(diamonds)str(diamonds)
# Phew!
Thursday, 2 September 2010
Your turn
Create a logical vector that selects diamonds with equal x & y. Create a new dataset that only contains these values.
Create a logical vector that selects diamonds with incorrect/unusual x, y, or z values. Create a new dataset that omits these values. (Hint: do this one variable at a time)
Thursday, 2 September 2010
equal_dim <- diamonds$x == diamonds$yequal <- diamonds[equal_dim, ]
y_big <- diamonds$y > 10z_big <- diamonds$z > 6
x_zero <- diamonds$x == 0 y_zero <- diamonds$y == 0z_zero <- diamonds$z == 0zeros <- x_zero | y_zero | z_zero
bad <- y_big | z_big | zerosgood <- diamonds[!bad, ]
Thursday, 2 September 2010
Missing values
Thursday, 2 September 2010
Typically removing the entire row because of one error is overkill. Better to selectively replace problem values with missing values.
In R, missing values are indicated by NA
Data errors
Thursday, 2 September 2010
Expression Guess Actual
5 + NA
NA / 2
sum(c(5, NA))
mean(c(5, NA)
NA < 3
NA == 3
NA == NA
Thursday, 2 September 2010
NA behaviour
Missing values propagate
Use is.na() to check for missing values
Many functions (e.g. sum and mean) have na.rm argument to remove missing values prior to computation.
Thursday, 2 September 2010
# Can use subsetting + <- to change individual # values
diamonds$x[diamonds$x == 0] <- NAdiamonds$y[diamonds$y == 0] <- NAdiamonds$z[diamonds$z == 0] <- NA
y_big <- !is.na(diamonds$y) & diamonds$y > 10diamonds$y[y_big] <- diamonds$y[y_big] / 10z_big <- !is.na(diamonds$z) & diamonds$z > 6diamonds$z[z_big] <- diamonds$z[z_big] / 10
Thursday, 2 September 2010
What happens if you don’t remove the missing values during the subsetting replacement? Why?
Your turn
Thursday, 2 September 2010
Statistical reports
Thursday, 2 September 2010
Statistical reports
Regardless of whether you go into academia or industry, you need to be able to present your findings.
And you should be able to do more than just present them, you should be able to reproduce them.
Thursday, 2 September 2010
Data (.csv)+
Code (.r)+
Graphics (.png, .pdf)+
Written report (.tex)
In one directory
Thursday, 2 September 2010
Set your working directory to specify where files will be loaded from and saved to.
From the terminal (linux or mac): the working directory is the directory you’re in when you start R
On windows: File | Change dir.
On the mac: ⌘-D
Working directory
Thursday, 2 September 2010
DataSo far we’ve just used built in datasets
Next week we’ll learn how to use external data
Thursday, 2 September 2010
Code
Thursday, 2 September 2010
Workflow
At the end of each interactive session, you want a summary of everything you did
Two options:
Save everything that you did with savehistory(filename.r) then remove the unimportant bits
Build up the important bits as you go
Up to you - I prefer the second
Thursday, 2 September 2010
R editor
Linux: gedit(copy and paste - see website)
Windows: File | New Script(press F5 to send line)
Mac: File | New document (press command-enter to send)
Thursday, 2 September 2010
Code is communication!
Thursday, 2 September 2010
Code presentationUse comments (#) to describe what you are doing and to create scannable headings in your code
Every comma should be followed by a space, and every mathematical operator (+, -, =, *, / etc) should be surrounded by spaces. Parentheses do not need spaces
Lines should be at most 80 characters. If you have to break up a line, indent the following piece
Thursday, 2 September 2010
qplot(table,depth,data=diamonds)qplot(table,depth,data=diamonds)+xlim(50,70)+ylim(50,70)qplot(table-depth,data=diamonds,geom="histogram")qplot(table/depth,data=diamonds,geom="histogram",binwidth=0.01)+xlim(0.8,1.2)
Thursday, 2 September 2010
# Table and depth -------------------------
qplot(table, depth, data = diamonds)qplot(table, depth, data = diamonds) + xlim(50, 70) + ylim(50, 70)
# Is there a linear relationship?qplot(table - depth, data = diamonds, geom = "histogram")
# This bin width seems the most revealing qplot(table / depth, data = diamonds, geom = "histogram", binwidth = 0.01) + xlim(0.8, 1.2)# Also tried: 0.05, 0.005, 0.002
Thursday, 2 September 2010
# Table and depth -------------------------
qplot(table, depth, data = diamonds)qplot(table, depth, data = diamonds) + xlim(50, 70) + ylim(50, 70)
# Is there a linear relationship?qplot(table - depth, data = diamonds, geom = "histogram")
# This bin width seems the most revealingqplot(table / depth, data = diamonds, geom = "histogram", binwidth = 0.01) + xlim(0.8, 1.2)# Also tried: 0.05, 0.005, 0.002
Thursday, 2 September 2010
Graphics
Thursday, 2 September 2010
Saving graphics# Uses size on screen:ggsave("my-plot.pdf")ggsave("my-plot.png")
# Specify sizeggsave("my-plot.pdf", width = 6, height = 6)
# Remember to set your working # directory!
Thursday, 2 September 2010
PDF PNG
Vector based (can zoom in infinitely)
Raster based(made up of pixels)
Good for most plots
Good for plots with thousands of
points
Thursday, 2 September 2010
Your turn
Recreate some of the graphics from previous lectures and save them.
Experiment with the scale and height and width settings.
Modify the template to include them.
Thursday, 2 September 2010
Written report
Thursday, 2 September 2010
Latex
We are going to use the open source document typesetting system called latex to produce our reports.
This is widespread in statistics - if you ever write a journal article, you will probably write it in latex.
(Not as useful if you’re not in grad school, but still an important skill)
Thursday, 2 September 2010
Edit-Compile-Preview
Edit: a text document with special formatting
Compile: to produce a pdf
Preview: with a pdf viewer
See web page for system specifics.
Thursday, 2 September 2010
Latex
Template
Sections
Images
Figures and cross-references
Verbatim input (for code)
Thursday, 2 September 2010
Your turn# Get the sample reportwget http://had.co.nz/stat405/\resources/sample-report.zip unzip sample-report.zip
cd sample-reportgedit template.tex &pdflatex template.texevince template.pdf# Experiment!
Thursday, 2 September 2010
Your turn
If not on linux, follow the instructions on the class website.
If you feel comfortable, start on homework 2.
Thursday, 2 September 2010
Homework
Thursday, 2 September 2010