04 reports

38
If you’re using a laptop, start installing latex, from the instructions on the website Thursday, 2 September 2010

Upload: hadley-wickham

Post on 01-Nov-2014

1.268 views

Category:

Sports


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: 04 Reports

If you’re using a laptop, start installing latex, from the instructions on the website

Thursday, 2 September 2010

Page 2: 04 Reports

Hadley Wickham

Stat405Statistical reports

Thursday, 2 September 2010

Page 3: 04 Reports

1. More subsetting.

2. Missing values.

3. Statistical reports: data, code, graphics & written report

Thursday, 2 September 2010

Page 4: 04 Reports

Office hoursMe: before class, DH 2056Garrett: Wednesday, 3pm, DH 1041

Lab access: you should now have it

Thursday, 2 September 2010

Page 5: 04 Reports

Saving results

# Prints to screen

diamonds[diamonds$x > 10, ]

# Saves to new data frame

big <- diamonds[diamonds$x > 10, ]

# Overwrites existing data frame. Dangerous!

diamonds <- diamonds[diamonds$x < 10,]

Thursday, 2 September 2010

Page 6: 04 Reports

diamonds <- diamonds[1, 1]diamonds

# Uh oh!

rm(diamonds)str(diamonds)

# Phew!

Thursday, 2 September 2010

Page 7: 04 Reports

Your turn

Create a logical vector that selects diamonds with equal x & y. Create a new dataset that only contains these values.

Create a logical vector that selects diamonds with incorrect/unusual x, y, or z values. Create a new dataset that omits these values. (Hint: do this one variable at a time)

Thursday, 2 September 2010

Page 8: 04 Reports

equal_dim <- diamonds$x == diamonds$yequal <- diamonds[equal_dim, ]

y_big <- diamonds$y > 10z_big <- diamonds$z > 6

x_zero <- diamonds$x == 0 y_zero <- diamonds$y == 0z_zero <- diamonds$z == 0zeros <- x_zero | y_zero | z_zero

bad <- y_big | z_big | zerosgood <- diamonds[!bad, ]

Thursday, 2 September 2010

Page 9: 04 Reports

Missing values

Thursday, 2 September 2010

Page 10: 04 Reports

Typically removing the entire row because of one error is overkill. Better to selectively replace problem values with missing values.

In R, missing values are indicated by NA

Data errors

Thursday, 2 September 2010

Page 11: 04 Reports

Expression Guess Actual

5 + NA

NA / 2

sum(c(5, NA))

mean(c(5, NA)

NA < 3

NA == 3

NA == NA

Thursday, 2 September 2010

Page 12: 04 Reports

NA behaviour

Missing values propagate

Use is.na() to check for missing values

Many functions (e.g. sum and mean) have na.rm argument to remove missing values prior to computation.

Thursday, 2 September 2010

Page 13: 04 Reports

# Can use subsetting + <- to change individual # values

diamonds$x[diamonds$x == 0] <- NAdiamonds$y[diamonds$y == 0] <- NAdiamonds$z[diamonds$z == 0] <- NA

y_big <- !is.na(diamonds$y) & diamonds$y > 10diamonds$y[y_big] <- diamonds$y[y_big] / 10z_big <- !is.na(diamonds$z) & diamonds$z > 6diamonds$z[z_big] <- diamonds$z[z_big] / 10

Thursday, 2 September 2010

Page 14: 04 Reports

What happens if you don’t remove the missing values during the subsetting replacement? Why?

Your turn

Thursday, 2 September 2010

Page 15: 04 Reports

Statistical reports

Thursday, 2 September 2010

Page 16: 04 Reports

Statistical reports

Regardless of whether you go into academia or industry, you need to be able to present your findings.

And you should be able to do more than just present them, you should be able to reproduce them.

Thursday, 2 September 2010

Page 17: 04 Reports

Data (.csv)+

Code (.r)+

Graphics (.png, .pdf)+

Written report (.tex)

In one directory

Thursday, 2 September 2010

Page 18: 04 Reports

Set your working directory to specify where files will be loaded from and saved to.

From the terminal (linux or mac): the working directory is the directory you’re in when you start R

On windows: File | Change dir.

On the mac: ⌘-D

Working directory

Thursday, 2 September 2010

Page 19: 04 Reports

DataSo far we’ve just used built in datasets

Next week we’ll learn how to use external data

Thursday, 2 September 2010

Page 20: 04 Reports

Code

Thursday, 2 September 2010

Page 21: 04 Reports

Workflow

At the end of each interactive session, you want a summary of everything you did

Two options:

Save everything that you did with savehistory(filename.r) then remove the unimportant bits

Build up the important bits as you go

Up to you - I prefer the second

Thursday, 2 September 2010

Page 22: 04 Reports

R editor

Linux: gedit(copy and paste - see website)

Windows: File | New Script(press F5 to send line)

Mac: File | New document (press command-enter to send)

Thursday, 2 September 2010

Page 23: 04 Reports

Code is communication!

Thursday, 2 September 2010

Page 24: 04 Reports

Code presentationUse comments (#) to describe what you are doing and to create scannable headings in your code

Every comma should be followed by a space, and every mathematical operator (+, -, =, *, / etc) should be surrounded by spaces. Parentheses do not need spaces

Lines should be at most 80 characters. If you have to break up a line, indent the following piece

Thursday, 2 September 2010

Page 25: 04 Reports

qplot(table,depth,data=diamonds)qplot(table,depth,data=diamonds)+xlim(50,70)+ylim(50,70)qplot(table-depth,data=diamonds,geom="histogram")qplot(table/depth,data=diamonds,geom="histogram",binwidth=0.01)+xlim(0.8,1.2)

Thursday, 2 September 2010

Page 26: 04 Reports

# Table and depth -------------------------

qplot(table, depth, data = diamonds)qplot(table, depth, data = diamonds) + xlim(50, 70) + ylim(50, 70)

# Is there a linear relationship?qplot(table - depth, data = diamonds, geom = "histogram")

# This bin width seems the most revealing qplot(table / depth, data = diamonds, geom = "histogram", binwidth = 0.01) + xlim(0.8, 1.2)# Also tried: 0.05, 0.005, 0.002

Thursday, 2 September 2010

Page 27: 04 Reports

# Table and depth -------------------------

qplot(table, depth, data = diamonds)qplot(table, depth, data = diamonds) + xlim(50, 70) + ylim(50, 70)

# Is there a linear relationship?qplot(table - depth, data = diamonds, geom = "histogram")

# This bin width seems the most revealingqplot(table / depth, data = diamonds, geom = "histogram", binwidth = 0.01) + xlim(0.8, 1.2)# Also tried: 0.05, 0.005, 0.002

Thursday, 2 September 2010

Page 28: 04 Reports

Graphics

Thursday, 2 September 2010

Page 29: 04 Reports

Saving graphics# Uses size on screen:ggsave("my-plot.pdf")ggsave("my-plot.png")

# Specify sizeggsave("my-plot.pdf", width = 6, height = 6)

# Remember to set your working # directory!

Thursday, 2 September 2010

Page 30: 04 Reports

PDF PNG

Vector based (can zoom in infinitely)

Raster based(made up of pixels)

Good for most plots

Good for plots with thousands of

points

Thursday, 2 September 2010

Page 31: 04 Reports

Your turn

Recreate some of the graphics from previous lectures and save them.

Experiment with the scale and height and width settings.

Modify the template to include them.

Thursday, 2 September 2010

Page 32: 04 Reports

Written report

Thursday, 2 September 2010

Page 33: 04 Reports

Latex

We are going to use the open source document typesetting system called latex to produce our reports.

This is widespread in statistics - if you ever write a journal article, you will probably write it in latex.

(Not as useful if you’re not in grad school, but still an important skill)

Thursday, 2 September 2010

Page 34: 04 Reports

Edit-Compile-Preview

Edit: a text document with special formatting

Compile: to produce a pdf

Preview: with a pdf viewer

See web page for system specifics.

Thursday, 2 September 2010

Page 35: 04 Reports

Latex

Template

Sections

Images

Figures and cross-references

Verbatim input (for code)

Thursday, 2 September 2010

Page 36: 04 Reports

Your turn# Get the sample reportwget http://had.co.nz/stat405/\resources/sample-report.zip unzip sample-report.zip

cd sample-reportgedit template.tex &pdflatex template.texevince template.pdf# Experiment!

Thursday, 2 September 2010

Page 37: 04 Reports

Your turn

If not on linux, follow the instructions on the class website.

If you feel comfortable, start on homework 2.

Thursday, 2 September 2010

Page 38: 04 Reports

Homework

Thursday, 2 September 2010