introduction to graphics in r 3/12/2014. first, let’s get some data load the duncan dataset it’s...
TRANSCRIPT
Introduction to Graphics in R
3/12/2014
First, let’s get some data
• Load the Duncan dataset
• It’s in the car package. Remember how to get it?
– library(car)– data(Duncan)
Getting started
• Okay, now plot income levels:– plot(Duncan$income)
• What is this graph? Can you make it a line plot instead?– plot(Duncan$income, type=“l”)
Histogram
• The X axis is useless. Wouldn’t a histogram be more informative?
• Make a histogram• If you’re stuck, use google
– hist(Duncan$income)
Fix the title
• ‘Histogram of Duncan$income’ is not a good title
• Change it to ‘Income Distribution in Duncan Dataset’
– hist(Duncan$income, main="Income Distribution in Duncan Dataset")
Another option
• There’s another way to set the title. Maybe some of you will have done this (my crystal ball is murky):
– hist(Duncan$income)– title("Income Distribution in Duncan Dataset“)
• But wait. That looks awful. We need to not print the title as part of the hist() call. How do we do that?
• hist(Duncan$income, main="")
Scatterplot
• Okay, let’s look at income vs. prestige
• Make a scatterplot comparing income (x-axis) to prestige (y-axis)– plot(Duncan$income, Duncan$prestige)
• Did you get the x- and y- axes right?• Add a title: Income vs. Prestige– title(“Income vs. Prestige”)
Scatterplot: Axis labels
• The axis labels display the variable names. Can we do better than that?
• Label the X axis “Income” and the Y axis “Prestige”– plot(Duncan$income, Duncan$prestige,
xlab="Income", ylab="Prestige")
Scatterplot: Axis range
• How come income doesn’t have ticks at 0 and 100 but prestige does?
• Make both axes run from 0 to 100– plot(Duncan$income, Duncan$prestige,
xlab="Income", ylab="Prestige", xlim=c(0,100))
Scatterplot Axis Tick Marks
• Actually, your collaborator wants tick marks every 5 points on the X axis.
• DO IT• Caveat: this is trickier:– plot(Duncan$income, Duncan$prestige,
xlab="Income", ylab="Prestige", xlim=c(0,100), xaxt="n")
– axis(1, at=seq(0,100, by=5))
Axis labels sideways
• Your collaborator still isn’t happy. Turn the x labels sideways.– plot(Duncan$income, Duncan$prestige,
xlab="Income", ylab="Prestige", xlim=c(0,100), xaxt="n")
– axis(1, las=2, at=seq(0,100, by=5))
More columns
• Now your collaborator wants to see how education affect this relationship. Create a dichotomous variable named ‘high_education’ categorizing education > 50 as TRUE and <= 50 as FALSE– Duncan$high_education <-
Duncan$education > 50
High education: sanity check
• How many high and low education jobs are there?– table(Duncan$high_education)
• Plot education (y-axis) by high_education (x-axis)– plot(Duncan$high_education,
Duncan$education)
• Does it look right?
Adding color
• Okay, now color your income/prestige graph so high-education jobs are blue and low-education jobs are red
• This is a little tricky– colors <-
as.numeric(Duncan$high_education)+1– plot(Duncan$income, Duncan$prestige,
col=c("red", "blue")[colors], xlab="Income", ylab="Prestige", xlim=c(0,100), xaxt="n")
– axis(1, at=seq(0,100, by=5))
Bar plot
• Okay, now run this code:– plot(Duncan$type, Duncan$income)
• What happened? Why didn't we get a scatterplot? Can you get one?– plot(as.numeric(Duncan$type),
Duncan$income)
More than one plot at a time
• Now your collaborator wants your scatterplot and histogram side-by-side. (Don’t worry about color if you don't want to)– opar<-par()– par(mfrow=c(1,2))– hist(Duncan$income, main="Income Distribution in
Duncan Dataset")– plot(Duncan$income, Duncan$prestige,
xlab="Income", ylab="Prestige", xlim=c(0,100), xaxt="n")
– axis(1, at=seq(0,100, by=5))– par(opar)
ggplot
• ggplot is a whole different beast from base graphics
• ggplot is like R itself – some work to get oriented, but powerful once you do
• You don't have to know ggplot to be successful using R– But you do have to experiment with it
for this class
Load the ggplot library
• Hint: the package name, confusingly, is ggplot2
Plot income vs. prestige
• It will be easiest to start using qplot. Qplot mimics plot(), but uses the ggplot layout engine.– qplot(Duncan$income,
Duncan$prestige)
ggplot
• qplot is the training wheels version of ggplot
• ggplot's syntax takes some getting used to. Try this:– ggplot(Duncan) + aes(x=income,
y=prestige) + geom_point()
• Huh? What are the pluses about?
ggplot syntax
• ggplot objects are weird• You execute them (like a command) to
draw their plot• But you construct them by adding options
to them• Options specify data source, data columns,
etc, resulting in code like this:• p <- ggplot(Duncan)• p <- p + aes(x=income, y=prestige)• p + geom_point()
Where ggplot shines
• In my opinion, it's harder to think about doing simple plots in ggplot
• But when I want to do something multi-faceted (e.g. with different colors, sizes, etc.), ggplot makes it really easy
• I use it a lot for to understand 3+-way relationships in data
ggplot example (one of many)
ggplot code for that example
ggplot(data=nycnames) + aes(x=as.factor(race), y=n1_013002p,
color=as.factor(nbhdarkwalk)) +geom_point(position="jitter") +scale_x_discrete(breaks=1:7, limits=1:7, name="Subject
Race", labels=c('Asian', 'Black', 'First\nPeoples', 'Pacific\nIslander', 'Non-Hispanic\nWhite', 'Other', 'Hispanic')) +
scale_color_discrete(breaks=1:4, limits=1:4, name="Neighborhood Safe After Dark", labels=c('Strongly Agree', 'Somewhat Agree', 'Somewhat disagree', 'Strongly Disagree')) +
scale_y_continuous(name="Neighborhood percent white (1km buffer)")
Exercises