chapter 3: understanding and describing your data · summary(df) ## descriptives for each variable...
TRANSCRIPT
![Page 1: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/1.jpg)
Chapter 3: Understanding and Describing Your Data
Tyson S. BarrettSummer 2017
Utah State University
1
![Page 2: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/2.jpg)
Introduction
Descriptive Statistics
Visualizations
2
![Page 3: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/3.jpg)
Introduction
3
![Page 4: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/4.jpg)
Understanding and Describing Your Data
We are going to take what we’ve learned from the previous two chapters and usethem together to have simple but powerful ways to understand your data. Thischapter will be broken down into:
1. Descriptive Statistics2. Visualizations
The two go hand-in-hand in understanding what is happening in your data.
4
![Page 5: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/5.jpg)
Exploring Data
We are often most interested in three things when exploring our data:
1. understanding distributions,2. understanding relationships, and3. looking for outliers or errors.
5
![Page 6: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/6.jpg)
Descriptive Statistics
6
![Page 7: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/7.jpg)
Descriptive Statistics
Several methods of discovering descriptives in a succinct way have beendeveloped for R.
• My favorite (full disclosure: it is one that I made so I may be biased) is thetable1 function in the furniture package.
• It has been designed to be simple and complete.• It produces a well-formatted table that you can easily export and use as a
table in a report or article.1
1It is called “table1” because a nice descriptive table is often found in the first table of manyacademic papers.
7
![Page 8: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/8.jpg)
Table 1
We’ll first create a ficticious data set and we’ll show the basic build of table1.
library(furniture)df <- data.frame("A"=c(1,2,1,4,3,NA),
"B"=c(1.4,2.1,4.6,2.0,NA,3.4),"C"=c(0,0,1,1,1,1),"D"=rnorm(6))
8
![Page 9: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/9.jpg)
Table 1
This can quickly give you means and standard deviations (or counts andpercentages for categorical variables).
“A” and “C” are supposed to be factors in this fake data set.
df$A <- factor(df$A, labels=c("cat1", "cat2", "cat3", "cat4"))df$C <- factor(df$C, labels=c("male", "female"))
9
![Page 10: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/10.jpg)
Table 1
Then we can use table1().
table1(df, A, B, C, D)
|==============================|Mean/Count (SD/%)
Observations 6A
cat1 2 (40%)cat2 1 (20%)cat3 1 (20%)cat4 1 (20%)
B2.7 (1.3)
Cmale 2 (33.3%)female 4 (66.7%)
D-0.2 (0.5)
|==============================|
10
![Page 11: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/11.jpg)
Table 1
So now we see the counts and percentages for the factor variables. But now wecan take a step further and look for relationships. The code below shows themeans/standard devaitions or counts/percentages by a grouping variable–in thiscase, C.
table1(df, A, B, D,splitby = ~C)
|=================================|C
male femaleObservations 2 4A
cat1 1 (50%) 1 (33.3%)cat2 1 (50%) 0 (0%)cat3 0 (0%) 1 (33.3%)cat4 0 (0%) 1 (33.3%)
B1.8 (0.5) 3.3 (1.3)
D0.1 (0.8) -0.3 (0.3)
|=================================|
11
![Page 12: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/12.jpg)
Table 1
We can also test for differences by group as well (although this is not particularlygood with a sample size of 5). It produces a warning since the χ2 approximationis not accurate with cells this small.
table1(df, A, B, D,splitby = ~C,test=TRUE)
|=========================================|C
male female P-ValueObservations 2 4A 0.405
cat1 1 (50%) 1 (33.3%)cat2 1 (50%) 0 (0%)cat3 0 (0%) 1 (33.3%)cat4 0 (0%) 1 (33.3%)
B 0.1621.8 (0.5) 3.3 (1.3)
D 0.6050.1 (0.8) -0.3 (0.3)
|=========================================|
12
![Page 13: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/13.jpg)
Table 1
Finally, we can play around with the formatting a bit to match what we need:
table1(df, A, B, D,splitby = ~C,test=TRUE,type = c("simple", "condense"))
|=========================================|C
male female P-ValueObservations 2 4A 0.405
cat1 50% 33.3%cat2 50% 0%cat3 0% 33.3%cat4 0% 33.3%
B 1.8 (0.5) 3.3 (1.3) 0.162D 0.1 (0.8) -0.3 (0.3) 0.605
|=========================================|13
![Page 14: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/14.jpg)
Table 1
So with three or four short lines of code we can get a good idea about variablesthat may be related to the grouping variable and any missingness in the factorvariables. There’s much more you can do with table1 and there are vignettesand tutorials available to learn more.2
2tysonstanley.github.io
14
![Page 15: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/15.jpg)
Other Descriptive Functions
Other quick descriptive functions exist; here are a few of them.
summary(df) ## descriptives for each variable in the data
library(psych) ## install firstdescribe(df) ## produces summary statistics for continuous variables
library(Hmisc) ## install firstHmisc::describe(df) ## gives summary for each variable separately
15
![Page 16: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/16.jpg)
Visualizations
16
![Page 17: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/17.jpg)
Visualizations
Understanding your data, in my experience, generally requires visualizations.
Can help us:
• understand the distributions and relationships• catch errors in the data• find any outliers that could be highly influencing any models
17
![Page 18: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/18.jpg)
ggplot2
For simple but appealing visualizations we are going to be using ggplot2.
This package is used to produce professional level plots for manyjournalism organizations (e.g. five-thrity-eight). These plots are quicklypresentation quality and can be used to impress your boss, youradvisor, or your friends.
18
![Page 19: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/19.jpg)
ggplot2
This package has a straight-forward syntax. It is built by adding layers to theplot.
library(ggplot2) ## first install using install.packages("ggplot2")
19
![Page 20: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/20.jpg)
ggplot2
First, we have a nice qplot function that provides us a “quick plot.” It quicklydecides what kind of plot is useful given the data and variables you provide.
qplot(df$A) ## Makes a simple histogram
0.0
0.5
1.0
1.5
2.0
cat1 cat2 cat3 cat4 NA
df$A
coun
t
20
![Page 21: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/21.jpg)
ggplot2
qplot(df$D, df$B) ## Makes a scatterplot
Warning: Removed 1 rows containing missing values (geom_point).
2
3
4
−0.4 0.0 0.4
df$D
df$B
21
![Page 22: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/22.jpg)
ggplot2
For a bit more control over the plot, you can use the ggplot function. The firstpiece is the ggplot piece. From there, we add layers. These layers generallystart with geom_ then have the type of plot.
22
![Page 23: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/23.jpg)
ggplot2
Below, we start with telling ggplot the basics of the plot and then build aboxplot. The x-axis is the variable “C” and the y-axis is the variable “D” andthen we color it by variable “C” as well.
ggplot(df, aes(x=C, y=D)) +geom_boxplot(aes(color = C))
−0.4
0.0
0.4
male female
C
D
C
male
female
23
![Page 24: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/24.jpg)
ggplot2
Here are a few more examples:
ggplot(df, aes(x=C)) +geom_bar(stat="count", aes(fill = C))
0
1
2
3
4
male female
C
coun
t C
male
female
24
![Page 25: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/25.jpg)
ggplot2
ggplot(df, aes(x=B, y=D)) +geom_point(aes(color = C))
Warning: Removed 1 rows containing missing values (geom_point).
−0.4
0.0
0.4
2 3 4
B
D
C
male
female
Notethat the warning that says it removed a row is because we had a missing value in“C”.
25
![Page 26: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/26.jpg)
ggplot2
We are going to make the first one again but with some aesthetic adjustments.Notice that we just added two extra lines telling ggplot2 how we want somethings to look.3
3This is just scratching the surface of what we can change in the plots.
26
![Page 27: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/27.jpg)
A final example
ggplot(df, aes(x=C, y=D)) +geom_boxplot(aes(color = C)) +theme_bw() +scale_color_manual(values = c("dodgerblue4", "coral2"))
−0.4
0.0
0.4
male female
C
D
C
male
female
27
![Page 28: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/28.jpg)
A final example
• theme_bw() makes the background white,• the scale_color_manual() allows us to change the colors in the plot.
You can get a good idea of how many types of plots you can do by going tohttp://docs.ggplot2.org/current.
Almost any informative plot that you need to do as a researcher is possible withggplot2.
We will be using ggplot2 extensively in the class to help understand our dataand our models as well as communicate our results.
28
![Page 29: Chapter 3: Understanding and Describing Your Data · summary(df) ## descriptives for each variable in the data library (psych) ## install first describe (df) ## produces summary statistics](https://reader035.vdocument.in/reader035/viewer/2022071016/5fcec9ea25a64721266b9eb4/html5/thumbnails/29.jpg)
29