Descriptive Statistics and Visualizing Data
in STATA
BIOS 514/517
R. Y. Coley
Week of October 7, 2013
Log Files, Getting Data in STATA
Log files save your commandscd /home/students/rycoley/bios514-517
• To change directory
log using stata-section-oct7, replace text
• To name log file (change stata-section-oct7)
• capture log close to close log file
insheet using
http://courses.washington.edu/b517/Datasets/FEVdata.csv
• To get FEV data in
Defining, Labeling Variables
table smoke
• Currently coded as 1 and 2
• No missing data (would be coded as 9)
label define smokelabel 1 "smoker" 2 "non-smoker"
label values smoke smokelabel
label define sexlabel 1 "male" 2 "female"
label values sex sexlabel
Labeling Variables
label variable age "Age (years)"
label variable fev "FEV (L/s)"
label variable height "Height (in)"
Descriptive Statistics
Basic commands detailed in this week’s lecture notes:
• summarize
• means
• centile
• tabstat
• tabulate
Descriptive Stats by Group
bysort sex: tabstat fev, stat(n mean sd min p25
med p75 max) col(stat) format
bysort sex: tabulate smoke
Defining New Variables
A few ways:
• gen age9over = age>=9
• gen age9over = 0
replace age9over=1 if age>=9
• gen age9over = age==9 | age==10 | age==11...
|age==19
Measures of Spread
• Range: tabstat fev, stat(min max range)
• Variance: tabstat fev, stat(var)
• Standard Deviation: tabstat fev, stat(sd)
• Interquartile Range: tabstat fev, stat(p25, p75,iqr)
• IQR is the distance between the 25th and 75thpercentiles of the data
Visualizing Data- Histograms
histogram fev
to save: graph export hist-fev.png, replace
Height of each bar proportional to proportion of observationsin that bin’s range
Visualizing Data- Histogramshistogram fev, kdensity by (sex)
kdensity adds smooth line estimating density
Visualizing Data- Dotplotsdotplot fev
Each dot represents an observations
Visualizing Data- Box Plots
• a.k.a. “Box and whiskers” plots
• Box extends from lower quartile (25th percentile of data) toupper quartile (75th percentile) with a line at the median(50th percentile).
• Whiskers extend from lower quartile to “lower adjacent value”and from upper quartile to “upper adjacent value”
LAV = lower quartile− 3
2IQR UAV = upper quartile+
3
2IQR (1)
• Observations outside the UAV and LAV plotted as points
• (Some box plots have whiskers extend to minimum andmaximum observations.)
Visualizing Data- Box Plots
graph box fev
Visualizing Data- Box Plots
graph box fev, over(sex)
Visualizing Data- Scatterplots
scatter fev height
Visualizing Data- Bar Charts
gen one=1
graph bar (count) one, over(smoke) ytitle("frequency")
Another Example
log using cause-of-death, text replace
set obs 10
input float deaths str30 cause
700142 "Heart Disease"
553768 "Cancer"
163538 "Cerebrovascular Disease"
123013 "Chronic respiratory disease"
101537 "Accidental Death"
71372 "Diabetes"
62034 "Flu and pneumonia"
53852 "Alzheimer’s disease"
39480 "Kidney disorder"
32238 "Septicemia"
Visualizing Data- Bar Chartgen dthou=deaths/1000
graph hbar dthou, over(cause) ytitle("Annual
deaths (thousands)")
Visualizing Data- Bar Chartsgen dthou=deaths/1000
graph hbar dthou, over(cause, sort(1) descending)
ytitle("Annual deaths (thousands)")
Visualizing Data- Pie Charts
graph pie deaths, over(cause) sort descending
Visualizing Data- Pie Charts
Visualizing Data- Pie Charts
Visualizing Data- Pie Charts
Doing it all over again in R!
Look at the code I have posted on the discussionboard. It is extensively commented (##)!Comments omitted here.
data<-read.csv("FEVdata.csv",header=TRUE)
names(data)
dim(data)
n<-dim(data)[1]
(Re-)defining variables
Variables don’t have labels like in Stata. But, we can improveupon the current coding of ”smoke” and ”sex”.
data$SMOKE[data$SMOKE==2]<-0 \\
data$FEMALE<-data$SEX==2
Creating a new variable:
data$age9over<-data$AGE>=9
Descriptive Statistics
summary(data$FEV) #min, 1Q, Med, Mean, 3Q, Max
mean(data$FEV)
quantile(data$FEV, p=c(0.25, 0.5, 0.75))
table(data$SMOKE)
xtabs(~data$SMOKE+data$FEMALE) #to get cross tabulation
Measures of Spread
range(data$FEV) #gives min and max
var(data$FEV) #variance
sd(data$FEV) #standard deviation
Histograms
hist(data$FEV, xlab="FEV (L/s)", main="Histogram of FEV")
To save the graph:
pdf(file="fev-hist-R.pdf")
hist(data$FEV, xlab="FEV (L/s)", main="Histogram of FEV")
graphics.off()Histogram of FEV
FEV (L/s)
Fre
quen
cy
1 2 3 4 5 6
050
100
150
Histogramshist(data$FEV, xlab="FEV (L/s)", main="Histogram of FEV",
prob=TRUE)
lines(density(data$FEV))
Histogram of FEV
FEV (L/s)
Den
sity
1 2 3 4 5 6
0.0
0.1
0.2
0.3
0.4
Histogram
plot(hist(data$FEV[data$FEMALE==0], xlab="FEV (L/s)",
main="Males", ylim=c(0,80)),
hist(data$FEV[data$FEMALE==1], xlab="FEV (L/s)",
main="Females", xlim=c(0,6)))
Males
FEV (L/s)
Fre
quen
cy
1 2 3 4 5 6
020
4060
80
Females
FEV (L/s)
Fre
quen
cy
0 1 2 3 4 5 6
020
4060
80
Boxplotboxplot(data$FEV, ylab="FEV (L/s)")
●
●
●
●●
●
●●
●
12
34
5
FE
V (
L/s)
Boxplotboxplot(data$FEV~data$FEMALE, ylab="FEV (L/s)",
xaxt="n")
axis(1, at=c(1,2), labels=c("Male", "Female"))1
23
45
FE
V (
L/s)
Male Female
Scatter Plot
plot(data$FEV~data$HEIGHT, ylab="FEV (L/s)",
xlab="Height (in)")
● ●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●●
●
●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
45 50 55 60 65 70 75
12
34
5
Height (in)
FE
V (
L/s)
Bar Plotcounts<-table(data$SMOKE)
barplot(counts, xlab="Smoker", xaxt="n")
axis(1, at=c(1,2), labels=c("No","Yes"))
Smoker
010
020
030
040
050
0
No Yes
Cause of Death Example in R
n.deaths<-c(700142, 553768, 163538, 123013,
101537, 71372, 62034, 53852, 39480, 32238)
cause<-c("Heart Disease", "Cancer", "Cerebrovascular
Disease", "Chronic Respiratory Diesease","Accidental
death", "Diabetes", "Flu and Pneumonia", "Alzheimer’s
Disease", "Kidney Disorder","Septicemia")
n.deaths<-n.deaths/1000
Cause of Death Example
par(mar=c(4,6.5,1,1))
barplot(n.deaths, horiz=T, yaxt="n", xlab="Number of Deaths
(Thousands)", main="Cause of Death")
text(y=seq(1,11.35, 1.15), par("usr")[1], labels=cause,
srt=45, pos=2, xpd=T, cex=0.75)
Cause of Death
Number of Deaths (Thousands)0 100 200 300 400 500 600 700Hea
rt Dise
ase
Cance
r
Cereb
rova
scula
r Dise
ase
Chron
ic Res
pirat
ory D
iesea
se
Accide
ntal
deat
hDiabet
es
Flu an
d Pne
umon
ia
Alzheim
er's
Diseas
e
Kidney
Diso
rderSep
ticem
ia
Cause of Death Examplepie(n.deaths, cause, main="Cause of Death" )
Heart Disease
Cancer
Cerebrovascular Disease Chronic Respiratory Diesease
Accidental death
Diabetes
Flu and Pneumonia
Alzheimer's Disease
Kidney Disorder
Septicemia
Cause of Death