the basics of stata: descriptive and summary statistics · 2/4/2014 · describe use the command...
TRANSCRIPT
The Basics of Stata: Descriptive and Summary Statistics
Rachael Meager
MIT
Monitoring & Evaluation Training Course for the Indian Economic Service
July 29, 2013
Opening Your Data
One way to open your data is manually, by clicking on “file -> open”.
Another way is to type in the command window. The command is “use”
followed by the file directory. The directory should be entered inside
quotation marks.
Stata will open data files in the format “.dta”.
If you want to open other formats you need to import them. We’ll do that
another time.
Describe
Use the command “describe” or “d” to get a basic description of your data and variables. It gives output like this:
newspaperread~k byte %8.0g Newspaper readership (times/wk)heightin byte %8.0g Height (in)averagescoreg~e byte %8.0g Average score (grade)sat int %8.0g SATage byte %8.0g Agecountry str9 %9s Countrymajor str8 %9s Majorstudentstatus str13 %13s Student Statusgender str6 %9s Genderstate str14 %14s Statecity str14 %14s Cityfirstname str6 %9s First Namelastname str5 %9s Last Nameid byte %8.0g ID variable name type format label variable label storage display value size: 2,580 (99.9% of memory free) vars: 14 8 Nov 2007 11:15 obs: 30 Contains data from D:\Divadhar\Downloads\students.dta
Codebook
“Codebook” gives a quick overview of the variables in the data file, organised by variable
66 71 79.5 88 95 percentiles: 10% 25% 50% 75% 90%
std. dev: 10.1114 mean: 80.3667
unique values: 20 missing .: 0/30 range: [63,96] units: 1
type: numeric (byte)
averagescoregrade Average score (grade)
1503 1643 1817 2041 2257.5 percentiles: 10% 25% 50% 75% 90%
std. dev: 275.112 mean: 1848.9
unique values: 30 missing .: 0/30 range: [1338,2309] units: 1
type: numeric (int)
sat SAT
List
Use “list” or “l” to display all (or some) of the variable values for all (or some) observations.
“l #” lists the value of every variable for every observation in the dataset (usually not a good idea!)
“list” can be used to show certain information only Eg: l gender age Eg: l gender if age>10 Eg: l in 1/10 # lists the values in observations 1 through 10
Side Note: We can also use “list” to give us information about a very small subset we might want to examine while cleaning data!
Tab
“tab variable” breaks down any variable into its values and their frequencies. Very useful!
For example suppose we want to know the frequency of genders, we can use: tab gender
If you want graphical representation use: tab gender, plot
Total 30 Male 15 *************** Female 15 *************** Gender Freq.
. tab gender, plot
Tab 1
If you want to run “tab” on more than one variable at the same time, the best way is to use ”tab1”, not enter “tab” multiple times!
For example: tab1 gender major country
Cross-tabulation
If you want to look at how data is grouped across two variables, use “tab row-variable col-variable” Eg: tab major gender
Eg: tab gender major
Crosstab - Percentages
To get percentages of the row variables accounted for by the columns use the add-on for tab: “, row nofreq”. For example: tab major gender, row nofreq
To get percentages of the column variables accounted for by the rows use the add-on for tab: “,col nofreq”. For example: tab major gender, col nofreq
Summarize
“Summarize” gives you a grid of all your variables with their number of observations, mean, standard deviation, min and max values. You can also use “Summarize variable” to see just one.
Use the add-on to summarize, “,detail”, to see percentiles for each of your variables as well!
Summarize
“Return list” calls the summary information from the most recently called variable:
Summarize
Useful information is encoded as scalars by these ‘r’ stats, in a way that is hard to do otherwise in Stata.
If you need to generate new information or transform your data using means or variances, the r stats are a good way to make these quantities useable without having to enter them manually.
13.8 1 3.33 100.00 12.8 1 3.33 96.67 11.8 1 3.33 93.33 7.8 3 10.00 90.00 5.8 1 3.33 80.00 4.8 4 13.33 76.67 2.8 1 3.33 63.33 .8 1 3.33 60.00 -.2 2 6.67 56.67 -4.2 3 10.00 50.00 -5.2 2 6.67 40.00 -6.2 5 16.67 33.33 -7.2 5 16.67 16.67 age_diff Freq. Percent Cum.
For example if I want to get a picture of how the age in my sample is spread around the average age, I could first call age in “summarize”, then run “return list”, then create that variable using the r(mean) by typing : “gen age_diff_mean=age – r(mean)” To see the results you can “tab” the new variable.
Summarize
Tabstat
“Tabstat” is the command to use when you want to select certain variables and see all of their summary statistics.
The general format is “tabstat variables, s(statistics list)
Eg: tabstat age sat averagescoregrade heightin newspaperreadershiptimeswk, s(mean semean median sd var skew k count sum range min max)
Another add-on to tabstat is “,by(variable)” which breaks your other variables up according to that particular variable.
The syntax is “tabstat variables, s(statistics) by(variable)”
Tabstat
By/Bysort
“by” is a great command when combined with “tab” as well.
But be careful, because you can’t use “by” with a variable which is not sorted in the data already.
For example if you try to type “by major: tab age” you may get an error message.
You can use “Sort variable” to sort by variable.
But if you want to do it only to then use the “by” command, why not just use “bysort”! Eg: Bysort major: tab age
Total 10 100.00 38 1 10.00 100.00 33 1 10.00 90.00 26 1 10.00 80.00 21 1 10.00 70.00 20 1 10.00 60.00 19 2 20.00 50.00 18 3 30.00 30.00 Age Freq. Percent Cum.
-> major = Math
Total 10 100.00 37 1 10.00 100.00 33 1 10.00 90.00 28 1 10.00 80.00 25 1 10.00 70.00 21 1 10.00 60.00 20 1 10.00 50.00 19 2 20.00 40.00 18 2 20.00 20.00 Age Freq. Percent Cum.
-> major = Econ
. bysort major: tab age
Correlation
Stata computes correlations between variables for us very easily using the command “correlate variables”.
We can do two variables, eg: correlate sat age
We can do a lot more than two variables, try it for yourself!
age -0.1260 1.0000 sat 1.0000 sat age
(obs=30). correlate sat age
Regressions
Stata can do OLS regressions very easily. The basic command is “regress” or just “reg” will work too.
The proper syntax is “regress y x1 x2 x3”
This will give us not only the OLS estimates but their standard errors, t statistics for the null hypothesis of zero, and confidence intervals.
It is very easy to calculate Eiker-White standard errors using the add-on “,robust”. But let’s leave that for later.
_cons 1976.049 195.8628 10.09 0.000 1574.842 2377.256 age -5.045587 7.507316 -0.67 0.507 -20.42363 10.33245 sat Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 2194914.7 29 75686.7138 Root MSE = 277.75 Adj R-squared = -0.0193 Residual 2160067.86 28 77145.2805 R-squared = 0.0159 Model 34846.8447 1 34846.8447 Prob > F = 0.5070 F( 1, 28) = 0.45 Source SS df MS Number of obs = 30
. regress sat age
Converting Data for Regressions
Suppose we want to include gender but it is a string….
Step 1: convert values to numeric values
Step 2: convert variable to numeric variable
Step 3: include in regression
(15 real changes made). replace gender="1" if gender=="Female"
(15 real changes made). replace gender="0" if gender=="Male"
gender has all characters numeric; replaced as byte. destring gender, replace
Graphs - Scatterplots
The command for scatter plots is “twoway scatter variable1 variable2”. Eg we can run twoway scatter sat age
Graphs - Scatterplots
The add-on “,mlabel(variable)” will put a label on the data points telling you the value of that variable.
Eg we can do: twoway scatter sat age, mlabel(lastname)
Graphs - Scatterplots
Another great add-on is “,lfit var1 var2”which draws the OLS line of best fit for the two variables plus a constant.
If combining with another add-on you can use the syntax “twoway scatter var1 var2 mlabel(var3) || lfit var1 var2”
Eg: twoway scatter sat age, mlabel(last) || lfit sat age
We can also break up the scatter plots by a certain variable if we want to. We use our old friend the “by(variable)” command as an add-on to the “twoway scatter”.
Eg: twoway scatter sat age, mlabel(lastname) by(gender, total)
Graphs - Scatterplots
Histogram
The command we use for our basic histogram is “histogram variable, frequency” Eg we can do: histogram age, frequency
We can also get Stata to plot the “best fitting” Normal Distribution for our data. The syntax is “histogram variable, frequency normal”.
Eg: histogram age, frequency normal
Histogram
We can also break up our histogram into several histograms by the categories in a certain variable.
The syntax is “histogram variable1, frequency by(variable2, total)” – the “total” ensures you see the whole picture too. Eg: histogram age, frequency by(gender, total)
Histogram
Bar charts
Making bar charts is quite easy in Stata too!
The syntax is “graph bar variables, by(variable1)” : this gives a bar graph of all the variables in the list, sorted by the variable you specified. Eg: graph bar age heightin, by(major)
020
4060
80
Econ Math Politics
mean of age mean of heightin
Questions?