the basics of stata: descriptive and summary statistics · 2/4/2014 · describe use the command...

The Basics of Stata: Descriptive and Summary Statistics

Rachael Meager

MIT

Monitoring & Evaluation Training Course for the Indian Economic Service

July 29, 2013

Opening Your Data

One way to open your data is manually, by clicking on “file -> open”.

Another way is to type in the command window. The command is “use”

followed by the file directory. The directory should be entered inside

quotation marks.

Stata will open data files in the format “.dta”.

If you want to open other formats you need to import them. We’ll do that

another time.

Describe

Use the command “describe” or “d” to get a basic description of your data and variables. It gives output like this:

newspaperread~k byte %8.0g Newspaper readership (times/wk)heightin byte %8.0g Height (in)averagescoreg~e byte %8.0g Average score (grade)sat int %8.0g SATage byte %8.0g Agecountry str9 %9s Countrymajor str8 %9s Majorstudentstatus str13 %13s Student Statusgender str6 %9s Genderstate str14 %14s Statecity str14 %14s Cityfirstname str6 %9s First Namelastname str5 %9s Last Nameid byte %8.0g ID variable name type format label variable label storage display value size: 2,580 (99.9% of memory free) vars: 14 8 Nov 2007 11:15 obs: 30 Contains data from D:\Divadhar\Downloads\students.dta

Codebook

“Codebook” gives a quick overview of the variables in the data file, organised by variable

66 71 79.5 88 95 percentiles: 10% 25% 50% 75% 90%

std. dev: 10.1114 mean: 80.3667

unique values: 20 missing .: 0/30 range: [63,96] units: 1

type: numeric (byte)

averagescoregrade Average score (grade)

1503 1643 1817 2041 2257.5 percentiles: 10% 25% 50% 75% 90%

std. dev: 275.112 mean: 1848.9

unique values: 30 missing .: 0/30 range: [1338,2309] units: 1

type: numeric (int)

sat SAT

List

Use “list” or “l” to display all (or some) of the variable values for all (or some) observations.

“l #” lists the value of every variable for every observation in the dataset (usually not a good idea!)

“list” can be used to show certain information only Eg: l gender age Eg: l gender if age>10 Eg: l in 1/10 # lists the values in observations 1 through 10

Side Note: We can also use “list” to give us information about a very small subset we might want to examine while cleaning data!

Tab

“tab variable” breaks down any variable into its values and their frequencies. Very useful!

For example suppose we want to know the frequency of genders, we can use: tab gender

If you want graphical representation use: tab gender, plot

Total 30 Male 15 *************** Female 15 *************** Gender Freq.

. tab gender, plot

Tab 1

If you want to run “tab” on more than one variable at the same time, the best way is to use ”tab1”, not enter “tab” multiple times!

For example: tab1 gender major country

Cross-tabulation

If you want to look at how data is grouped across two variables, use “tab row-variable col-variable” Eg: tab major gender

Eg: tab gender major

Crosstab - Percentages

To get percentages of the row variables accounted for by the columns use the add-on for tab: “, row nofreq”. For example: tab major gender, row nofreq

To get percentages of the column variables accounted for by the rows use the add-on for tab: “,col nofreq”. For example: tab major gender, col nofreq

Summarize

“Summarize” gives you a grid of all your variables with their number of observations, mean, standard deviation, min and max values. You can also use “Summarize variable” to see just one.

Use the add-on to summarize, “,detail”, to see percentiles for each of your variables as well!

Summarize

“Return list” calls the summary information from the most recently called variable:

Summarize

Useful information is encoded as scalars by these ‘r’ stats, in a way that is hard to do otherwise in Stata.

If you need to generate new information or transform your data using means or variances, the r stats are a good way to make these quantities useable without having to enter them manually.

13.8 1 3.33 100.00 12.8 1 3.33 96.67 11.8 1 3.33 93.33 7.8 3 10.00 90.00 5.8 1 3.33 80.00 4.8 4 13.33 76.67 2.8 1 3.33 63.33 .8 1 3.33 60.00 -.2 2 6.67 56.67 -4.2 3 10.00 50.00 -5.2 2 6.67 40.00 -6.2 5 16.67 33.33 -7.2 5 16.67 16.67 age_diff Freq. Percent Cum.

For example if I want to get a picture of how the age in my sample is spread around the average age, I could first call age in “summarize”, then run “return list”, then create that variable using the r(mean) by typing : “gen age_diff_mean=age – r(mean)” To see the results you can “tab” the new variable.

Summarize

Tabstat

“Tabstat” is the command to use when you want to select certain variables and see all of their summary statistics.

The general format is “tabstat variables, s(statistics list)

Eg: tabstat age sat averagescoregrade heightin newspaperreadershiptimeswk, s(mean semean median sd var skew k count sum range min max)

Another add-on to tabstat is “,by(variable)” which breaks your other variables up according to that particular variable.

The syntax is “tabstat variables, s(statistics) by(variable)”

Tabstat

By/Bysort

“by” is a great command when combined with “tab” as well.

But be careful, because you can’t use “by” with a variable which is not sorted in the data already.

For example if you try to type “by major: tab age” you may get an error message.

You can use “Sort variable” to sort by variable.

But if you want to do it only to then use the “by” command, why not just use “bysort”! Eg: Bysort major: tab age

Total 10 100.00 38 1 10.00 100.00 33 1 10.00 90.00 26 1 10.00 80.00 21 1 10.00 70.00 20 1 10.00 60.00 19 2 20.00 50.00 18 3 30.00 30.00 Age Freq. Percent Cum.

-> major = Math

Total 10 100.00 37 1 10.00 100.00 33 1 10.00 90.00 28 1 10.00 80.00 25 1 10.00 70.00 21 1 10.00 60.00 20 1 10.00 50.00 19 2 20.00 40.00 18 2 20.00 20.00 Age Freq. Percent Cum.

-> major = Econ

. bysort major: tab age

Correlation

Stata computes correlations between variables for us very easily using the command “correlate variables”.

We can do two variables, eg: correlate sat age

We can do a lot more than two variables, try it for yourself!

age -0.1260 1.0000 sat 1.0000 sat age

(obs=30). correlate sat age

Regressions

Stata can do OLS regressions very easily. The basic command is “regress” or just “reg” will work too.

The proper syntax is “regress y x1 x2 x3”

This will give us not only the OLS estimates but their standard errors, t statistics for the null hypothesis of zero, and confidence intervals.

It is very easy to calculate Eiker-White standard errors using the add-on “,robust”. But let’s leave that for later.

_cons 1976.049 195.8628 10.09 0.000 1574.842 2377.256 age -5.045587 7.507316 -0.67 0.507 -20.42363 10.33245 sat Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 2194914.7 29 75686.7138 Root MSE = 277.75 Adj R-squared = -0.0193 Residual 2160067.86 28 77145.2805 R-squared = 0.0159 Model 34846.8447 1 34846.8447 Prob > F = 0.5070 F( 1, 28) = 0.45 Source SS df MS Number of obs = 30

. regress sat age

Converting Data for Regressions

Suppose we want to include gender but it is a string….

Step 1: convert values to numeric values

Step 2: convert variable to numeric variable

Step 3: include in regression

(15 real changes made). replace gender="1" if gender=="Female"

(15 real changes made). replace gender="0" if gender=="Male"

gender has all characters numeric; replaced as byte. destring gender, replace

Graphs - Scatterplots

The command for scatter plots is “twoway scatter variable1 variable2”. Eg we can run twoway scatter sat age


The add-on “,mlabel(variable)” will put a label on the data points telling you the value of that variable.

Eg we can do: twoway scatter sat age, mlabel(lastname)


Another great add-on is “,lfit var1 var2”which draws the OLS line of best fit for the two variables plus a constant.

If combining with another add-on you can use the syntax “twoway scatter var1 var2 mlabel(var3) || lfit var1 var2”

Eg: twoway scatter sat age, mlabel(last) || lfit sat age

We can also break up the scatter plots by a certain variable if we want to. We use our old friend the “by(variable)” command as an add-on to the “twoway scatter”.

Eg: twoway scatter sat age, mlabel(lastname) by(gender, total)


Histogram

The command we use for our basic histogram is “histogram variable, frequency” Eg we can do: histogram age, frequency

We can also get Stata to plot the “best fitting” Normal Distribution for our data. The syntax is “histogram variable, frequency normal”.

Eg: histogram age, frequency normal

Histogram

We can also break up our histogram into several histograms by the categories in a certain variable.

The syntax is “histogram variable1, frequency by(variable2, total)” – the “total” ensures you see the whole picture too. Eg: histogram age, frequency by(gender, total)

Histogram

Bar charts

Making bar charts is quite easy in Stata too!

The syntax is “graph bar variables, by(variable1)” : this gives a bar graph of all the variables in the list, sorted by the variable you specified. Eg: graph bar age heightin, by(major)

020

4060

80

Econ Math Politics

mean of age mean of heightin

Questions?

the basics of stata: descriptive and summary statistics · 2/4/2014 · describe use the command...

Documents