research methods lecture 3 more stata ian walker room s2.109 i.walker@warwick.ac.uk 02475...

Post on 11-Dec-2015

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Research MethodsLecture 3

More STATA

Ian WalkerRoom S2.109

i.walker@warwick.ac.uk 02475 23054

Slides available at:

http://www2.warwick.ac.uk/fac/soc/economics/pg/modules/rm/notes/iw_lectures/

Housekeeping announcement

Stat-Transfer• Use STAT-

TRANSFER to convert data.

• Click on• Stat-transfer is

“point and click”.

• Just tell it the file name and format

• and the format you want it in.

• Click “transfer”.

Stat Tran 6.lnk

Stat Transfer options• Useful options for creating a manageable

dataset from a large one:– Keep or drop variables– Change variable format

• E.g. float to integer

– Select observations• E.g. “where (income + benefits)/famsize < 4500”

• Can be used for reading a large STATA dataset and writing a smaller one

• Avoids doing this in STATA itself

Customising STATA• profile.do runs automatically when STATA

starts• Edit it to include commands you want to

invoke every time.set mem 200m.log using justincase.log, replace

• Define preferences for STATA’s look and feel– Click on Prefs in menu

• Colours, graph scheme, etc.• Save window positioning

Merging data - 1• file1 has id x1 x2 x3 , file2 has id x4 x4 x5.

• You can merge using “key” in BOTH files (id)

• But you need to sort both files first. use file1. sort id sorts file1 according to id variable

. save, replace

. use file 2

. sort id sorts file2 according to id variable

. merge using file1

. drop if _merge~=3 drops obs with any missing info

. save file3

Merging data - 2

• For each row (id) all vars in file1 added to corresponding row of file2 (if there is one).

• .merge creates a new variable, _merge – which =1 for those obs only in file1, =2 for

those only in file2, and =3 for those in both.

• So the syntax above drops those obs that don’t have data in both files– and saves the result containing x1-x6 in file3

• .append to add more obs on the same vars.

Collapsing data (use with care)

• Collapse converts the data in memory into a dataset of means (or sums, medians, etc.)

• This is useful when you want to provide summary info at a higher level of aggregation– For example, suppose a dataset contains data

on individuals – say their reg and whether u/e– To find the average u/e rates across reg type:

. collapse unemp, by(region)

leaves 1 obs for each reg and mean u/e rate.

Reshaping files• Data may be “long” but thin

– Eg each record is a household member– But there are few vars - say wage and hours

• Data may be “wide” but short– each record is a household and has lots of vars– (eg w1 w2 w3 hours1 hours2 hours3)

. reshape long inc ue, i(id) j(year) wide to long

. reshape wide inc ue, i(id) j(year) back to wide

. Handy for merging data together and for panel data

Syntax to remember>= means "greater or equal",  &     means    "and",  |       means    "or"  = means “set equal to”== means “is it equal to?”~= means “not equal” (or use != ). means missing valueFor example

. keep if x1>= 1 & x1<=3 | x1==7 & x1 ~= .

. gen x = log(y)

. reg y x if z == 1 & y != .

Using STATA as a calculator• .display command

– .dis 22/7– disp log(250)– di exp(3.6)– di chiprob(2,6.45) (i.e. 2 df, deviance 6.45)

• returns 0.398 (i.e. its significant at 5% level)– display _N

• Returns the sample size – (_N is the number of the last obs)

Using the data editor• Open a

datafile (eg auto.dta)

• Click on the icon

• Or type .edit • You can edit

datapoints!• Or just

browse the datafile

STATA’s editor• STATA has an editor that

allows you to create do files– Enter cmds – 1 per line– Save the commands in a

“do” file – Highlight commands and

click the button with page (or page with text) and down arrow to “run” (or “do”) commands.

Saving output• Scroll – best to open “log file” (and close it). • Click on file, log, begin . • Or type

. log using myoutputThen type some commands here and then. log close

• log command allows replace and append• Default is a .smcl file extension (to “view”)• It doesn’t save graphs

– Copy graphs (use cut and paste or use menus)

Saving commands• You might prefer own extension, say, .log

– then you get an ASCII file that anything can edit– you can translate files to and from smcl format

• click on file, log, translate and fill in the dialog box

• Logging your output is a good way of developing a .do file– since it saves the commands as well as output

• Or you can just log the commands– type .cmdlog using xxx

• You can turn logging off and back on– .log off then .log on when ready to resume

Useful tips for .do files.#delimit ; /*makes ; the end of line char) */.use mydata, clear ;.set more off ;.set mem 200m ;.set matsize 200 ;.log using xxxx.log, replace ;...log close ;.exit, clear ;

Handling string variablesencode • Use encode when the original var is a

character var (eg gender is "m" or "f’")• encode command does not produce dummy

variables, it just assigns numbers to each group defined by the character variable.

• In this example, gender was the original character var and sex is new numeric var:

. encode gender, gen(sex)• decode does the opposite

Extended generate (.egen)

egen• Useful when you need a new variable that

is the mean, median, etc. of another variable– for all observations or for groups of

observations. • Also useful when you need to simply

number groups of observations based on some classification variables.

• Great when you have panel data

.egen examples

. egen sumvar1 = sum(var1)creates sumvar1 as sum of values of var1

. egen meanvar1= mean(var1), by(var3) creates meanvar1 as mean of all values of var1

. egen counter = count(id), by(company)creates count as the number of companies with

nonmissing id’s

. egen groupid = group(month year)assigns a number to each month/year group

Saving typing - 1macro• For defining lists of vars (globally or locally).

.local macvar x1 x2 x3 x4 x5 x6

.reg y1 `macvar’

.reg y2 `macvar’

.reg y3 `macvar’

.reg y4 `macvar’

• macvar becomes string “x1 x2 x3 x4 x5 x6" • Pay careful attention to the different type of

quotation marks

Saving typing - 2for • Performs same command on several vars. • It can use several types of variable lists

.for var1-var25: replace @=. if @==99replaces 99 by missings)

.for var*: replace @=. if @=99 replaces 99 by missings

.for 1-3, ltype(numeric): gen q@==0creates q1=0, q2=0, q3=0

.for a b c, ltype(any): gen str2 @="x” creates a=x, b=x, c=x

Regression models - I

• Linear regression and related models when the outcome variable is continuous– OLS, 2SLS, 3SLS, IV, quantile reg, Box-Cox …

• Binary outcome data– the outcome variable is 0 or 1(or y/n)

• probit, logit, nested logit...;

• Multiple outcome data– the outcome variable is 1, 2, ...,

• conditional logit, ordered probit

Regression models - II• Count data

– the outcome variable is 0, 1, 2, ..., occurrences • Poisson regression, negative binomial

• Choice models– multinomial choice– A, B or C

• Multinomial logit, Random utility model, unordered probit, nested logit, ...etc

• Selection models– Truncated, censored

• Tobit, Heckman selection models; • linear regression or probit with selection

Regression models - III• STATA supports several special data types.• Once type is defined special commands

work• Time series

– Estimate ARIMA, and ARCH models– Estimators for autocorrelation and

heteroscedasticity– Estimate MA and other smoothers– Tests for auto, het, unit roots - h, d, LM, Q, ADF,

P-P …..– TS graphs

Special data types: survey

• Non-randomness induces OLS to be inefficient

• STATA can handle non-random survey data– see the “syv***” commands– Example (stratified sample of medical cases):

. webuse nhanes2f, clear

. svyset psuid [pweight=finalwgt], strata(stratid)

. svy: reg zinc age age2 weight female black orace rural

. reg zinc age age2 weight female black orace rural

Number of strata = 31 Number of obs = 9189 Number of PSUs = 62 Population size = 1.042e+08 Design df = 31 F( 7, 25) = 62.50 Prob > F = 0.0000 R-squared = 0.0698 ------------------------------------------------------------------------------ | Linearized zinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | -.1701161 .0844192 -2.02 0.053 -.3422901 .002058 age2 | .0008744 .0008655 1.01 0.320 -.0008907 .0026396 weight | .0535225 .0139115 3.85 0.001 .0251499 .0818951 female | -6.134161 .4403625 -13.93 0.000 -7.032286 -5.236035 black | -2.881813 1.075958 -2.68 0.012 -5.076244 -.687381 orace | -4.118051 1.621121 -2.54 0.016 -7.424349 -.8117528 rural | -.5386327 .6171836 -0.87 0.390 -1.797387 .7201216 _cons | 92.47495 2.228263 41.50 0.000 87.93038 97.01952 ------------------------------------------------------------------------------ . regress zinc age age2 weight female black orace rural Source | SS df MS Number of obs = 9189 -------------+------------------------------ F( 7, 9181) = 79.72 Model | 110417.827 7 15773.9753 Prob > F = 0.0000 Residual | 1816535.3 9181 197.85811 R-squared = 0.0573 -------------+------------------------------ Adj R-squared = 0.0566 Total | 1926953.13 9188 209.724982 Root MSE = 14.066 ------------------------------------------------------------------------------ zinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | -.090298 .0638452 -1.41 0.157 -.2154488 .0348528 age2 | -.0000324 .0006788 -0.05 0.962 -.0013631 .0012983 weight | .0606481 .0105986 5.72 0.000 .0398725 .0814237 female | -5.021949 .3194705 -15.72 0.000 -5.648182 -4.395716 black | -2.311753 .5073536 -4.56 0.000 -3.306279 -1.317227 orace | -3.390879 1.060981 -3.20 0.001 -5.470637 -1.311121 rural | -.0966462 .3098948 -0.31 0.755 -.7041089 .5108166 _cons | 89.49465 1.477528 60.57 0.000 86.59836 92.39093

Special data types: duration

• Survival time data– See the “st***” commands

.stset failtime /*sets the var that defines duration*/

• Estimates a wide variety of models to explain duration

Weibull example ….• ST regression supports Weibull, Cox PH

and other options. streg load bearings, distribution(weibull)

• After streg you can plot the estimated hazard with

. stcurve, cumhaz• STATA allows functions to be plotted by

specifying the function:– E.g. Weibull “hazard” model –

Special data types: Panel data

• STATA can handle “panel” data easily– see the “xt***” commands

• Common commands are.xtdes Describe pattern of xt data

.xtsum Summarize xt data

.xttab Tabulate xt data

.xtline Line plots with xt data

.xtreg Fixed and random effects

Panel data• An xt dataset looks like this:

pid yr_visit fev age sex height smokes ---------------------------------------------------------- 1071 1991 1.21 25 1 69 0 1071 1992 1.52 26 1 69 0 1071 1993 1.32 28 1 68 0 1072 1991 1.33 18 1 71 1 1072 1992 1.18 20 1 71 1 1072 1993 1.19 21 1 71 0

• xt*** cmds need vars identify person and “wave”: . iis pid . tis yr_visit• Or use the tsset command

. tsset pid yr_visit, yearly

Panel regression

• Once STATA has been told how to read the data it can perform regressions quite quickly:. xtreg y x, fe

. xtreg y x, re

Further advice• See Stephen Jenkins’ excellent course on

duration modelling in STATA• Steve Pudney’s excellent panel course

– Beware his example dataset is 30mb+

• To get up and running– Just have a go - you won’t break it!– Try some of the commands in this lecture

• To start to get proficient– Sign up for netcourses

top related