stata prirucnik

Upload: jasmin-jasmin

Post on 04-Jun-2018

244 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Stata prirucnik

    1/75

    UVOD U STATA SOFTWARE21/09/2013

  • 8/13/2019 Stata prirucnik

    2/75

    SOFTWARE OPTIONS

    (NO SUCH THING AS THE BEST)

    MS Excel

    SPSS

    STATA

    EViews

    R

    SAS (+ SAP, Oracle, Business Objects, etc.)

    MATLab Shazam

    Compare statistical packages online

    Install student/trial versions2

  • 8/13/2019 Stata prirucnik

    3/75

    OBJECTIVES

    As per prof. Verbi

    Introduction to the Stata software:

    Capabilities

    User interface

    Menus vs commands

    Statistics features

    Regression features

    3

  • 8/13/2019 Stata prirucnik

    4/75

    USEFUL INTRO INFO

    o STATA Help for all STATA commands

    o the STATA Users Guide andReference Manual

    o STATA Journal - a quarterly publication containing articles about statistics,

    data analysis, teaching methods, and effective use of STATA's language.

    o STATA Tutorial for Stock and WatsonIntroduction to Econometrics,

    Pearson, 2003

    o University of Toronto, Department of Economics, Elena Capatina

    o www.ats.ucla.edu/stat/stata/webbooks/reg/default.htm

    o www.iies.su.se/~masa/stata.htm

    o www.princeton.edu/~erp/stata/main.html4

  • 8/13/2019 Stata prirucnik

    5/75

    HOW TO GET DATA

    World Bank databases

    Penn World Tables

    COMPUSTAT

    OECD National Accounts Database

    Greene, Wooldridge, Hsiao, Gujarati, Kennedy, etc.

    Websites of professors

    Google Scholar

    5

  • 8/13/2019 Stata prirucnik

    6/75

    STATAWINDOWS

    The command window

    The viewer/results window

    The review of commands window

    The variable window

    6

  • 8/13/2019 Stata prirucnik

    7/75

    7

  • 8/13/2019 Stata prirucnik

    8/75

    8

    Drop-down menu

    Review

    window

    Variableswindow

    Command window

    Results window

  • 8/13/2019 Stata prirucnik

    9/75

    9

  • 8/13/2019 Stata prirucnik

    10/75

    10

  • 8/13/2019 Stata prirucnik

    11/75

    11

  • 8/13/2019 Stata prirucnik

    12/75

    12

  • 8/13/2019 Stata prirucnik

    13/75

    13

    copy + later paste

  • 8/13/2019 Stata prirucnik

    14/75

    14

  • 8/13/2019 Stata prirucnik

    15/75

    DATA EDITOR VS. DATA BROWSER

    Data editor shows you your data and you can edit it

    Data browser shows you your data but you cannot edit it

    Check this frequently, especially after commands you are

    unsure about

    15

  • 8/13/2019 Stata prirucnik

    16/75

    16

    data browser

    data editor

  • 8/13/2019 Stata prirucnik

    17/75

    17

  • 8/13/2019 Stata prirucnik

    18/75

    18

  • 8/13/2019 Stata prirucnik

    19/75

    TYPE OF COMMANDS

    1. Administrative commands that tell STATA where to

    save results, how to manage computer memory, and

    2. Commands that tell STATA to read and manage

    datasets

    3. Commands that tell STATA to modify existing

    variables or to create new variables

    4. Commands that tell STATA to carry out the

    statistical analysis19

  • 8/13/2019 Stata prirucnik

    20/75

    WORKING WITH STATA

    Once you have started STATA, you will see a large window containing several

    smaller windows. At this point you can load the dataset and begin thestatistical analysis.

    STATA can be operated interactively or in batch mode.

    When you use STATA interactively, you type each STATA command in theSTATA command window and hit theReturn/Enter key on your keyboard.

    STATA executes the command and the results are displayed in the STATAResults window.

    Then you enter the next command, STATA executes it, and so forth, until theanalysis is complete.

    Even the simplest statistical analysis will involve several STATA commands.20

  • 8/13/2019 Stata prirucnik

    21/75

    WORKING WITH STATA

    When STATA is used in batch mode, all of the commands for the analysisare listed in a file, and STATA is told to read the file and execute all of thecommands.

    These files are called do files by STATA and are saved using a .dosuffix.When STATA executes .do file, all of the empirical resultsfor some work/paper/research study are produced.

    21

  • 8/13/2019 Stata prirucnik

    22/75

    WORKING WITH STATA

    Using STATA in batch mode has three important advantages over usingSTATA interactively.

    1. .do files provide an audit trail for your work. The file provides an exact

    record of each STATA command that allows you to be more efficient inyour research.

    2. .do files allows others to learn from your work, replicate your work andfind other ways to improve their/your research.

    3. Everyone makes errors when using STATA. When a command containsan error, it will not be executed by STATA, or if it is, it will produce the

    wrong result. Following an error, it often necessary to start the analysisfrom the beginning.

    22

  • 8/13/2019 Stata prirucnik

    23/75

    WORKING WITH STATA

    If you are using STATA interactively, you must retype all of thecommands.

    If you are using a do file, then you only need to correct thecommand containing the error and rerun the file.

    For these reasons, you are strongly encouraged to use .do files.

    23

  • 8/13/2019 Stata prirucnik

    24/75

    WORKING WITH STATA

    1. From the command window

    2. Using a .do file

    A text file that can be edited using any text editor (the

    STATA do-file editor, notepad, word, etc), but you need

    to save it as filename.do for STATA to read it

    file do for STATA to execute all commands

    24

  • 8/13/2019 Stata prirucnik

    25/75

    25

    Lets run our first STATA .do file

  • 8/13/2019 Stata prirucnik

    26/75

    EXAMPLE: STATA1.DO

    clear

    log using caschool.log

    use caschool.dta

    describe

    generate income = avginc*1000

    summarize income

    log close

    exit

    26

  • 8/13/2019 Stata prirucnik

    27/75

    THELOG USINGCOMMAND

    The log file is an output file

    Creates and saves a log with all the actions performed by

    STATA and all the results How do I view this log file?

    From the drop-down menu: file log view and then

    search for your filename, keeping in mind it has extension .log

    27

  • 8/13/2019 Stata prirucnik

    28/75

    LOADING YOUR DATA

    3 ways to enter your data:

    1. If your data is in STATA format, for example,

    filename.dta, then enter: usefilename.dta

    2. If your data is a comma delimited file, then enter: insheet

    usingfilename.txt

    3. Or simply copy paste your data from your .xls file.

    Warning: This is the most troublesome and error-proneway of loading your data.

    For other formats, can useStatTransfer software to convert

    to STATA format28

  • 8/13/2019 Stata prirucnik

    29/75

    USEFUL COMMANDS:

    describewill list all the variables, their labels, types,

    and tell you the number of observations

    Two types of variables:

    1. Numerical

    2. String (usually appear in red in the data browser)

    You can convert a string variable to numerical using the

    destring. For example, destring var1, replace or destringvar1, force replace

    ** NO UNDO OPTION **29

  • 8/13/2019 Stata prirucnik

    30/75

    MORE COMMANDS:

    generate or gen creates a new variable

    e.g. generate income = avginc*1000

    e.g. generate log_inc = log(income)

    e.g. gen inc_sq = (income)^2

    30

  • 8/13/2019 Stata prirucnik

    31/75

    MORE COMMANDS:

    summarize or summ tells STATA to compute

    summary statistics (mean, standard deviations,

    etc.) for all variables

    This is useful to identify outliers and get an idea of

    your data!

    e.g. summarize

    e.g. summ income inc_sq

    31

  • 8/13/2019 Stata prirucnik

    32/75

    ENDING THE DO FILE

    log close closes the file stata1.log that contains the

    output.

    The command exit tells STATA that the programhas ended.

    32

  • 8/13/2019 Stata prirucnik

    33/75

    33

    Lets run our second STATA .do file

  • 8/13/2019 Stata prirucnik

    34/75

  • 8/13/2019 Stata prirucnik

    35/75

    COMMENTS IN .DO FILE:

    Star(*) STATA ignores the text that comes after *

    These lines can be used to describe what the commandsare doing

    Allows you to write comments (usually administrativecommands)

    35

  • 8/13/2019 Stata prirucnik

    36/75

    SOME USEFUL COMMANDS

    # delimit;

    Tells STATA that each STATA command ends with a semicolon.

    Useful for long commands

    Do not forget the ; and write this even after the comment lines

    that start with *.

    set more off

    Ensures STATA executes all commands.

    If code is too long, the output window might be filled, and STATA

    will display --more-- at the bottom and not execute all commands

    set memory 600m

    Increases memory available for STATA to do work36

  • 8/13/2019 Stata prirucnik

    37/75

    OTHER COMMANDS

    tabulate shows the frequency and percent of each value of acertain variable in the dataset

    e.g. tabulate county

    generate ... if

    e.g. generate teachers_new= teachers if teachers10

    37

  • 8/13/2019 Stata prirucnik

    38/75

  • 8/13/2019 Stata prirucnik

    39/75

    MORE COMMANDS

    by performs whatever command is given for each categoryof variable

    e.g. by county: summarize income by county, sort: summarize income

    sort simply sorts data in ascending order (for descendingorder find gsort)

    e.g. sort income e.g. sort county income

    39

  • 8/13/2019 Stata prirucnik

    40/75

    DELETING VARIABLES AND OBSERVATIONS

    drop

    use this command to delete variables or observations

    e.g. drop avginc deletes average income variable

    e.g. drop if teachers=7

    40

  • 8/13/2019 Stata prirucnik

    41/75

    BASIC STATISTICAL RELATIONSHIPS

    Correlation:

    correlate

    Remember: Correlation coefficients which are close to -1 or +1indicate a strong linear correlation. Values close to o indicate a

    weak linear correlation; 0 indicates no linear correlation at all. e.g. correlate income teachers

    e.g. correlate income teachers computer

    Regression:

    reg performs OLS regression predicting value of dependent

    variable from one or more independent variables

    e.g. reg income teachers

    e.g. reg income teachers computer41

  • 8/13/2019 Stata prirucnik

    42/75

    GRAPHS: SCATTER PLOTS

    e.g. graph twoway scatter income computer e.g. graph twoway scatter income computer || lfitci income computer

    42

    STATA graph editor

  • 8/13/2019 Stata prirucnik

    43/75

    SAVING YOUR DATA

    Saving data in Stata format:

    (the usual way) file save as ... or

    savefile name.dta

    (on my PC the file is saved to C:\data)

    Export your data in another format:

    file export (choose file format)

    43

  • 8/13/2019 Stata prirucnik

    44/75

    SOME DATA CLEANING COMMANDS

    reshape transforms (converts) data from long to wideformat or from wide to long format

    Before using reshape, you need to determine whether thedata are in long or wide form.

    Also determine the logical observation (i) and the

    subobservation (j) by which to organize the data.

    44

  • 8/13/2019 Stata prirucnik

    45/75

  • 8/13/2019 Stata prirucnik

    46/75

    RESHAPE COMMANDS

    Lets practice: use http://www.ats.ucla.edu/stat/stata/modules/kidshtwt, clear

    Then save example2.dta

    Ask yourself:

    Q: What is the stem of the variable going fromwide to long?A: The stem is ht andwt

    Q: What variable uniquely identifies an observation when it is in thewide form?A: famid andbirth together uniquely identify thewide observations.

    Q: What do we want to call the variable which contains the suffix of ht (andwt)?A: Lets call the suffix age.

    From wide to long:

    reshape long stem-of-wide-vars, i(wide-id-var) j(var-for-suffix)

    Example: browse list famid birth ht1 ht2 wt1 wt2 reshape long ht wt, i(famid birth) j(age) list famid birth ht wt

    46

  • 8/13/2019 Stata prirucnik

    47/75

    EGEN COMMAND

    Extended generate (egen) is more powerful thanordinary gen

    Examples:

    egen age_mean = mean(age), by(year)

    egen stdage = std(age)

    47

  • 8/13/2019 Stata prirucnik

    48/75

    LAGGED VARIABLES

    [_n-1] tells STATA this is the previous observation

    [_n-2] is 2 observations before

    Examples:

    First sort your data!

    gen GDP_lagged= GDP[_n-1]

    gen GDP_2= GDP[_n-2]

    Other uses:Filling in missing data

    by ID: replace education=1 if education[_n-1]==1 &education[_n+1]==1 & ID[_n-1]==ID[_n+1];

    48

  • 8/13/2019 Stata prirucnik

    49/75

    COLLAPSE COMMAND

    Lets practice: use http://www.ats.ucla.edu/stat/Stata/modules/collapse.htm

    Then save example3.dta

    Example:

    create one record per family (famid) with the average of age (avgage) andaverage weight (avgwt) within each family, and the number of kids (numkids)per family

    collapse (mean) avgage=age avgwt=wt (count) numkids=birth, by(famid)

    49

  • 8/13/2019 Stata prirucnik

    50/75

    NEW DATA AFTER COLLAPSE

    50

    famid avgage avgwt numkids

    1 6 40 32 5,333333 50 3

    3 4 40 3

  • 8/13/2019 Stata prirucnik

    51/75

    PRESERVING DATA

    preserve tells STATA to keep your data in memory,so if your next commands modify it, you can comeback to your original data

    restore gives you back your original data

    Example: use data1.dta

    preserve collapse (mean) age, by (family)

    save data2.dta

    restore

    51

  • 8/13/2019 Stata prirucnik

    52/75

    SIMPLE REGRESSION

    Example:

    use http://www.ats.ucla.edu/stat/stata/notes/hsb2

    browse

    regress science math female socst read

    52

  • 8/13/2019 Stata prirucnik

    53/75

    53

    OUTPUT

  • 8/13/2019 Stata prirucnik

    54/75

    54

    OUTPUT

    ANOVA table Model fit

    Parameter estimates

  • 8/13/2019 Stata prirucnik

    55/75

    ANOVA TABLE

    Source: Looking at the breakdown of variance in the outcome variable,

    these are the categories we examine:Model, Residual, and Total.

    Total variance is partitioned into the variance which can be explained

    by the independent variables (Model) and the variance which is not

    explained by the independent variables (Residual, sometimes calledError).

    SS: These are the Sum of Squares associated with the three sources of

    variance: Total, Model and Residual.

    df: These are the degrees of freedom associated with the sources of

    variance.55

  • 8/13/2019 Stata prirucnik

    56/75

    DF

    The total variance has N-1 degrees of freedom. The model degrees of freedom

    corresponds to the number of coefficients estimated minus 1. Including the

    intercept, there are 5 coefficients, so the model has 5-1=4 degrees of freedom. The

    Residual degrees of freedom is the DF total minus the DF model, 199-4=195.

    DF is the number of free or linearly independent observations used in the

    calculation of the statistic. DF of a statistic is the number of quantities that enter

    into calculation of the statistic minus the number of constraints connecting these

    quantities. as the number of independent pieces of information available to

    estimate another piece of information.

    This is the number of degrees of freedom is the number of independent

    observations in a sample of data that are available to estimate a parameter of the

    population from which that sample is drawn.

    MS: These are the Mean Squares: the Sum of Squares divided by their respective DF. 56

  • 8/13/2019 Stata prirucnik

    57/75

    57

    OUTPUT

    ANOVA table Model fit

    Parameter estimates

  • 8/13/2019 Stata prirucnik

    58/75

    MODEL FIT

    Number of obs: This is the number of observations used inthe regression analysis.

    F(4, 195): This is the F-statistic. It is the Mean Square Model(2385.93019) divided by the Mean Square Residual

    (51.0963039), yielding F=46.69. The numbers in parenthesesare the Model and Residual degrees of freedom from theprevious ANOVA table.

    Prob > F: This is the p-value associated with the F-statistic. It is used in testing the null hypothesis that all of the

    model coefficients are 0.

    R-squared: This is the proportion of variance in thedependent variable (science) which can be explained by theindependent variables (math, female, socst and read).

    58

  • 8/13/2019 Stata prirucnik

    59/75

    MODEL FIT

    R-squared is an overall measure of the strength of associationand does not reflect the extent to which any particular indep.variable is associated with the dependent variable.

    Adj R-squared: This is an adjustmentof the R-squared that

    penalizes the addition of extraneous predictors to the model. Adjusted R-squared is computed using the following formula:

    1 - ((1 R2)((N - 1)/( N - k - 1)) where k is the number ofpredictors.

    Root MSE: Root MSE is the standard deviation of the errorterm, and is the square root of the Mean Square Residual (orError).

    59

  • 8/13/2019 Stata prirucnik

    60/75

    60

    OUTPUT

    ANOVA table Model fit

    Parameter estimates

  • 8/13/2019 Stata prirucnik

    61/75

    PARAMETER ESTIMATES

    science: This column shows the dependent variable atthe top (science) with the predictor variables below it(math, female, socst, read and_cons).

    The last variable (_cons) represents the constant or

    intercept.Coef.: These are the values for the regression equation

    for predicting the dependent variable from theindependent variable.

    The regression equation can have the following form

    Y(hat) =0+1X1 +2X2 +3X3 +4X461

  • 8/13/2019 Stata prirucnik

    62/75

    PARAMETER ESTIMATES

    The column estimates provide values for 0, 1, 2, 3 and 4.

    science(predicted) = 12.32529 + .3893102 math + -2.009765 female+.0498443 socst+.3352998 read

    math: The coefficient is .3893102. So for every unit increase in math, a.389 point increase in science is predicted, holding all other variablesconstant.

    Be careful! Since female is coded 0/1 (0=male, 1=female), we interpretthe coefficient: the predicted science score would be 2 points lower for afemale than for a male, for a randomly chosen student.

    62

  • 8/13/2019 Stata prirucnik

    63/75

    PARAMETER ESTIMATES

    socst: The coefficient for socst is .0498443. So for every unit increase in socst,we expect an approximately .05 point increase in the science score, holding allother variables constant.

    read: The coefficient for read is .3352998. So for every unit increase in read, weexpect a .34 point increase in the science score.

    t and P>|t|: These columns provide the t-value and 2-tailed p-value used intesting the null hypothesis that the coefficient (parameter) is 0.

    Remember: In significance testing, the p-value is the probability of obtaining atest statistic at least as extreme as the one that was actually observed, assumingthat the null hypothesis is true. One often "rejects the null hypothesis" when the p-

    value is less than the significance level (alhpa) which is often 0.01 or 0.05. Whenthe null hypothesis is rejected, the result is said to be statistically significant.

    [95% Conf. Interval]: This shows a 95% confidence interval for the coefficient.

    Remember: The coefficient will not be statistically significant if the confidenceinterval includes 0.

    63

  • 8/13/2019 Stata prirucnik

    64/75

    PREDICTED VALUES

    After the regression, type predict yhat This command creates a new variable yhatwith the predicted values

    for the dependant variable (science).

    Next ...

    Regression diagnostics is beyond the scope of our short intro course,but ...

    Some issues:

    heteroskedasticity (when disturbances do not all have the same variance),

    autocorrelation (when disturbances are correlated with one another),

    multicolinearity (two or more independent variables are approximatelylinearly related in the sample data)

    64

  • 8/13/2019 Stata prirucnik

    65/75

    DIAGNOSTICS

    Lets check homoscedasticity of residuals

    predict r, residuals

    One of the main assumptions for the ordinary leastsquares regression is the homogeneity of variance ofthe residuals.

    If the model is well-fitted, there should be no patternto the residuals plotted against the fitted values. If thevariance of the residuals is non-constant, then theresidual variance is said to be heteroscedastic.

    65

  • 8/13/2019 Stata prirucnik

    66/75

    DIAGNOSTICS

    There are graphical and non-graphical methods fordetecting heteroscedasticity.

    A commonly used graphical method is to plot theresiduals versus fitted (predicted) values. We do

    this by issuing the rvfplot command.yline(0)puts a reference line at y=0.

    rvfplot, yline(0)

    66

    DIAGNOSTICS

  • 8/13/2019 Stata prirucnik

    67/75

    DIAGNOSTICS

    estat imtest (White test)

    estat hettest (Breusch-Pagan test)67

  • 8/13/2019 Stata prirucnik

    68/75

    DIAGNOSTICS

    White and Breusch-Pagan tests test the null hypothesisthat the variance of the residuals is homogenous.

    If the p-value is very small, we would have to reject thehypothesis and accept the alternative hypothesis that the

    variance is not homogenous. In this case, the evidence is against the null hypothesis

    that the variance is homogeneous.

    These tests are very sensitive to model assumptions, such

    as the assumption of normality. So, it is a commonpractice to combine the tests with diagnostic plots to makea judgment on the severity of the heteroscedasticity and todecide if any correction is needed for heteroscedasticity

    68

  • 8/13/2019 Stata prirucnik

    69/75

    69

  • 8/13/2019 Stata prirucnik

    70/75

    LINEAR REGRESSION WITH PANEL DATA

    Declaring the data to be a panel:

    Example, where data consists of many firms, eachobserved over 5 years

    iis Firm ;

    tis Year ;

    xt are the prefix for the commands in this class

    xtreg should be used for regressions with panel

    data

    70

  • 8/13/2019 Stata prirucnik

    71/75

    FIXED EFFECTS:

    yit = a + xitb + vi + eiti.e. xtreg lnc lny, fe

    Equivalent to including a dummy variable for eachcase (i.e. firm). But not really!

    71

  • 8/13/2019 Stata prirucnik

    72/75

    RANDOM EFFECTS (RE)

    If you think some omitted variables may beconstant over time but vary between cases, andothers may be fixed between cases but vary overtime, then you can include both types by using

    random effects. Stata's RE estimator is a weighted average of fixed

    and between effects

    i.e. xtreg lnc lny, re

    72

    C OOS G B F

  • 8/13/2019 Stata prirucnik

    73/75

    CHOOSING BETWEEN FIXED ANDRANDOM EFFECTS

    running a Hausman test: estimate the FE model, save the coefficients, estimate

    the RE model, and then do the comparison.

    Example: xtreg dependentvar var1 var2 var3 ... , fe

    estimates store fixed

    xtreg dependentvar var1 var2 var3 ... , re

    estimates store random

    hausman fixed random

    If significant p-value, use FE

    Source:http://dss.princeton.edu/online_help/analysis/panel.htm

    73

  • 8/13/2019 Stata prirucnik

    74/75

    TIME SERIES DATA

    tsset declare data to be time-series data

    Examples:

    tsset time, yearly (For an annual time series, timetakes on values such as 1990, 1991, ...)

    tsset company year, yearly (For yearly panel data,variable company being the panel ID variable and yearbeing a four-digit calendar year)

    74

  • 8/13/2019 Stata prirucnik

    75/75

    QUESTIONS

    Ensar Sehic, PhD, Assistant Professor

    Academic Unit for Quantitative Economics

    University of Sarajevo, School of Economics and Business

    Trg Oslobodjenja 1, office #69, 71000 Sarajevo, B&H

    Tel: +387 33 253 767

    Mob: +387 62 225 123

    Email: [email protected]: ensar.sehic