Download - Key Data Management Tasks in Stata FHSS Research Support Center fhssrsc.byu.edu 115 and 116 SWKT
Key Data Management Tasks in Stata
FHSS Research Support Centerfhssrsc.byu.edu
115 and 116 SWKT
Investigate Duplicates in the Data (1a.)
If you suspect that duplicates exist in your data, as in this example…
You can use duplicates report to investigate…
3 3 2 2 6 3 1 197 0 copies observations surplus
Duplicates in terms of all variables
. duplicates report
3 3 2 2 8 4 1 195 0 copies observations surplus
Duplicates in terms of id female ses
. duplicates report id female ses
Observations with 1, 2, or 3 copies
Most observations are unique
3 observations have 2 copies 1 observation has 3 copies
When the report is given in terms of only some of the variables, there are more duplicated obs.
. //Note different math score for 1st & 2nd obs
9 male middle 48 49 52 8 female low 39 44 52 7 male middle 57 54 59 6 female low 47 41 46 5 male low 47 40 43 5 male low 47 40 43 4 female low 44 50 41 4 female low 44 50 41 3 male low 63 65 48 3 male low 63 65 48 2 female middle 39 41 33 2 female middle 39 41 33 2 female middle 39 41 33 1 female low 34 44 40 1 female low 34 44 84 id female ses read write math
. list in 1/15, noobs compress separator(15)
. use http://fhssweb4:1019/duplicates.dta, clear
View the Duplicates in the Data (1b.)
4 11 5 male low 47 40 43 4 10 5 male low 47 40 43 3 9 4 female low 44 50 41 3 8 4 female low 44 50 41 2 7 3 male low 63 65 48 2 6 3 male low 63 65 48 1 5 2 female middle 39 41 33 1 4 2 female middle 39 41 33 1 3 2 female middle 39 41 33 group: obs: id female ses read write math
Duplicates in terms of all variables
. duplicates list, sepby(id) //new line when id changes
5 11 5 male low 5 10 5 male low 4 9 4 female low 4 8 4 female low 3 7 3 male low 3 6 3 male low 2 5 2 female middle 2 4 2 female middle 2 3 2 female middle 1 2 1 female low 1 1 1 female low group: obs: id female ses
Duplicates in terms of id female ses
. duplicates list id female ses, sepby(id)
4 observations are completely duplicated in all variables: the first one 3 times and the others twice; Stata creates a different “group:” for each observation that appears duplicated
5 observations are duplicated in id, female, and ses, because observations 1 and 2 only differ in math
Create a Variable to Tag Duplicates (1c.)
New variable is 0 if the observation is unique, 1 if there is one duplicate of it, 2 if there are two duplicates of it, etc.
We can see the difference in math scores for observation 1 and 2, which is why duplicates report and duplicates report id female ses gave us different outputs. Let’s set them both equal to 84.
11. 5 male low 47 40 43 1 10. 5 male low 47 40 43 1 9. 4 female low 44 50 41 1 8. 4 female low 44 50 41 1 7. 3 male low 63 65 48 1 6. 3 male low 63 65 48 1 5. 2 female middle 39 41 33 2 4. 2 female middle 39 41 33 2 3. 2 female middle 39 41 33 2 2. 1 female low 34 44 40 1 1. 1 female low 34 44 84 1 id female ses read write math dup_id
. list if dup_id >=1, sepby(id)
Duplicates in terms of id female ses
. duplicates tag id female ses, gen(dup_id)
(1 real change made). replace math = 84 if id ==1
Drop the Duplicate Observations (1d.)
1 200 0 copies observations surplus
Duplicates in terms of all variables
. duplicates report
(6 observations deleted)
Duplicates in terms of all variables
. duplicates drop
The command duplicates drop drops all observations that are duplicated, leaving just the first observation in each group.
Now we run duplicates report to check that all of the duplicate observations have been deleted.
Label the Values of a Numeric Variable (2a.)
Variable foreign currently displayed as binary numeric variable.
Dodge Colt 3,984 30 domestic car Chev. Nova 3,955 19 domestic car Renault Le Car 3,895 26 foreign car Merc. Bobcat 3,829 22 domestic car AMC Spirit 3,799 22 domestic car Subaru 3,798 35 foreign car Toyota Corolla 3,748 31 foreign car Chev. Monza 3,667 24 domestic car Chev. Chevette 3,299 29 domestic car Merc. Zephyr 3,291 20 domestic car make price mpg foreign
. list in 1/10, noobs
. label values foreign foreign_lbl
. label define foreign_lbl 0 "domestic car" 1 "foreign car"
The labels are now displayed for the Variable foreign, which is more helpful, but the actual values in the data are still 0 and 1.
Creates labeling scheme called “foreign_lbl”, but nothing happens to data yet
Applies labeling scheme “foreign_lbl” to the variable foreign
Dodge Colt 3,984 30 0 Chev. Nova 3,955 19 0 Renault Le Car 3,895 26 1 Merc. Bobcat 3,829 22 0 AMC Spirit 3,799 22 0 Subaru 3,798 35 1 Toyota Corolla 3,748 31 1 Chev. Monza 3,667 24 0 Chev. Chevette 3,299 29 0 Merc. Zephyr 3,291 20 0 make price mpg foreign
. list in 1/10, noobs
. use val_labels.dta, clear. use http://fhssweb4:1019/valuelabels.dta, clear
Now Let’s Look at the Code In-Depth (2a.)
Dodge Colt 3,984 30 domestic car Chev. Nova 3,955 19 domestic car Renault Le Car 3,895 26 foreign car Merc. Bobcat 3,829 22 domestic car AMC Spirit 3,799 22 domestic car Subaru 3,798 35 foreign car Toyota Corolla 3,748 31 foreign car Chev. Monza 3,667 24 domestic car Chev. Chevette 3,299 29 domestic car Merc. Zephyr 3,291 20 domestic car make price mpg foreign
. list in 1/10, noobs
. label values foreign foreign_lbl
. label define foreign_lbl 0 "domestic car" 1 "foreign car"
Says we want to define a labeling scheme that will be stored in Stata’s memory, and later applied to variables
Name of the labeling scheme that we want to create
The actual labeling scheme: which labels go with which numbers
Says we want to apply a labeling scheme to a specific variable
Name of the variable to which we want to apply the labeling scheme
Name of the labeling scheme we want to apply
Create Variable Labels (2b.)
Variable we want to label
Label we want to give it
Note the difference between variable label and value label
female byte %8.0g sexlbl hbp byte %8.0g high blood pressurerace byte %8.0g age_grp byte %8.0g year int %8.0g city byte %8.0g id str10 %10s Record identification number variable name type format label variable label storage display value size: 19,210 vars: 7 22 Jan 2011 11:12 obs: 1,130 Contains data from http://www.stata-press.com/data/r12/hbp4.dta
. describe
. label variable hbp "high blood pressure"
. webuse hbp4
. webuse hbp4, clear
Create a Labeled Categorical Variable from a Continuous Numeric Variable (3.)
. list in 1/8, noobs
> greater than 30 mpg=efficient". label variable efficiency "1-14 mpg=inefficient; 15-30 mpg=efficient;
(74 differences between mpg and efficiency)> (30/max=3 "efficient"), gen(efficiency) label(effcny_lbl). recode mpg (min/14=1 "inefficient") (15/30=2 "moderately efficient")
Cad. Deville 14 domestic car inefficient Linc. Continental 12 domestic car inefficient Volvo 260 17 foreign car moderately efficient Peugeot 604 14 foreign car inefficient Linc. Versailles 14 domestic car inefficient Linc. Mark V 12 domestic car inefficient Cad. Eldorado 14 domestic car inefficient Cad. Seville 21 domestic car moderately efficient make mpg foreign efficiency
. list in 1/8, noobs nolabel ab(10)
Cad. Deville 14 0 1 Linc. Continental 12 0 1 Volvo 260 17 1 2 Peugeot 604 14 1 1 Linc. Versailles 14 0 1 Linc. Mark V 12 0 1 Cad. Eldorado 14 0 1 Cad. Seville 21 0 2 make mpg foreign efficiency
We have a continuous numeric variable (mpg)…
…but instead we want a variable which groups observations into 3 categories, based on mpg …
…note that the actual values of the new variable are numbers, but it will display value labels. This is what we need for analysis.
Cad. Deville 14 domestic car Linc. Continental 12 domestic car Volvo 260 17 foreign car Peugeot 604 14 foreign car Linc. Versailles 14 domestic car Linc. Mark V 12 domestic car Cad. Eldorado 14 domestic car Cad. Seville 21 domestic car make mpg foreign . list in 1/8, noobs
. use http://fhssweb4:1019/recode.dta, clear
Now Let’s Look at the Code In-Depth (3.)
> greater than 30 mpg=efficient". label variable efficiency "1-14 mpg=inefficient; 15-30 mpg=efficient;
> (30/max=3 "efficient"), gen(efficiency) label(effcny_lbl). recode mpg (min/14=1 "inefficient") (15/30=2 "moderately efficient")
Says that rather than alter the values of mpg, we want to just create a new variable called efficiency
The set of value labels that we are defining will be saved as effcny_lbl in Stata’s memory
This just means that the command took up more than one line
Change the values of a variable based on some coding rules
Variable who’s values I want to change
First rule: If the value is between the lowest number and 14, make it to a 1…
…and give it a value label of “inefficient”
Create a variable label (not to be confused with a value label) describing how the coding rules work
11 22
33
44 55
66 77
88
Covert a String Variable Containing Digits into a Numeric Variable (4a.)
numid double %10.0g Record identification numberid str10 %10s Record identification number variable name type format label variable label storage display value
. describe id numid
10. 8003187296 8.003e+09 9. 8005012348 8.005e+09 8. 8006962950 8.007e+09 7. 8004411604 8.004e+09 6. 8007340259 8.007e+09 5. 8006142590 8.006e+09 4. 8006167153 8.006e+09 3. 8000468015 8.000e+09 2. 8007143470 8.007e+09 1. 8008238923 8.008e+09 id numid
. list id numid in 1/10
id has all characters numeric; numid generated as double. destring id, generate(numid)
. use http://www.stata-press.com/data/r12/hbp2, clear
10. 8003187296 8003187296 9. 8005012348 8005012348 8. 8006962950 8006962950 7. 8004411604 8004411604 6. 8007340259 8007340259 5. 8006142590 8006142590 4. 8006167153 8006167153 3. 8000468015 8000468015 2. 8007143470 8007143470 1. 8008238923 8008238923 id numid
. list id numid in 1/10
. format numid %10.0f
Create numeric variable
Notice the default exponential format
Use fixed format to display
Automatically Create a Labeled Numeric Variable from a String Variable (4b.)
Total 433 695 1,128 male 0 695 695 female 433 0 433 sex female male Total gender
. tab sex gender
Total 433 695 1,128 male 0 695 695 female 433 0 433 sex 1 2 Total gender
. tab sex gender, nolabel
gender long %8.0g gender sex str6 %9s variable name type format label variable label storage display value
. describe sex gender
. encode sex, generate(gender)
. use http://www.stata-press.com/data/r12/hbp2, clear
New labeled numeric variable
Note: The numeric values assigned as integers beginning with 1 are ordered by the alphabetized values of the original string variable
Original string variable
Data values
Makes a new numeric variable, with value labels containing the text from the original variable
Value labels
Reshape Wide to Long (5a.1)
3. 3 0 3000 2000 1000 0 0 1 2. 2 1 2000 2200 3300 1 0 0 1. 1 0 5000 5500 6000 0 1 0 id sex inc80 inc81 inc82 ue80 ue81 ue82
. list
. webuse reshape1, clear
9. 3 82 0 1000 1 8. 3 81 0 2000 0 7. 3 80 0 3000 0 6. 2 82 1 3300 0 5. 2 81 1 2200 0 4. 2 80 1 2000 1 3. 1 82 0 6000 0 2. 1 81 0 5500 1 1. 1 80 0 5000 0 id year sex inc ue
When you have a wide dataset … but need a long one
. list
ue80 ue81 ue82 -> ue inc80 inc81 inc82 -> incxij variables:j variable (3 values) -> yearNumber of variables 8 -> 5Number of obs. 3 -> 9 Data wide -> long
(note: j = 80 81 82). reshape long inc ue, i(id) j(year)
You can reshape the data from wide to long
wide
long
Why would you do this?Some Stata statistical procedures (e.g. xtreg for panel data) require the data to be in long form
Let’s Look at the Code In-Depth (5a.1)
3. 3 0 3000 2000 1000 0 0 1 2. 2 1 2000 2200 3300 1 0 0 1. 1 0 5000 5500 6000 0 1 0 id sex inc80 inc81 inc82 ue80 ue81 ue82
. list
. webuse reshape1, clear
9. 3 82 0 1000 1 8. 3 81 0 2000 0 7. 3 80 0 3000 0 6. 2 82 1 3300 0 5. 2 81 1 2200 0 4. 2 80 1 2000 1 3. 1 82 0 6000 0 2. 1 81 0 5500 1 1. 1 80 0 5000 0 id year sex inc ue
We want our data to end up in long form
The two vars that currently have numbers tacked on the end of their names; the ones we want to reshape. In Stata these are called “stubs”.
. list
ue80 ue81 ue82 -> ue inc80 inc81 inc82 -> incxij variables:j variable (3 values) -> yearNumber of variables 8 -> 5Number of obs. 3 -> 9 Data wide -> long
(note: j = 80 81 82). reshape long inc ue, i(id) j(year)
Take the numbers off the end of the reshape vars, and put them in a new var called “year”
This specifies a unique individual
Reshape Wide to Long Without ID (5a.2)
3. 0 3000 2000 1000 0 0 1 3 2. 1 2000 2200 3300 1 0 0 2 1. 0 5000 5500 6000 0 1 0 1 sex inc80 inc81 inc82 ue80 ue81 ue82 id
. list
. generate id=_n
3. 0 3000 2000 1000 0 0 1 2. 1 2000 2200 3300 1 0 0 1. 0 5000 5500 6000 0 1 0 sex inc80 inc81 inc82 ue80 ue81 ue82
. list
. drop id
. webuse reshape1, clear
9. 3 82 0 1000 1 8. 3 81 0 2000 0 7. 3 80 0 3000 0 6. 2 82 1 3300 0 5. 2 81 1 2200 0 4. 2 80 1 2000 1 3. 1 82 0 6000 0 2. 1 81 0 5500 1 1. 1 80 0 5000 0 id year sex inc ue
. list, sepby(id)
ue80 ue81 ue82 -> ue inc80 inc81 inc82 -> incxij variables:j variable (3 values) -> yearNumber of variables 8 -> 5Number of obs. 3 -> 9 Data wide -> long
(note: j = 80 81 82). reshape long inc ue, i(id) j(year)
What if there is no ID variable?
Let’s create one
Reshape Long to Wide (5b.)
9. 3 82 0 1000 1 8. 3 81 0 2000 0 7. 3 80 0 3000 0 6. 2 82 1 3300 0 5. 2 81 1 2200 0 4. 2 80 1 2000 1 3. 1 82 0 6000 0 2. 1 81 0 5500 1 1. 1 80 0 5000 0 id year sex inc ue
3. 3 0 3000 2000 1000 0 0 1 2. 2 1 2000 2200 3300 1 0 0 1. 1 0 5000 5500 6000 0 1 0 id sex inc80 inc81 inc82 ue80 ue81 ue82
When you have a long dataset… but need a wide dataset
You can reshape the data from long to wide … and optionally reorder the variables
. list
. order id sex inc80 inc81 inc82 ue80 ue81 ue82
ue -> ue80 ue81 ue82 inc -> inc80 inc81 inc82xij variables:j variable (3 values) year -> (dropped)Number of variables 5 -> 8Number of obs. 9 -> 3 Data long -> wide
(note: j = 80 81 82). reshape wide inc ue, i(id) j(year)
The order command serves only to rearrange the sequence of the variables on the file
long wide
Let’s Look at the Code In-Depth(5b.)
9. 3 82 0 1000 1 8. 3 81 0 2000 0 7. 3 80 0 3000 0 6. 2 82 1 3300 0 5. 2 81 1 2200 0 4. 2 80 1 2000 1 3. 1 82 0 6000 0 2. 1 81 0 5500 1 1. 1 80 0 5000 0 id year sex inc ue
3. 3 0 3000 2000 1000 0 0 1 2. 2 1 2000 2200 3300 1 0 0 1. 1 0 5000 5500 6000 0 1 0 id sex inc80 inc81 inc82 ue80 ue81 ue82
. list
. order id sex inc80 inc81 inc82 ue80 ue81 ue82
ue -> ue80 ue81 ue82 inc -> inc80 inc81 inc82xij variables:j variable (3 values) year -> (dropped)Number of variables 5 -> 8Number of obs. 9 -> 3 Data long -> wide
(note: j = 80 81 82). reshape wide inc ue, i(id) j(year)
long wide
We want our data to end up in wide form
The two vars that change each year, that we want to stick numbers on the end of
Take the values in the variable “year”, and stick them on the end of inc and ue
This specifies a unique individual
What We Will Cover After the Break (6.)
• Combining multiple datasets vertically (append and preserve/restore)
• Save subsets of observations to different datasets
• Combining multiple datasets horizontally (1:1 merge)
• Save subsets of variables to different datasets
• m:1 (many-to-one) merging of datasets
• Extract group and individual data from multilevel datasets (collapse)
• Execute commands by groups (bysort)
• Create new variables based on data summaries and functions (egen)
• Create standardized scores and deviation scores (sd and std)
• Automate the same tasks for multiple variables (foreach loops)
• Global and local macros and looping
Append Multiple Datasets and Generate a Labeled Source Identifier (7a.)
3. Ventura 798364 2. Orange 2997033 1. Los Angeles 9878554 county pop
capop
3. Will 673586 2. DeKalb 103729 1. Cook 5285107 county pop
ilpop
3. Harris 4011475 2. Johnson 149797 1. Brazos 152415 county pop
txpop
9. Harris 4011475 TX 8. Johnson 149797 TX 7. Brazos 152415 TX 6. Will 673586 IL 5. DeKalb 103729 IL 4. Cook 5285107 IL 3. Ventura 798364 CA 2. Orange 2997033 CA 1. Los Angeles 9878554 CA county pop state
Combine several datasets with the same variables but different observations …
into a single dataset, while identifying the source of the data
. list, sep(0)
. label values state statelab
. label define statelab 0 "CA" 1 "IL" 2 "TX"
. append using ilpop txpop, generate(state)
. use capop, clear
Appending Datasets (7a.)
. list, sep(0)
. label values state statelab
. label define statelab 0 "CA" 1 "IL" 2 "TX"
. append using ilpop txpop, generate(state)
. use capop, clear
Open the master datasets
Append the other datasets to the first one
Generate a variable identifying the data source: Consecutive integers beginning with 0
Define and name a label for the new source identifier variable
Apply the label to the source identifier variable
Save Subsets of Observations to Separate Datasets (7b.)
. restore
file TX.dta saved. save TX, replace
(6 observations deleted). keep if (state==2)
. preserve
. restore
file IL.dta saved. save IL, replace
(6 observations deleted). keep if (state==1)
. preserve
. restore
file CA.dta saved. save CA, replace
(6 observations deleted). keep if (state==0)
. preserve
. ***Save subsets of cases
3. Harris 4011475 TX 2. Johnson 149797 TX 1. Brazos 152415 TX county pop state
. list
. use TX, clear
3. Will 673586 IL 2. DeKalb 103729 IL 1. Cook 5285107 IL county pop state
. list
. use IL, clear
3. Ventura 798364 CA 2. Orange 2997033 CA 1. Los Angeles 9878554 CA county pop state
. list
. use CA, clear
9. Harris 4011475 2 8. Johnson 149797 2 7. Brazos 152415 2 6. Will 673586 1 5. DeKalb 103729 1 4. Cook 5285107 1 3. Ventura 798364 0 2. Orange 2997033 0 1. Los Angeles 9878554 0 county pop state
. list, nolabel sep(0)
Create Separate files Containing Subsets of the Observations (7b.)
. restore
file TX.dta saved. save TX, replace
(6 observations deleted). keep if (state==2)
. preserve
. restore
file IL.dta saved. save IL, replace
(6 observations deleted). keep if (state==1)
. preserve
. restore
file CA.dta saved. save CA, replace
(6 observations deleted). keep if (state==0)
. preserve
. ***Save subsets of cases
Create a temporary backup of the dataset
Keep only a subset of the observations
Save the subset dataset
Restore the dataset to its original state from the temporary backup
Merge Files Containing the Same Observations but Different Variables (8a.)
6. Plym. Arrow 3,260 170 5. Datsun 210 2,020 165 4. Pont. Grand Prix 3,210 201 3. Cad. Seville 4,290 204 2. BMW 320i 2,650 177 1. Toyota Celica 2,410 174 make weight length
5. Datsun 210 4,589 35 4. Pont. Grand Prix 5,222 19 3. Cad. Seville 15,906 21 2. BMW 320i 9,735 25 1. Toyota Celica 5,899 18 make price mpg
6. Toyota Celica 2,410 174 5,899 18 matched (3) 5. Pont. Grand Prix 3,210 201 5,222 19 matched (3) 4. Plym. Arrow 3,260 170 . . master only (1) 3. Datsun 210 2,020 165 4,589 35 matched (3) 2. Cad. Seville 4,290 204 15,906 21 matched (3) 1. BMW 320i 2,650 177 9,735 25 matched (3) make weight length price mpg _merge
Merge data from two datasets with the same observations, but different variables (except for the key)
autosize (master) autoexpense (using)
matched 5 (_merge==3)
from using 0 (_merge==2) from master 1 (_merge==1) not matched 1 Result # of obs.
. merge 1:1 make using autoexpense
(1978 Automobile Data). use autosize, clear
merged
key
1:1 (Match) Merging (8a.)
matched 5 (_merge==3)
from using 0 (_merge==2) from master 1 (_merge==1) not matched 1 Result # of obs.
. merge 1:1 make using autoexpense
(1978 Automobile Data). use autosize, clear
Open one of the datasets
Do a match merge
Based on a common key variable which uniquely identifies each observation across both datasets
Merge with the other dataset
Observations with data from both datasets
Observations with data from just one dataset
Save Subsets of Variables to Separate Datasets (8b.)
6. Toyota Celica 2,410 174 5,899 18 matched (3) 5. Pont. Grand Prix 3,210 201 5,222 19 matched (3) 4. Plym. Arrow 3,260 170 . . master only (1) 3. Datsun 210 2,020 165 4,589 35 matched (3) 2. Cad. Seville 4,290 204 15,906 21 matched (3) 1. BMW 320i 2,650 177 9,735 25 matched (3) make weight length price mpg _merge
. restore
file EXPENSE.dta saved. save EXPENSE, replace
. keep make price mpg
. preserve
. restore
file SIZE.dta saved. save SIZE, replace
. keep make weight length
. preserve
. ***Save subsets of variables
.
6. Toyota Celica 2,410 174 5. Pont. Grand Prix 3,210 201 4. Plym. Arrow 3,260 170 3. Datsun 210 2,020 165 2. Cad. Seville 4,290 204 1. BMW 320i 2,650 177 make weight length
. list, sep(0)
(1978 Automobile Data). use SIZE, clear
6. Toyota Celica 5,899 18 5. Pont. Grand Prix 5,222 19 4. Plym. Arrow . . 3. Datsun 210 4,589 35 2. Cad. Seville 15,906 21 1. BMW 320i 9,735 25 make price mpg
. list, sep(0)
(1978 Automobile Data). use EXPENSE, clear
Save Subsets of Variables to Separate Datasets (8b.)
. restore
file EXPENSE.dta saved. save EXPENSE, replace
. keep make price mpg
. preserve
. restore
file SIZE.dta saved. save SIZE, replace
. keep make weight length
. preserve
. ***Save subsets of variables
. Backup before subsetting variables Keep the first
variable subsetSave the first subset as a Stata data file
Restore the backup datasetMake sure the
key variable is included in both subsets
Distribute Group-level Information Across Individual-level Observations (9a.)
12. West Grant 11. West Cobb 10. West Charles 9. South McNeil 8. South Lee 7. South Dubnoff 6. South Anderson 5. NE Franks 4. NE Ecklund 3. N Cntrl Willis 2. N Cntrl Phipps 1. N Cntrl Krantz region name
4. West 310,565 165,348 3. South 532,399 330,499 2. NE 360,523 138,097 1. N Cntrl 419,472 227,677 region sales cost
matched 12 (_merge==3) not matched 0 Result # of obs.
(label region already defined). merge m:1 region using dollars
(Sales Force). use sforce, clear
12. West Grant 310,565 165,348 matched (3) 11. West Cobb 310,565 165,348 matched (3) 10. West Charles 310,565 165,348 matched (3) 9. South McNeil 532,399 330,499 matched (3) 8. South Lee 532,399 330,499 matched (3) 7. South Dubnoff 532,399 330,499 matched (3) 6. South Anderson 532,399 330,499 matched (3) 5. NE Franks 360,523 138,097 matched (3) 4. NE Ecklund 360,523 138,097 matched (3) 3. N Cntrl Willis 419,472 227,677 matched (3) 2. N Cntrl Phipps 419,472 227,677 matched (3) 1. N Cntrl Krantz 419,472 227,677 matched (3) region name sales cost _merge
sforce
dollars
key
Look up the variable values in “dollars” and attach them to the records in “sforce”
merged
m:1 Many-to-One (Lookup) Merging (9a.)
matched 12 (_merge==3) not matched 0 Result # of obs.
(label region already defined). merge m:1 region using dollars
(Sales Force). use sforce, clear
Level 1 dataset
Key Variable
Level 2 dataset
Lookup merging
Extract the Individual- and Group-Level Data from a Multilevel Data Set (9b.)
. restore
file lev1.dta saved. save lev1, replace
mathach 7185 6031 12.74785 -2.832 24.993 ses 7185 373 .0001434 -3.758 2.692 female 7185 2 .5281837 0 1 minority 7185 2 .274739 0 1 id 7185 160 . . . Variable Obs Unique Mean Min Max Label
. codebook, compact
. keep id minority female ses mathach
. preserve
. ***Write out the individual-level data
.
. sort id
. use http://www.ats.ucla.edu/stat/hlm/faq/hsball, clear
. ***Get and sort the multilevel data
himinty 160 2 .275 0 1 (mean) himintydisclim 160 159 -.015125 -2.416 2.756 (mean) disclimpracad 160 73 .5139375 0 1 (mean) pracadsector 160 2 .4375 0 1 (mean) sectorsize 160 149 1097.825 100 2713 (mean) sizemeanses 160 150 -.0001875 -1.188 .831 (mean) meansesid 160 160 . . . Variable Obs Unique Mean Min Max Label
. codebook, compact
file lev2.dta saved. save lev2, replace
. collapse (mean) meanses size sector pracad disclim himinty, by(id)
. ***Write out the school-level data
Number of schools
Number of students Note: Requires that the school-level variables in the original multilevel data have the same (constant) values for every student within a given school.
Separating Level 1 and Level 2 Data (9b.)
. restore
file lev1.dta saved. save lev1, replace
mathach 7185 6031 12.74785 -2.832 24.993 ses 7185 373 .0001434 -3.758 2.692 female 7185 2 .5281837 0 1 minority 7185 2 .274739 0 1 id 7185 160 . . . Variable Obs Unique Mean Min Max Label
. codebook, compact
. keep id minority female ses mathach
. preserve
. ***Write out the individual-level data
.
. sort id
. use http://www.ats.ucla.edu/stat/hlm/faq/hsball, clear
. ***Get and sort the multilevel data
Sort by the group identifier
. restore
file lev1.dta saved. save lev1, replace
mathach 7185 6031 12.74785 -2.832 24.993 ses 7185 373 .0001434 -3.758 2.692 female 7185 2 .5281837 0 1 minority 7185 2 .274739 0 1 id 7185 160 . . . Variable Obs Unique Mean Min Max Label
. codebook, compact
. keep id minority female ses mathach
. preserve
. ***Write out the individual-level data
.
. sort id
. use http://www.ats.ucla.edu/stat/hlm/faq/hsball, clear
. ***Get and sort the multilevel data
Keep the level 1 variables
Save the level 1 data
himinty 160 2 .275 0 1 (mean) himintydisclim 160 159 -.015125 -2.416 2.756 (mean) disclimpracad 160 73 .5139375 0 1 (mean) pracadsector 160 2 .4375 0 1 (mean) sectorsize 160 149 1097.825 100 2713 (mean) sizemeanses 160 150 -.0001875 -1.188 .831 (mean) meansesid 160 160 . . . Variable Obs Unique Mean Min Max Label
. codebook, compact
file lev2.dta saved. save lev2, replace
. collapse (mean) meanses size sector pracad disclim himinty, by(id)
. ***Write out the school-level data
Get the group means of the level 2 variables
Save the level 2 dataset
Aggregating Data by Subgroups [With Frequency Weights] (10.)
12. 2.9 31 4 2 11. 3.4 32 4 5 10. 3.3 33 3 3 9. 2.2 35 3 2 8. 3.7 30 3 4 7. 2.9 35 2 5 6. 2.5 30 2 4 5. 3.8 29 2 3 4. 2.1 30 1 4 3. 2.8 28 1 9 2. 3.5 34 1 2 1. 3.2 30 1 3 gpa hour year number
4. 4 3.257143 31.71428 3.4 32 3. 3 3.233333 32.11111 3.3 33 2. 2 2.991667 31.83333 2.9 30 1. 1 2.788889 29.44444 2.8 29 year gpa hour medgpa medhour
college
. list
. collapse (mean) gpa hour (median) medgpa=gpa medhour=hour [ fw = number ], by(year)
. use college, clear
frequency weights
aggregated
Produce a new file with a single observation for each group of records in the original data set. This example produces the group means and medians.
Execute Commands by Subgroups (11a.)
• - bysort runs a stata command separately for each value of a for each value of a variable
consideration. bysort does that
• ‘bysort’ runs a command separately for each value of a variable
• Using just ‘by’ requires the data to be sorted by the variable in consideration. ‘bysort’ does that for you
Runs separate regressions for observations when foreign=“domestic” and when foreign=“foreign”
Summarizes the variables price & mpg when foreign=“domestic” and foreign=“foreign”
Using bysort to Identify Duplicates (11b.)
4 groups of duplicates
It is important to note that bysort cannot be used with every stata commands eg- scatter, histogram etc.
Within-observation Across-variables Data Summaries (12a.)
4. 10 11 12 33 3 11 11 10 12 3. 7 8 . 15 2 7.5 7.5 7 8 2. 4 . 6 10 2 5 5 4 6 1. . 2 3 5 2 2.5 2.5 2 3 a b c rtot rn rmean rmed rmin rmax
. list
. egen rmax = rowmax(a b c) //row maximum
. egen rmin = rowmin(a b c) //row minimum
. egen rmed = rowmedian(a b c) //row median
. egen rmean = rowmean(a b c) //row mean
. egen rn = rownonmiss(a b c) //row n
. egen rtot = rowtotal(a b c) //row total
. use http://www.stata-press.com/data/r12/egenxmpl4, clear
Create new variables that are statistical functions of multiple original variables for each observation
Example statistical functions
Within-variable Across-observations Data Summaries (12b.)
10. Buick Skylark 20 Buick 19 19.28572 9. Buick Riviera 15 Buick 19 19.28572 8. Buick Regal 20 Buick 19 19.28572 7. Buick Opel 25 Buick 19 19.28572 6. Buick LeSabre 20 Buick 19 19.28572 5. Buick Electra 15 Buick 19 19.28572 4. Buick Century 20 Buick 19 19.28572 3. AMC Spirit 20 AMC 19 18.33333 2. AMC Pacer 15 AMC 19 18.33333 1. AMC Concord 20 AMC 19 18.33333 make mpg mfg vm_mpg gm_mpg
. list, sepby(mfg) //list by mfg
. bysort mfg: egen gm_mpg=mean(mpg) //mpg group mean
. egen vm_mpg=mean(mpg) //mpg dataset mean
. format mfg %-7s //left align the mfg variable
. generate mfg=word(make,1) //extract manufacturer from make
(64 observations deleted). keep in 1/10 //keep the first 10 observations
. keep make mpg //keep make and mpg
(1978 Automobile Data). sysuse autornd.dta, clear //get the data Create new variables that are statistical
functions of individual original variables across all, or groups of, the observations
Means for the whole sample
Means for subgroups
Creating Standardized Scores and Deviation Scores (13.)
10. Buick Skylark 20 19 3.162278 .3162278 1 9. Buick Riviera 15 19 3.162278 -1.264911 -4 8. Buick Regal 20 19 3.162278 .3162278 1 7. Buick Opel 25 19 3.162278 1.897367 6 6. Buick LeSabre 20 19 3.162278 .3162278 1 5. Buick Electra 15 19 3.162278 -1.264911 -4 4. Buick Century 20 19 3.162278 .3162278 1 3. AMC Spirit 20 19 3.162278 .3162278 1 2. AMC Pacer 15 19 3.162278 -1.264911 -4 1. AMC Concord 20 19 3.162278 .3162278 1 make mpg vm_mpg vs_mpg vz_mpg vd_mpg
. list
. generate vd_mpg=mpg-vm_mpg //mpg deviation scores
. egen vz_mpg=std(mpg) //mpg z-scores
. egen vs_mpg=sd(mpg) //mpg standard deviation
. egen vm_mpg=mean(mpg) //mpg dataset mean
(64 observations deleted). keep in 1/10 //keep the first 10 observations
. keep make mpg //keep make and mpg
(1978 Automobile Data). sysuse autornd.dta, clear //get the data
Standardized scores
Deviations from the variable’s meanAKA Grand mean centering
Create and Format Multiple Variables at Once (14a.)
10. Buick Skylark 4,082 19 3.5 9. Buick Riviera 10,372 16 3.5 8. Buick Regal 5,189 20 2.0 7. Buick Opel 4,453 26 3.0 6. Buick LeSabre 5,788 18 4.0 5. Buick Electra 7,827 15 4.0 4. Buick Century 4,816 20 4.5 3. AMC Spirit 3,799 22 3.0 2. AMC Pacer 4,749 17 3.0 1. AMC Concord 4,099 22 2.5 make price mpg headroom
. list in 1/10
. keep make price mpg headroom
(1978 Automobile Data). sysuse auto.dta, clear
10. Buick Skylark 4,082 19 3.5 -0.71 -0.40 0.60 9. Buick Riviera 10,372 16 3.5 1.43 -0.92 0.60 8. Buick Regal 5,189 20 2.0 -0.33 -0.22 -1.17 7. Buick Opel 4,453 26 3.0 -0.58 0.81 0.01 6. Buick LeSabre 5,788 18 4.0 -0.13 -0.57 1.19 5. Buick Electra 7,827 15 4.0 0.56 -1.09 1.19 4. Buick Century 4,816 20 4.5 -0.46 -0.22 1.78 3. AMC Spirit 3,799 22 3.0 -0.80 0.12 0.01 2. AMC Pacer 4,749 17 3.0 -0.48 -0.74 0.01 1. AMC Concord 4,099 22 2.5 -0.70 0.12 -0.58 make price mpg headroom z_price z_mpg z_head~m
. list in 1/10
4. } 3. format z_`v' %6.2f 2. egen z_`v'=std(`v'). foreach v in price mpg headroom {Stata puts these line
numbers in the output even though they are not in the do file
Create and Check Dummy Variables (14b.)
Total 28,534 100.00 88 2,272 7.96 100.00 87 2,164 7.58 92.04 85 2,085 7.31 84.45 83 1,987 6.96 77.15 82 2,085 7.31 70.18 80 1,847 6.47 62.88 78 1,964 6.88 56.40 77 2,171 7.61 49.52 75 2,141 7.50 41.91 73 1,981 6.94 34.41 72 1,693 5.93 27.47 71 1,851 6.49 21.53 70 1,686 5.91 15.05 69 1,232 4.32 9.14 68 1,375 4.82 4.82 year Freq. Percent Cum. interview
. tabulate year, generate(yr)
(National Longitudinal Survey. Young Women 14-26 years of age in 1968). use http://www.stata-press.com/data/r12/nlswork.dta,clear. *Dummy variables
Total 28,534 100.00 1 1,375 4.82 100.00 0 27,159 95.18 95.18 68.0000 Freq. Percent Cum. year==
3. } 2. tab `x'. foreach x of varlist yr1-yr15{. *Tabulate to verify
--Some output omitted--
Total 28,534 100.00 1 2,272 7.96 100.00 0 26,262 92.04 92.04 88.0000 Freq. Percent Cum. year==
Macros (15.)
MartinTyrellJoseStevenJake 3. } 2. di "`x'". foreach x in `names2' {
DavidJoeChongMingNickBallav 3. } 2. di "`x'". foreach x in $names {
Jake Steven Jose Tyrell Martin. di "`names2'"
. local names2 "Jake Steven Jose Tyrell Martin"
. global names "Ballav Nick ChongMing Joe David"
. // Lists
_cons .8044773 .3080703 2.61 0.010 .197013 1.411942 math .0066498 .0067761 0.98 0.328 -.0067116 .0200113 write .0051675 .0073557 0.70 0.483 -.0093367 .0196717 read .0135099 .0066418 2.03 0.043 .0004133 .0266065 female -.17103 .1049502 -1.63 0.105 -.3779747 .0359146 ses Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 108.762136 205 .530547004 Root MSE = .69518 Adj R-squared = 0.0891 Residual 97.137997 201 .483273617 R-squared = 0.1069 Model 11.6241389 4 2.90603473 Prob > F = 0.0001 F( 4, 201) = 6.01 Source SS df MS Number of obs = 206
. reg ses $ind_vars
. global ind_vars "female read write math"
. // Macros can also be used to specify variables.
Global – Exists until STATA is closed, or a “clear all” command is used.
Local – temporary macro, disappears when do file finishes running
Macros can be used for many things. Two examples are:1)Lists or other storage 2)Variables