info 7470/econ 7400/ilrle 7400 solutions to lab 5 john m. abowd and lars vilhuber march 25, 2013
TRANSCRIPT
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
2
LESSONS TO BE LEARNEDSubtitle organization
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
3
Lessons
• Answering data-driven questions• Identify tools to answer the question• Correctly use available metadata• Not all data on the same topic provide the
same answer
3/4/2013 3
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
4
Required tools
• SAS, Stata, R, Python, etc.• Web browser• Search engine…
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
5
NAICS
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
6
NAICS sub-sectors (NAICS3)
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
7
QCEW
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
8
After downloading ZIP file
• For historical data, BLS has packaged an entire year into a single ZIP file (151MB)
• We only need one file from there: county file for Pennsylvania
• What is the state code for PA?– PA -> FIPS=42
• We thus need cn42pa10.enb (note the extension, but no choice: only .enb files available)
• Extract it from the ZIP file, unpacked: 38MB
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
9
How to read it in?
• No information in the ZIP file, but…– On the same FTP server: DOCUMENT/– On the Web page: “Flat file formatters”– On the Web page: “Tools and tutorials”
• Use the template files to construct a SAS program– For Stata: construct a dictionary file– For R: read a fixed format file
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
10
Solution to QCEW
• http://www.vrdc.cornell.edu/info7470/Data/lab5-qcew.sas.txt
• Compare it to the template program provided in BLS’ makesas.zip
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
13
QCEW Pitfalls
• Industry coding: ftp://ftp.bls.gov/pub/special.requests/cew/DOCUMENT/industry.map
• “Industry Code Map: This is for NAICS based Quarterly Census of Employment and Wages (QCEW) data.”
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
14
Mixed industry codingIndustry Code Industry Title
10 10 Total, all industries
101 101 Goods-producing
1011 1011 Natural resources and mining
11 NAICS 11 Agriculture, forestry…
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
15
QWI
• Challenge: very large files• http://
www.vrdc.cornell.edu/qwipu/R2012Q2/pa/wia/qwi_pa_wia_county_naicssec_pri.csv.bz2 : 81MB compressed, 2.3GB uncompressed
• Read-in requires 8GB of RAM for R…
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
16
Metadata and data
• “How many data rows does the file you downloaded have?”– QCEW: as many as the .enb file has (no embedded
metadata) (88,093)– QWI: count of lines minus 1: the header row is
metadata, not data (8,482,131)– Same reasoning for CBP (2,155,389)
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
17
Reading in QWI
• http://www.vrdc.cornell.edu/qwipu/R2012Q2/pa/wia/sas_import_wia.sas in the same directory
• Very long program, but the very first section is for the file we want: qwi_pa_wia_county_naics3
• Alternatively, use “proc import”, but may not yield correct results.
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
18
After read-in, same as for QCEW
• http://www.vrdc.cornell.edu/info7470/Data/lab5_qwi.sas.txt :
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
19
Solution for QWI
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
20
County Business Patterns
• Straight CSV file, but for entire year (15.2MB ZIP file)
• But: employment refers to March 15, so comparable to the other two
• Caution: file contains all levels of NAICS, right-filled with “////”
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
21
Solution for CBP
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
22
Results
• Not all sources give the same answer…– Differences in source data• Count of individual wage records• Firm-level report of employment at a particular point in
time to state reporting system• Establishment-level report of employment a particular
point in time to federal reporting system
– Differences in data cleaning– Other…
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
23
Now that you know how
• Try it on Lewis and Clark County, MT• Try it for earlier time periods• Drill down
3/4/2013