info 7470/econ 7400/ilrle 7400 citing literature, citing data john m. abowd and lars vilhuber march...

78
INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

Upload: lewis-gilmore

Post on 28-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

INFO 7470/ECON 7400/ILRLE 7400Citing Literature, Citing Data

John M. Abowd and Lars VilhuberMarch 11, 2013

Page 2: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

2

CITING LITERATURE

3/11/2013

Page 3: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

3

Citing Literature

• Why? To prevent plagiarism, to establish provenance of ideas

• How? Why do we cite as we do – publishing cycles, uniqueness of sources

• Plagiarism: appropriating other people’s ideas• Examples (Bruno Frey)• Citing literature today: does it still work?– Issues of versioning of articles– Revisions/retractions/corrections

3/11/2013

Page 4: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

4

Why Do We Cite Literature?

• To give credit to the original authors of ideas– To not give credit is plagiarism

• To allow readers to find the information cited– Trace the evolution of ideas– Document cited results

3/11/2013

Page 5: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

5

Plagiarism

• More easily detected nowadays– http://plagiarism.repec.org/offenders.html – http://ideas.repec.org/a/che/chepap/v20y2008i1p20-25.html

• Software– http://plagiarism.bloomfieldmedia.com/z-wordpress/software/wcopyfind

/– Turnitin– AEA uses http://www.aeaweb.org/crosscheck.php

3/11/2013

Source: http://www.elsevier.com/authors/author-rights-and-responsibilities#responsibilities via RePEc

Page 6: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

6

Prominent Recent Examples of Plagiarism

• Bruno Frey– AEA PP, others (see FreyPlag_Wiki but also

responses by Frey)• German ministers– Defense

Karl-Theodor Maria Nikolaus Johann Jacob Philipp Franz Joseph Sylvester Freiherr von und

zu Guttenberg [German source]– Education… Annette Schavan [German source]

• Russian presidents? [2006]

3/11/2013

Page 7: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

7

How Do We Cite?

• Multiple typographical standards• Generally enough unique keys to correctly

identify the source• Current conventions driven to a large extent

by the publishing model in effect through the end of the 20th century (see also Margo Anderson’s Session 1 on data publishing)

3/11/2013

Page 8: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

8

ExamplesBased on and using images from http://bcs.bedfordstmartins.com/resdoc5e/RES5e_ch09_s1-0002.html (2013-03-08)

3/11/2013

Page 9: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

9

ExamplesBased on and using images from http://bcs.bedfordstmartins.com/resdoc5e/RES5e_ch09_s1-0002.html (2013-03-08)

3/11/2013

Declining uniqueness:

Online documents:

Page 10: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

10

Permanent Links

• The URL (Uniform Resource Locator or Web address) may be temporary, may not function in the near or far future

• Links designated as “permanent”, “persistent” or “stable” are designed specifically to remain active and useable over time.

• Permanent links– Digital Object Identifier (DOIs) (more formally: Handle System)

• actionable, interoperable, persistent link

– Other Types of Permanent Links• JSTOR (old)• EBSCO

Adapted from http://library.concordia.ca/services/users/faculty/permanentlinks.php

3/11/2013

Page 11: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

11

DOI

3/11/2013

Page 12: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

12

DOI

3/11/2013

Page 13: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

13

DOI in References

3/11/2013

Page 14: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

14

Up to Here …

• … nothing new, or mostly• Starting in 5th grade, we’ve been thoroughly

trained in citing our “sources” • Or have we?

3/11/2013

Page 15: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

15

CITING DATA

3/11/2013

Page 16: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

16

Neal (1999)

• http://www.jstor.org/stable/10.1086/209919

3/11/2013

Page 17: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

17

References

3/11/2013

Page 18: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

18

References

3/11/2013

Page 19: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

19

No Data …

3/11/2013

Page 20: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

203/11/2013

Page 21: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

213/11/2013

Page 22: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

22

The Problem

• I want to replicate Neal’s analysis• Process:– Download NLSY data (latest!)– Read article, replicate his described analysis in

software of my choice– Get results, compare

• What happens if the results are not the same– Qualitatively– Quantitatively

3/11/2013

Page 23: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

23

Attempts to Falsify

“5. Every genuine test of a theory is an attempt to falsify it, or to refute it [...]6. Confirming evidence should not count except when it is the result of a genuine test of the theory; and this means that it can be presented as a serious but unsuccessful attempt to falsify the theory. (I now speak in such cases of ‘corroborating evidence.’)”

Karl Popper, Science : Conjectures and Refutations, pg. 47

3/11/2013

Page 24: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

24

Replication Study

• Different result driven by– Differences in data– Differences in software– Differences in implementation– Errors by the original author…

• Start by keeping as much as possible the same setup– Same data– Same software– Same implementation (programs)

3/11/2013

Page 25: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

25

Data for Replication

• What does “same data” imply?– Ability to find the data– Assurance that the data are, in fact, the same

• Data curation and citation are critical to the replication exercise

• Increasing impetus by funding agencies– NSF– NIH

3/11/2013

Page 26: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

26

Not Futile

• Neal JOLE article is much cited (60 citations on RePEc, undercount)

• Only instance of a substantive correction of a JOLE article (as of 2013-03-08, search term: erratum)

• Notable because the author publishing the erratum was– Referring to a (seminal) article from 5 years earlier– Was the editor-in-chief at the time

3/11/2013

Page 27: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

27

Example: JOLE

• “In the April 1999 issue of this Journal, I published an article entitled “The Complexity of Job Mobility among Young Men” (Journal of Labor Economics 17, no. 2 [1999]: 237–61). Recently, I began a dialogue with another researcher who was attempting to replicate the empirical results in that article. Through this dialogue, I learned that, for some workers, I erred in constructing my original counts of the number of employer changes within specific careers.1 I have corrected this error and have found that, given correct variable constructions, several empirical results differ quantitatively, although not qualitatively, from the results reported in the original article.”

3/11/2013

Page 28: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

28

Other Items to Note

• Not available if not a subscriber …• The original author’s publication count increased by 1• The discrepancy’s reporter (Ronni Pavan) was not an author on

the erratum (Pavan did publish in the same journal in 2011)• Neither the original data nor the corrected data (and the

associated programs) are available from the journal (they are probably available from the author).– The original data are public-use NLSY data, referenced as “1979-92”

• The online version of the original article contains a link to the erratum (and is found when searching for “Erratum”)

3/11/2013

Page 29: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

29

Why Do We Cite Data This Way?

• Used to be sufficient– Data were the same as a book (see Margo’s Session

1) – If not, then they were rarely modified (punch cards,

tapes)– Example “NLSY 1979-1992” was a well-defined

CDROM• No longer sufficient– Where is the NLSY CDROM? – Which version does your library have?

3/11/2013

Page 30: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

30

Publications by the Census Bureau

• Decennial Census: SF1, SF2, SF3 … once every ten years

• Economic Census: Limited number of tables every 5 years

• LEHD: 4860 tables every three months

3/11/2013

Page 31: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

31

Improvements

• https://usa.ipums.org/usa/cite.shtml

3/11/2013

Page 32: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

32

Improvements

3/11/2013

Page 33: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

33

These Are the Easy Cases

• NLSY, IPUMS-USA, ICPSR data– Public-use datasets or– Data distributor is also data custodian –

guarantees availability of the data• Many other public-use datasets– QCEW – no (can be defined by latest date on file,

but not officially defined)– QWI – version.txt, but hidden– BDS – “yearly” releases (two listed, in fact three)

3/11/2013

Page 34: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

34

Data Availability Not a New Issue

• “In its first issue, the editor of Econometrica (1933), Ragnar Frisch, noted the importance of publishing data such that readers could fully explore empirical results. Publication of data, however, was discontinued early in the journal’s history. [...] The journal arrived full-circle in late 2004 when Econometrica adopted one of the more stringent policies on availability of data and programs.”

http://www.econometricsociety.org/submissions.asp#4 as cited in Anderson et al (2005)

3/11/2013

Page 35: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

35

Citing Restricted-use Data

• Abowd, Kramarz, Margolis (1999): “The data used in this paper are confidential but the authors’ access is not exclusive.”

• But– No current statistical agency has in place a way to

uniquely cite data– Black box of restricted-access data enclaves– Worries about “leakage” of confidential

information

3/11/2013

Page 36: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

36

Declining Role of Public-use Data

3/11/2013

(Chetty, 2012)

Page 37: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

37

Increasing Use of Administrative Data

3/11/2013

(Chetty, 2012)

Page 38: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

38

Not Just in Social Sciences

• Nature, 2012 “Many of the emerging ‘big data’ applications come from private sources that are inaccessible to other researchers. The data source may be hidden, compounding problems of verification, as well as concerns about the generality of the results.”

Huberman, Nature 482, 308 (16 February 2012), doi:10.1038/482308d

3/11/2013

Page 39: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

39

Verification Is Important

• Falsifying data– Andrew Wakefield (autism and vaccines)– Yoshitaka Fujii (fabricated data in 172 out of 249 papers)

• “Believe it or not: how much can we rely on published data on potential drug targets?” doi:10.1038/nrd3439-c1 – Drug maker cannot replicate more than 20-25% of findings

• “Why Most Published Research Findings Are False” Ioannidis JPA (2005) doi:10.1371/journal.pmed.0020124

3/11/2013

Page 40: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

40

But …

• Even studies that worry about replication… do not provide their own data in a replicable way “The questionnaire can be obtained from the authors.” (doi:10.1038/nrd3439-c1)

3/11/2013

Page 41: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

41

Other Approaches: Replication for a Fee

• “The Reproducibility Initiative takes advantage of Science Exchange’s existing network of more than 1,000 core facilities and commercial research organizations. Researchers submit their studies (…) [which] will attempt to replicate the studies for a fee.

• Submitting researchers will have to pay for the replication studies (…) one-tenth that of the original study (…) 5 percent transaction fee to Science Exchange.

• Participants will remain anonymous unless they choose to publish the replication results in a PLoS ONE Special Collection (source)

3/11/2013

Page 42: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

42

CORE ISSUES

3/11/2013

Page 43: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

43

Core Issues

a. Insufficient curation (starting with archiving)b. No consistent way to learn about the data

(metadata)c. No way to reference data (unique identifiers)

3/11/2013

Page 44: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

44

Core Requirements for Data Access

• Royal Society (2012)– Accessible (a researcher can easily find it);– Intelligible (to various audiences);– Assessable (are researchers able make judgments

about or assess the quality of the data);– Usable (at minimum, by other scientists).

3/11/2013

Page 45: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

45

Identifying Data

“DOI names are assigned to any entity for use on digital networks. They are used to provide current information, including where they (or information about them) can be found on the Internet. Information about a digital object may change over time, including where to find it, but its DOI name will not change.”

http://datacite.org/whatisdoi, accessed on Sept 26, 2012.

3/11/2013

Page 46: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

46

Data Curation

• First step: make (some of) the data accessible• Repositories/data custodians can address the

issue for some types of data• Generally provide a way to identify data (DOI)

3/11/2013

Page 47: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

47

Repositories

• DataOne (bio sciences)• Dryad (ecological data)• DataVerse (data extracts and programs

accompanying papers)• University Libraries (Dspace)• UK Data Archive• ICPSR (researcher-initiated surveys)• FRED (St. Louis Fed, time-series)

3/11/2013

Page 48: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

48

Journals and Data Curation

• PLOS ONE– Policy– Limitations: data limited to 10MB…

• AEA– Policy– Example

• Econometrica– Policy

3/11/2013

Page 49: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

49

PLoS ONE

• http://www.plosone.org/static/policies#sharing • “PLOS is committed to ensuring the availability of

data and materials that underpin any articles published in PLOS journals.”

• “PLOS reserves the right to post corrections on articles, to contact authors' institutions and funders, and in extreme cases to withdraw publication, if restrictions on access to data or materials come to light after publication of a PLOS journal article.”

3/11/2013

Page 50: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

50

PLoS ONE (cont.)

• “(…)appropriate accession numbers or digital object identifiers (DOIs) published with the paper”

• Also guidelines for software (in particular when it is critical to the paper)

3/11/2013

Page 51: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

51

AEA Policy

• http://www.aeaweb.org/aer/data.php • “Authors of accepted papers that contain

empirical work, simulations, or experimental work must provide to the Review, prior to publication, the data, programs, and other details of the computations sufficient to permit replication.”

3/11/2013

Page 52: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

52

AEA Policy (cont.)

• http://www.aeaweb.org/aer/data.php • For econometric and simulation papers, the minimum

requirement should – include the data set(s) and programs used to run the final

models, – plus a description of how previous intermediate data sets

and programs were employed to create the final data set(s). – Authors are invited to submit these intermediate data files

and programs as an option– […] as well as instructing a user on how replication can be

conducted.

3/11/2013

Page 53: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

53

AEA Example: Abowd and Vilhuber (2012)

• Article: http://www.aeaweb.org/articles.php?doi=10.1257/aer.102.3.589

• Appendix– Description at http://

www.aeaweb.org/aer/data/may2012/2012_2790_app.pdf (note: no DOI!)

– Tried to be careful about referencing data, but no DOIs available on any of the data• Even our own data (National QWI, 38MB compressed)

– Only generic programs– Final dataset was too large – not accepted.

3/11/2013

Page 54: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

54

Econometrica Policy• http://

www.econometricsociety.org/submissionprocedures.asp#replication • “Econometrica has the policy that all empirical, experimental and

simulation results must be replicable. • Therefore, authors of accepted papers must submit data sets,

programs, and information on empirical analysis, experiments and simulations that are needed for replication and some limited sensitivity analysis”

• Limited-access/proprietary datasets: “detailed data description and the programs used to generate the estimation data sets must be provided, as well as information of the source of the data so that researchers who do obtain access may be able to replicate the results”

3/11/2013

Page 55: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

55

Limitation of Current Repositories

• Do not (yet) provide full provenance– For lack of citation tools– For lack of guidance

• Limitations when using “big data”– Repository not the solution (suggested size: <10MB,

although Econometrica has some in the 400MB range)– Unique references to data publication, onus on

publisher?• Do not work (well) for restricted-access data

3/11/2013

Page 56: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

56

Metadata Access

• Information about the data• Can be– Variable names– Formats– Values– Distribution of values– Description– Provenance

3/11/2013

Page 57: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

57

Metadata on Public-use Data

• IPUMS: Structured/browsable metadata• Most other sites:– PDF or ASCII files– Generally not linked to actual data

• Restricted-access data in Census RDC– Generic information outside– PDF once access granted

3/11/2013

Page 58: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

58

IPUMS Metadata

3/11/2013

Page 59: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

59

IPUMS Metadata (Details)

3/11/2013

Page 60: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

60

ICPSR Metadata on ATUS

3/11/2013

Page 61: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

61

BLS Metadata on ATUS

3/11/2013

Page 62: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

62

Current Metadata on Confidential Data

• Mostly by inference• Census Bureau (CES): – links to public-use tabulations, documents (some by yours

truly), codebooks (Snapshot S2004)– PDFs of detailed data in RDC– Codebooks for a few data sets at ICPSR

• 1960 (ICPSR 21980); 1970 (21981); 1980 (21982); 1990 (21983); 2000 (21820)

• NCHS: – what is in questionnaire (PDF) but not in public-use

codebook (PDF) might be accessible

3/11/2013

Page 63: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

63

Approaches and Solutions

• NCRN-Cornell node: Comprehensive Extensible Data Documentation and Access Repository (CED²AR) – Based on existing metadata standards (DDI) with

possible extensions– Provide structured mechanism to synchronize

confidential and public-use metadata– Assign DOI where needed

3/11/2013

Page 64: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

64

NCRN-Cornell

3/11/2013

Page 65: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

65

Pruning Confidential Metadata

3/11/2013

Page 66: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

66

End Result (mid-2013)

3/11/2013

Page 67: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

67

End Result (mid-2013)

3/11/2013

Page 68: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

68

EASE OF ACCESS/REPLICATION

3/11/2013

Page 69: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

69

FRED: Federal Reserve Economic Data

• http://research.stlouisfed.org/fred2/ – Excellent job in providing easy access to a large

number of data series– Also provide archival versions (data series ‘as-of’)– Online graphs

3/11/2013

Page 70: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

70

Issues with FRED

• No link back to original data provider’s unique ID (in large part because there is nothing to link back to)

• Archival versions identified by “publication” date (may be imprecise at times)

• Incomplete …

3/11/2013

Page 71: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

71

Accessing FRED

• Demo using Stata’s “freduse”• Program used in this demo:– stata-recession-fred.do

3/11/2013

Page 72: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

72

Stata

3/11/2013

Page 73: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

73

Stata Results

3/11/2013

Page 74: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

74

Accessing FRED

• Demo using R’s “quantmod”• Program used in this demo:– r-recession-fred.R

3/11/2013

Page 75: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

75

Using quantmod

3/11/2013

Page 76: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

76

Results with R

3/11/2013

Page 77: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

77

FRED Issues

• Positives: it’s available! • Trains people to use keys to look up online

references• Issues: – Not able to link to archival versions (always the

latest version), – But does store local copies (-> repository, onus

back on ad hoc data archiving)– How to cite the data?

3/11/2013

Page 78: INFO 7470/ECON 7400/ILRLE 7400 Citing Literature, Citing Data John M. Abowd and Lars Vilhuber March 11, 2013

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

78

Tools and Replicability

• Tools help to do replicability analysis– Ability to reference URL of data (handle, DOI, etc.)– Ability to access data through URL • Even if/when run in restricted-access environments

3/11/2013