INFO 7470/ECON 7400/ILRLE 7400Citing Literature, Citing Data
John M. Abowd and Lars VilhuberMarch 11, 2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
2
CITING LITERATURE
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
3
Citing Literature
• Why? To prevent plagiarism, to establish provenance of ideas
• How? Why do we cite as we do – publishing cycles, uniqueness of sources
• Plagiarism: appropriating other people’s ideas• Examples (Bruno Frey)• Citing literature today: does it still work?– Issues of versioning of articles– Revisions/retractions/corrections
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
4
Why Do We Cite Literature?
• To give credit to the original authors of ideas– To not give credit is plagiarism
• To allow readers to find the information cited– Trace the evolution of ideas– Document cited results
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
5
Plagiarism
• More easily detected nowadays– http://plagiarism.repec.org/offenders.html – http://ideas.repec.org/a/che/chepap/v20y2008i1p20-25.html
• Software– http://plagiarism.bloomfieldmedia.com/z-wordpress/software/wcopyfind
/– Turnitin– AEA uses http://www.aeaweb.org/crosscheck.php
3/11/2013
Source: http://www.elsevier.com/authors/author-rights-and-responsibilities#responsibilities via RePEc
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
6
Prominent Recent Examples of Plagiarism
• Bruno Frey– AEA PP, others (see FreyPlag_Wiki but also
responses by Frey)• German ministers– Defense
Karl-Theodor Maria Nikolaus Johann Jacob Philipp Franz Joseph Sylvester Freiherr von und
zu Guttenberg [German source]– Education… Annette Schavan [German source]
• Russian presidents? [2006]
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
7
How Do We Cite?
• Multiple typographical standards• Generally enough unique keys to correctly
identify the source• Current conventions driven to a large extent
by the publishing model in effect through the end of the 20th century (see also Margo Anderson’s Session 1 on data publishing)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
8
ExamplesBased on and using images from http://bcs.bedfordstmartins.com/resdoc5e/RES5e_ch09_s1-0002.html (2013-03-08)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
9
ExamplesBased on and using images from http://bcs.bedfordstmartins.com/resdoc5e/RES5e_ch09_s1-0002.html (2013-03-08)
3/11/2013
Declining uniqueness:
Online documents:
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
10
Permanent Links
• The URL (Uniform Resource Locator or Web address) may be temporary, may not function in the near or far future
• Links designated as “permanent”, “persistent” or “stable” are designed specifically to remain active and useable over time.
• Permanent links– Digital Object Identifier (DOIs) (more formally: Handle System)
• actionable, interoperable, persistent link
– Other Types of Permanent Links• JSTOR (old)• EBSCO
Adapted from http://library.concordia.ca/services/users/faculty/permanentlinks.php
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
11
DOI
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
12
DOI
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
13
DOI in References
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
14
Up to Here …
• … nothing new, or mostly• Starting in 5th grade, we’ve been thoroughly
trained in citing our “sources” • Or have we?
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
15
CITING DATA
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
16
Neal (1999)
• http://www.jstor.org/stable/10.1086/209919
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
17
References
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
18
References
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
19
No Data …
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
203/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
213/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
22
The Problem
• I want to replicate Neal’s analysis• Process:– Download NLSY data (latest!)– Read article, replicate his described analysis in
software of my choice– Get results, compare
• What happens if the results are not the same– Qualitatively– Quantitatively
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
23
Attempts to Falsify
“5. Every genuine test of a theory is an attempt to falsify it, or to refute it [...]6. Confirming evidence should not count except when it is the result of a genuine test of the theory; and this means that it can be presented as a serious but unsuccessful attempt to falsify the theory. (I now speak in such cases of ‘corroborating evidence.’)”
Karl Popper, Science : Conjectures and Refutations, pg. 47
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
24
Replication Study
• Different result driven by– Differences in data– Differences in software– Differences in implementation– Errors by the original author…
• Start by keeping as much as possible the same setup– Same data– Same software– Same implementation (programs)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
25
Data for Replication
• What does “same data” imply?– Ability to find the data– Assurance that the data are, in fact, the same
• Data curation and citation are critical to the replication exercise
• Increasing impetus by funding agencies– NSF– NIH
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
26
Not Futile
• Neal JOLE article is much cited (60 citations on RePEc, undercount)
• Only instance of a substantive correction of a JOLE article (as of 2013-03-08, search term: erratum)
• Notable because the author publishing the erratum was– Referring to a (seminal) article from 5 years earlier– Was the editor-in-chief at the time
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
27
Example: JOLE
• “In the April 1999 issue of this Journal, I published an article entitled “The Complexity of Job Mobility among Young Men” (Journal of Labor Economics 17, no. 2 [1999]: 237–61). Recently, I began a dialogue with another researcher who was attempting to replicate the empirical results in that article. Through this dialogue, I learned that, for some workers, I erred in constructing my original counts of the number of employer changes within specific careers.1 I have corrected this error and have found that, given correct variable constructions, several empirical results differ quantitatively, although not qualitatively, from the results reported in the original article.”
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
28
Other Items to Note
• Not available if not a subscriber …• The original author’s publication count increased by 1• The discrepancy’s reporter (Ronni Pavan) was not an author on
the erratum (Pavan did publish in the same journal in 2011)• Neither the original data nor the corrected data (and the
associated programs) are available from the journal (they are probably available from the author).– The original data are public-use NLSY data, referenced as “1979-92”
• The online version of the original article contains a link to the erratum (and is found when searching for “Erratum”)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
29
Why Do We Cite Data This Way?
• Used to be sufficient– Data were the same as a book (see Margo’s Session
1) – If not, then they were rarely modified (punch cards,
tapes)– Example “NLSY 1979-1992” was a well-defined
CDROM• No longer sufficient– Where is the NLSY CDROM? – Which version does your library have?
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
30
Publications by the Census Bureau
• Decennial Census: SF1, SF2, SF3 … once every ten years
• Economic Census: Limited number of tables every 5 years
• LEHD: 4860 tables every three months
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
31
Improvements
• https://usa.ipums.org/usa/cite.shtml
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
32
Improvements
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
33
These Are the Easy Cases
• NLSY, IPUMS-USA, ICPSR data– Public-use datasets or– Data distributor is also data custodian –
guarantees availability of the data• Many other public-use datasets– QCEW – no (can be defined by latest date on file,
but not officially defined)– QWI – version.txt, but hidden– BDS – “yearly” releases (two listed, in fact three)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
34
Data Availability Not a New Issue
• “In its first issue, the editor of Econometrica (1933), Ragnar Frisch, noted the importance of publishing data such that readers could fully explore empirical results. Publication of data, however, was discontinued early in the journal’s history. [...] The journal arrived full-circle in late 2004 when Econometrica adopted one of the more stringent policies on availability of data and programs.”
http://www.econometricsociety.org/submissions.asp#4 as cited in Anderson et al (2005)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
35
Citing Restricted-use Data
• Abowd, Kramarz, Margolis (1999): “The data used in this paper are confidential but the authors’ access is not exclusive.”
• But– No current statistical agency has in place a way to
uniquely cite data– Black box of restricted-access data enclaves– Worries about “leakage” of confidential
information
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
36
Declining Role of Public-use Data
3/11/2013
(Chetty, 2012)
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
37
Increasing Use of Administrative Data
3/11/2013
(Chetty, 2012)
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
38
Not Just in Social Sciences
• Nature, 2012 “Many of the emerging ‘big data’ applications come from private sources that are inaccessible to other researchers. The data source may be hidden, compounding problems of verification, as well as concerns about the generality of the results.”
Huberman, Nature 482, 308 (16 February 2012), doi:10.1038/482308d
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
39
Verification Is Important
• Falsifying data– Andrew Wakefield (autism and vaccines)– Yoshitaka Fujii (fabricated data in 172 out of 249 papers)
• “Believe it or not: how much can we rely on published data on potential drug targets?” doi:10.1038/nrd3439-c1 – Drug maker cannot replicate more than 20-25% of findings
• “Why Most Published Research Findings Are False” Ioannidis JPA (2005) doi:10.1371/journal.pmed.0020124
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
40
But …
• Even studies that worry about replication… do not provide their own data in a replicable way “The questionnaire can be obtained from the authors.” (doi:10.1038/nrd3439-c1)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
41
Other Approaches: Replication for a Fee
• “The Reproducibility Initiative takes advantage of Science Exchange’s existing network of more than 1,000 core facilities and commercial research organizations. Researchers submit their studies (…) [which] will attempt to replicate the studies for a fee.
• Submitting researchers will have to pay for the replication studies (…) one-tenth that of the original study (…) 5 percent transaction fee to Science Exchange.
• Participants will remain anonymous unless they choose to publish the replication results in a PLoS ONE Special Collection (source)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
42
CORE ISSUES
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
43
Core Issues
a. Insufficient curation (starting with archiving)b. No consistent way to learn about the data
(metadata)c. No way to reference data (unique identifiers)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
44
Core Requirements for Data Access
• Royal Society (2012)– Accessible (a researcher can easily find it);– Intelligible (to various audiences);– Assessable (are researchers able make judgments
about or assess the quality of the data);– Usable (at minimum, by other scientists).
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
45
Identifying Data
“DOI names are assigned to any entity for use on digital networks. They are used to provide current information, including where they (or information about them) can be found on the Internet. Information about a digital object may change over time, including where to find it, but its DOI name will not change.”
http://datacite.org/whatisdoi, accessed on Sept 26, 2012.
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
46
Data Curation
• First step: make (some of) the data accessible• Repositories/data custodians can address the
issue for some types of data• Generally provide a way to identify data (DOI)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
47
Repositories
• DataOne (bio sciences)• Dryad (ecological data)• DataVerse (data extracts and programs
accompanying papers)• University Libraries (Dspace)• UK Data Archive• ICPSR (researcher-initiated surveys)• FRED (St. Louis Fed, time-series)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
48
Journals and Data Curation
• PLOS ONE– Policy– Limitations: data limited to 10MB…
• AEA– Policy– Example
• Econometrica– Policy
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
49
PLoS ONE
• http://www.plosone.org/static/policies#sharing • “PLOS is committed to ensuring the availability of
data and materials that underpin any articles published in PLOS journals.”
• “PLOS reserves the right to post corrections on articles, to contact authors' institutions and funders, and in extreme cases to withdraw publication, if restrictions on access to data or materials come to light after publication of a PLOS journal article.”
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
50
PLoS ONE (cont.)
• “(…)appropriate accession numbers or digital object identifiers (DOIs) published with the paper”
• Also guidelines for software (in particular when it is critical to the paper)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
51
AEA Policy
• http://www.aeaweb.org/aer/data.php • “Authors of accepted papers that contain
empirical work, simulations, or experimental work must provide to the Review, prior to publication, the data, programs, and other details of the computations sufficient to permit replication.”
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
52
AEA Policy (cont.)
• http://www.aeaweb.org/aer/data.php • For econometric and simulation papers, the minimum
requirement should – include the data set(s) and programs used to run the final
models, – plus a description of how previous intermediate data sets
and programs were employed to create the final data set(s). – Authors are invited to submit these intermediate data files
and programs as an option– […] as well as instructing a user on how replication can be
conducted.
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
53
AEA Example: Abowd and Vilhuber (2012)
• Article: http://www.aeaweb.org/articles.php?doi=10.1257/aer.102.3.589
• Appendix– Description at http://
www.aeaweb.org/aer/data/may2012/2012_2790_app.pdf (note: no DOI!)
– Tried to be careful about referencing data, but no DOIs available on any of the data• Even our own data (National QWI, 38MB compressed)
– Only generic programs– Final dataset was too large – not accepted.
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
54
Econometrica Policy• http://
www.econometricsociety.org/submissionprocedures.asp#replication • “Econometrica has the policy that all empirical, experimental and
simulation results must be replicable. • Therefore, authors of accepted papers must submit data sets,
programs, and information on empirical analysis, experiments and simulations that are needed for replication and some limited sensitivity analysis”
• Limited-access/proprietary datasets: “detailed data description and the programs used to generate the estimation data sets must be provided, as well as information of the source of the data so that researchers who do obtain access may be able to replicate the results”
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
55
Limitation of Current Repositories
• Do not (yet) provide full provenance– For lack of citation tools– For lack of guidance
• Limitations when using “big data”– Repository not the solution (suggested size: <10MB,
although Econometrica has some in the 400MB range)– Unique references to data publication, onus on
publisher?• Do not work (well) for restricted-access data
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
56
Metadata Access
• Information about the data• Can be– Variable names– Formats– Values– Distribution of values– Description– Provenance
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
57
Metadata on Public-use Data
• IPUMS: Structured/browsable metadata• Most other sites:– PDF or ASCII files– Generally not linked to actual data
• Restricted-access data in Census RDC– Generic information outside– PDF once access granted
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
58
IPUMS Metadata
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
59
IPUMS Metadata (Details)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
60
ICPSR Metadata on ATUS
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
61
BLS Metadata on ATUS
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
62
Current Metadata on Confidential Data
• Mostly by inference• Census Bureau (CES): – links to public-use tabulations, documents (some by yours
truly), codebooks (Snapshot S2004)– PDFs of detailed data in RDC– Codebooks for a few data sets at ICPSR
• 1960 (ICPSR 21980); 1970 (21981); 1980 (21982); 1990 (21983); 2000 (21820)
• NCHS: – what is in questionnaire (PDF) but not in public-use
codebook (PDF) might be accessible
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
63
Approaches and Solutions
• NCRN-Cornell node: Comprehensive Extensible Data Documentation and Access Repository (CED²AR) – Based on existing metadata standards (DDI) with
possible extensions– Provide structured mechanism to synchronize
confidential and public-use metadata– Assign DOI where needed
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
64
NCRN-Cornell
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
65
Pruning Confidential Metadata
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
66
End Result (mid-2013)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
67
End Result (mid-2013)
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
68
EASE OF ACCESS/REPLICATION
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
69
FRED: Federal Reserve Economic Data
• http://research.stlouisfed.org/fred2/ – Excellent job in providing easy access to a large
number of data series– Also provide archival versions (data series ‘as-of’)– Online graphs
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
70
Issues with FRED
• No link back to original data provider’s unique ID (in large part because there is nothing to link back to)
• Archival versions identified by “publication” date (may be imprecise at times)
• Incomplete …
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
71
Accessing FRED
• Demo using Stata’s “freduse”• Program used in this demo:– stata-recession-fred.do
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
72
Stata
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
73
Stata Results
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
74
Accessing FRED
• Demo using R’s “quantmod”• Program used in this demo:– r-recession-fred.R
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
75
Using quantmod
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
76
Results with R
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
77
FRED Issues
• Positives: it’s available! • Trains people to use keys to look up online
references• Issues: – Not able to link to archival versions (always the
latest version), – But does store local copies (-> repository, onus
back on ad hoc data archiving)– How to cite the data?
3/11/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
78
Tools and Replicability
• Tools help to do replicability analysis– Ability to reference URL of data (handle, DOI, etc.)– Ability to access data through URL • Even if/when run in restricted-access environments
3/11/2013