strata 2012: big data and bibliometrics
DESCRIPTION
Mendeley TalkTRANSCRIPT
Big Data and Bibliometrics
William GunnHead of Academic [email protected]
@mrgunn
Crowdsourcing the World’s Largest Open Database of Research
“The state of knowledge of the human race is sitting in the scientists’ computers, and is currently not shared […] We need to get it unlocked so we can tackle those huge problems.”
A Big Problem
https://secure.flickr.com/photos/mharvey75/2493468041/https://secure.flickr.com/photos/bfishadow/4237025430/
$31.2B $????
Journal Impact Factor
Number of citationsCiteable items
= Impact Factor
It's inaccurate
Problems with Impact Factor
Problems with Impact Factor
http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0030291
Problems with Impact Factor
“Thomson Scientific, the sole arbiter of the impact factor game...has no obligation to be accountable to... the authors and readers of scientific research. During discussions with Thomson Scientific over which article types in PLoS Medicine the company deems as “citable,” it became clear that the process of determining a journal's impact factor is unscientific and arbitrary.”
Highly Tweeted articles are 11x more likely to be highly cited. (Eysenbach 2011) http://www.jmir.org/2011/4/e123/
The higher the impact factor, the more likely the research is to be retracted, partly due to intense competition.http://bjoern.brembs.net/news766.html.11
What matters is who is reading your work!
https://secure.flickr.com/photos/fireflythegreat/2845637227/
Building the black box
Watch research as it happens in real-time.
...and aggregates research data in the cloud
Mendeley extracts research data…
Mendeley makes science more collaborative and transparent:
Install Mendeley Desktop
Collecting rich signals from domain experts.
160 million documents uploaded1.7 million users
Cambridge Stanford University MITImperial College LondonUniversity of OxfordHarvard UniversityUniversity of MichiganUniversity College LondonUniversity of California at BerkeleyColumbia University
The world's largest open database of research
Rich user profile data
Big data problems we've had to solve
Metadata extraction – We trained a 2-stage SVM to achieve precision at .91 and recall at .94, beating all other approaches.
Deduplication – To build the web catalog, we've got to cluster and de-duplicate 17TB+ of documents daily.
Author name disambiguation – aka the “big Wang problem”.We tried a variety of approaches settling on a method of hierarchical agglomerative clustering.
Solving our big data problems
We needed something that provided scalable processing as well as data storage, which made HDFS + MapReduce on AWS a pretty obvious choice.
Stats
Searchgood user experienceenterprise-class search with easy setupvibrant open source community
ScaleSSDs for catalog searchCaching index in RAM
Solving our big data problems
DeduplicationPDFs are easier than pictures or audio because the descriptors are already text strings
OCR doesn't work well
Hashing works for trivial modifications, but not for discriminating pre-prints vs. post-prints.
You don't necessarily know when you don't have a complete record.
File hash check(SHA-1) Identifier check(e.g.PubMed id)Document fingerprint(fulltext)Metadata similarity checkUpdate individual article page
Recommendations“searches you haven't run yet”
Google Analytics for research
Mendeley was 3rd largest UK OpenURL referrer in April 2011, beating Medline and Scopus, with 34k click-throughs.
http://dev.mendeley.com
Tim O’ReillyO’Reilly Media
James PowellCTO Thomson Reuters
Juan Enriquez MD Excel Venture Management
John Wilbanks VP Science, Creative Commons
Werner VogelsCTO Amazon.com
Mendeley/PLoS API Binary Battle$16,001 for the best app
Select relation:
supportsrefutescomplementsuses same method...
Result:A human-curated,
constantly evolving semantic article database