strata 2012: big data and bibliometrics

36
Big Data and Bibliometrics William Gunn Head of Academic Outreach [email protected] @mrgunn Crowdsourcing the World’s Largest Open Database of Research

Upload: william-gunn

Post on 03-Jul-2015

94 views

Category:

Data & Analytics


0 download

DESCRIPTION

Mendeley Talk

TRANSCRIPT

Page 1: Strata 2012: Big Data and Bibliometrics

Big Data and Bibliometrics

William GunnHead of Academic [email protected]

@mrgunn

Crowdsourcing the World’s Largest Open Database of Research

Page 2: Strata 2012: Big Data and Bibliometrics

“The state of knowledge of the human race is sitting in the scientists’ computers, and is currently not shared […] We need to get it unlocked so we can tackle those huge problems.”

A Big Problem

Page 3: Strata 2012: Big Data and Bibliometrics

https://secure.flickr.com/photos/mharvey75/2493468041/https://secure.flickr.com/photos/bfishadow/4237025430/

$31.2B $????

Page 4: Strata 2012: Big Data and Bibliometrics

Journal Impact Factor

Number of citationsCiteable items

= Impact Factor

Page 5: Strata 2012: Big Data and Bibliometrics

It's inaccurate

Problems with Impact Factor

Page 6: Strata 2012: Big Data and Bibliometrics

Problems with Impact Factor

Page 7: Strata 2012: Big Data and Bibliometrics

http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0030291

Problems with Impact Factor

“Thomson Scientific, the sole arbiter of the impact factor game...has no obligation to be accountable to... the authors and readers of scientific research. During discussions with Thomson Scientific over which article types in PLoS Medicine the company deems as “citable,” it became clear that the process of determining a journal's impact factor is unscientific and arbitrary.”

Page 8: Strata 2012: Big Data and Bibliometrics

Highly Tweeted articles are 11x more likely to be highly cited. (Eysenbach 2011) http://www.jmir.org/2011/4/e123/

The higher the impact factor, the more likely the research is to be retracted, partly due to intense competition.http://bjoern.brembs.net/news766.html.11

What matters is who is reading your work!

Page 9: Strata 2012: Big Data and Bibliometrics

https://secure.flickr.com/photos/fireflythegreat/2845637227/

Page 10: Strata 2012: Big Data and Bibliometrics

Building the black box

Watch research as it happens in real-time.

Page 11: Strata 2012: Big Data and Bibliometrics

...and aggregates research data in the cloud

Mendeley extracts research data…

Mendeley makes science more collaborative and transparent:

Install Mendeley Desktop

Collecting rich signals from domain experts.

Page 12: Strata 2012: Big Data and Bibliometrics

160 million documents uploaded1.7 million users

Cambridge Stanford University MITImperial College LondonUniversity of OxfordHarvard UniversityUniversity of MichiganUniversity College LondonUniversity of California at BerkeleyColumbia University

The world's largest open database of research

Page 13: Strata 2012: Big Data and Bibliometrics

Rich user profile data

Page 14: Strata 2012: Big Data and Bibliometrics

Big data problems we've had to solve

Metadata extraction – We trained a 2-stage SVM to achieve precision at .91 and recall at .94, beating all other approaches.

Deduplication – To build the web catalog, we've got to cluster and de-duplicate 17TB+ of documents daily.

Author name disambiguation – aka the “big Wang problem”.We tried a variety of approaches settling on a method of hierarchical agglomerative clustering.

Page 15: Strata 2012: Big Data and Bibliometrics

Solving our big data problems

We needed something that provided scalable processing as well as data storage, which made HDFS + MapReduce on AWS a pretty obvious choice.

Stats

Searchgood user experienceenterprise-class search with easy setupvibrant open source community

ScaleSSDs for catalog searchCaching index in RAM

Page 16: Strata 2012: Big Data and Bibliometrics
Page 17: Strata 2012: Big Data and Bibliometrics

Solving our big data problems

DeduplicationPDFs are easier than pictures or audio because the descriptors are already text strings

OCR doesn't work well

Hashing works for trivial modifications, but not for discriminating pre-prints vs. post-prints.

You don't necessarily know when you don't have a complete record.

File hash check(SHA-1) Identifier check(e.g.PubMed id)Document fingerprint(fulltext)Metadata similarity checkUpdate individual article page

Page 18: Strata 2012: Big Data and Bibliometrics
Page 19: Strata 2012: Big Data and Bibliometrics
Page 20: Strata 2012: Big Data and Bibliometrics

Recommendations“searches you haven't run yet”

Page 21: Strata 2012: Big Data and Bibliometrics
Page 22: Strata 2012: Big Data and Bibliometrics

Google Analytics for research

Page 23: Strata 2012: Big Data and Bibliometrics
Page 24: Strata 2012: Big Data and Bibliometrics
Page 25: Strata 2012: Big Data and Bibliometrics

Mendeley was 3rd largest UK OpenURL referrer in April 2011, beating Medline and Scopus, with 34k click-throughs.

Page 26: Strata 2012: Big Data and Bibliometrics
Page 27: Strata 2012: Big Data and Bibliometrics
Page 28: Strata 2012: Big Data and Bibliometrics

http://dev.mendeley.com

Tim O’ReillyO’Reilly Media

James PowellCTO Thomson Reuters

Juan Enriquez MD Excel Venture Management

John Wilbanks VP Science, Creative Commons

Werner VogelsCTO Amazon.com

Mendeley/PLoS API Binary Battle$16,001 for the best app

Page 29: Strata 2012: Big Data and Bibliometrics
Page 30: Strata 2012: Big Data and Bibliometrics
Page 31: Strata 2012: Big Data and Bibliometrics
Page 32: Strata 2012: Big Data and Bibliometrics
Page 33: Strata 2012: Big Data and Bibliometrics
Page 34: Strata 2012: Big Data and Bibliometrics

Select relation:

supportsrefutescomplementsuses same method...

Result:A human-curated,

constantly evolving semantic article database

Page 35: Strata 2012: Big Data and Bibliometrics
Page 36: Strata 2012: Big Data and Bibliometrics