mendeley’s research catalogue: building it, opening it up and making it even more useful for...

Mendeley’s Research Catalogue: building it, opening it up and

making it even more useful for researchers

Kris Jack, PhD Chief Data Scientist, @_krisjack

Outline

1.  What‘s Mendeley?

2.  Under the Bonnet

3.  Opening up Data

4.  Working with Academia

5.  Conclusions

What's Mendeley?

Mendeley‘s not just a reference manager

è  Mendeley is a platform that connects researchers, research data and apps

Mendeley Open API

research catalogue

è  Mendeley is a platform that connects researchers, research data and apps

...organise their research

Mendeley provides tools to help users...

è  Reference management è  Cite-as-you- write è  Full-text article search è  Digitalised annotations

...collaborate with one another

è  Professional research groups è  Social network è  Annotation sharing

...discover new research

è  Explore crowdsourced research catalogue

è  Document statistics

è  Personalised article recommendations è  Related research è  Research contact suggestions

Social network (>2.4M users)

Research catalogue (~85M unique articles)

Research groups (~240K groups)

Personal libraries (>425M articles)

Our community from a data perspective

Logging massive set of usage data

Under the Bonnet

Lots of features to build & support

features

Crowdsourcing (deduplication,

metadata aggregation,

statistics)

The curse of success

•  More articles came •  More users came •  Keeping catalogue data fresh was a burden

•  Algorithms relied on global counts •  Iterating over MySQL tables was slow •  Needed to shard tables to grow catalogue

•  In short, our backend system didn’t scale

Please try again later

~0.5 million users; the 20 largest user bases: University of Cambridge

Stanford University MIT

University of Michigan Harvard University University of Oxford Sao Paulo University

Imperial College London University of Edinburgh

Cornell University University of California at Berkeley

RWTH Aachen Columbia University

Georgia Tech University of Wisconsin

UC San Diego University of California at LA

University of Florida University of North Carolina ~30m research articles

The system started to become slow.

How long did it take to

generate our daily readership statistics?

The system started to become slow.

How long did it take to

generate our daily readership statistics?

23 hours!

We had serious needs

•  Build a catalogue based on billions of articles •  Support many features that rely on the catalogue

•  Statistics •  Search •  Recommendations •  Sharing

•  Data •  Freshness •  Consistency

•  Business context •  Agile development (rapid prototyping) •  Cost effective •  Going viral •  Technical debt stacking up

Enter Hadoop

What is Hadoop?

The Apache Hadoop Project develops open-source software for reliable, scalable, distributed computing

www.hadoop.apache.org

Hadoop

•  Designed to operate on a cluster of computers

•  1…thousands •  Commodity hardware (low cost units)

•  Each node offers local computation and storage

•  Provides framework for working with big data (beyond petabytes)

New tech stack for backend

features

statistics)

features

statistics) 23 hr

computations now took 15

minutes

features

statistics)

mendeley’s research catalogue: building it, opening it up and making it even more useful for...

Technology

doing math as researchers do it -...

connecting researchers with information - and unlocking it!

emtacl 2012: connecting researchers to information - and...

identifying researchers

data management – what does it mean for researchers?...

european policies for researchers & euraxess –...

10 tips for teaching mendeley’s reference management...

an institutional strategy for researchers’ career...

it for researchers user group (rug) //info.unmc.edu/helpdesk...

2000 annual report · the fiscal 2000 annual meeting of ......

internet librarian 2011: connecting researchers to...

is it important to create a pages early-career researchers

all about mendeley - institut teknologi bandung ·...

changes in us patent law: how it affects...

university global partnership network · researchers...

a user guide for researchers: research project … · gantt...

cni spring 2011: connecting researchers with information -...

if you build it, will they come? how researchers perceive...

what researchers want, and how to pay for it by michael...

external versus internal perspectives in ......southwest's...