mendeley’s research catalogue: building it, opening it up and making it even more useful for...

Post on 05-Dec-2014

1.281 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presentation given at Workshop on Academic-Industrial Collaborations for Recommender Systems 2013 (http://bit.ly/114XDsE), JCDL'13. A walk through Mendeley as a platform, growing pains involved with engineering at a large scale, the data that we're making publicly available and some demos that have come out of academic collaborations.

TRANSCRIPT

Mendeley’s Research Catalogue: building it, opening it up and

making it even more useful for researchers

Kris Jack, PhD Chief Data Scientist, @_krisjack

Outline

1.  What‘s Mendeley?

2.  Under the Bonnet

3.  Opening up Data

4.  Working with Academia

5.  Conclusions

What's Mendeley?

Mendeley‘s not just a reference manager

è  Mendeley is a platform that connects researchers, research data and apps

Mendeley Open API

Mendeley Open API

research catalogue

è  Mendeley is a platform that connects researchers, research data and apps

...organise their research

Mendeley provides tools to help users...

è  Reference management è  Cite-as-you- write è  Full-text article search è  Digitalised annotations

...organise their research

...collaborate with one another

Mendeley provides tools to help users...

è  Professional research groups è  Social network è  Annotation sharing

...organise their research

...collaborate with one another

...discover new research

Mendeley provides tools to help users...

è  Explore crowdsourced research catalogue

è  Document statistics

è  Personalised article recommendations è  Related research è  Research contact suggestions

...organise their research

...collaborate with one another

...discover new research

Mendeley provides tools to help users...

...organise their research

...collaborate with one another

...discover new research

Mendeley provides tools to help users...

Social network (>2.4M users)

Research catalogue (~85M unique articles)

Research groups (~240K groups)

Personal libraries (>425M articles)

Our community from a data perspective

Logging massive set of usage data

Under the Bonnet

Lots of features to build & support

è  Reference management è  Cite-as-you- write è  Full-text article search è  Digitalised annotations

è  Professional research groups è  Social network è  Annotation sharing

è  Explore crowdsourced research catalogue

è  Document statistics

è  Personalised article recommendations è  Related research è  Research contact suggestions

Lots of features to build & support

è  Reference management è  Cite-as-you- write è  Full-text article search è  Digitalised annotations

è  Professional research groups è  Social network è  Annotation sharing

è  Explore crowdsourced research catalogue

è  Document statistics

è  Personalised article recommendations è  Related research è  Research contact suggestions

Lots of features to build & support

è  Reference management è  Cite-as-you- write è  Full-text article search è  Digitalised annotations

è  Professional research groups è  Social network è  Annotation sharing

è  Explore crowdsourced research catalogue

è  Document statistics

è  Personalised article recommendations è  Related research è  Research contact suggestions

Lots of features to build & support

features

Lots of features to build & support

features

Research catalogue (~30M unique articles)

Personal libraries (>100M articles)

Lots of features to build & support

features

Research catalogue (~30M unique articles)

Personal libraries (>100M articles)

Crowdsourcing (deduplication,

metadata aggregation,

statistics)

The curse of success

•  More articles came •  More users came •  Keeping catalogue data fresh was a burden

•  Algorithms relied on global counts •  Iterating over MySQL tables was slow •  Needed to shard tables to grow catalogue

•  In short, our backend system didn’t scale

Please try again later

~0.5 million users; the 20 largest user bases: University of Cambridge

Stanford University MIT

University of Michigan Harvard University University of Oxford Sao Paulo University

Imperial College London University of Edinburgh

Cornell University University of California at Berkeley

RWTH Aachen Columbia University

Georgia Tech University of Wisconsin

UC San Diego University of California at LA

University of Florida University of North Carolina ~30m research articles

~0.5 million users; the 20 largest user bases: University of Cambridge

Stanford University MIT

University of Michigan Harvard University University of Oxford Sao Paulo University

Imperial College London University of Edinburgh

Cornell University University of California at Berkeley

RWTH Aachen Columbia University

Georgia Tech University of Wisconsin

UC San Diego University of California at LA

University of Florida University of North Carolina ~30m research articles

The system started to become slow.

How long did it take to

generate our daily readership statistics?

~0.5 million users; the 20 largest user bases: University of Cambridge

Stanford University MIT

University of Michigan Harvard University University of Oxford Sao Paulo University

Imperial College London University of Edinburgh

Cornell University University of California at Berkeley

RWTH Aachen Columbia University

Georgia Tech University of Wisconsin

UC San Diego University of California at LA

University of Florida University of North Carolina ~30m research articles

The system started to become slow.

How long did it take to

generate our daily readership statistics?

23 hours!

We had serious needs

•  Build a catalogue based on billions of articles •  Support many features that rely on the catalogue

•  Statistics •  Search •  Recommendations •  Sharing

•  Data •  Freshness •  Consistency

•  Business context •  Agile development (rapid prototyping) •  Cost effective •  Going viral •  Technical debt stacking up

Enter Hadoop

What is Hadoop?

The Apache Hadoop Project develops open-source software for reliable, scalable, distributed computing

www.hadoop.apache.org

Hadoop

•  Designed to operate on a cluster of computers

•  1…thousands •  Commodity hardware (low cost units)

•  Each node offers local computation and storage

•  Provides framework for working with big data (beyond petabytes)

New tech stack for backend

features

Research catalogue (~30M unique articles)

Personal libraries (>100M articles)

Crowdsourcing (deduplication,

metadata aggregation,

statistics)

New tech stack for backend

features

Research catalogue (~30M unique articles)

Personal libraries (>100M articles)

Crowdsourcing (deduplication,

metadata aggregation,

statistics) 23 hr

computations now took 15

minutes

New tech stack for backend

features

Research catalogue (~30M unique articles)

Personal libraries (>100M articles)

Crowdsourcing (deduplication,

metadata aggregation,

statistics)

recommended reading

Mendeley Suggest

Generating recommendations through matrix multiplication

This is item-based recommendations as similarity is based on items, not users

org.apache.mahout.cf.taste.hadoop.item.RecommenderJob

Running on Amazon's Elastic Map Reduce

On demand use and easy to cost

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5 Orig. item-based

3

Mahout's Performance

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5 Orig. item-based

Cust. item-based è 2.4K, 1.5

3

Mahout's Performance

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5 Orig. item-based

Cust. item-based è 2.4K, 1.5

3

-4.1K (63%)

Mahout's Performance

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5 Orig. item-based

Cust. item-based è 2.4K, 1.5

3

Mahout's Performance

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5 Orig. item-based

Cust. item-based è 2.4K, 1.5

Orig. user-based è 1K, 2.5

3

Mahout's Performance

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5 Orig. item-based

Cust. item-based è 2.4K, 1.5

Orig. user-based è 1K, 2.5

3

-1.4K (58%)

+1 (67%)

Mahout's Performance

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5 Orig. item-based

Cust. item-based è 2.4K, 1.5

Orig. user-based è 1K, 2.5

3

Cust. user-based è 0.3K, 2.5

Mahout's Performance

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5 Orig. item-based

Cust. item-based è 2.4K, 1.5

Orig. user-based è 1K, 2.5

3

Cust. user-based è 0.3K, 2.5

-0.7K (70%)

Mahout's Performance

-4.1K (63%)

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5 Orig. item-based

Cust. item-based è 2.4K, 1.5

Orig. user-based è 1K, 2.5

3

Cust. user-based è 0.3K, 2.5

-6.2K (95%)

Mahout's Performance

+1 (67%)

Disclaimer: these advantages have costs

•  Migrating to a new system (data consistency) •  Setup costs

•  Learn black magic to configure •  Hardware for cluster

•  Administrative costs •  High learning curve to administrate Hadoop •  Still an immature technology •  You may need to debug the source code

•  Developing against Mahout •  Still needs lots of love

Big data backend

features

Research catalogue (~30M unique articles)

Personal libraries (>100M articles)

Crowdsourcing (deduplication,

metadata aggregation,

statistics)

Opening up Data

Social network (>2.4M users)

Research catalogue (~85M unique articles)

Research groups (~240K groups)

Personal libraries (>425M articles)

Our community from a data perspective

Logging massive set of usage data

Challenge: Build an application with our data, make science more open.

PloS/Mendeley's Binary Battle

More details at http://dev.mendeley.com/api-binary-battle/

Challenge: Build off-line system for scientific recommendations with our API and DataTEL data set

ScienceRec Challenge 2012

More details at http://2012.recsyschallenge.com/tracks/sciencerec/

Challenge: Build off-line system for scientific recommendations with our API and DataTEL data set

ScienceRec Challenge 2012

More details at http://2012.recsyschallenge.com/tracks/sciencerec/

Challenge: Metadata Extraction Challenge

The Next Challenge…?

Working with Academia

We have a history of academic collaborations

Duration Project 2009-2011 MAKIN’IT 2010-2014 TEAM 2010-2011 DURA 2012-2012 CSL Editor 2012-2014 CODE 2012-2014 ERASM 2013-2015 EEXCESS

Demo

CSL Editor http://editor.citationstyles.org/

Demo

CODE Mendeley Desktop http://code-research.eu/results

Demo

Mendeley Labs http://labs.mendeley.com/

We have a history of academic collaborations

Duration Project 2009-2011 MAKIN’IT 2010-2014 TEAM 2010-2011 DURA 2012-2012 CSL Editor 2012-2014 CODE 2012-2014 ERASM 2013-2015 EEXCESS

Want to collaborate?

Conclusions

Conclusions è  Mendeley is far more than a reference manager – it‘s a platform that connects researchers, data and apps è  Starting small is good, but be prepared for the cost of scaling up è  We‘re opening up our data for you to build apps on our platform è  We‘re always keen to collaborate with academic groups

Kris Jack, PhD Chief Data Scientist, @_krisjack

top related