mendeley’s research catalogue: building it, opening it up and making it even more useful for...

67
Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers Kris Jack, PhD Chief Data Scientist, @_krisjack

Upload: kris-jack

Post on 05-Dec-2014

1.280 views

Category:

Technology


0 download

DESCRIPTION

Presentation given at Workshop on Academic-Industrial Collaborations for Recommender Systems 2013 (http://bit.ly/114XDsE), JCDL'13. A walk through Mendeley as a platform, growing pains involved with engineering at a large scale, the data that we're making publicly available and some demos that have come out of academic collaborations.

TRANSCRIPT

Page 1: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Mendeley’s Research Catalogue: building it, opening it up and

making it even more useful for researchers

Kris Jack, PhD Chief Data Scientist, @_krisjack

Page 2: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Outline

1.  What‘s Mendeley?

2.  Under the Bonnet

3.  Opening up Data

4.  Working with Academia

5.  Conclusions

Page 3: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

What's Mendeley?

Page 4: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Mendeley‘s not just a reference manager

Page 5: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

è  Mendeley is a platform that connects researchers, research data and apps

Mendeley Open API

Page 6: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Mendeley Open API

research catalogue

è  Mendeley is a platform that connects researchers, research data and apps

Page 7: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

...organise their research

Mendeley provides tools to help users...

è  Reference management è  Cite-as-you- write è  Full-text article search è  Digitalised annotations

Page 8: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

...organise their research

...collaborate with one another

Mendeley provides tools to help users...

è  Professional research groups è  Social network è  Annotation sharing

Page 9: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

...organise their research

...collaborate with one another

...discover new research

Mendeley provides tools to help users...

è  Explore crowdsourced research catalogue

è  Document statistics

è  Personalised article recommendations è  Related research è  Research contact suggestions

Page 10: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

...organise their research

...collaborate with one another

...discover new research

Mendeley provides tools to help users...

Page 11: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

...organise their research

...collaborate with one another

...discover new research

Mendeley provides tools to help users...

Page 12: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Social network (>2.4M users)

Research catalogue (~85M unique articles)

Research groups (~240K groups)

Personal libraries (>425M articles)

Our community from a data perspective

Logging massive set of usage data

Page 13: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Under the Bonnet

Page 14: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Lots of features to build & support

è  Reference management è  Cite-as-you- write è  Full-text article search è  Digitalised annotations

è  Professional research groups è  Social network è  Annotation sharing

è  Explore crowdsourced research catalogue

è  Document statistics

è  Personalised article recommendations è  Related research è  Research contact suggestions

Page 15: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Lots of features to build & support

è  Reference management è  Cite-as-you- write è  Full-text article search è  Digitalised annotations

è  Professional research groups è  Social network è  Annotation sharing

è  Explore crowdsourced research catalogue

è  Document statistics

è  Personalised article recommendations è  Related research è  Research contact suggestions

Page 16: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Lots of features to build & support

è  Reference management è  Cite-as-you- write è  Full-text article search è  Digitalised annotations

è  Professional research groups è  Social network è  Annotation sharing

è  Explore crowdsourced research catalogue

è  Document statistics

è  Personalised article recommendations è  Related research è  Research contact suggestions

Page 17: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Lots of features to build & support

features

Page 18: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Lots of features to build & support

features

Research catalogue (~30M unique articles)

Personal libraries (>100M articles)

Page 19: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Lots of features to build & support

features

Research catalogue (~30M unique articles)

Personal libraries (>100M articles)

Crowdsourcing (deduplication,

metadata aggregation,

statistics)

Page 20: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers
Page 21: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers
Page 22: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers
Page 23: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers
Page 24: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

The curse of success

•  More articles came •  More users came •  Keeping catalogue data fresh was a burden

•  Algorithms relied on global counts •  Iterating over MySQL tables was slow •  Needed to shard tables to grow catalogue

•  In short, our backend system didn’t scale

Page 25: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Please try again later

Page 26: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

~0.5 million users; the 20 largest user bases: University of Cambridge

Stanford University MIT

University of Michigan Harvard University University of Oxford Sao Paulo University

Imperial College London University of Edinburgh

Cornell University University of California at Berkeley

RWTH Aachen Columbia University

Georgia Tech University of Wisconsin

UC San Diego University of California at LA

University of Florida University of North Carolina ~30m research articles

Page 27: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

~0.5 million users; the 20 largest user bases: University of Cambridge

Stanford University MIT

University of Michigan Harvard University University of Oxford Sao Paulo University

Imperial College London University of Edinburgh

Cornell University University of California at Berkeley

RWTH Aachen Columbia University

Georgia Tech University of Wisconsin

UC San Diego University of California at LA

University of Florida University of North Carolina ~30m research articles

The system started to become slow.

How long did it take to

generate our daily readership statistics?

Page 28: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

~0.5 million users; the 20 largest user bases: University of Cambridge

Stanford University MIT

University of Michigan Harvard University University of Oxford Sao Paulo University

Imperial College London University of Edinburgh

Cornell University University of California at Berkeley

RWTH Aachen Columbia University

Georgia Tech University of Wisconsin

UC San Diego University of California at LA

University of Florida University of North Carolina ~30m research articles

The system started to become slow.

How long did it take to

generate our daily readership statistics?

23 hours!

Page 29: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

We had serious needs

•  Build a catalogue based on billions of articles •  Support many features that rely on the catalogue

•  Statistics •  Search •  Recommendations •  Sharing

•  Data •  Freshness •  Consistency

•  Business context •  Agile development (rapid prototyping) •  Cost effective •  Going viral •  Technical debt stacking up

Page 30: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Enter Hadoop

What is Hadoop?

The Apache Hadoop Project develops open-source software for reliable, scalable, distributed computing

www.hadoop.apache.org

Page 31: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Hadoop

•  Designed to operate on a cluster of computers

•  1…thousands •  Commodity hardware (low cost units)

•  Each node offers local computation and storage

•  Provides framework for working with big data (beyond petabytes)

Page 32: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

New tech stack for backend

features

Research catalogue (~30M unique articles)

Personal libraries (>100M articles)

Crowdsourcing (deduplication,

metadata aggregation,

statistics)

Page 33: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

New tech stack for backend

features

Research catalogue (~30M unique articles)

Personal libraries (>100M articles)

Crowdsourcing (deduplication,

metadata aggregation,

statistics) 23 hr

computations now took 15

minutes

Page 34: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

New tech stack for backend

features

Research catalogue (~30M unique articles)

Personal libraries (>100M articles)

Crowdsourcing (deduplication,

metadata aggregation,

statistics)

recommended reading

Page 35: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Mendeley Suggest

Page 36: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers
Page 37: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Generating recommendations through matrix multiplication

This is item-based recommendations as similarity is based on items, not users

org.apache.mahout.cf.taste.hadoop.item.RecommenderJob

Page 38: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Running on Amazon's Elastic Map Reduce

On demand use and easy to cost

Page 39: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5 Orig. item-based

3

Mahout's Performance

Page 40: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5 Orig. item-based

Cust. item-based è 2.4K, 1.5

3

Mahout's Performance

Page 41: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5 Orig. item-based

Cust. item-based è 2.4K, 1.5

3

-4.1K (63%)

Mahout's Performance

Page 42: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5 Orig. item-based

Cust. item-based è 2.4K, 1.5

3

Mahout's Performance

Page 43: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5 Orig. item-based

Cust. item-based è 2.4K, 1.5

Orig. user-based è 1K, 2.5

3

Mahout's Performance

Page 44: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5 Orig. item-based

Cust. item-based è 2.4K, 1.5

Orig. user-based è 1K, 2.5

3

-1.4K (58%)

+1 (67%)

Mahout's Performance

Page 45: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5 Orig. item-based

Cust. item-based è 2.4K, 1.5

Orig. user-based è 1K, 2.5

3

Cust. user-based è 0.3K, 2.5

Mahout's Performance

Page 46: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5 Orig. item-based

Cust. item-based è 2.4K, 1.5

Orig. user-based è 1K, 2.5

3

Cust. user-based è 0.3K, 2.5

-0.7K (70%)

Mahout's Performance

-4.1K (63%)

Page 47: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Nor

mal

ised

Am

azon

Hou

rs

No. Good Recommendations/10

0

1K

2K

3K

4K

5K

6K

7K

0 0.5 1 1.5 2 2.5

Costly & Bad Costly & Good

Cheap & Bad Cheap & Good

6.5K, 1.5 Orig. item-based

Cust. item-based è 2.4K, 1.5

Orig. user-based è 1K, 2.5

3

Cust. user-based è 0.3K, 2.5

-6.2K (95%)

Mahout's Performance

+1 (67%)

Page 48: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Disclaimer: these advantages have costs

•  Migrating to a new system (data consistency) •  Setup costs

•  Learn black magic to configure •  Hardware for cluster

•  Administrative costs •  High learning curve to administrate Hadoop •  Still an immature technology •  You may need to debug the source code

•  Developing against Mahout •  Still needs lots of love

Page 49: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Big data backend

features

Research catalogue (~30M unique articles)

Personal libraries (>100M articles)

Crowdsourcing (deduplication,

metadata aggregation,

statistics)

Page 50: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Opening up Data

Page 51: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Social network (>2.4M users)

Research catalogue (~85M unique articles)

Research groups (~240K groups)

Personal libraries (>425M articles)

Our community from a data perspective

Logging massive set of usage data

Page 52: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers
Page 53: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers
Page 54: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers
Page 55: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Challenge: Build an application with our data, make science more open.

PloS/Mendeley's Binary Battle

More details at http://dev.mendeley.com/api-binary-battle/

Page 56: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers
Page 57: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Challenge: Build off-line system for scientific recommendations with our API and DataTEL data set

ScienceRec Challenge 2012

More details at http://2012.recsyschallenge.com/tracks/sciencerec/

Page 58: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Challenge: Build off-line system for scientific recommendations with our API and DataTEL data set

ScienceRec Challenge 2012

More details at http://2012.recsyschallenge.com/tracks/sciencerec/

Page 59: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Challenge: Metadata Extraction Challenge

The Next Challenge…?

Page 60: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Working with Academia

Page 61: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

We have a history of academic collaborations

Duration Project 2009-2011 MAKIN’IT 2010-2014 TEAM 2010-2011 DURA 2012-2012 CSL Editor 2012-2014 CODE 2012-2014 ERASM 2013-2015 EEXCESS

Page 62: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Demo

CSL Editor http://editor.citationstyles.org/

Page 63: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Demo

CODE Mendeley Desktop http://code-research.eu/results

Page 64: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Demo

Mendeley Labs http://labs.mendeley.com/

Page 65: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

We have a history of academic collaborations

Duration Project 2009-2011 MAKIN’IT 2010-2014 TEAM 2010-2011 DURA 2012-2012 CSL Editor 2012-2014 CODE 2012-2014 ERASM 2013-2015 EEXCESS

Want to collaborate?

Page 66: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Conclusions

Page 67: Mendeley’s Research Catalogue: building it, opening it up and making it even more useful for researchers

Conclusions è  Mendeley is far more than a reference manager – it‘s a platform that connects researchers, data and apps è  Starting small is good, but be prepared for the cost of scaling up è  We‘re opening up our data for you to build apps on our platform è  We‘re always keen to collaborate with academic groups

Kris Jack, PhD Chief Data Scientist, @_krisjack