the role of “big data” in scientific publishing

18
1 The Role of “Big Data” in Scientific Publishing Bradley P. Allen Chief Architect, Elsevier Presentation for panel on “Giving Voice to Content: Emerging Technologies” NFAIS 56 th Annual Conference Philadelphia, PA, USA 2014-02-24

Upload: wiley

Post on 14-Feb-2016

39 views

Category:

Documents


1 download

DESCRIPTION

The Role of “Big Data” in Scientific Publishing. Bradley P. Allen Chief Architect, Elsevier Presentation for panel on “Giving Voice to Content: Emerging Technologies” NFAIS 56 th Annual Conference Philadelphia, PA, USA 2014-02-24. Why the scare quotes?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Role of “Big Data” in Scientific Publishing

1

The Role of “Big Data” in Scientific Publishing

Bradley P. AllenChief Architect, ElsevierPresentation for panel on “Giving Voice to Content: Emerging Technologies”NFAIS 56th Annual ConferencePhiladelphia, PA, USA2014-02-24

Page 2: The Role of “Big Data” in Scientific Publishing

2

Reference: http://ajharmony.tumblr.com/post/65901268958/mostlysignssomeportents-big-data-is-like, from a quote by Dan Ariely in https://www.facebook.com/dan.ariely/posts/904383595868

Why the scare quotes?

Page 3: The Role of “Big Data” in Scientific Publishing

3

How large is the amount of data your organization currently manages to produce its online products and services?1. Gigabytes2. Terabytes3. Petabytes4. Exabytes

Audience poll: current data scales

Page 4: The Role of “Big Data” in Scientific Publishing

4

Scientific content in the context of big data

Page 5: The Role of “Big Data” in Scientific Publishing

5

• Scientific publishing is the act of compressing a universe’s worth of data into small pieces of content that people can consume

• In essence, this is the ultimate big data problem

• But it is one in which until recently publishers have played a very simple role

• That is beginning to change

What does big data mean to scientific publishing?

Page 6: The Role of “Big Data” in Scientific Publishing

6

• Create more useful content by enhancing it with data extracted from content

• Make the researcher’s life better by exploiting data about how content is used to improve her experience of using our online applications

• Enable research itself by supporting the care and feeding of experimental data at scale

What are we beginning to do with big data?

Page 7: The Role of “Big Data” in Scientific Publishing

7

Which of these uses of big data is most important for your organization?1. Extracting data from content2. Improving user experience through usage

analytics3. Managing experimental data4. All of the above5. None of the above

Audience poll: big data use cases

Page 8: The Role of “Big Data” in Scientific Publishing

8

Sources of data in scientific publishingType of data Inputs Outputs Benefits

Data extracted from content

XMLLong-form free text Short-form free textTablesImagesVideoAudio

Asset metadata CitationsClassificationsClustersEntitiesRelationsLanguage modelsProbabilistic graphical models

Advances scientific understandingProvides publishers with raw material for linking content with task-specific solutions

Data about how content is used

Article viewsSearch queriesUser behaviorSocial media streams

Article-level metricsSentiment analysisRanking and impact metricsUser interest profilesCollaborative filtering models

Provides the researcher insight about her careerProvides institutions data about their performance and impactProvides publishers with data for optimizing content delivery

Experimental Data

Sensor and instrumentation feedsCrowdsourced data (e.g. user surveys)

Data recordsCurated datasets

Provides input to research analyticsProvides archival management of research data assets

Page 9: The Role of “Big Data” in Scientific Publishing

9

Roxie

Example: collaborative filtering in ScienceDirect• When users look at articles on ScienceDirect, they are provided links to other articles of interest• Related Articles originally implemented using bag-of-words similarity using search engine query• Goal: Increase click-through rate on Recommended Articles over previous Related Articles offering;

drive usage, engagement & revenue• Pilot: Ran from March to July 2013, with 9 variants A/B tested with ~5% SD traffic A/B tested• Production: Since Aug 2013

Inputs• 5 years of SD usage data/events• All SD XML Articles • SNIP2 Journal Rankings

ThorCo-

download matrix

Similarity

Attribute Ranking

6 billion events

~12M articles

pii-739156

Daily updates

pii-684259, pii_585346, pii_491635

Page 10: The Role of “Big Data” in Scientific Publishing

10

Which big data tools/platforms are you currently using?1. Apache Hadoop2. A Hadoop distribution (Cloudera, MapR,

Amazon EMR, …)3. LexisNexis HPCC4. Twitter Storm5. Rolling our own6. None of the above

Audience poll: big data tools and platforms

Page 11: The Role of “Big Data” in Scientific Publishing

11

• All of these tools and platforms basically make the following easy to do– Break data up into many chunks, each of which

can fit into memory on a given machine– Send each chunk to a machine where it is

processed into chunks containing intermediate results

– Combine the intermediate results into a single aggregate data set

– Lather, rinse, repeat…

How big data infrastructure works

Page 12: The Role of “Big Data” in Scientific Publishing

12

Big data technologies within Elsevier

Type of processing Timeframe Data Volume Key

Platforms Projects/Products

Batch

Minutes to hours

TBs to PBs HPCC Thor,

Hadoop

SciVal Spotlight, Scopus author profile deduplication, ScienceDirect related articles recommendation

StreamNeverending Unbounded

and continuous

HPCC Roxie, Twitter Storm

Internal content analytics and text mining tools

Ad-hoc QueryMilliseconds to minutes

GBs to PBs HPCC Roxie

ScienceDirect usage analytics

Page 13: The Role of “Big Data” in Scientific Publishing

13

• Talent acquisition– What training is needed to make big data platforms usable by our existing

teams?– Who/what is a data scientist?

• Best practices and design patterns for big data– @nathanmarz’ Lambda Architecture

• The proliferation of big data platforms – HPCC, MapR, Cloudera…

• Cloud-based vs. hosted solutions– Amazon Elastic MapReduce, Redshift

• Data formats and practice for scaling ETL/ELT– Apache Avro, Google Protocol Buffers, zlib-compressed JSON

• Numerical computing frameworks for optimization– High-performance computing using GPUs

Big data technology issues (in no particular order)

Page 14: The Role of “Big Data” in Scientific Publishing

14

• These technologies can yield a wealth of infrastructure, tools, workflows and business models to clone and adapt to the special circumstances of scientific publishing

• Big data can open the door to optimizing the value exchange between author, publisher and reader

• This will require us to walk away from legacy preconceptions– Ask yourself: is it this way because it was done on paper?

• A thought experiment: gold open access as computational advertising

Can we use big data to enable new business models?

Page 15: The Role of “Big Data” in Scientific Publishing

15

Big data is key to computational advertising

Reference: S. Yuan, A.Z. Abidin, M. Sloan and J. Wang. Internet Advertising: An Interplay among Advertisers, Online Publishers, Ad Exchanges and Web Users. arXiv:1206.1754v1 [cs.IR] 8 Jun 2012.

Page 16: The Role of “Big Data” in Scientific Publishing

16

Can big data enable computational publishing?

Authors Researchers

PublishersArticle exchanges

knowledge

article inventories

article inventories

article inventories

credit

time & focus$$$ $$$

$$

($)

The simplified ecosystem of author-pays scientific publishing. Authors spend budget to buyarticle inventories from article exchanges and publishers; article exchanges serve as matchers for articles and journals; publishers provide valuable information to satisfy and keep researchers; researchers read articles and exchange credit for knowledge from the authors. Note that normally researchers would not receive cash from publishers.

Page 17: The Role of “Big Data” in Scientific Publishing

17

• Big data can play a role in creating new value for researchers and institutions

• Ways in which big data is currently exploited in the consumer Internet provide guidance for its use by scientific publishers

Summary