research library, los alamos national laboratory @ march, 2008, u. indiana, school of informatics...

46
Quick TIFF are ne Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team QuickTime™ TIFF (Uncompr are needed t Usage-based models of science: applications to community mapping and scholarly assessment Johan Bollen Digital Library Research & Prototyping Team Los Alamos National Laboratory - Research Library jbollen@lanl . gov Acknowledgements: Herbert Van de Sompel (LANL), Marko A. Rodriguez (LANL), Ryan Chute (LANL), Lyudmila L. Balakireva (LANL), Aric Hagberg (LANL), Luis Bettencourt (LANL) Research supported by the Andrew W. Mellon Foundation.

Upload: mitchell-haynes

Post on 02-Jan-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Usage-based models of science:

applications to community mapping and scholarly assessment

Johan BollenDigital Library Research & Prototyping Team

Los Alamos National Laboratory - Research Library

[email protected]

Acknowledgements:

Herbert Van de Sompel (LANL), Marko A. Rodriguez (LANL), Ryan Chute (LANL),

Lyudmila L. Balakireva (LANL), Aric Hagberg (LANL), Luis Bettencourt (LANL)

Research supported by the Andrew W. Mellon Foundation.

Page 2: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Scholarly evaluation

• Qualitative/subjective:o Peer reviewo Tenure committeeso Networks

• Quantitative:o Citation countso Many proposals, little clarity

Page 3: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Impact evaluation from citation data.

IF is part of Journal Citation Reports (JCR)

JCR = citation graph

2005 journal citation network• +- 8,560 journals• 6,370,234 weighted citation edges

Impact Factor = mean 2 year citation rate

Lowinfluence

Highinfluence

20032001 2002

Journal x All (2003)

Citation data:• Golden standard of scholarly evaluation• Citation = scholarly influences.• Extracted from published materials.• Main bibliometric data source for scholarly

evaluation.

Page 4: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Citation graph: other metrics?

Possibility to calculate other metrics of impact.

How about PageRank?

• IF = normalized indegree in citation grapho Popularityo Favors review journals

• Are all citations created equal?o Transfer citer influence to citeeo Normalize edge weight o Modulate transfer for citer influenceo Indicator of node prestige

Pinski, G., & Narin, F. (1976). Citation influence for journal aggregates of scientific publications: theory, with application to the literature of physics. Information processing and management, 12(5), 297-312.Chen, P., Xie, H., Maslov, S., & Redner, S. (2007). Finding scientific gems with Google. Journal of Informetrics, 1(1), arxiv.org/abs/physics/0604130.

Page 5: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Popularity vs. prestige

rho=0.61

Outliers reveal differences in aspects of “status”

IF ~ general popularity

PR ~ prestige, influence

Johan Bollen, Marko A. Rodriguez, and Herbert Van de Sompel. Journal status. Scientometrics, 69(3), December 2006 (DOI: 10.1007/s11192-006-0176-z)

Philip Ball. Prestige is factored into journal ratings. Nature 439, 770-771, February 2006 (doi:10.1038/439770a)

Page 6: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Domain specific

Page 7: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

A few issues…

On the basis of citation graph, many alternatives:• Normalized citation statistics• Social network indicators (centrality)• Semantic web hybrids• Which one to choose?

Semantics:• Validity: does indicator express what it is intended to

express?• Reliability: sensitivity to changes in network structure?

Another question: need they be based on citation data?• Other means of expressing “influence”• Readership, services requested, etc.• Usage!

Lowinfluence

Highinfluence

Page 8: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Usage data

Citation data pertain to 4 levels in the scholarly communication process:

• Community: authors of journal articles.• Artifacts: journal articles.• Data: citation data (+1 year publication delay).• Metrics: mean citation rate rules supreme.• Scale: expensive to extract.

Hence, various initiatives focused on usage data: COUNTER, IRS, SUSHI, CiteBase.But where are the metrics?

However, for usage data:• Community: all users including most authors.• Artifacts: all that is accessible.• Data: recorded upon publication.• Metrics: a range of web and web2.0 inspired metrics, e.g. clickstream and datamining.• Scale: automatically recorded at point of service.

Page 9: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Challenges to usage-based metrics.

Usage data is here:• Routinely recorded by library, publisher and aggregators

services• Large-scale and longitudinal• Highly detailed

o Requestero Referento Sessionso Service type

Usage-based metrics have lagged development. Here’s why:• Multiple communities• Multiple collection (artifacts)• Data: usage data limited to particular sub-communities

and collections of artifacts.• Metrics: various metrics studied. Different results

because of sample, collection or metric definition? Aspects of scholarly status?

Page 10: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Our experience: divergence and convergence.

CSU Usage PR IF (2003) Title (abbv.)

1 78.565 21.455 JAMA-J AM MED ASSOC

2 71.414 29.781 SCIENCE

3 60.373 30.979 NATURE

4 40.828 3.779 J AM ACAD CHILD PSY

5 39.708 7.157 AM J PSYCHIAT

LANL Usage PR IF (2003) Title (abbv.)

1 60.196 7.035 PHYS REV LETT

2 37.568 2.950 J CHEM PHYS

3 34.618 1.179 J NUCL MATER

4 31.132 2.202 PHYS REV E

5 30.441 2.171 J APPL PHYS

MSR Usage PR IF (2005) Title (abbv.)

1 15.830 30.927 SCIENCE

2 15.167 29.273 NATURE

3 12.798 10.231 PNAS

4 10.131 0.402 LECT NOTES COMP SCI

5 8.409 5.854 J BIOL CHEM

Convergence!• Is this guaranteed?• To what? A common-baseline?• What we do know:

o Institutional perspective can be contrasted to baseline.

o As aggregation increases in size, so does value.

o Cross-validation is key.

Page 11: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

MESUR1: Metrics from Scholarly Usage of Resources.

1. Pronounced “measure”

Andrew W. Mellon Foundation funded study of usage-based metrics (2006-2008)

Executed at the Digital Library Research and Prototyping team, Los Alamos National Laboratory Research Library

Objectives:1. Create a model of the scholarly

communication process.2. Create a large-scale reference data

set (semantic network) that relates all relevant bibliographic, citation and usage data according to (1).

3. Characterize reference data set.4. Survey usage-based metrics on

basis of reference data set.

Page 12: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

The MESUR project.

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture. QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Johan Bollen (LANL): Principal investigatorHerbert Van de Sompel (LANL): Architectural consultantMarko Rodriguez (LANL): PhD student (Computer Science, UCSC)Ryan Chute (LANL): Software development and database managementLyudmila Balakireva (LANL): Database management and HCIAric Hagberg (LANL): Mathematical and statistical consultantLuis Bettencourt (LANL): Mathematical and statistical consultant

“The Andrew W. Mellon Foundation has awarded a grant to Los Alamos National Laboratory (LANL) in support of a two-year project that will investigate metrics derived from the network-based usage of scholarly information. The Digital Library Research & Prototyping Team of the LANL Research Library will carry out the project. The project's major objective is enriching the toolkit used for the assessment of the impact of scholarly communication items, and hence of scholars, with metrics that derive from usage data.”

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Page 13: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Project data flow and work plan.

1

2 3 4

Page 14: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Project timeline.

We are here!

Page 15: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Presentation structure:an update on the MESUR project

1) Usage data characterization2) Analysis

1) Usage graphs2) Metrics analysis

3) Results4) Discussion

12

3

4

Page 16: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Presentation structure:an update on the MESUR project

1) Usage data characterization2) Analysis

1) Usage graphs2) Metrics analysis

3) Results4) Discussion

Page 17: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

A tapestry of usage data providers:

Main players:• Individual institutions

o Link resolver datao EZ proxy

• Aggregatorso Ad hoc formatso COUNTER reports

• Publisherso Ad hoco COUNTER reports

Each represent different, and possibly overlapping, samples of the scholarly community.

Institutions:• Institutional communities• Many collectionsAggregators:• Many communities• Many collectionsPublishers:• Many communities• Publisher collection

Page 18: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Negotiation resultsType Asked Positive Negative % pos

Publishers 14 8 6 57%

Aggregators 3 3 0 100%

Institutions 4 4 0 100%

Totals 21 14 7

• Data: > 1B usage events and 1B citationso At this point, 247,083,481 usage events loadedo Another +1,000,000,000 on the way

• Documents: > 50M documents• Journals: 326,000

o Includes newspapers, magazineso Professional magazineso Obscure material

• Community: > 100M users and authors combined

Page 19: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Data acquired: timelines

Span:• Majority: -1 years• Some minor 2002-2003 data

Sharing models:• Historical data for period [t-x,t]• Periodical updates

Main issues:• Restoration of archives• Digital preservation issues

o All data fields intacto Integration of various sources of usage data

Page 20: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Data flow

http://www.mesur.org/schemas/2007- 01/mesur/

Page 21: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

An ontology of the scholarly communication process?

document4

person3

uses

document2cites

document3

publishes

publishes

isA

journal1

person2authors

person1authors

document1cites

person4

person5

uses

uses

instances

ontology

journal

document

publishes

documentcites

personauthors

uses

Usage data

Bibliographicdata

Citation data

Page 22: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Modeling the scholarly communication process:the MESUR ontology.

Basic concepts:1) OWL RDF/XML representation2) Three basic notions: Documents,

Agents and Contexts3) Context: n-ary relationship between

documents and agent.4) Subclassed to Events and States to

express action, e.g. “Uses” vs. continuous state, e.g. “hasImpact”

Previous efforts: The ScholOnto project, ABC ontology, VA. Tech: Goncalves (2002), Web Scholars, etc.

Requirements:- Combined representation of usage data with bibliographic and citation data.- Fine granularity- Pragmatism in modeling

Page 23: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Modeling the scholarly communication process:the MESUR ontology.

Based on OntologyX5 frameworkdeveloped by Rights.com

Examples:An author (Agent) publishes

(Context:Event) an article (Document)

A user (Agent) uses (Context:Event) a journal (Document)

Rodriguez, Bollen & Van de Sompel. A Practical Ontology for the Large-Scale

Modeling of Scholarly Artifacts and their Usage. JCDL07

http://www.mesur.org/schemas/2007-01/mesur/

Page 24: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Presentation structure:an update on the MESUR project

1) Usage data characterization2) Analysis

1) Usage graphs2) Metrics analysis

3) Results4) Discussion

Page 25: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

MESUR’s usage data representation framework.

Assumptions:1. Sessions identify sequences of events

(same user - same document)2. Documents tied to aggregate request

objects3. Request objects consist of series of

service requests and the date and time at which request took place.

Implications:1. Sequence preserved.2. Most usage data and statistics can be

reconstructed from framework3. Lends itself to XML and RDF format4. Permits request type filtering

Example:1. COUNTER stats= aggregate request

counts for each document (journal) grouped by date/time (month)

2. Usage graph: overlay sessions with same document pairs

• Out of 13 MESUR providers so far, only 3 natively follow this model.• The usage data of another 8 contains the necessary information for conversion

Page 26: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Implications for structural analysis of usage data

• Sequence preservation allows:o Reconstruction of user behavioro Usage graphs!

• Statistics do not allow this type of analysis BUT are useful for:

o validating resultso rankings

Page 27: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

How to generate a usage graph.

Documents are associated by co-occurrence in same session

• Same session, same user: common interest

• Frequency of co-occurrence in same session: degree of relationship

• Normalized: conditional probability

Usage data:• Works for journals and articles• Anything for which usage was recorded

Options:• Strict pair-wise sequence?• All within session?• Take “distance” into account?

Note: not something we invented. Association rule learning in data mining. Beer and diapers!

Page 28: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Usage graphs

journal1journal2

MESUR graph created:• 200M usage events• Usage restricted to 2006• Journals clipped to 7600 2004

JCR journals• Pair-wise sequences

o Within session, only consecutive pairs

o Raw frequency weights

Network analysis now on-going

• Network properties

• Clustering

Page 29: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Lay of the land: flow of information.

Page 30: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Presentation structure:an update on the MESUR project

1) Usage data characterization2) Analysis

1) Usage graphs2) Metrics analysis

3) Results4) Discussion

Page 31: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Metric types

Note:• Metrics can be calculated

both on citation and usage data

• Structural metrics require graphs

o Citation graph, e.g. 2004 JCR

o Usage graph, e.g. created by MESUR

Page 32: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Frequentist metrics

Raw cites:• Count number of

citations to document or journal

• Count number of times document or journal was accessed

Normalized:

•Journal Impact factor:o Number of citations to journalo Divided by number of articles published in journal

•Usage Impact Factor

• Number of request for journal or article

• Divided by number of articles published in journal

Johan Bollen. Usage Impact Factor: the effects of sample characteristics on usage-based impact metrics. Journal of the American Society for Information Science and Technology, 59(1), 2008

Page 33: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Structural metrics calculated from usage graph

Classes of metrics:• Degree• Shortest path• Random walk• Distribution

Degree• In-degree• Out-degree

Shortest path• Closeness• Betweenness• Newman

Random walk• PageRank• Eigenvector

Distribution• In-degree entropy• Out-degree entropy• Bucket Entropy

Each can be defined to take into account weights by e.g. means of weighted shortest path definition

Page 34: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Social network metrics: different aspects of impact I

In-degree/IFDegree

centrality

Closenesscentrality

Betweennesscentrality

Shortest pathmetrics

Degreemetrics

Page 35: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Social network metrics: different aspects of impact II

Basic idea:- Random walkers follow edges- + Probability of random teleportation- Visitation numbers converge ~ PageRank

- “Stationary Probability distribution”

From wikipedia.org

Random walkMetrics, e.g. PageRank

Page 36: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Presentation structure:an update on the MESUR project

1) Usage data characterization2) Analysis

1) Usage graphs2) Metrics analysis

3) Results4) Discussion

Page 37: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Set of metrics calculated on MESUR data set

List of metrics:JCR 2004 • CITE-BE• CITE-ID• CITE-IE• CITE-IF• CITE-OD• CITE-OE• CITE-PG• CITE-UBW• CITE-UBW-UN• CITE-UCL• CITE-UCL-UN• CITE-UNM• CITE-UNM-UN• CITE-UPG• CITE-UPR• CITE-WBW• CITE-WBW-UN• CITE-WCL• CITE-WCL-UN• CITE-WID• CITE-WNM• CITE-WNM-UN• CITE-WOD• CITE-WPR

Usage-based metrics:

MESUR 2006• USES-BE, • USES-ID• USES-IE• USES-OD• USES-OE• USES-PG• USES-UBW• USES-UBW-UN• USES-UCL• USES-UCL-UN• USES-UNM• USES-UNM-UN• USES-UPG• USES-UPR• USES-WBW• USES-WBW-UN• USES-WCL• USES-WCL-UN• USES-WID• USES-WNM• USES-WNM-UN• USES-WOD• USES-WPR

Usage graph creation: Wenzhong Zhao

Metrics: Marko Rodriguez and Aric Hagberg

Page 38: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Overlaps and discrepancies

Rankings and correlation structure will reveal components of notion of “scholarly impact” across citation and usage data.

Page 39: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Citation rankings

2004 Impact Factorvalue journal1 49.794 CANCER2 47.400 ANNU REV IMMUNOL3 44.016 NEW ENGL J MED4 33.456 ANNU REV BIOCHEM5 31.694 NAT REV CANCER

Citation Pagerankvalue journal1 0.0116 SCIENCE2 0.0111 J BIOL CHEM3 0.0108 NATURE4 0.0101 PNAS5 0.006 PHYS REV LETT

betweennessvalue journal1 0.076 PNAS2 0.072 SCIENCE3 0.059 NATURE4 0.039 LECT NOTES COMPUT SC5 0.017 LANCET

Closenessvalue journal1 7.02e-05 PNAS2 6.72e-05 LECT NOTES COMPUT SC3 6.43e-05 NATURE4 6.37e-05 SCIENCE5 6.37e-05 J BIOL CHEM

In-Degree value journal1 3448 SCIENCE2 3182 NATURE3 2913 PNAS4 2190 LANCET5 2160 NEW ENGL J MED

In-degree entropy Value journal1 9.849 LANCET2 9.748 SCIENCE3 9.701 NEW ENGL J MED4 9.611 NATURE5 9.526 JAMA

Page 40: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Usage rankings

2004 Impact Factorvalue journal1 49.794 CANCER2 47.400 ANNU REV IMMUNOL3 44.016 NEW ENGL J MED4 33.456 NNU REV BIOCHEM5 31.694 NAT REV CANCER

In-Degree value journal1 4195 SCIENCE2 4019 NATURE3 3562 PNAS4 2438 J BIOL CHEM5 2432 LECT NOTES COMPUT SC

In-degree entropy Value journal1 9.364 MED HYPOTHESES2 9.152 PNAS3 9.027 LIFE SCI4 8.939 LANCET5 8.858 INT J BIOCHEM CELL B

betweennessvalue journal1 0.035 SCIENCE2 0.032 NATURE3 0.020 PNAS4 0.017 LECT NOTES COMPUT SC5 0.006 LANCET

Closenessvalue journal1 0.670 SCIENCE2 0.665 NATURE3 0.644 PNAS4 0.591 LECT NOTES COMPUT SC5 0.587 BIOCHEM BIOPH RES CO

Pagerankvalue journal1 0.0016 SCIENCE2 0.0015 NATURE3 0.0013 PNAS4 0.0010 LECT NOTES COMPUT SC5 0.0008 J BIOL CHEM

Page 41: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Metrics relationship

Page 42: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Metrics relationships

• Citation and usage metrics reveal an entirely different pattern

o Citation is split in 2 section:- Degree metrics (right)

- Shortest path and random walk (left)

o Usage is split in 4 clusters:- Degree metrics

- PageRank and entropy

- Closeness

- Betweenness

• Usage pattern can be causedo Noise in usage grapho Higher density of usage/nodes

Page 43: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Hierarchical cluster analysis

CitationDegree

Citationcloseness

Citationbetweenness

ImpactFactor

CitationPageRank

Usagedegree

UsageCloseness

Usagebetweenness

UsagePageRank

Page 44: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Presentation structure:an update on the MESUR project

1) Usage data characterization2) Analysis

1) Usage graphs2) Metrics analysis

3) Results4) Discussion

Page 45: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

MESUR: an update

Usage data:• Creation of single largest reference data set of usage, citation and bibliographic data• +1,000,000,000 usage events loaded in next month• Usage data obtained from multiple publishers, aggregators and institutions• Infrastructure for a continued research program in this domain• Results will guide scholarly evaluation and may help produce standards for usage data representation

Usage graphs:• Adequate data model for item-level usage data naturally leads to this• Reduced distortion compared to raw usage: structure counts, not raw hits• Several options on how to create: MESUR investigates option

Metrics:• Frequentist and structural metrics• Each can represent different facets of scholarly impact• Simple metrics can produce adequate results. Law of diminishing returns?• Hybrid metrics based on triple store functionality• Note increasing convergence of usage-metrics to citation metrics as sample increases.

Reference data set will provide years of exciting research:• Let me know what you think.

Page 46: Research Library, Los Alamos National Laboratory @ March, 2008, U. Indiana, School of Informatics Digital Library Research & Prototyping Team Usage-based

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory@ March, 2008, U. Indiana, School of Informatics

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Some relevant publications.

Johan Bollen, Herbert Van de Sompel, and Marko A. Rodriguez. Towards usage-based impact metrics: first results from the MESUR project. In Proceedings of the Joint Conference on Digital Libraries, Pittsburgh, June 2008

Marko A. Rodriguez, Johan Bollen and Herbert Van de Sompel. A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage, In Proceedings of the Joint Conference on Digital Libraries, Vancouver, June 2007

Johan Bollen and Herbert Van de Sompel. Usage Impact Factor: the effects of sample characteristics on usage-based impact metrics. (cs.DL/0610154)

Johan Bollen and Herbert Van de Sompel. An architecture for the aggregation and analysis of scholarly usage data. In Joint Conference on Digital Libraries (JCDL2006), pages 298-307, June 2006.

Johan Bollen and Herbert Van de Sompel. Mapping the structure of science through usage. Scientometrics, 69(2), 2006.

Johan Bollen, Marko A. Rodriguez, and Herbert Van de Sompel. Journal status. Scientometrics, 69(3), December 2006 (arxiv.org:cs.DL/0601030)

Johan Bollen, Herbert Van de Sompel, Joan Smith, and Rick Luce. Toward alternative metrics of journal impact: a comparison of download and citation data. Information Processing and Management, 41(6):1419-1440, 2005.