subsift web services and workflows for profiling and comparing scientists and their published works

37
SubSift web services and workflows for profiling and comparing scientists and their published works Simon Price, Peter Flach, Sebastian Spiegler, Christopher Bailey and Nikki Rogers

Upload: simon-price

Post on 11-Apr-2017

11 views

Category:

Data & Analytics


0 download

TRANSCRIPT

SubSift web services and workflows for profiling and comparing scientists and their published works

Simon Price, Peter Flach, Sebastian Spiegler, Christopher Bailey and Nikki Rogers

2

Outline of this paper

1. SubSift – submission sifting

2. Background Theory: Vector Space

Model

3. SubSift REST API

4. Demonstration Workflows

5. Conclusions

3

1. SubSift – submission sifting

1. SubSift – submission sifting

2. Background Theory

3. SubSift REST API

4. Demonstration Workflows

5. Conclusions

4

SubSiftSubSift is a prototype application to support academic peer review.

SubSift matches submitted conference/journal papers to potential peer reviewers based on similarity to published works.

Website:http://subsift.ilrt.bris.ac.uk

5

SubSift has been used for...15

6

Contribution of this work

SubSift RESTful web services:• Open Source software (on Google Code)• Hosted open web service at University of Bristol

Re-usable workflows for profiling and comparing scientists and their published works.

Tool for constructing, manipulating and publishing document-centric datasets.

Related Work• SubSift uses techniques more normally associated with

Information Retrieval

• Full text search tools support text matching on large-scale document collections

e.g. Apache Lucene, PostgreSQL, Oracle UltraSearchDesigned for 1:M matching but can also to do Cartesian product M:M matching.

• How SubSift differs:• Exposes detailed metadata throughout.

• Partly a research tool: need to plug in + instrument new algorithms.

• Fewer licensing restrictions and dependencies for open source.

7

8

2. Background Theory: Vector Space Model

1. SubSift – submission sifting

2. Background Theory

3. SubSift REST API

4. Demonstration Workflows

5. Conclusions

9

Vector Space Model (from Information Retrieval)

Vector Space Model consists of:• bag-of-words representation

• cosine similarity

• tf-idf weighting

For a query (q), rank the documents (dj) in collection (D) by descending similarity to the query.

10

Vector Space Model: bag-of-words representation

no. terms in each abstract

no. terms in DBLP author page of each PC member

11

Vector Space Model: cosine similarity

12

Vector Space Model: tf-idf weighting

13

Representational State Transfer (REST)

“RESTful” web services:• URIs to represent resources

• HTTP POST/GET/PUT/DELETE correspond to usualCreate/Read/Update/Delete (CRUD) operations

• Response formats typically include: XML, JSON, CSV

REST is a design pattern for web services based on HTTP using its familiar URIs, requests, responses, authentication, etc.

14

3. SubSift REST API

1. SubSift – submission sifting

2. Background Theory

3. SubSift REST API

4. Demonstration Workflows

5. Conclusions

15

SubSift System Archicture

16

SubSift REST API

17

Profiles

18

Matches

19

SubSift – canonical workflow

20

4. Demonstration Workflows

1. SubSift – submission sifting

2. Background Theory

3. SubSift REST API

4. Demonstration Workflows

5. Conclusions

21

Workflow 1 – Submission Sifting

Workflow 1 – Web 2.0 Client Implementation

22

Workflow 1 – Papers is just a list of URLs (e.g. Yahoo! Pipes)

23

24

Workflow 2 – Finding an Expert

25

Finding an expert

26

Workflow 3 – Visualising Similarity

27

Clustering staff based on homepage similarity

Dendrogram produced in Matlab from SubSift generated similarity matrix

28

Precision-recall at different thresholds

29

Similarity networks

Diagram created by Graphvis from SubSift generated dot file

30

Connectivity

Diagram created by Graphvis from SubSift generated dot file

31

Workflow 4 – Profiling Reading Lists

32

Profiling a research group by its publications

Diagram produced in Wordle using SubSift profile data

33

Workflow 5 – Ranking News Stories

34

And finally...

Future Work

• Scaling-up• Currently a small-scale web application running on modest

hardware.

• Plans to migrate to a larger-scale HPC application at Bristol.

• ExaMiner project• Mining and mapping the University of Bristol’s research landscape.

• Crawling the University’s web pages to profile and visualise research interests of and similarities between faculty, departments, research groups and researchers.

• Plans to apply to websites of other Universities.

35

36

5. Conclusions

1. SubSift – submission sifting

2. Background Theory

3. SubSift REST API

4. Demonstration Workflows

5. Conclusions

37

Conclusion• SubSift Services useful outside of peer review domain

• Workflows for profiling/comparing scientists Promising e-Science and e-Research use cases for profiling and

comparing scientists and their published works.

• Tool for constructing, manipulating and publishing document-centric datasets E.g. information retrieval, data mining, pattern analysis research. Publication of datasets in this way supports reproducibility of

science. Connects data through Linked Data and the Semantic Web.