academic recommendation using citation analysis with ...esaule/public-website/... · academic...
TRANSCRIPT
Academic Recommendation using Citation Analysis withtheadvisor
Erik Saulejoint work with Onur Kucuktunc, Kamer Kaya, Umit V. Catalyurek
Department of Biomedical InformaticsThe Ohio State University
CSTA 2013
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
:: 1 / 37
Table of Contents
1 IntroductionWhy?Overview
2 Citation Analysis for Document RecommendationPrevious ApproachesDirection Aware Recommendation
3 A High Performance Computing ProblemA specialization of SpMVOrdering and Partitioning
4 Result Diversification
5 Other Features
6 Final ThoughtsConclusionFuture Works
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
:: 2 / 37
Once upon a time : a survey paper
The Jimmy John’s scheduling problem
schedulingpartitioningmapping
×pipelineworkflowdata flowtask graph
×linearchainsequences(tree)(serial parallel)
But also...“Scheduling problems in parallel query optimization”“Bringing skeletons out of the closet: A pragmatic manifesto for skeletal parallelprogramming”
After 6 months, unknown papers where still uncovered
Develop software to make the search easier!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Why? 3 / 37
Once upon a time : a survey paper
The Jimmy John’s scheduling problem
schedulingpartitioningmapping
×pipelineworkflowdata flowtask graph
×linearchainsequences(tree)(serial parallel)
But also...“Scheduling problems in parallel query optimization”“Bringing skeletons out of the closet: A pragmatic manifesto for skeletal parallelprogramming”
After 6 months, unknown papers where still uncovered
Develop software to make the search easier!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Why? 3 / 37
Once upon a time : a survey paper
The Jimmy John’s scheduling problem
schedulingpartitioningmapping
×pipelineworkflowdata flowtask graph
×linearchainsequences(tree)(serial parallel)
But also...“Scheduling problems in parallel query optimization”“Bringing skeletons out of the closet: A pragmatic manifesto for skeletal parallelprogramming”
After 6 months, unknown papers where still uncovered
Develop software to make the search easier!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Why? 3 / 37
Once upon a time : a survey paper
The Jimmy John’s scheduling problem
schedulingpartitioningmapping
×pipelineworkflowdata flowtask graph
×linearchainsequences(tree)(serial parallel)
But also...“Scheduling problems in parallel query optimization”“Bringing skeletons out of the closet: A pragmatic manifesto for skeletal parallelprogramming”
After 6 months, unknown papers where still uncovered
Develop software to make the search easier!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Why? 3 / 37
Design Goals
Personalized
The user should be able to make a query that describes precisely what sheis looking for.
Conceptual
The system should free of linguistic problems. Ambiguity and synonymyshould be taken into accounts.
Exploratory
Different perspective should be available. The system should enhance theuser’s search.
Easy to use
The user should not need to know anything about data mining oralgorithms.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Why? 4 / 37
The Academic Web Service Ecosystem
DBLPList of CS papers with clean reference anddisambiguated names.
Citeseer, {Ref,Ack,Collab}Seer
Automatically crawled papers in CS. GivePDFs. Contain citation information, fulltext. Compute similarity.
CiteUlikeSocial paper tagging application. Findpaper from researchers with similar interest.
ArnetMinerAcademic network analysis.
MendeleyApplication for managing references.Database of reference.
Google Scholar
Keyword-based search engine (with citationinformations).
Microsoft Academic SearchKeyword-based search engine andAcademic network analysis.
IEEE, ACM, Elsevier, JSTOR, ...Publishers or digital libraries with completetext and references. Some suggestions.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Why? 5 / 37
The Academic Web Service Ecosystem
DBLPList of CS papers with clean reference anddisambiguated names.
Citeseer, {Ref,Ack,Collab}Seer
Automatically crawled papers in CS. GivePDFs. Contain citation information, fulltext. Compute similarity.
CiteUlikeSocial paper tagging application. Findpaper from researchers with similar interest.
ArnetMinerAcademic network analysis.
MendeleyApplication for managing references.Database of reference.
Google Scholar
Keyword-based search engine (with citationinformations).
Microsoft Academic SearchKeyword-based search engine andAcademic network analysis.
IEEE, ACM, Elsevier, JSTOR, ...Publishers or digital libraries with completetext and references. Some suggestions.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Why? 5 / 37
The Academic Web Service Ecosystem
DBLPList of CS papers with clean reference anddisambiguated names.
Citeseer, {Ref,Ack,Collab}Seer
Automatically crawled papers in CS. GivePDFs. Contain citation information, fulltext. Compute similarity.
CiteUlikeSocial paper tagging application. Findpaper from researchers with similar interest.
ArnetMinerAcademic network analysis.
MendeleyApplication for managing references.Database of reference.
Google Scholar
Keyword-based search engine (with citationinformations).
Microsoft Academic SearchKeyword-based search engine andAcademic network analysis.
IEEE, ACM, Elsevier, JSTOR, ...Publishers or digital libraries with completetext and references. Some suggestions.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Why? 5 / 37
The Academic Web Service Ecosystem
DBLPList of CS papers with clean reference anddisambiguated names.
Citeseer, {Ref,Ack,Collab}Seer
Automatically crawled papers in CS. GivePDFs. Contain citation information, fulltext. Compute similarity.
CiteUlikeSocial paper tagging application. Findpaper from researchers with similar interest.
ArnetMinerAcademic network analysis.
MendeleyApplication for managing references.Database of reference.
Google Scholar
Keyword-based search engine (with citationinformations).
Microsoft Academic SearchKeyword-based search engine andAcademic network analysis.
IEEE, ACM, Elsevier, JSTOR, ...Publishers or digital libraries with completetext and references. Some suggestions.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Why? 5 / 37
The Academic Web Service Ecosystem
DBLPList of CS papers with clean reference anddisambiguated names.
Citeseer, {Ref,Ack,Collab}Seer
Automatically crawled papers in CS. GivePDFs. Contain citation information, fulltext. Compute similarity.
CiteUlikeSocial paper tagging application. Findpaper from researchers with similar interest.
ArnetMinerAcademic network analysis.
MendeleyApplication for managing references.Database of reference.
Google Scholar
Keyword-based search engine (with citationinformations).
Microsoft Academic SearchKeyword-based search engine andAcademic network analysis.
IEEE, ACM, Elsevier, JSTOR, ...Publishers or digital libraries with completetext and references. Some suggestions.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Why? 5 / 37
The Academic Web Service Ecosystem
DBLPList of CS papers with clean reference anddisambiguated names.
Citeseer, {Ref,Ack,Collab}Seer
Automatically crawled papers in CS. GivePDFs. Contain citation information, fulltext. Compute similarity.
CiteUlikeSocial paper tagging application. Findpaper from researchers with similar interest.
ArnetMinerAcademic network analysis.
MendeleyApplication for managing references.Database of reference.
Google Scholar
Keyword-based search engine (with citationinformations).
Microsoft Academic SearchKeyword-based search engine andAcademic network analysis.
IEEE, ACM, Elsevier, JSTOR, ...Publishers or digital libraries with completetext and references. Some suggestions.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Why? 5 / 37
The Academic Web Service Ecosystem
DBLPList of CS papers with clean reference anddisambiguated names.
Citeseer, {Ref,Ack,Collab}Seer
Automatically crawled papers in CS. GivePDFs. Contain citation information, fulltext. Compute similarity.
CiteUlikeSocial paper tagging application. Findpaper from researchers with similar interest.
ArnetMinerAcademic network analysis.
MendeleyApplication for managing references.Database of reference.
Google Scholar
Keyword-based search engine (with citationinformations).
Microsoft Academic SearchKeyword-based search engine andAcademic network analysis.
IEEE, ACM, Elsevier, JSTOR, ...Publishers or digital libraries with completetext and references. Some suggestions.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Why? 5 / 37
The Academic Web Service Ecosystem
DBLPList of CS papers with clean reference anddisambiguated names.
Citeseer, {Ref,Ack,Collab}Seer
Automatically crawled papers in CS. GivePDFs. Contain citation information, fulltext. Compute similarity.
CiteUlikeSocial paper tagging application. Findpaper from researchers with similar interest.
ArnetMinerAcademic network analysis.
MendeleyApplication for managing references.Database of reference.
Google Scholar
Keyword-based search engine (with citationinformations).
Microsoft Academic SearchKeyword-based search engine andAcademic network analysis.
IEEE, ACM, Elsevier, JSTOR, ...Publishers or digital libraries with completetext and references. Some suggestions.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Why? 5 / 37
A Use Case
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Overview 6 / 37
System Overview
Architecture
A web-server as a front end. A cluster in the back-end. New instances aredynamically created as the load varies.
Functional
.bib
.ris
.xml
paper IDs
parameters{k,d,κ}
πVisualization
Relevance Feedback
RecommendationEngine
Venue Rec.
Reviewer Rec.
DiversificationEngine
PaperMapper
venues
reviewers
papers
ππ
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Introduction::Overview 7 / 37
Outline
1 IntroductionWhy?Overview
2 Citation Analysis for Document RecommendationPrevious ApproachesDirection Aware Recommendation
3 A High Performance Computing ProblemA specialization of SpMVOrdering and Partitioning
4 Result Diversification
5 Other Features
6 Final ThoughtsConclusionFuture Works
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis:: 8 / 37
Using the Citation Graph
Hypothesis
If two papers are related or treat the same subject, then they will be closeto each other in the citation graph (and reciprocal)
Benefits
No linguistic => no synonymy, no ambiguity
Automatically crowd source by researchers
Drawbacks
Difficult to gather the data (But thanks Citeseer)
Relies on researcher already having made similar connections
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis:: 9 / 37
Local Approaches
t
20032002 2004 20052001 2006
v
xu
reference edges of v citation edges of v
Bibliographic coupling [Kessler63]: papers having similarreferences are related
Cocitation [Small73]: papers which are cited by the same papersare related
CCIDF [Lawrence99]: cocitations weighted with inverse frequencies
Problem: Considers only level-2 papers based on level-1 information.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis::Previous Approaches 10 / 37
Local Approaches
t
20032002 2004 20052001 2006
v
xu
reference edges of v citation edges of v
Bibliographic coupling [Kessler63]: papers having similarreferences are related
Cocitation [Small73]: papers which are cited by the same papersare related
CCIDF [Lawrence99]: cocitations weighted with inverse frequencies
Problem: Considers only level-2 papers based on level-1 information.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis::Previous Approaches 10 / 37
Global Approaches
Graph distance-based
Katz: number of paths between two papers [Strohman07]
Random walk with restarts (RWR) based
ArticleRank [Li09] (PageRank [Brin98] extension)
PaperRank [Gori06] (Personalized PageRank [Haveliwala02]extension)
RWR treats the citations and references in the same way
This is not exploratory!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis::Previous Approaches 11 / 37
Global Approaches
Graph distance-based
Katz: number of paths between two papers [Strohman07]
Random walk with restarts (RWR) based
ArticleRank [Li09] (PageRank [Brin98] extension)
PaperRank [Gori06] (Personalized PageRank [Haveliwala02]extension)
RWR treats the citations and references in the same way
This is not exploratory!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis::Previous Approaches 11 / 37
PageRank
Let G = (V ,E ) be the citation graph
PageRank [Brin98]
πi (u) = d 1|V | + (1− d)
∑v∈N(u)
πi−1(v)δ(v)
with d ∈ (0 : 1) is the damping factor.It converges to a stable distribution.
source: wikipedia
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis::Direction Awareness 12 / 37
PageRank
Let G = (V ,E ) be the citation graph
PageRank [Brin98]
πi (u) = d 1|V | + (1− d)
∑v∈N(u)
πi−1(v)δ(v)
with d ∈ (0 : 1) is the damping factor.It converges to a stable distribution.But it is not personalized.
source: wikipedia
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis::Direction Awareness 12 / 37
PageRank
Let G = (V ,E ) be the citation graph
PageRank [Brin98]
πi (u) = d 1|V | + (1− d)
∑v∈N(u)
πi−1(v)δ(v)
with d ∈ (0 : 1) is the damping factor.It converges to a stable distribution.But it is not personalized.
Personalized PageRank [Jeh03]
πi (u) = dp∗(u)+(1−d)∑
v∈N(u)πi−1(v)δ(v)
with∑
p∗(u) = 1.source: wikipedia
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis::Direction Awareness 12 / 37
PageRank
Let G = (V ,E ) be the citation graph
PageRank [Brin98]
πi (u) = d 1|V | + (1− d)
∑v∈N(u)
πi−1(v)δ(v)
with d ∈ (0 : 1) is the damping factor.It converges to a stable distribution.But it is not personalized.
Personalized PageRank [Jeh03]
πi (u) = dp∗(u)+(1−d)∑
v∈N(u)πi−1(v)δ(v)
with∑
p∗(u) = 1.But it is not exploratory.
source: wikipedia
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis::Direction Awareness 12 / 37
Direction Awareness
Time exploration
What if we are interested in searching papers per years. Recent papers?Traditional papers?
Let M be a set of known relevant papers.
Direction Aware Random Walk with Restart
πi (u) = dp∗(u) + (1− d)(κ∑
v∈N+(u)πi−1(v)δ−(v) + (1− κ)
∑v∈N−(u)
πi−1(v)δ+(v) )
d ∈ (0 : 1) is the damping factor.
κ ∈ (0 : 1).
p∗(u) = 1|M| , if u ∈ M, p∗(u) = 0, otherwise
a b c d
restartedge
reference edge back-reference(citation) edgev
d (1-κ)δ+(v)
d κδ-(v)
(1-d)m
qm
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis::Direction Awareness 13 / 37
Exploring in Depth
0 0.2 0.4 0.6 0.8 1κ
0.5
0.6
0.7
0.8
0.9
1
d
1
1.5
2
2.5
3
aver
age
shor
test
dis
tanc
e
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis::Direction Awareness 14 / 37
Exploring in Time
0 0.2 0.4 0.6 0.8 1κ
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
d
1980
1985
1990
1995
2000
2005
2010
aver
age
publ
icat
ion
year
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis::Direction Awareness 15 / 37
Hidden Reference Discovery
The recovery test
Let’s hide some references from a paper and see if an algorithm can findthem
Results of the experiments with mean average precision (MAP@50) and95% confidence intervals.
hide random hide recent hide earliermean interval mean interval mean interval
DaRWR 48.00 46.80 49.20 42.22 40.95 43.50 60.64 59.48 61.80P.R. 56.56 55.31 57.80 38.75 37.50 40.00 58.93 57.76 60.10Katzβ 46.33 45.16 47.50 34.56 33.42 35.70 44.19 42.97 45.40Cocit 44.60 43.39 45.80 14.22 13.25 15.20 55.97 54.64 57.30Cocoup 17.28 16.36 18.20 17.56 16.61 18.50 2.93 2.57 3.30CCIDF 18.05 17.11 19.00 18.97 17.94 20.00 3.55 3.10 4.00
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Citation Analysis::Direction Awareness 16 / 37
Outline
1 IntroductionWhy?Overview
2 Citation Analysis for Document RecommendationPrevious ApproachesDirection Aware Recommendation
3 A High Performance Computing ProblemA specialization of SpMVOrdering and Partitioning
4 Result Diversification
5 Other Features
6 Final ThoughtsConclusionFuture Works
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem:: 17 / 37
A Sparse Matrix-Vector Multiplication (SpMV)
Rewriting DaRWR
πi (u) = dp∗(u) + (1− d)
κ ∑v∈N+(u)
πi−1(v)
δ−(v)+ (1− κ)
∑v∈N−(u)
πi−1(v)
δ+(v)
πi (u) = dp∗(u) +
∑v∈N+(u)
(1− d)κ
δ−(v)πi−1(v) +
∑v∈N−(u)
(1− d)(1− κ)
δ+(v)πi−1(v)
πi = dp∗ + Aπi−1
CRS Full
Traverse A column per column.
Skip columns where πi−1(v) = 0.
Per edge: 2 non-zeros (2 indices, 2 values)
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem::SpMV 18 / 37
A Sparse Matrix-Vector Multiplication (SpMV)
Rewriting DaRWR
πi (u) = dp∗(u) + (1− d)
κ ∑v∈N+(u)
πi−1(v)
δ−(v)+ (1− κ)
∑v∈N−(u)
πi−1(v)
δ+(v)
πi (u) = dp∗(u) +
∑v∈N+(u)
(1− d)κ
δ−(v)πi−1(v) +
∑v∈N−(u)
(1− d)(1− κ)
δ+(v)πi−1(v)
πi = dp∗ + Aπi−1
CRS Full
Traverse A column per column.
Skip columns where πi−1(v) = 0.
Per edge: 2 non-zeros (2 indices, 2 values)
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem::SpMV 18 / 37
A Sparse Matrix-Vector Multiplication (SpMV)
Rewriting DaRWR
πi (u) = dp∗(u) +∑
v∈N+(u)
(1− d)κ
δ−(v)πi−1(v) +
∑v∈N−(u)
(1− d)(1− κ)
δ+(v)πi−1(v)
πi = dp∗ + B−(1− d)κ
δ−πi−1 +B+ (1− d)(1− κ)
δ+πi−1
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem::SpMV 19 / 37
A Sparse Matrix-Vector Multiplication (SpMV)
Rewriting DaRWR
πi (u) = dp∗(u) +∑
v∈N+(u)
(1− d)κ
δ−(v)πi−1(v) +
∑v∈N−(u)
(1− d)(1− κ)
δ+(v)πi−1(v)
πi = dp∗ + B−(1− d)κ
δ−πi−1 +B+ (1− d)(1− κ)
δ+πi−1
CRS Half
pre-compute: (1−d)κδ− πi−1 and (1−d)(1−κ)
δ+ πi−1
B− and B+ are 0/1 and symmetric
Traverse the matrix twice (B− and B+)
Skip columns where πi−1(v) = 0.
Per edge: 1 non-zeros (1 index, 0 values)
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem::SpMV 19 / 37
A Sparse Matrix-Vector Multiplication (SpMV)
Rewriting DaRWR
πi (u) = dp∗(u) +∑
v∈N+(u)
(1− d)κ
δ−(v)πi−1(v) +
∑v∈N−(u)
(1− d)(1− κ)
δ+(v)πi−1(v)
πi = dp∗ + B−(1− d)κ
δ−πi−1 +B+ (1− d)(1− κ)
δ+πi−1
COO Half
pre-compute: (1−d)κδ− πi−1 and (1−d)(1−κ)
δ+ πi−1
B− and B+ are 0/1 and symmetric
Traverse the matrix once (B− and B+)
Arbitrary order. Don’t skip anything.
Per edge: 1 non-zeros (2 indices, 0 values)
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem::SpMV 19 / 37
Number of updates
0
2M
4M
6M
8M
10M
12M
2 4 6 8 10 12 14 16 18 20
# up
date
s
iteration
CRS-FullCRS-HalfCOO-Half
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem::SpMV 20 / 37
Runtimes
1
1.5
2
2.5
3
1 2 4 8 16 32 64
exec
utio
n tim
e (s
)
#partitions
CRS-FullCRS-Full (RCM)CRS-Full (AMD)CRS-Full (SB)CRS-HalfCRS-Half (RCM)CRS-Half (AMD)CRS-Half (SB)COO-HalfCOO-Half (RCM)COO-Half (AMD)COO-Half (SB)HybridHybrid (RCM)Hybrid (AMD)Hybrid (SB) 1
1.5
2
2.5
3
1 2 4 8 16 32 64
exec
utio
n tim
e (s
)
#partitions
CRS-FullCRS-Full (RCM)CRS-Full (AMD)CRS-Full (SB)CRS-HalfCRS-Half (RCM)CRS-Half (AMD)CRS-Half (SB)COO-HalfCOO-Half (RCM)COO-Half (AMD)COO-Half (SB)HybridHybrid (RCM)Hybrid (AMD)Hybrid (SB)
CRS-FullCRS-Half
COO-Half
Hybrid
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem::SpMV 21 / 37
Ordering
LocalitySpMV is sensitive to non-zero locality.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem::Ordering 22 / 37
Ordering
LocalitySpMV is sensitive to non-zero locality.
Reverse Cuthill-McKee [Cuthill, McKee, 69]
Order with respect to a Breadth First Search ordering. (Do 10 times, pick best)
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem::Ordering 22 / 37
Ordering
LocalitySpMV is sensitive to non-zero locality.
Reverse Cuthill-McKee [Cuthill, McKee, 69]
Order with respect to a Breadth First Search ordering. (Do 10 times, pick best)
Approximate Minimum Degree [Amestoy et al.,96]
Greedily, add the vertex whose degree is minimum.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem::Ordering 22 / 37
Ordering
LocalitySpMV is sensitive to non-zero locality.
Reverse Cuthill-McKee [Cuthill, McKee, 69]
Order with respect to a Breadth First Search ordering. (Do 10 times, pick best)
Approximate Minimum Degree [Amestoy et al.,96]
Greedily, add the vertex whose degree is minimum.
Slashburn [Kang, Faloutsos,11]
Order by connected components. Remove the highest degree vertex. Repeat.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem::Ordering 22 / 37
Partitioning
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem::Ordering 23 / 37
Runtimes
1
1.5
2
2.5
3
1 2 4 8 16 32 64
exec
utio
n tim
e (s
)
#partitions
CRS-FullCRS-Full (RCM)CRS-Full (AMD)CRS-Full (SB)CRS-HalfCRS-Half (RCM)CRS-Half (AMD)CRS-Half (SB)COO-HalfCOO-Half (RCM)COO-Half (AMD)COO-Half (SB)HybridHybrid (RCM)Hybrid (AMD)Hybrid (SB) 1
1.5
2
2.5
3
1 2 4 8 16 32 64
exec
utio
n tim
e (s
)
#partitions
CRS-FullCRS-Full (RCM)CRS-Full (AMD)CRS-Full (SB)CRS-HalfCRS-Half (RCM)CRS-Half (AMD)CRS-Half (SB)COO-HalfCOO-Half (RCM)COO-Half (AMD)COO-Half (SB)HybridHybrid (RCM)Hybrid (AMD)Hybrid (SB)
CRS-FullCRS-Half
COO-Half
Hybrid
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
A HPC computing problem::Ordering 24 / 37
Outline
1 IntroductionWhy?Overview
2 Citation Analysis for Document RecommendationPrevious ApproachesDirection Aware Recommendation
3 A High Performance Computing ProblemA specialization of SpMVOrdering and Partitioning
4 Result Diversification
5 Other Features
6 Final ThoughtsConclusionFuture Works
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Result Diversification:: 25 / 37
Principle
The goal of diversity is to avoid clustered answers.
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Result Diversification:: 26 / 37
Principle
The goal of diversity is to avoid clustered answers.
Relevant
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Result Diversification:: 26 / 37
Principle
The goal of diversity is to avoid clustered answers.
Relevant Relevant Diverse
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Result Diversification:: 26 / 37
A Modelization problem
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
dens
2
rel
10-RLMBC1BC1VBC2BC2VBC1100BC11000BC150BC2100BC21000BC250CDivRankDRAGON
GRASSHOPPERGSPARSEIL1IL2LMPDivRankPRk-RLMtop-90%+randomtop-75%+randomtop-50%+randomtop-25%+randomAll Random
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
dens
2
rel
10-RLMBC1BC1VBC2BC2VBC1100BC11000BC150BC2100BC21000BC250CDivRankDRAGON
GRASSHOPPERGSPARSEIL1IL2LMPDivRankPRk-RLMtop-90%+randomtop-75%+randomtop-50%+randomtop-25%+randomAll Random
better
Here is a distribution ofknown algorithms
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Result Diversification:: 27 / 37
A Modelization problem
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
dens
2
rel
10-RLMBC1BC1VBC2BC2VBC1100BC11000BC150BC2100BC21000BC250CDivRankDRAGON
GRASSHOPPERGSPARSEIL1IL2LMPDivRankPRk-RLMtop-90%+randomtop-75%+randomtop-50%+randomtop-25%+randomAll Random
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
dens
2
rel
10-RLMBC1BC1VBC2BC2VBC1100BC11000BC150BC2100BC21000BC250CDivRankDRAGON
GRASSHOPPERGSPARSEIL1IL2LMPDivRankPRk-RLMtop-90%+randomtop-75%+randomtop-50%+randomtop-25%+randomAll Random
better
Would such an algorithm beof interest?
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Result Diversification:: 27 / 37
A Modelization problem
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
dens
2
rel
10-RLMBC1BC1VBC2BC2VBC1100BC11000BC150BC2100BC21000BC250CDivRankDRAGON
GRASSHOPPERGSPARSEIL1IL2LMPDivRankPRk-RLMtop-90%+randomtop-75%+randomtop-50%+randomtop-25%+randomAll Random
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
dens
2
rel
10-RLMBC1BC1VBC2BC2VBC1100BC11000BC150BC2100BC21000BC250CDivRankDRAGON
GRASSHOPPERGSPARSEIL1IL2LMPDivRankPRk-RLMtop-90%+randomtop-75%+randomtop-50%+randomtop-25%+randomAll Random
better
Would such an algorithm beof interest?That algorithm is random!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Result Diversification:: 27 / 37
What to do?
See later talk!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Result Diversification:: 28 / 37
Results
GPU
Multicore
Generic SpMV
Eigensolvers
Partitioning
Compression
Graph mining
references
recommendations
top-100
Multicore
GPU
Multicore
Generic SpMV
Eigensolvers
Partitioning
Compression
Graph mining
references
recommendations
top-100
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Result Diversification:: 29 / 37
Outline
1 IntroductionWhy?Overview
2 Citation Analysis for Document RecommendationPrevious ApproachesDirection Aware Recommendation
3 A High Performance Computing ProblemA specialization of SpMVOrdering and Partitioning
4 Result Diversification
5 Other Features
6 Final ThoughtsConclusionFuture Works
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Other Features:: 30 / 37
Relevance Feedback
Papers can be tagged are relevantor irrelevant.
Positive feedback (+RF):Relevant results are added to Q
Negative feedback (-RF):Irrelevant results are removedfrom the graph
How long does it take to find thefirst level-3 paper?
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
1 10 100
Tau
Ratio
No RFPos RFNeg RF
Pos+Neg RF
More exploration!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Other Features:: 31 / 37
Relevance Feedback
Papers can be tagged are relevantor irrelevant.
Positive feedback (+RF):Relevant results are added to Q
Negative feedback (-RF):Irrelevant results are removedfrom the graph
How long does it take to find thefirst level-3 paper?
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
1 10 100
Tau
Ratio
No RFPos RFNeg RF
Pos+Neg RF
More exploration!Erik Saule
Ohio State University, Biomedical InformaticsHPC Lab http://bmi.osu.edu/hpc
theadvisor: http://theadvisor.osu.edu/Other Features:: 31 / 37
Visualization
More exploration!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Other Features:: 32 / 37
Visualization
More exploration!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Other Features:: 32 / 37
Application Programming Interface
Web service
theadvisor can be accessed programmatically. Emit HTTP requests andobtain JSON encoded replies.
Potential Applications
Interfacing with article editors (e.g., TexShop)
Recommendation in bibliography manager (e.g., Mendeley)
Suggesting reviewers to program committees (e.g., EasyChair)
Suggesting sessions of interest at conferences (e.g., iConference )
Easier to use!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Other Features:: 33 / 37
Application Programming Interface
Web service
theadvisor can be accessed programmatically. Emit HTTP requests andobtain JSON encoded replies.
Potential Applications
Interfacing with article editors (e.g., TexShop)
Recommendation in bibliography manager (e.g., Mendeley)
Suggesting reviewers to program committees (e.g., EasyChair)
Suggesting sessions of interest at conferences (e.g., iConference )
Easier to use!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Other Features:: 33 / 37
Application Programming Interface
Web service
theadvisor can be accessed programmatically. Emit HTTP requests andobtain JSON encoded replies.
Potential Applications
Interfacing with article editors (e.g., TexShop)
Recommendation in bibliography manager (e.g., Mendeley)
Suggesting reviewers to program committees (e.g., EasyChair)
Suggesting sessions of interest at conferences (e.g., iConference )
Easier to use!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Other Features:: 33 / 37
Outline
1 IntroductionWhy?Overview
2 Citation Analysis for Document RecommendationPrevious ApproachesDirection Aware Recommendation
3 A High Performance Computing ProblemA specialization of SpMVOrdering and Partitioning
4 Result Diversification
5 Other Features
6 Final ThoughtsConclusionFuture Works
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Final Thoughts:: 34 / 37
Design Goals - Are they matched?
Personalized
The query is expressed in a very precise way.
Conceptual
Using the citation graph allows to avoid all linguistic. Though, it may notbe enough to find all relevant papers.
Exploratory
Direction Awareness (to choose time), Diversification (to see more topics),Visualization (for manual crawling)
Easy to use
Efficient (recommendation in less than 2 seconds), web-based.
Is it good enough? Tell us!
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/Final Thoughts::Conclusion 35 / 37
Design Goals - Are they matched?
Personalized
The query is expressed in a very precise way.
Conceptual
Using the citation graph allows to avoid all linguistic. Though, it may notbe enough to find all relevant papers.
Exploratory
Direction Awareness (to choose time), Diversification (to see more topics),Visualization (for manual crawling)
Easy to use
Efficient (recommendation in less than 2 seconds), web-based.
Is it good enough? Tell us!Erik Saule
Ohio State University, Biomedical InformaticsHPC Lab http://bmi.osu.edu/hpc
theadvisor: http://theadvisor.osu.edu/Final Thoughts::Conclusion 35 / 37
Future works
Clustering
Let’s assume for an instant that we have accurate disambiguated tags forevery document. We could restrict analysis to some fields. Improvediversification.
Betweenness Centrality
DaRWR provides recommendation around the query set. What aboutrecommending what is between it?
Contextual information
Distinguishing types of papers and citations. Survey, Method,Application...
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Final Thoughts::Future Works 36 / 37
Thank you
More information
contact : [email protected]: http://theadvisor.osu.edu
(or http://bmi.osu.edu/hpc/
or http://bmi.osu.edu/~esaule)
Research at HPC lab is supported by
Erik SauleOhio State University, Biomedical Informatics
HPC Lab http://bmi.osu.edu/hpctheadvisor: http://theadvisor.osu.edu/
Final Thoughts::Future Works 37 / 37