graph and topological structure mining on scientific articles fan wang, ruoming jin, gagan agrawal...

24
Graph and Topological Graph and Topological Structure Mining on Structure Mining on Scientific Articles Scientific Articles Fan Wang, Ruoming Jin, Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska Gagan Agrawal and Helen Piontkivska The Ohio State University The Ohio State University The Kent State University The Kent State University Presenter: Fan Wang The Ohio State University

Upload: shawn-porter

Post on 29-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The

Graph and Topological Graph and Topological Structure Mining on Scientific Structure Mining on Scientific

ArticlesArticles

Fan Wang, Ruoming Jin, Fan Wang, Ruoming Jin, Gagan Agrawal and Helen PiontkivskaGagan Agrawal and Helen Piontkivska

The Ohio State University The Ohio State University The Kent State UniversityThe Kent State University

Presenter: Fan Wang

The Ohio State University

Page 2: Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The

OutlineOutline

• IntroductionIntroduction

• Topological Structure MiningTopological Structure Mining

• Data Preprocessing and Graph Data Preprocessing and Graph RepresentationsRepresentations

• Experiment Results and Pattern AnalysisExperiment Results and Pattern Analysis

• ConclusionConclusion

Page 3: Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The

IntroductionIntroduction

• Huge number of genes in literatureHuge number of genes in literature

• Associated with targeted disease or Associated with targeted disease or functionalityfunctionality

• Finding interaction among genes Finding interaction among genes manuallymanually– Time consumingTime consuming– Error Prone Error Prone

Page 4: Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The

IntroductionIntroduction

C C L 5

C C L 4

TC C L 3C C R 5

C C L 5

TC C L 3C C L 4

C D 4

• Well-known relationship among chemokine ligandsWell-known relationship among chemokine ligands• Mining these relations from literature documentsMining these relations from literature documents• Mining frequent patterns from graph datasetsMining frequent patterns from graph datasets

– Convenient representationConvenient representation– Lots of research in subgraph miningLots of research in subgraph mining

Page 5: Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The

IntroductionIntroduction

• Our GoalOur Goal– Find commonly occurring interactionsFind commonly occurring interactions– Represent them visuallyRepresent them visually

• Capture the co-occurrence of scientific Capture the co-occurrence of scientific termsterms

• Graph representation of scientific Graph representation of scientific documentdocument

• Mining frequent topological structuresMining frequent topological structures

Page 6: Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The

OutlineOutline

• IntroductionIntroduction

• Topological Structure MiningTopological Structure Mining

• Data Preprocessing and Graph Data Preprocessing and Graph RepresentationsRepresentations

• Experiment Results and Pattern Experiment Results and Pattern AnalysisAnalysis

• ConclusionConclusion

Page 7: Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The

Topological Structure MiningTopological Structure Mining

• Disadvantages of subgraph miningDisadvantages of subgraph mining– Exact matchingExact matching– Missing potential patternsMissing potential patterns

• Focusing on the topological relationshipFocusing on the topological relationship• Incorporating approximate matchingIncorporating approximate matching

Page 8: Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The

Topological Structure MiningTopological Structure Mining

Y

G X

G is a subgraph of Y

X is a (0,3) topological structure of Y

Page 9: Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The

Topological Structure MiningTopological Structure Mining

• DefinitionDefinition– Given a collection of graphs, two parameterGiven a collection of graphs, two parameter

s l and h, and a threshold s l and h, and a threshold θθ. A (l,h)-topologic. A (l,h)-topological structure whose support is greater than oal structure whose support is greater than or equal to r equal to θθis called a frequent topological sis called a frequent topological structure.tructure.

• Given a set of graphs, in our KDD05 papeGiven a set of graphs, in our KDD05 paper, an algorithm TSMiner finding frequent r, an algorithm TSMiner finding frequent topological structures is implementedtopological structures is implemented

Page 10: Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The

Our WorkOur Work

• Using topological structure mining Using topological structure mining

• ChallengesChallenges– How to create graphs?How to create graphs?– What are the keywords?What are the keywords?– How to insert edges into graphs?How to insert edges into graphs?

Page 11: Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The

OutlineOutline

• IntroductionIntroduction

• Topological Structure MiningTopological Structure Mining

• Data Preprocessing and Graph Data Preprocessing and Graph RepresentationsRepresentations

• Experiment Results and Pattern Experiment Results and Pattern AnalysisAnalysis

• ConclusionConclusion

Page 12: Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The

Data Preprocessing Data Preprocessing and Graph Representationand Graph Representation

• One graph for each documentOne graph for each document

• Nodes are keywords of interestNodes are keywords of interest

• Edges inserted based on occurrence Edges inserted based on occurrence of the keywordsof the keywords

• Run topological structure mining Run topological structure mining algorithmalgorithm

Page 13: Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The

Data PreprocessingData Preprocessing• Four dictionaries of keywordsFour dictionaries of keywords

– Short DictionaryShort Dictionary•321 genes expressed between prostate epithelial 321 genes expressed between prostate epithelial

and stromal cellsand stromal cells– Long DictionaryLong Dictionary

•2600 human genes found in supperarray’s DNA 2600 human genes found in supperarray’s DNA microarray experimentmicroarray experiment

– Confusion DictionaryConfusion Dictionary•Gene names easily confused with ordinary wordsGene names easily confused with ordinary words

– GO DictionaryGO Dictionary•GO terms (molecular function, biological process GO terms (molecular function, biological process

and cellular component)and cellular component)

Page 14: Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The

Graph RepresentationsGraph Representations

• Edge Construction MethodsEdge Construction Methods– Sentence-based MethodSentence-based Method

• Two keywords in one sentenceTwo keywords in one sentence

– Mutual Information MethodMutual Information Method• The mutual information of two keywords greater than The mutual information of two keywords greater than

a thresholda threshold

– Sliding Window MethodSliding Window Method• Two keywords located within a sliding window with a Two keywords located within a sliding window with a

pre-defined sizepre-defined size

Page 15: Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The

OutlineOutline

• IntroductionIntroduction

• Topological Structure MiningTopological Structure Mining

• Data Preprocessing and Graph Data Preprocessing and Graph RepresentationsRepresentations

• Experiment Results and Pattern Experiment Results and Pattern AnalysisAnalysis

• ConclusionConclusion

Page 16: Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The

Experiment ResultsExperiment Results

• Focusing on articles containing at Focusing on articles containing at least one of the 5 genes least one of the 5 genes – CCL5, TF, IGF1, MYLK, IGFBP3CCL5, TF, IGF1, MYLK, IGFBP3

• Generating graph for each articleGenerating graph for each article

• Finding frequent topological Finding frequent topological structuresstructures

Page 17: Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The

Three Edge Construction Three Edge Construction MethodsMethods

Number of Patterns Found, MYLK, sup=15%

0

200

400

600

800

1000

1200

(0,0) (1,1) (2,2)

L and H values

Tot

al N

umbe

r of

Pat

tern

s SentenceMI

Window

Number of Patterns Found, MYLK, sup=20%

0

100

200

300

400

500

(0,0) (1,1) (2,2)

L and H values

Tot

al N

umbe

r of

Pat

tern

s Sentence

MI

Window

Page 18: Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The

Three Edge Construction Three Edge Construction MethodsMethods

Number of Patterns Found, IGFBP3, sup=15%

0

50

100

150

200

250

300

350

400

(0,0) (1,1) (2,2)

L and H values

Tot

al N

umbe

r of

Pat

tern

s Sentence

MI

Window

Number of Patterns Found, IGFBP3, sup=20%

0

50

100

150

200

250

300

350

(0,0) (1,1) (2,2)

L and H values

Tot

al N

umbe

r of

Pat

tern

s Sentence

MI

Window

Page 19: Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The

Three Edge Construction Three Edge Construction MethodsMethods

Page 20: Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The

ResultsResults

• Sliding window method winsSliding window method wins– Largest number of frequent patternsLargest number of frequent patterns– Best scalabilityBest scalability

• Topological structure mining giving Topological structure mining giving us more frequent patterns us more frequent patterns

• Large number doesn’t mean high Large number doesn’t mean high biological significance biological significance

Page 21: Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The

Pattern AnalysisPattern AnalysisC C L 5

TC C L 3C C L 4

P S M B 6

C C L 5

C C L 4T

C C L 3C C R 5

C C B P 2

• ONLY be found by ONLY be found by topological structure topological structure miningmining

• ONLY be found by ONLY be found by sliding window sliding window methodmethod

• Restoring nodes Restoring nodes revealing interesting revealing interesting patternspatterns

C C L 5

TC C L 3C C L 4

P S M B 6

C X C R 4

C C L 5

T

C C L 3

C C L 4

P S M B 6

U nd er lying G ra p h 1 U nd er lying G ra p h 2

Page 22: Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The

OutlineOutline

• IntroductionIntroduction

• Topological Structure MiningTopological Structure Mining

• Data Preprocessing and Graph Data Preprocessing and Graph RepresentationsRepresentations

• Experiment Results and Pattern Experiment Results and Pattern AnalysisAnalysis

• ConclusionConclusion

Page 23: Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The

ConclusionConclusion

• Sliding window method is the bestSliding window method is the best– The most number of frequent patternsThe most number of frequent patterns– The highest quality of frequent patternsThe highest quality of frequent patterns

• Topological structures found Topological structures found corresponding well to known corresponding well to known relationshipsrelationships

• Topological mining being a very Topological mining being a very valuable tool for biological researchersvaluable tool for biological researchers

Page 24: Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The

Three Edge Construction Three Edge Construction MethodsMethods

• Interestingness of EdgesInterestingness of Edges– Counting the number of distinct edgesCounting the number of distinct edges– Computing the average interestingness of Computing the average interestingness of

edges for all patterns found by using each edges for all patterns found by using each edge construction methodedge construction method