graph and topological structure mining on scientific articles fan wang, ruoming jin, gagan agrawal...
TRANSCRIPT
Graph and Topological Graph and Topological Structure Mining on Scientific Structure Mining on Scientific
ArticlesArticles
Fan Wang, Ruoming Jin, Fan Wang, Ruoming Jin, Gagan Agrawal and Helen PiontkivskaGagan Agrawal and Helen Piontkivska
The Ohio State University The Ohio State University The Kent State UniversityThe Kent State University
Presenter: Fan Wang
The Ohio State University
OutlineOutline
• IntroductionIntroduction
• Topological Structure MiningTopological Structure Mining
• Data Preprocessing and Graph Data Preprocessing and Graph RepresentationsRepresentations
• Experiment Results and Pattern AnalysisExperiment Results and Pattern Analysis
• ConclusionConclusion
IntroductionIntroduction
• Huge number of genes in literatureHuge number of genes in literature
• Associated with targeted disease or Associated with targeted disease or functionalityfunctionality
• Finding interaction among genes Finding interaction among genes manuallymanually– Time consumingTime consuming– Error Prone Error Prone
IntroductionIntroduction
C C L 5
C C L 4
TC C L 3C C R 5
C C L 5
TC C L 3C C L 4
C D 4
• Well-known relationship among chemokine ligandsWell-known relationship among chemokine ligands• Mining these relations from literature documentsMining these relations from literature documents• Mining frequent patterns from graph datasetsMining frequent patterns from graph datasets
– Convenient representationConvenient representation– Lots of research in subgraph miningLots of research in subgraph mining
IntroductionIntroduction
• Our GoalOur Goal– Find commonly occurring interactionsFind commonly occurring interactions– Represent them visuallyRepresent them visually
• Capture the co-occurrence of scientific Capture the co-occurrence of scientific termsterms
• Graph representation of scientific Graph representation of scientific documentdocument
• Mining frequent topological structuresMining frequent topological structures
OutlineOutline
• IntroductionIntroduction
• Topological Structure MiningTopological Structure Mining
• Data Preprocessing and Graph Data Preprocessing and Graph RepresentationsRepresentations
• Experiment Results and Pattern Experiment Results and Pattern AnalysisAnalysis
• ConclusionConclusion
Topological Structure MiningTopological Structure Mining
• Disadvantages of subgraph miningDisadvantages of subgraph mining– Exact matchingExact matching– Missing potential patternsMissing potential patterns
• Focusing on the topological relationshipFocusing on the topological relationship• Incorporating approximate matchingIncorporating approximate matching
Topological Structure MiningTopological Structure Mining
Y
G X
G is a subgraph of Y
X is a (0,3) topological structure of Y
Topological Structure MiningTopological Structure Mining
• DefinitionDefinition– Given a collection of graphs, two parameterGiven a collection of graphs, two parameter
s l and h, and a threshold s l and h, and a threshold θθ. A (l,h)-topologic. A (l,h)-topological structure whose support is greater than oal structure whose support is greater than or equal to r equal to θθis called a frequent topological sis called a frequent topological structure.tructure.
• Given a set of graphs, in our KDD05 papeGiven a set of graphs, in our KDD05 paper, an algorithm TSMiner finding frequent r, an algorithm TSMiner finding frequent topological structures is implementedtopological structures is implemented
Our WorkOur Work
• Using topological structure mining Using topological structure mining
• ChallengesChallenges– How to create graphs?How to create graphs?– What are the keywords?What are the keywords?– How to insert edges into graphs?How to insert edges into graphs?
OutlineOutline
• IntroductionIntroduction
• Topological Structure MiningTopological Structure Mining
• Data Preprocessing and Graph Data Preprocessing and Graph RepresentationsRepresentations
• Experiment Results and Pattern Experiment Results and Pattern AnalysisAnalysis
• ConclusionConclusion
Data Preprocessing Data Preprocessing and Graph Representationand Graph Representation
• One graph for each documentOne graph for each document
• Nodes are keywords of interestNodes are keywords of interest
• Edges inserted based on occurrence Edges inserted based on occurrence of the keywordsof the keywords
• Run topological structure mining Run topological structure mining algorithmalgorithm
Data PreprocessingData Preprocessing• Four dictionaries of keywordsFour dictionaries of keywords
– Short DictionaryShort Dictionary•321 genes expressed between prostate epithelial 321 genes expressed between prostate epithelial
and stromal cellsand stromal cells– Long DictionaryLong Dictionary
•2600 human genes found in supperarray’s DNA 2600 human genes found in supperarray’s DNA microarray experimentmicroarray experiment
– Confusion DictionaryConfusion Dictionary•Gene names easily confused with ordinary wordsGene names easily confused with ordinary words
– GO DictionaryGO Dictionary•GO terms (molecular function, biological process GO terms (molecular function, biological process
and cellular component)and cellular component)
Graph RepresentationsGraph Representations
• Edge Construction MethodsEdge Construction Methods– Sentence-based MethodSentence-based Method
• Two keywords in one sentenceTwo keywords in one sentence
– Mutual Information MethodMutual Information Method• The mutual information of two keywords greater than The mutual information of two keywords greater than
a thresholda threshold
– Sliding Window MethodSliding Window Method• Two keywords located within a sliding window with a Two keywords located within a sliding window with a
pre-defined sizepre-defined size
OutlineOutline
• IntroductionIntroduction
• Topological Structure MiningTopological Structure Mining
• Data Preprocessing and Graph Data Preprocessing and Graph RepresentationsRepresentations
• Experiment Results and Pattern Experiment Results and Pattern AnalysisAnalysis
• ConclusionConclusion
Experiment ResultsExperiment Results
• Focusing on articles containing at Focusing on articles containing at least one of the 5 genes least one of the 5 genes – CCL5, TF, IGF1, MYLK, IGFBP3CCL5, TF, IGF1, MYLK, IGFBP3
• Generating graph for each articleGenerating graph for each article
• Finding frequent topological Finding frequent topological structuresstructures
Three Edge Construction Three Edge Construction MethodsMethods
Number of Patterns Found, MYLK, sup=15%
0
200
400
600
800
1000
1200
(0,0) (1,1) (2,2)
L and H values
Tot
al N
umbe
r of
Pat
tern
s SentenceMI
Window
Number of Patterns Found, MYLK, sup=20%
0
100
200
300
400
500
(0,0) (1,1) (2,2)
L and H values
Tot
al N
umbe
r of
Pat
tern
s Sentence
MI
Window
Three Edge Construction Three Edge Construction MethodsMethods
Number of Patterns Found, IGFBP3, sup=15%
0
50
100
150
200
250
300
350
400
(0,0) (1,1) (2,2)
L and H values
Tot
al N
umbe
r of
Pat
tern
s Sentence
MI
Window
Number of Patterns Found, IGFBP3, sup=20%
0
50
100
150
200
250
300
350
(0,0) (1,1) (2,2)
L and H values
Tot
al N
umbe
r of
Pat
tern
s Sentence
MI
Window
Three Edge Construction Three Edge Construction MethodsMethods
ResultsResults
• Sliding window method winsSliding window method wins– Largest number of frequent patternsLargest number of frequent patterns– Best scalabilityBest scalability
• Topological structure mining giving Topological structure mining giving us more frequent patterns us more frequent patterns
• Large number doesn’t mean high Large number doesn’t mean high biological significance biological significance
Pattern AnalysisPattern AnalysisC C L 5
TC C L 3C C L 4
P S M B 6
C C L 5
C C L 4T
C C L 3C C R 5
C C B P 2
• ONLY be found by ONLY be found by topological structure topological structure miningmining
• ONLY be found by ONLY be found by sliding window sliding window methodmethod
• Restoring nodes Restoring nodes revealing interesting revealing interesting patternspatterns
C C L 5
TC C L 3C C L 4
P S M B 6
C X C R 4
C C L 5
T
C C L 3
C C L 4
P S M B 6
U nd er lying G ra p h 1 U nd er lying G ra p h 2
OutlineOutline
• IntroductionIntroduction
• Topological Structure MiningTopological Structure Mining
• Data Preprocessing and Graph Data Preprocessing and Graph RepresentationsRepresentations
• Experiment Results and Pattern Experiment Results and Pattern AnalysisAnalysis
• ConclusionConclusion
ConclusionConclusion
• Sliding window method is the bestSliding window method is the best– The most number of frequent patternsThe most number of frequent patterns– The highest quality of frequent patternsThe highest quality of frequent patterns
• Topological structures found Topological structures found corresponding well to known corresponding well to known relationshipsrelationships
• Topological mining being a very Topological mining being a very valuable tool for biological researchersvaluable tool for biological researchers
Three Edge Construction Three Edge Construction MethodsMethods
• Interestingness of EdgesInterestingness of Edges– Counting the number of distinct edgesCounting the number of distinct edges– Computing the average interestingness of Computing the average interestingness of
edges for all patterns found by using each edges for all patterns found by using each edge construction methodedge construction method