1 current research in data mining research group current research in data mining research group...
TRANSCRIPT
1
Current Research in Data Current Research in Data Mining Research GroupMining Research Group
Jiawei HanData Mining Research Group
Department of Computer Science
University of Illinois at Urbana-ChampaignAcknowledgements: NSF, ARL, ARO, NASA, AFOSR (MURI), DHS, Microsoft, IBM, Yahoo! Labs, LinkedIn, HP Lab &
Boeing
April 19, 2023
2
OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group
Pattern Discovery MethodsPattern Discovery Methods
Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks
Construction of Heterogeneous Information Networks from Construction of Heterogeneous Information Networks from Unstructured DataUnstructured Data
TextCube and OLAP heterogeneous networksTextCube and OLAP heterogeneous networks
Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks
ConclusionsConclusions
Data Mining and Data WarehousingData Mining and Data WarehousingJiawei HanJiawei Han’’s Group at CS, s Group at CS, UIUCUIUC
Mining patterns and knowledge discovery from massive data Data mining in heterogeneous information networks Exploring broad applications of data mining
3
Developed popular data mining algorithms: FPgrowth, gSpan, PrefixSpan, RankingCube, TruthFinder, NetClus, RankClass, …
600+ research papers, most cited author/group in data mining ACM Fellow, IEEE Fellow, ACM SIGKDD Innovation Award, W. McDowell
Award; Students: ACM KDD Dissertation Awards (2008, 2013), …… Textbook, “Data mining: Concepts and Techniques,” adopted worldwide Funded as NSCTA (Network Science Collaborative
Technology Alliance) by ARL [09-14, 15-19], ARO, NIH KnowEnG, NSF, Boeing, MSR, Google, Yahoo!, HP Labs, …
Graduated 40+ Ph.D.’s: joined Google, Microsoft Research, Yahoo! Labs, Facebook, Twitter, as well as professors (14)
Supervising 17 Ph.D., 4 M.S. students & 5 visitors/postdocs
Data Mining Research Group in CS, Data Mining Research Group in CS, Univ. IllinoisUniv. Illinois
• Student Prominent AwardsStudent Prominent Awards– SIGKDD or SIGMOD Ph.D. Dissertation Awards/
Runner-Ups– 10-year impact paper awards – Best student paper awards, best papers, best posters, …– KDDCUP 2013 Runner Up Award– IBM/Microsoft/NSF/NDSEG Ph.D. Fellowships
• Graduation:Graduation:– Professors at UVA, UCSB, PSU, U. Buffalo, Northeastern, FSU, MSU, Notre Dame, CUHK, …– Researchers at IBM, MSR, Google Research, Yahoo! Labs, Facebook, Twitter, NEC, etc.
44
5
OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group
Pattern Discovery MethodsPattern Discovery Methods
Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks
Construction of Heterogeneous Information Networks from Construction of Heterogeneous Information Networks from Unstructured DataUnstructured Data
TextCube and OLAP heterogeneous networksTextCube and OLAP heterogeneous networks
Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks
ConclusionsConclusions
6
Mining Sequential Patterns from Shopping SequencesMining Sequential Patterns from Shopping Sequences
Sequential pattern mining: Given a set of (shopping) sequences, find the complete set of frequent subsequences
A sequence database
<a(bc)dc>: a subsequence of <<a(abc)(ac)d(cf)>
Given support threshold min_sup =2, <(ab)c> is a sequential pattern
SID sequence10 <a(abc)(ac)d(cf)>20 <(ad)c(bc)(ae)>30 <(ef)(ab)(df)cb>40 <eg(af)cbc>
Our innovation: (1) PrefixSpan (TKDE’04): 1598 citations(2) CloSpan (SDM’03): 568 (reduce redundancy)(3) FPgrowth (SIGMOD’00): 4956
s=<a(abc)(ac)d(cf)>
<(abc)(ac)d(cf)>
<(_c)(ac)d(cf)>
<a>
<ab>
s|<a>: ( , 2)
s|<ab>: ( , 4)
Idea of PrefixSpan
Idea of CloSpan
Difficulty to generalize it to biosequence mining: approximate patterns & noise
Mining Frequent Subgraph Patterns from Graph DBs Mining Frequent Subgraph Patterns from Graph DBs
GRAPH DATASET (e.g., Chemical Compound Database)
FREQUENT PATTERNS (Let MIN SUPPORT = 2)
Graph pattern mining: Given a set of graphs, find the complete set of frequent subgraphs
Our innovation: (1) gSpan (ICDM’02): 1319 citations(2) CloseGraph (KDD’03): 520 (not to mine
subgraphs covered by their super-patterns)
7
Idea of gSpan
Graph pattern growth + completeness of right-most extension
…
G
G1
G2
Gn
k-edge
(k+1)-edge
At what condition, can we stop searching their
Children. i.e., early termination?
NCI/NIH AIDS antiviral screen compound data
minsup = 5% Extend to mine structures in large single networks (VLDB’11)
CloseGraph
Graph Indexing and Graph Similarity SearchGraph Indexing and Graph Similarity SearchGraph Search: Given a query graph Q, find all the graphs in graph DB containing Q
query graph graph DB
Graph (G)
Graph Index
Query:Q
Graph Index helps search
Our Innovation:gIndex (SIGMOD’04): 419 citationsgrafil (SIGMOD’05): similarity search
gIndex key idea: index on frequent and discriminative substructures (mined)
0.0E+00
2.0E+04
4.0E+04
6.0E+04
8.0E+04
1.0E+05
1.2E+05
1.4E+05
1k 2k 4k 8k 16k
PathFrequent StructureDiscriminative Frequent Structure
0
20
40
60
80
100
120
140
4 8 12 16 20 24
GraphGrep
gIndex
Actual Match
# candidates/query size # indices/ DBsize
grafil key idea: explore feature similarity
…
Query:Q
Graph (G)
featuresApproximate
features
8
11
OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group
Pattern Discovery MethodsPattern Discovery Methods
Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks
Construction of Heterogeneous Information Networks from Construction of Heterogeneous Information Networks from Unstructured DataUnstructured Data
TextCube and OLAP heterogeneous networksTextCube and OLAP heterogeneous networks
Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks
ConclusionsConclusions
Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks
Heterogeneous networks: Multiple object types and/or multiple link types
VenueVenue PaperPaper AuthorAuthorDBLP Bibliographic NetworkDBLP Bibliographic Network The IMDB Movie NetworkThe IMDB Movie Network
ActorActorMovieMovie
DirectorDirector
Movie Movie StudioStudio
Homogeneous networks are info. lossinfo. loss projection of heterogeneous networks!
The Facebook NetworkThe Facebook Network
Directly mining information-richer heterogeneous networksDirectly mining information-richer heterogeneous networks
Current work: Mining DBLP (CS bibliographic DB), PubMed, news, tweets, data.gov, …
Structured Heterogeneous Network Modeling Structured Heterogeneous Network Modeling Leads to the New Power of Data Mining! Leads to the New Power of Data Mining!
DBLP: A Computer Science bibliographic database
A sample publication record in DBLP (>2 M papers, >0.7 M authors, >10 K venues), …
13
Power of het. network modeling: Treat Author, Venue, Term, Paper all first-class citizens!
RankClus: Rank-Based ClusteringRankClus: Rank-Based Clustering
14RankCompete: Organize your photo album automatically!RankCompete: Organize your photo album automatically!
Rank treatments for AIDS from MEDLINERank treatments for AIDS from MEDLINE
Research Paper
Term
AuthorVenue
Publish Write
Contain
P
T
AV
P
T
AV
……
P
T
AVNetClus
Computer Science
Database
Hardware
Theory
RankClus (EDBT’09)/NetClus (KDD’09): Integrate ranking & clustering for mining RankClus (EDBT’09)/NetClus (KDD’09): Integrate ranking & clustering for mining heterogeneous info networksheterogeneous info networks
DBLP SchemaDBLP Schema
15
RankClass: Integration of Tanking and ClassificationRankClass: Integration of Tanking and Classification
Knowledge propagation via multi-typed heterogeneous networksKnowledge propagation via multi-typed heterogeneous networks
ECMLPKDD'10/KDD’11: integrate ranking and classification; small training set; knowledge propagation across typed links; efficient and scalable
Database Data Mining AI IR
Top-5 ranked conf.s
VLDB KDD IJCAI SIGIR
SIGMOD SDM AAAI ECIR
ICDE ICDM ICML CIKM
PODS PKDD CVPR WWW
EDBT PAKDD ECML WSDM
Top-5 ranked terms
data mining learning retrieval
database data knowledge information
query clustering reasoning web
system classification logic search
xml frequent cognition text DBLP: 4-fields data set (DB, DM, AI, IR)
forming a heterog. info. network Rank objects within each class (with
extremely limited label information) Obtain High classification accuracy and
excellent rankings within each class
Our innovation:
Potential applications:Biological network mining
Anhai DoanCS, WisconsinDatabase areaPhD: 2002
Meta-Path: Author-Paper-Venue-Paper-Author (APVPA)
Jignesh PatelCS, WisconsinDatabase areaPhD: 1998
Amol DeshpandeCS, MarylandDatabase areaPhD: 2004
Jun YangCS, DukeDatabase areaPhD: 2001
16
Meta-Path GuidedMeta-Path Guided Similarity Search in Networks Similarity Search in Networks
Similarity search: Find similar objects in networksSimilarity search: Find similar objects in networks Who are most similar to AnHai Doan?
Meta-Path: Meta-level description of a path between two objects
Different meta-paths carry rather different semantics
DBLP Network Schema
Our innovation
PathSim (VLDB’11): Similarity search in heterogeneous networks; a balanced similarity measure; user-guidance by selecting different meta-paths
Application in biomedical domain
IBM: search for close relationships among disease, drugs, treatments, side-effects, and explanations
PathPredict: Meta-Path Based Relationship PredictionPathPredict: Meta-Path Based Relationship Prediction
Network schema
17
Co-author prediction Co-author prediction for Jian Peifor Jian Pei: Only 42 among : Only 42 among 4809 candidates are true first-time co-authors!4809 candidates are true first-time co-authors!(Trained based on data collected in [1996, 2002]; Testing period: [2003,2009])
papertopic
venue
author
publish publish-1
mention-1
mention write
write-1
contain/contain-1 cite/cite-1
Meta path-guided prediction:Infer or predict new relationships among multi-typed links
PathPredict (ASONAM’11)Co-author prediction (A—P—A) using topological features encoded by meta paths, e.g., (A—P→P—A).Which meta-path is more important?
Our contributionDifferent meta-paths have different prediction power: p-values obtained from the DBLP data
Applications
Who will be your new coauthors?
Truth Analysis: Enhancing the Quality of Truth Analysis: Enhancing the Quality of Heterogeneous Information NetworksHeterogeneous Information Networks
Motivation: Info. provided can be untrustworthy, error-prone, missing, …Application: handling conflicting claims on biomedical properties
w1 f1
f2w2
w3
w4 f4
Info provider Claim
o1
o2
Objects
f3
IMDBIMDB
Negative Claim
Positive Claim
Multiple facts, two-sided claims:Multiple facts, two-sided claims:
Harry Potter
NetflixNetflix
BadSourceBadSource
Correct Claim
Incorrect Claim
1818
Experimental datasets: Experimental datasets: Large and real datasets
Book Authors from abebooks.comBook Authors from abebooks.com (1263 books, 879 sources, 48153 claims, 2420 book-author, 100 labeled)
Movie Directors from Bing Movie Directors from Bing (15073 movies, 12 sources, 108873 claims, 33526 movie-director, 100 labeled)
TruthFinder (TKDE’08): mutual enhancement of trustworthiness of info providers and claimsLatent Truth Model (VLDBLatent Truth Model (VLDB’’12): modeling 12): modeling two sided truthtwo sided truth
Our contribution
19
OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group
Pattern Discovery MethodsPattern Discovery Methods
Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks
Construction of Heterogeneous Information Networks from Construction of Heterogeneous Information Networks from Unstructured DataUnstructured Data
TextCube and OLAP heterogeneous networksTextCube and OLAP heterogeneous networks
Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks
ConclusionsConclusions
Hierarchical Relationship DiscoveryHierarchical Relationship Discovery
20
From partially ordered objects to hierarchy (tree) Based on NLP or other techniques to extract partially
ordered objects Using constraints to discover relationships
Singleton PotentialSingleton Potential
Pairwise Potential Function: Cases Pairwise Potential Function: Cases
Discovery of the Kenny Family Tree
Recursive Construction of a Topical Hierarchy by Recursive Construction of a Topical Hierarchy by Phrase MiningPhrase Mining
21
Topic discovery
Topical phrase mining and ranking
Recursive construction
Term co-occurrence network
The Framework of CATHY The Framework of CATHY (Constructing A Topical (Constructing A Topical HierarchY)HierarchY)
Growing Parallel Paths Growing Parallel Paths (WWW 2011)(WWW 2011)
DIV UL
AB
AC
HTML DIV UL
LI
LI
AX
AY
HTML DIV UL
LI
LI
AZ
AW
TABLE TR
TD
TD AU
AV
HTML
HTML
LI
LI
DIV
DIV ...
...
Page A
Page D
Page E
Page F
DIV P AFHTML
Page C
DIV
P
AE
Page B
HTML
P
AD
1
2
3
4
5
6
X
Y
Z
W
U
V
Path
Result:
22
WinaCS: Web Information Network WinaCS: Web Information Network Analysis for Computer ScienceAnalysis for Computer Science
/people
/people/faculty
/jiawei-han
/people/faculty
/dan-roth
/people/faculty/vikram-
adve
/research/research
/areas/data
Faculty
DataMining
Jiawei Han
Dan Roth
Vikram Adve
Jiawei Han
Dan Roth
People
/people/faculty
www.cs.illinois.edu/homes/hanj/
llvm.cs.uiuc.edu/~vadve/Home.html
l2r.cs.uiuc.edu/~danr/
Research
PersonalSite
PersonalSite
PersonalSite
/ (root) [cs.illinois.edu]
llvm.cs.uiuc.edu/~vadve/Home.html
rsim.cs.illinois.edu/~sadve/
www.cs.illinois.edu/homes/hanj/
l2r.cs.uiuc.edu/~danr/
Tarek AbdelzaherSarita AdveVikram Adve
Gul AghaEyal AmirDan Roth
Jiawei Han
--------------
Name URL
Structured Data Web PagesMappings
--------------
Zipcode
Database records can be found on link paths!
23
Research-Insight [SIGMODResearch-Insight [SIGMOD’’13 Demo]13 Demo]
24
Advisor-Advisee result for “Kevin Chang”
Potential collaborators for “Jiawei Han”
Query on “Jim Gray”
Query on “Machine Learning”
25
OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group
Pattern Discovery MethodsPattern Discovery Methods
Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks
Construction of Heterogeneous Information Networks from Construction of Heterogeneous Information Networks from Unstructured DataUnstructured Data
TextCube and OLAP heterogeneous networksTextCube and OLAP heterogeneous networks
Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks
ConclusionsConclusions
Event Cube:Event Cube: An Overview An Overview
MultidimensionalText Database
98.0199.0299.01
98.02
LAX SJC MIA AUS
overshoot
undershootbirds
turbulence
Tim
eLocatio
n
Topic
CA FL TXLocatio
n
1998
1999
Tim
e
Deviation
Encounter
Topic
drill-down
roll-up
Event CubeRepresentation
Analyst…Multidimensional OLAP, Ranking, Cause Analysis,
Topic Summarization/Comparison …… Analysis Support
26 Event Cube: An Organized Approach for Mining and Understanding Anomalous Aviation EventsEvent Cube: An Organized Approach for Mining and Understanding Anomalous Aviation Events
Funded by NASA (2008-2010)
Text/Topic Cube: General Idea
Heterogeneous: categorical attributes + unstructured text
How to combine? Our solution:
Time Location Place Environment … … Event ReportACN
Text data
Cube: Categorical Attributes
Term/Topic Weight
T1 W1
T2 W2
T3 W3
… …
Text/Topic Model: Unstructured TextMeasure
27
Effective OLAP Exploration TopCells (ICDE’ 10): Ranking aggregated cells (objects) in TextCube TEXplorer (CIKM’11): Integrating keyword-based ranking and OLAP
exploration
HealthcareReform
28
EventCube Snapshot: Query ResultEventCube Snapshot: Query Result
29
30
OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group
Pattern Discovery MethodsPattern Discovery Methods
Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks
Construction of Heterogeneous Information Networks from Construction of Heterogeneous Information Networks from Unstructured DataUnstructured Data
TextCube and OLAP heterogeneous networksTextCube and OLAP heterogeneous networks
Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks
ConclusionsConclusions
MoveMine: Mining Moving Object DatabasesMoveMine: Mining Moving Object Databases
A system that mines moving object patterns: Z. Li, et al., “MoveMine: Mining Moving Object Databases", SIGMOD’10 (system demo)
31 3131
Mining Spatiotemporal and Mobility DataMining Spatiotemporal and Mobility Data
#1 #2
#3
#4
density map
#1#2
#4
#3
Long
itude
Latit
ude
Raw movement data (time series view)
Time (hour)
Spot #1: OfficeSpot #2: Commuting citySpot #3: HomeSpot #4: Vacation place
3232
Mining Periodicity in Sparse Data Mining Periodicity in Sparse Data [KDD12][KDD12]
Event has a period of 20 Occurrences of the event happen between 20k+5 to 20k+10
3333
GeoTopic Discovery: Mining Spatial TextGeoTopic Discovery: Mining Spatial Text
LDM
TDM
GeoFolk
LGTA
Geo-tagged photos w. landscape (coast vs. desert vs. mountain)
34
Z. Yin, et a., GeoTopic Discovery and Comparison, WWW'11
LPTA: Latent Periodic Topic Analysis: Discovery of LPTA: Latent Periodic Topic Analysis: Discovery of Temporal Patterns of TopicsTemporal Patterns of Topics
Periodic topic: repeating in regular intervals Background topic: covered uniformly over the entire period Bursty topic: A transient topic that is intensively covered only in a certain time period
Time distribution of topics Integration of both text and time in analysis
3535
Social Relationship Mining from Sensor Trace DataSocial Relationship Mining from Sensor Trace Data
T-Motif: a time interval [S,T], that many positive pairs meet at that
time few negative pairs meet at that
time Ex.: MIT Reality mining dataset:
94 people tracked for 10 months Use only spatiotemporal info
Algs. for efficient mining of T-motifs and effective classification
3636
Mining RFID Data to Explore TrajectoriesMining RFID Data to Explore Trajectories
(Factory, T1,T2) (Shipping,T3,T4) (Warehouse, T5,T6)
(Shelf, T7,T8)(Checkout,T9,T10)
3737
Warehousing and mining RFID Warehousing and mining RFID datadata
38
OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group
Pattern Discovery MethodsPattern Discovery Methods
Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks
Construction of Heterogeneous Information Networks from Construction of Heterogeneous Information Networks from Unstructured DataUnstructured Data
TextCube and OLAP heterogeneous networksTextCube and OLAP heterogeneous networks
Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks
ConclusionsConclusions
39
Conclusions Conclusions An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group
Pattern Discovery MethodsPattern Discovery Methods
Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks
Construction of Heterogeneous Information Networks from Construction of Heterogeneous Information Networks from Unstructured DataUnstructured Data
TextCube and OLAP heterogeneous networksTextCube and OLAP heterogeneous networks
Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks
Lots to be done in this promising research frontier!Lots to be done in this promising research frontier!