1 current research in data mining research group current research in data mining research group...

37
1 Current Research Current Research in Data Mining in Data Mining Research Group Research Group Jiawei Han Data Mining Research Group Department of Computer Science University of Illinois at Urbana-Champaign Acknowledgements: NSF, ARL, ARO, NASA, AFOSR (MURI), DHS, Microsoft, IBM, Yahoo! Labs, LinkedIn, HP Lab & Boeing February 7, 2022

Upload: jane-morrison

Post on 27-Dec-2015

244 views

Category:

Documents


4 download

TRANSCRIPT

1

Current Research in Data Current Research in Data Mining Research GroupMining Research Group

Jiawei HanData Mining Research Group

Department of Computer Science

University of Illinois at Urbana-ChampaignAcknowledgements: NSF, ARL, ARO, NASA, AFOSR (MURI), DHS, Microsoft, IBM, Yahoo! Labs, LinkedIn, HP Lab &

Boeing

April 19, 2023

2

OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group

Pattern Discovery MethodsPattern Discovery Methods

Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks

Construction of Heterogeneous Information Networks from Construction of Heterogeneous Information Networks from Unstructured DataUnstructured Data

TextCube and OLAP heterogeneous networksTextCube and OLAP heterogeneous networks

Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks

ConclusionsConclusions

Data Mining and Data WarehousingData Mining and Data WarehousingJiawei HanJiawei Han’’s Group at CS, s Group at CS, UIUCUIUC

Mining patterns and knowledge discovery from massive data Data mining in heterogeneous information networks Exploring broad applications of data mining

3

Developed popular data mining algorithms: FPgrowth, gSpan, PrefixSpan, RankingCube, TruthFinder, NetClus, RankClass, …

600+ research papers, most cited author/group in data mining ACM Fellow, IEEE Fellow, ACM SIGKDD Innovation Award, W. McDowell

Award; Students: ACM KDD Dissertation Awards (2008, 2013), …… Textbook, “Data mining: Concepts and Techniques,” adopted worldwide Funded as NSCTA (Network Science Collaborative

Technology Alliance) by ARL [09-14, 15-19], ARO, NIH KnowEnG, NSF, Boeing, MSR, Google, Yahoo!, HP Labs, …

Graduated 40+ Ph.D.’s: joined Google, Microsoft Research, Yahoo! Labs, Facebook, Twitter, as well as professors (14)

Supervising 17 Ph.D., 4 M.S. students & 5 visitors/postdocs

Data Mining Research Group in CS, Data Mining Research Group in CS, Univ. IllinoisUniv. Illinois

• Student Prominent AwardsStudent Prominent Awards– SIGKDD or SIGMOD Ph.D. Dissertation Awards/

Runner-Ups– 10-year impact paper awards – Best student paper awards, best papers, best posters, …– KDDCUP 2013 Runner Up Award– IBM/Microsoft/NSF/NDSEG Ph.D. Fellowships

• Graduation:Graduation:– Professors at UVA, UCSB, PSU, U. Buffalo, Northeastern, FSU, MSU, Notre Dame, CUHK, …– Researchers at IBM, MSR, Google Research, Yahoo! Labs, Facebook, Twitter, NEC, etc.

44

5

OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group

Pattern Discovery MethodsPattern Discovery Methods

Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks

Construction of Heterogeneous Information Networks from Construction of Heterogeneous Information Networks from Unstructured DataUnstructured Data

TextCube and OLAP heterogeneous networksTextCube and OLAP heterogeneous networks

Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks

ConclusionsConclusions

6

Mining Sequential Patterns from Shopping SequencesMining Sequential Patterns from Shopping Sequences

Sequential pattern mining: Given a set of (shopping) sequences, find the complete set of frequent subsequences

A sequence database

<a(bc)dc>: a subsequence of <<a(abc)(ac)d(cf)>

Given support threshold min_sup =2, <(ab)c> is a sequential pattern

SID sequence10 <a(abc)(ac)d(cf)>20 <(ad)c(bc)(ae)>30 <(ef)(ab)(df)cb>40 <eg(af)cbc>

Our innovation: (1) PrefixSpan (TKDE’04): 1598 citations(2) CloSpan (SDM’03): 568 (reduce redundancy)(3) FPgrowth (SIGMOD’00): 4956

s=<a(abc)(ac)d(cf)>

<(abc)(ac)d(cf)>

<(_c)(ac)d(cf)>

<a>

<ab>

s|<a>: ( , 2)

s|<ab>: ( , 4)

Idea of PrefixSpan

Idea of CloSpan

Difficulty to generalize it to biosequence mining: approximate patterns & noise

Mining Frequent Subgraph Patterns from Graph DBs Mining Frequent Subgraph Patterns from Graph DBs

GRAPH DATASET (e.g., Chemical Compound Database)

FREQUENT PATTERNS (Let MIN SUPPORT = 2)

Graph pattern mining: Given a set of graphs, find the complete set of frequent subgraphs

Our innovation: (1) gSpan (ICDM’02): 1319 citations(2) CloseGraph (KDD’03): 520 (not to mine

subgraphs covered by their super-patterns)

7

Idea of gSpan

Graph pattern growth + completeness of right-most extension

G

G1

G2

Gn

k-edge

(k+1)-edge

At what condition, can we stop searching their

Children. i.e., early termination?

NCI/NIH AIDS antiviral screen compound data

minsup = 5% Extend to mine structures in large single networks (VLDB’11)

CloseGraph

Graph Indexing and Graph Similarity SearchGraph Indexing and Graph Similarity SearchGraph Search: Given a query graph Q, find all the graphs in graph DB containing Q

query graph graph DB

Graph (G)

Graph Index

Query:Q

Graph Index helps search

Our Innovation:gIndex (SIGMOD’04): 419 citationsgrafil (SIGMOD’05): similarity search

gIndex key idea: index on frequent and discriminative substructures (mined)

0.0E+00

2.0E+04

4.0E+04

6.0E+04

8.0E+04

1.0E+05

1.2E+05

1.4E+05

1k 2k 4k 8k 16k

PathFrequent StructureDiscriminative Frequent Structure

0

20

40

60

80

100

120

140

4 8 12 16 20 24

GraphGrep

gIndex

Actual Match

# candidates/query size # indices/ DBsize

grafil key idea: explore feature similarity

Query:Q

Graph (G)

featuresApproximate

features

8

11

OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group

Pattern Discovery MethodsPattern Discovery Methods

Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks

Construction of Heterogeneous Information Networks from Construction of Heterogeneous Information Networks from Unstructured DataUnstructured Data

TextCube and OLAP heterogeneous networksTextCube and OLAP heterogeneous networks

Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks

ConclusionsConclusions

Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks

Heterogeneous networks: Multiple object types and/or multiple link types

VenueVenue PaperPaper AuthorAuthorDBLP Bibliographic NetworkDBLP Bibliographic Network The IMDB Movie NetworkThe IMDB Movie Network

ActorActorMovieMovie

DirectorDirector

Movie Movie StudioStudio

Homogeneous networks are info. lossinfo. loss projection of heterogeneous networks!

The Facebook NetworkThe Facebook Network

Directly mining information-richer heterogeneous networksDirectly mining information-richer heterogeneous networks

Current work: Mining DBLP (CS bibliographic DB), PubMed, news, tweets, data.gov, …

Structured Heterogeneous Network Modeling Structured Heterogeneous Network Modeling Leads to the New Power of Data Mining! Leads to the New Power of Data Mining!

DBLP: A Computer Science bibliographic database

A sample publication record in DBLP (>2 M papers, >0.7 M authors, >10 K venues), …

13

Power of het. network modeling: Treat Author, Venue, Term, Paper all first-class citizens!

RankClus: Rank-Based ClusteringRankClus: Rank-Based Clustering

14RankCompete: Organize your photo album automatically!RankCompete: Organize your photo album automatically!

Rank treatments for AIDS from MEDLINERank treatments for AIDS from MEDLINE

Research Paper

Term

AuthorVenue

Publish Write

Contain

P

T

AV

P

T

AV

……

P

T

AVNetClus

Computer Science

Database

Hardware

Theory

RankClus (EDBT’09)/NetClus (KDD’09): Integrate ranking & clustering for mining RankClus (EDBT’09)/NetClus (KDD’09): Integrate ranking & clustering for mining heterogeneous info networksheterogeneous info networks

DBLP SchemaDBLP Schema

15

RankClass: Integration of Tanking and ClassificationRankClass: Integration of Tanking and Classification

Knowledge propagation via multi-typed heterogeneous networksKnowledge propagation via multi-typed heterogeneous networks

ECMLPKDD'10/KDD’11: integrate ranking and classification; small training set; knowledge propagation across typed links; efficient and scalable

Database Data Mining AI IR

Top-5 ranked conf.s

VLDB KDD IJCAI SIGIR

SIGMOD SDM AAAI ECIR

ICDE ICDM ICML CIKM

PODS PKDD CVPR WWW

EDBT PAKDD ECML WSDM

Top-5 ranked terms

data mining learning retrieval

database data knowledge information

query clustering reasoning web

system classification logic search

xml frequent cognition text DBLP: 4-fields data set (DB, DM, AI, IR)

forming a heterog. info. network Rank objects within each class (with

extremely limited label information) Obtain High classification accuracy and

excellent rankings within each class

Our innovation:

Potential applications:Biological network mining

Anhai DoanCS, WisconsinDatabase areaPhD: 2002

Meta-Path: Author-Paper-Venue-Paper-Author (APVPA)

Jignesh PatelCS, WisconsinDatabase areaPhD: 1998

Amol DeshpandeCS, MarylandDatabase areaPhD: 2004

Jun YangCS, DukeDatabase areaPhD: 2001

16

Meta-Path GuidedMeta-Path Guided Similarity Search in Networks Similarity Search in Networks

Similarity search: Find similar objects in networksSimilarity search: Find similar objects in networks Who are most similar to AnHai Doan?

Meta-Path: Meta-level description of a path between two objects

Different meta-paths carry rather different semantics

DBLP Network Schema

Our innovation

PathSim (VLDB’11): Similarity search in heterogeneous networks; a balanced similarity measure; user-guidance by selecting different meta-paths

Application in biomedical domain

IBM: search for close relationships among disease, drugs, treatments, side-effects, and explanations

PathPredict: Meta-Path Based Relationship PredictionPathPredict: Meta-Path Based Relationship Prediction

Network schema

17

Co-author prediction Co-author prediction for Jian Peifor Jian Pei: Only 42 among : Only 42 among 4809 candidates are true first-time co-authors!4809 candidates are true first-time co-authors!(Trained based on data collected in [1996, 2002]; Testing period: [2003,2009])

papertopic

venue

author

publish publish-1

mention-1

mention write

write-1

contain/contain-1 cite/cite-1

Meta path-guided prediction:Infer or predict new relationships among multi-typed links

PathPredict (ASONAM’11)Co-author prediction (A—P—A) using topological features encoded by meta paths, e.g., (A—P→P—A).Which meta-path is more important?

Our contributionDifferent meta-paths have different prediction power: p-values obtained from the DBLP data

Applications

Who will be your new coauthors?

Truth Analysis: Enhancing the Quality of Truth Analysis: Enhancing the Quality of Heterogeneous Information NetworksHeterogeneous Information Networks

Motivation: Info. provided can be untrustworthy, error-prone, missing, …Application: handling conflicting claims on biomedical properties

w1 f1

f2w2

w3

w4 f4

Info provider Claim

o1

o2

Objects

f3

IMDBIMDB

Negative Claim

Positive Claim

Multiple facts, two-sided claims:Multiple facts, two-sided claims:

Harry Potter

NetflixNetflix

BadSourceBadSource

Correct Claim

Incorrect Claim

1818

Experimental datasets: Experimental datasets: Large and real datasets

Book Authors from abebooks.comBook Authors from abebooks.com (1263 books, 879 sources, 48153 claims, 2420 book-author, 100 labeled)

Movie Directors from Bing Movie Directors from Bing (15073 movies, 12 sources, 108873 claims, 33526 movie-director, 100 labeled)

TruthFinder (TKDE’08): mutual enhancement of trustworthiness of info providers and claimsLatent Truth Model (VLDBLatent Truth Model (VLDB’’12): modeling 12): modeling two sided truthtwo sided truth

Our contribution

19

OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group

Pattern Discovery MethodsPattern Discovery Methods

Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks

Construction of Heterogeneous Information Networks from Construction of Heterogeneous Information Networks from Unstructured DataUnstructured Data

TextCube and OLAP heterogeneous networksTextCube and OLAP heterogeneous networks

Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks

ConclusionsConclusions

Hierarchical Relationship DiscoveryHierarchical Relationship Discovery

20

From partially ordered objects to hierarchy (tree) Based on NLP or other techniques to extract partially

ordered objects Using constraints to discover relationships

Singleton PotentialSingleton Potential

Pairwise Potential Function: Cases Pairwise Potential Function: Cases

Discovery of the Kenny Family Tree

Recursive Construction of a Topical Hierarchy by Recursive Construction of a Topical Hierarchy by Phrase MiningPhrase Mining

21

Topic discovery

Topical phrase mining and ranking

Recursive construction

Term co-occurrence network

The Framework of CATHY The Framework of CATHY (Constructing A Topical (Constructing A Topical HierarchY)HierarchY)

Growing Parallel Paths Growing Parallel Paths (WWW 2011)(WWW 2011)

DIV UL

AB

AC

HTML DIV UL

LI

LI

AX

AY

HTML DIV UL

LI

LI

AZ

AW

TABLE TR

TD

TD AU

AV

HTML

HTML

LI

LI

DIV

DIV ...

...

Page A

Page D

Page E

Page F

DIV P AFHTML

Page C

DIV

P

AE

Page B

HTML

P

AD

1

2

3

4

5

6

X

Y

Z

W

U

V

Path

Result:

22

WinaCS: Web Information Network WinaCS: Web Information Network Analysis for Computer ScienceAnalysis for Computer Science

/people

/people/faculty

/jiawei-han

/people/faculty

/dan-roth

/people/faculty/vikram-

adve

/research/research

/areas/data

Faculty

DataMining

Jiawei Han

Dan Roth

Vikram Adve

Jiawei Han

Dan Roth

People

/people/faculty

www.cs.illinois.edu/homes/hanj/

llvm.cs.uiuc.edu/~vadve/Home.html

l2r.cs.uiuc.edu/~danr/

Research

PersonalSite

PersonalSite

PersonalSite

/ (root) [cs.illinois.edu]

llvm.cs.uiuc.edu/~vadve/Home.html

rsim.cs.illinois.edu/~sadve/

www.cs.illinois.edu/homes/hanj/

l2r.cs.uiuc.edu/~danr/

Tarek AbdelzaherSarita AdveVikram Adve

Gul AghaEyal AmirDan Roth

Jiawei Han

--------------

Name URL

Structured Data Web PagesMappings

--------------

Zipcode

Database records can be found on link paths!

23

Research-Insight [SIGMODResearch-Insight [SIGMOD’’13 Demo]13 Demo]

24

Advisor-Advisee result for “Kevin Chang”

Potential collaborators for “Jiawei Han”

Query on “Jim Gray”

Query on “Machine Learning”

25

OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group

Pattern Discovery MethodsPattern Discovery Methods

Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks

Construction of Heterogeneous Information Networks from Construction of Heterogeneous Information Networks from Unstructured DataUnstructured Data

TextCube and OLAP heterogeneous networksTextCube and OLAP heterogeneous networks

Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks

ConclusionsConclusions

Event Cube:Event Cube: An Overview An Overview

MultidimensionalText Database

98.0199.0299.01

98.02

LAX SJC MIA AUS

overshoot

undershootbirds

turbulence

Tim

eLocatio

n

Topic

CA FL TXLocatio

n

1998

1999

Tim

e

Deviation

Encounter

Topic

drill-down

roll-up

Event CubeRepresentation

Analyst…Multidimensional OLAP, Ranking, Cause Analysis,

Topic Summarization/Comparison …… Analysis Support

26 Event Cube: An Organized Approach for Mining and Understanding Anomalous Aviation EventsEvent Cube: An Organized Approach for Mining and Understanding Anomalous Aviation Events

Funded by NASA (2008-2010)

Text/Topic Cube: General Idea

Heterogeneous: categorical attributes + unstructured text

How to combine? Our solution:

Time Location Place Environment … … Event ReportACN

Text data

Cube: Categorical Attributes

Term/Topic Weight

T1 W1

T2 W2

T3 W3

… …

Text/Topic Model: Unstructured TextMeasure

27

Effective OLAP Exploration TopCells (ICDE’ 10): Ranking aggregated cells (objects) in TextCube TEXplorer (CIKM’11): Integrating keyword-based ranking and OLAP

exploration

HealthcareReform

28

EventCube Snapshot: Query ResultEventCube Snapshot: Query Result

29

30

OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group

Pattern Discovery MethodsPattern Discovery Methods

Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks

Construction of Heterogeneous Information Networks from Construction of Heterogeneous Information Networks from Unstructured DataUnstructured Data

TextCube and OLAP heterogeneous networksTextCube and OLAP heterogeneous networks

Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks

ConclusionsConclusions

MoveMine: Mining Moving Object DatabasesMoveMine: Mining Moving Object Databases

A system that mines moving object patterns: Z. Li, et al., “MoveMine: Mining Moving Object Databases", SIGMOD’10 (system demo)

31 3131

Mining Spatiotemporal and Mobility DataMining Spatiotemporal and Mobility Data

#1 #2

#3

#4

density map

#1#2

#4

#3

Long

itude

Latit

ude

Raw movement data (time series view)

Time (hour)

Spot #1: OfficeSpot #2: Commuting citySpot #3: HomeSpot #4: Vacation place

3232

Mining Periodicity in Sparse Data Mining Periodicity in Sparse Data [KDD12][KDD12]

Event has a period of 20 Occurrences of the event happen between 20k+5 to 20k+10

3333

GeoTopic Discovery: Mining Spatial TextGeoTopic Discovery: Mining Spatial Text

LDM

TDM

GeoFolk

LGTA

Geo-tagged photos w. landscape (coast vs. desert vs. mountain)

34

Z. Yin, et a., GeoTopic Discovery and Comparison, WWW'11

LPTA: Latent Periodic Topic Analysis: Discovery of LPTA: Latent Periodic Topic Analysis: Discovery of Temporal Patterns of TopicsTemporal Patterns of Topics

Periodic topic: repeating in regular intervals Background topic: covered uniformly over the entire period Bursty topic: A transient topic that is intensively covered only in a certain time period

Time distribution of topics Integration of both text and time in analysis

3535

Social Relationship Mining from Sensor Trace DataSocial Relationship Mining from Sensor Trace Data

T-Motif: a time interval [S,T], that many positive pairs meet at that

time few negative pairs meet at that

time Ex.: MIT Reality mining dataset:

94 people tracked for 10 months Use only spatiotemporal info

Algs. for efficient mining of T-motifs and effective classification

3636

Mining RFID Data to Explore TrajectoriesMining RFID Data to Explore Trajectories

(Factory, T1,T2) (Shipping,T3,T4) (Warehouse, T5,T6)

(Shelf, T7,T8)(Checkout,T9,T10)

3737

Warehousing and mining RFID Warehousing and mining RFID datadata

38

OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group

Pattern Discovery MethodsPattern Discovery Methods

Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks

Construction of Heterogeneous Information Networks from Construction of Heterogeneous Information Networks from Unstructured DataUnstructured Data

TextCube and OLAP heterogeneous networksTextCube and OLAP heterogeneous networks

Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks

ConclusionsConclusions

39

Conclusions Conclusions An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group

Pattern Discovery MethodsPattern Discovery Methods

Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks

Construction of Heterogeneous Information Networks from Construction of Heterogeneous Information Networks from Unstructured DataUnstructured Data

TextCube and OLAP heterogeneous networksTextCube and OLAP heterogeneous networks

Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks

Lots to be done in this promising research frontier!Lots to be done in this promising research frontier!