construction and mining of heterogeneous information...

54
1 Construction and Mining of Heterogeneous Information Networks: Will It Be a Key to Web-Aged Information Management and Mining Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign Collaborated with many students, especially Yizhou Sun, Ming Ji, Chi Wang, Ahmed El-Kishky, Bo Zhao Acknowledgements: NSF, ARL, NASA, AFOSR (MURI), ARO, Microsoft, IBM, Yahoo!, LinkedIn, Google & Boeing July 3, 2014

Upload: others

Post on 23-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

1

Construction and Mining of Heterogeneous Information Networks: Will It Be a Key to

Web-Aged Information Management and Mining

Jiawei Han Department of Computer Science

University of Illinois at Urbana-Champaign Collaborated with many students, especially Yizhou Sun, Ming Ji, Chi Wang, Ahmed El-Kishky, Bo Zhao

Acknowledgements: NSF, ARL, NASA, AFOSR (MURI), ARO, Microsoft, IBM, Yahoo!, LinkedIn, Google & Boeing

July 3, 2014

Page 2: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

2

Outline Why Heterogeneous Information Networks?

On the Power of Mining Structured Heterogeneous Info. Networks

Clustering & Classification: A Ranking-Based Approach

Similarity Search & Meta Path in Heterogeneous Info. Net

Prediction & Recommendation: A Meta Path-Based Approach

From Unstructured Data to Structured Networks

From Unstructured Text to Structured Hierarchies

Type Extraction and Role Discovery: Enrich Network Structures

Truth Analysis: Enhance Quality of Structured Network

Looking Forward to the Future

Page 3: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

Where There Is Information, There Are Networks!

Social Networking Websites Biological Network: Protein Interaction

Research Collaboration Network Product Recommendation Network via Emails 3

Page 4: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

The Real World: Heterogeneous Networks Multiple object types and/or multiple link types

Venue Paper Author DBLP Bibliographic Network The IMDB Movie Network

Actor Movie

Director

Movie Studio

Homogeneous networks are information loss projection of heterogeneous networks!

The Facebook Network

Directly mining information-richer heterogeneous networks 4

Page 5: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

Structured Heterogeneous Network Modeling Leads to the New Power of Data Mining! DBLP: A Computer Science bibliographic database

Knowledge hidden in DBLP Network Mining Functions How are CS research areas structured? Clustering

Who are the leading researchers on Web search? Ranking

What are the most essential terms, venues, authors in AI? Classification + Ranking

Who are the peer researchers of Jure Leskovec? Similarity Search

Whom will Christos Faloutsos collaborate with? Relationship Prediction

Which types of relationships are most influential for an author to decide her topics?

Relation Strength Learning

How was the field of Data Mining emerged or evolving? Network Evolution

Which authors are rather different from his/her peers in IR? Outlier/anomaly detection

A sample publication record in DBLP (>2 M papers, >0.7 M authors, >10 K venues), …

5

Power of het. network modeling: Treat Author, Venue, Term & Paper all first-class citizens!

Page 6: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

6

Outline Why Heterogeneous Information Networks?

On the Power of Mining Structured Heterogeneous Info. Networks

Clustering & Classification: A Ranking-Based Approach

Similarity Search & Meta Path in Heterogeneous Info. Net

Prediction & Recommendation: A Meta Path-Based Approach

From Unstructured Data to Structured Networks

From Unstructured Text to Structured Hierarchies

Type Extraction and Role Discovery: Enrich Network Structures

Truth Analysis: Enhance Quality of Structured Network

Looking Forward to the Future

Page 7: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

7

Basic functions for networks: Clustering, classification & ranking Why not integrate ranking with clustering & classification?

High-ranked objects should be more important (thus more weighted) in a cluster/class than low-ranked ones

Ranking will make more sense within a particular cluster/class Einstein in physics vs. Turing in Coputer Science

Integrate ranking with clustering/classification in the same process Ranking, as the feature, is conditional (i.e., relative) to a specific

cluster/class Ranking and clustering/classification mutually enhance each other Ranking-based clustering: RankClus [EDBT’09], NetClus [KDD’09],

PathSelClus [KDD’12] Ranking-based classification: GNetMine (for classification)

[ECMLPKDD’10], RankClass [KDD’11]

Ranking-Based Clustering & Classification

Presenter
Presentation Notes
Better clustering: rank distributions for clusters are more distinguishing from each other Better ranking: better metric for objects is learned from the ranking
Page 8: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

Initialization Randomly partition

Repeat Ranking

Ranking objects in each sub-network induced from each cluster

Sub-Network Ranking

Clustering

8

Generating new measure space Estimate mixture model coefficients

for each target object Adjusting cluster

Until stable

RankClus: Integrating Clustering with Ranking

An E-M framework for iterative enhancement

Difference on ranking rules: 1. Simple ranking rule: Ranking high if more papers 2. Highly ranked if publishing more in highly ranked-

venues

Page 9: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

9

Step-by-Step Running of RankClus

Initially, ranking distributions are mixed together

Two clusters of objects mixed together, but

preserve similarity somehow

Improved a little Two clusters are

almost well separated

Improved significantly

Stable

Well separated

Clustering and ranking two fields: DB/DM & HW/CA

Page 10: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

10

NetClus: Ranking-Based Clustering with Star Network Schema

Beyond bi-typed network: capture more semantics with multiple types Split a network into multi-subnets, each a net-cluster [KDD’09]

Research Paper

Term

AuthorVenue

Publish Write

Contain

P

T

AV

P

T

AV

……

P

T

AVNetClus

Computer Science

Database

Hardware

Theory

DBLP network: using terms, venues & authors to jointly distinguish a sub-field, e.g., database

Page 11: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

11

NetClus: Database System Cluster

database 0.0995511 databases 0.0708818

system 0.0678563 data 0.0214893

query 0.0133316 systems 0.0110413 queries 0.0090603

management 0.00850744 object 0.00837766

relational 0.0081175 processing 0.00745875

based 0.00736599 distributed 0.0068367

xml 0.00664958 oriented 0.00589557 design 0.00527672 web 0.00509167

information 0.0050518 model 0.00499396

efficient 0.00465707

Surajit Chaudhuri 0.00678065 Michael Stonebraker 0.00616469

Michael J. Carey 0.00545769 C. Mohan 0.00528346

David J. DeWitt 0.00491615 Hector Garcia-Molina 0.00453497

H. V. Jagadish 0.00434289 David B. Lomet 0.00397865

Raghu Ramakrishnan 0.0039278 Philip A. Bernstein 0.00376314

Joseph M. Hellerstein 0.00372064 Jeffrey F. Naughton 0.00363698 Yannis E. Ioannidis 0.00359853

Jennifer Widom 0.00351929 Per-Ake Larson 0.00334911 Rakesh Agrawal 0.00328274

Dan Suciu 0.00309047 Michael J. Franklin 0.00304099 Umeshwar Dayal 0.00290143

Abraham Silberschatz 0.00278185

VLDB 0.318495 SIGMOD Conf. 0.313903

ICDE 0.188746 PODS 0.107943 EDBT 0.0436849

One-level deeper: Authors in XML, Xquery cluster

Term Venue Author

Page 12: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

Rank-Based Clustering: Works in Multiple Domains

12

RankCompete: Organize your photo album automatically!

MedRank: Rank treatments for AIDS from Medline

Use multi-typed image features to build up heterog. networks

Explore multiple types in a star schema network

Page 13: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

13

Classification: Knowledge Propagation Across Heterogeneous Typed Networks

Labeled knowledge propagates through multi-typed objects across heterogeneous networks [ECMLPKDD'10, KDD’11]

Page 14: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

RankClass: Ranking-Based Classification Class: a group of multi-typed nodes sharing the same topic + within-class ranking

node class

The RankClass Framework

14

Page 15: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

Methodology of RankClass Classification in heterogeneous networks

Knowledge propagation: Class label knowledge propagated across multi-typed objects through a heterogeneous network

GNetMine [Ji et al., PKDD’10]: Objects are treated equally

RankClass [Ji et al., KDD’11]: Ranking-based classification

Highly ranked objects will play more role in classification

An object can only be ranked high in some focused classes

Class membership and ranking are stat. distributions

Let ranking and classification mutually enhance each other!

Output: Classification results + ranking list of objects within each class

15

Page 16: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

16

Experiments on DBLP Class: Four research areas (communities)

Database, data mining, AI, information retrieval Four types of objects

Paper (14376), Conf. (20), Author (14475), Term (8920) Three types of relations

Paper-conf., paper-author, paper-term Algorithms for comparison

Learning with Local and Global Consistency (LLGC) [Zhou et al. NIPS 2003] – also the homogeneous version of our method

Weighted-vote Relational Neighbor classifier (wvRN) [Macskassy et al. JMLR 2007]

Network-only Link-based Classification (nLB) [Lu et al. ICML 2003, Macskassy et al. JMLR 2007]

Page 17: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

Comparing Classification Accuracy on the DBLP Data

17

Presenter
Presentation Notes
and the within-class ranking can be accurately performed mainly within the sub-network instead of the global network.
Page 18: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

Object Ranking in Each Class: Experiment

DBLP: 4-fields data set (DB, DM, AI, IR) forming a heterog. info. network Rank objects within each class (with extremely limited label information) Obtain high classification accuracy and excellent ranking within each class

Database Data Mining AI IR

Top-5 ranked conferences

VLDB KDD IJCAI SIGIR

SIGMOD SDM AAAI ECIR

ICDE ICDM ICML CIKM

PODS PKDD CVPR WWW

EDBT PAKDD ECML WSDM

Top-5 ranked terms

data mining learning retrieval

database data knowledge information

query clustering reasoning web

system classification logic search

xml frequent cognition text

18

Page 19: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

19

Outline Why Heterogeneous Information Networks?

On the Power of Mining Structured Heterogeneous Info. Networks

Clustering & Classification: A Ranking-Based Approach

Similarity Search & Meta Path in Heterogeneous Info. Net

Prediction & Recommendation: A Meta Path-Based Approach

From Unstructured Data to Structured Networks

From Unstructured Text to Structured Hierarchies

Type Extraction and Role Discovery: Enrich Network Structures

Truth Analysis: Enhance Quality of Structured Network

Looking Forward to the Future

Page 20: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

Similarity Search: Find Similar Objects in Networks

Who are most similar to Christos Faloutsos? Meta-Path: Meta-level description of a path between

two objects

Christos’s students or close collaborators Similar reputation at similar venues

Meta-Path: Author-Paper-Author (APA) Meta-Path: Author-Paper-Venue-Paper-Author (APVPA)

20

Schema of the DBLP Network

Different meta-paths lead to very different results!

Different meta-paths carry rather different semantics

Page 21: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

PathSim: A Het. Network Similarity Measure

Popular object similarity measures in networks Random walk (RW) (or Personalized PageRank): Favors highly

visible objects (i.e., objects with large degrees) Pairwise random walk (PRW) (or SimRank): Favors “pure”

objects (i.e., objects with highly skewed distribution in their in-links or out-links)

PathSim [VLDB’11] Favor “peers”: objects with strong connectivity and similar

visibility under the given meta-path

x y

21

Note: P-PageRank and SimRank do not distinguish object type nor relationship type

Page 22: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

Which Similarity Measure Is Better?

Anhai Doan CS, Wisconsin Database area PhD: 2002

Meta-Path: Author-Paper-Venue-Paper-Author (APVPA)

• Jignesh Patel

• CS, Wisconsin

• Database area

• PhD: 1998

• Amol Deshpande

• CS, Maryland

• Database area

• PhD: 2004

• Jun Yang

• CS, Duke

• Database area

• PhD: 2001

22

Presenter
Presentation Notes
May replace with Jure Lescovec
Page 23: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

23

Outline Why Heterogeneous Information Networks?

On the Power of Mining Structured Heterogeneous Info. Networks

Clustering & Classification: A Ranking-Based Approach

Similarity Search & Meta Path in Heterogeneous Info. Net

Prediction & Recommendation: A Meta Path-Based Approach

From Unstructured Data to Structured Networks

From Unstructured Text to Structured Hierarchies

Type Extraction and Role Discovery: Enrich Network Structures

Truth Analysis: Enhance Quality of Structured Network

Looking Forward to the Future

Page 24: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

Meta path-guided prediction of links and relationships

Philosophy: Meta path relationships among similar typed links share similar semantics and are comparable and inferable

24

paper topic

venue

author

publish publish-1

mention-1

mention write write-1

contain/contain-1 cite/cite-1

Bibliographic network: Co-author prediction (A—P—A) Use topological features encoded by meta paths, e.g.,

citation relations between authors (A—P→P—A)

PathPredict: Meta-Path Based Relationship Prediction

vs.

Page 25: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

Meta-Path Based Co-authorship Prediction

Co-authorship prediction problem Whether two authors are going to collaborate for the first time

Co-authorship encoded in meta-path Author-Paper-Author

Topological features encoded in meta-paths

Meta-paths between authors under length 4

Meta-Path Semantic Meaning

25

Page 26: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

The Power of PathPredict: Experiments

Explain the prediction power of each meta-path Wald Test for logistic

regression Higher prediction accuracy

than using projected homogeneous network 11% higher in

prediction accuracy

26

Co-author prediction for Jian Pei: Only 42 among 4809 candidates are true first-time co-authors! (Feature collected in [1996, 2002]; Test period in [2003,2009])

Page 27: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

Enhancing the Power of Recommender Systems by Heterog. Info. Network Analysis

Avatar Titanic Aliens Revolutionary Road

James Cameron

Kate Winslet

Leonardo Dicaprio

Zoe Saldana

Adventure Romance

Collaborative filtering methods suffer from the data sparsity issue

# of users or items

A small set of users & items have a large number of ratings

Most users and items have a small number of ratings

# of

ratin

gs

Heterogeneous relationships complement each other Users and items with limited feedback can be connected to the

network by different types of paths Connect new users or items in the information network

Different users may require different models: Relationship heterogeneity makes personalized recommendation models easier to define

27

Personalized recommendation with heterog. Networks [WSDM’14]

Page 28: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

Performance Study: Personalized Recommendation in Heterogeneous Networks

Winner: HeteRec personalized recommendation (HeteRec-p)

Datasets:

Methods to be compared: Popularity: Recommend the most popular items to users Co-click: Conditional probabilities between items NMF: Non-negative matrix factorization on user feedback Hybrid-SVM: Use Rank-SVM with plain features (utilize both user

feedback and information network)

28

Page 29: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

29

Outline Why Heterogeneous Information Networks?

On the Power of Mining Structured Heterogeneous Info. Networks

Clustering & Classification: A Ranking-Based Approach

Similarity Search & Meta Path in Heterogeneous Info. Net

Prediction & Recommendation: A Meta Path-Based Approach

From Unstructured Data to Structured Networks

From Unstructured Text to Structured Hierarchies

Type Extraction and Role Discovery: Enrich Network Structures

Truth Analysis: Enhance Quality of Structured Network

Looking Forward to the Future

Page 30: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

Automated Construction of Heterogeneous Info. Networks

Rich semantics can be mined from structured heterog. networks

Unfortunately, much of the real world data is unstructured

News, Wikipedia, blogs, tweets, multimedia data, …

Too expensive to be structured by human: Automated & scalable

How to turn unstructured data to structured heterog. Networks?

From Unstructured Text to Structured Hierarchies

Type Extraction and Role Discovery: Enrich Network Structures

Truth Analysis: Enhance Quality of Structured Network

Methodology: Incrementally & progressively explore links and structures to construct quality heterogeneous networks

30

Page 31: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

From Unstructured Text to Structured Hierarchies

Kert [Danilevsky et al, SDM’14]: LDA + Keyphrase extraction and ranking by topic (4 criteria: coverage, purity, phraseness, completeness)

CATHTY [Wang, et al., KDD’13]: Recursive generation of topic hierarchy based on term co-occurrence network

CATHYHIN [Wang, et al., ICDM’13]: Use hetero. network to help topic hierarchy generation

TopMine [El-Kishky et al., 2014]: Exploring frequent contiguous pattern mining

31

1. Hierarchical topic discovery with entities

2. Phrase mining

3. Rank phrases & entities per topic

Output hierarchy w/ phrases & entities

Input collection

text o

o/1

o/1/1 o/1/2

o/2

o/2/1

Entity

Hierarchy so generated can be used for network mining, exploration, etc.

Our recent effort:

Page 32: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

Automatic Discovery of Clusters and Term Hierarchies from Collection of Documents

KERT: Keyphrase Extraction and Ranking by Topic [SDM’14] 1. Use background LDA (bLDA) to cluster unigrams into T topics + one background topic 2. Extract candidate keyphrases for each foreground topic according to the word topic assignment — should be order-free and non-consecutive

E.g.,‘mining frequent patterns’ vs.‘frequent pattern mining’ E.g., ‘mining top-k frequent closed patterns’

3. Rank the keyphrases in each foreground topic Coverage: information retrieval’ vs. ‘cross-language information

retrieval Purity: only frequent in documents about topic t Phraseness: ‘active learning’ vs.‘learning classification’ Completeness: ‘vector machine’ vs. ‘support vector machine’

32

Page 33: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

Experiment: Machine Learning: Top Keyphrases

– Dataset: DBLP paper titles – KERT variations: ignore each criterion in turn – Baselines: two variations of an approach from a recent work

33

Page 34: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

CATHYHIN: Topic Hierarchy Construction by Integration of Heterogeneous Info. Networks

Using DBLP heterog. Info. network to enhance topical hierarchy generation

Hierarchies generated not only on topical phrases but also on authors & venues

34

Presenter
Presentation Notes
I’m not getting into this too much right now. The generative model is altered, to account for the different types of nodes and links. Mining and ranking are nearly identical. There is also a component of learning the link type weights, this slightly improves results
Page 35: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

TopMine: Phrase Mining Then Top Modeling

Frequent Contiguous Phrase Mining Frequency: A low frequency phrase is likely unimportant --- a

generalization of the most likely unigrams in LDA Collocation: The co-occurrence of tokens in such frequency

that is significantly higher than expected due to chance. Completeness: Do not double count sub-phrases (e.g. support

vector is not a phrase when inside support vector machine) General frequent pattern mining may place words far apart in the

same phrase The pattern for a phrase: each phrase is a contiguous sequence of

ordered words. Markov Blanket Feature Selection for Support Vector Machines

35

Page 36: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

Good Phrases Should Be Stat. Significant

36

Under the assumption that the text data was generated by a sequence of Bernoulli trials, we can derive a significance measure for potential mergings.

Good Phrases!

Page 37: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

Bag-of-Phrase Construction: An Example

37

Page 38: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

The ToPMine Framework

38

1. Preprocessing: Perform stemming to reduce vocabulary size and increase collisions of phrases, split on phrase-invariant punctuation

2. Frequent Contiguous Pattern Mining: Collects aggregate counts and candidate phrases

3. Phrase Construction: Agglomerative construction of significant phrases (addresses frequency, collocation, and completeness)

4. Use the phrase construction of “bag of phrases”: This can be considered an enriched input for later algorithms (not exclusive to topic modeling)

5. Bag-of-phrases Topic Modeling

Presenter
Presentation Notes
document-based co-occurrence (top) and adjacency-based co-occurrence (bottom). Document-based co- occurrence yields better quality phrases, especially at lower levels.
Page 39: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

Experiments on AP News

39

Page 40: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

Experiments on Yelp Reviews

40

Page 41: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

41

Scalability: Runtime Results

Page 42: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

42

Outline Why Heterogeneous Information Networks?

On the Power of Mining Structured Heterogeneous Info. Networks

Clustering & Classification: A Ranking-Based Approach

Similarity Search & Meta Path in Heterogeneous Info. Net

Prediction & Recommendation: A Meta Path-Based Approach

From Unstructured Data to Structured Networks

From Unstructured Text to Structured Hierarchies

Type Extraction and Role Discovery: Enrich Network Structures

Truth Analysis: Enhance Quality of Structured Network

Looking Forward to the Future

Page 43: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

43

Role Discovery: Mining Advisor-Advisee Relationships in DBLP Network

Propagation of simple, commonly accepted constraints in Time-Constrained Probabilistic Factor Graph (TPFG) “Advisor has more publications and longer history than advisee at the time

of advising” “Once an advisee becomes advisor, s/he will not become advisee again”

Smithth

2000

2000

2001

2002

2003

22000000000000000000

1999

Ada Bob

Jerry

Ying

Input: Temporalcollaborationnetwork

Output: Relationshipanalysis

(0.8, [1999,2000])

(0.7,[2000, 2001])

(0.65, [2002, 2004])

2004

Ada

Bob

Ying

Smith

(0.2,[2001, 2003])

(0.5, [/, 2000])

(0.9, [/, 1998])

(0.4,[/, 1998])

(0.49,[/, 1999])

Visualizedchorological hierarchies

Jerry

Page 44: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

44

Role Discovery: Performance & Case Study

Datasets RULE SVM IndMAX TPFG

TEST1 69.9% 73.4% 75.2% 78.9% 80.2% 84.4%

TEST2 69.8% 74.6% 74.6% 79.0% 81.5% 84.3%

TEST3 80.6% 86.7% 83.1% 90.9% 88.8% 91.3%

Empirical parameter

optimized parameter

heuristics Supervised learning

DBLP data: 654, 628 authors, 1076,946 publications, years provided Labeled data: MathGealogy Project; AI Gealogy Project; Homepage

Advisee Top Ranked Advisor Time Note

David M. Blei 1. Michael I. Jordan 01-03 PhD advisor, 2004 grad

2. John D. Lafferty 05-06 Postdoc, 2006

Hong Cheng 1. Qiang Yang 02-03 MS advisor, 2003

2. Jiawei Han 04-08 PhD advisor, 2008

Sergey Brin 1. Rajeev Motawani 97-98 “Unofficial advisor”

Case study

Presenter
Presentation Notes
Testing data: 3 sets. Manually copied from advisor’s homepage Manually labeled by Jingyi Guo Crawled from mathematics genealogy project by Yuntao Jia TEST1=Teacher(257)+ MathGP(1909)+ Colleague(2166) TEST2=PHD(100)+ MathGP(1909)+ Colleague(4351) TEST3=AIGP(666)+ Colleague(459) How to do prediction according to ranking? P@(k, θ): Top k potential advisors, with r>θ
Page 45: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

45

Hierarchical Relationship Discovery From partially ordered objects to hierarchy (tree) Based on NLP or other techniques to extract

partially ordered objects Using constraints to discover relationships

Type Cognitive description Potential definition Homophile Parent and child are similar Polarity Parent is superior to child Support pattern

Patterns frequently occurring with child-parent pairs

Forbidden pattern

Patterns rarely occurring with child-parent pairs

Singleton Potential

Type Cognitive description Potential definition Attribute augment

Use inherited attributes from parents or children

Label propagate

Similar nodes share similar parents (or children)

Reciprocity Patterns altering in child-parent & parent-child pairs

Constraints Restrict certain patterns

Pairwise Potential Function: Cases

Discovery of the Kenny Family Tree

Page 46: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

46

Outline Why Heterogeneous Information Networks?

On the Power of Mining Structured Heterogeneous Info. Networks

Clustering & Classification: A Ranking-Based Approach

Similarity Search & Meta Path in Heterogeneous Info. Net

Prediction & Recommendation: A Meta Path-Based Approach

From Unstructured Data to Structured Networks

From Unstructured Text to Structured Hierarchies

Type Extraction and Role Discovery: Enrich Network Structures

Truth Analysis: Enhance Quality of Structured Network

Looking Forward to the Future

Page 47: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

Truth Analysis: Enhancing the Quality of Heterogeneous Info. Networks

Info. networks could be untrustworthy, error-prone, missing, … TruthFinder [KDD’07]: Inference on trustworthiness by

constructing heterogeneous info. networks

w1 f1

f2 w2

w3

w4 f4

Web sites (Book Sellers)

Facts (Book Authors)

o1

o2

Objects (Books)

f3

True facts and trustable websites mutually enhance each other and will become apparent after some iterations

47

Page 48: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

Truth Discovery: Multiple Truth Value and Handling False Negatives

• Voting may not always work well: Some sources tend to miss true values (False Negatives), while some others tend to produce false claims (False Positives)

• Why Latent Truth Model (LTM)? Modeling two-sided quality to support multiple true values per entity for truth finding [VLDB’12]

IMDB

Negative Claim

Positive Claim

Generating Implicit Negative Claims:

Harry Potter

Netflix

BadSource

Correct Claim

Incorrect Claim

High Precision, High Recall

High Precision, Low Recall

Low Precision, Low Recall

48

Presenter
Presentation Notes
Generate negative claims: If a source A claims some attributes of an entity to be true, we call those claims positive claims. In the mean time, for attributes of the same entity that are not claimed true by source A but by some other sources, we generate negative claims from source A to those attributes, indicating A is (implicitly) claiming them to be false. Importance of negative claims: If we look at the example, actress Emma and actor Johnny both gets one positive claim. Although we can predict the precision of IMDB is larger than BadSource, it may not be adequate to distinguish the two. However, if we consider negative claims, and know IMDB has very high recall which means its negative claims are very accurate, we can easily predict Johnny is false. Moreover, if we know Netflix and BadSource has fairly low recall, we will not punish Emma due to the negative claims from the two sources. In this way, true attributes can be predicted more accurately.
Page 49: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

Model source quality in other data integration tasks, e.g. entity resolution. Trustworthiness in multi-genre networks (text-rich networks, social networks, etc.)

Truth Discovery: Effectiveness of Latent Truth Model

Experimental datasets: Large and real Book Authors from abebooks.com (1263 books, 879 sources, 48153 claims, 2420 book-author, 100 labeled) Movie Directors from Bing (15073 movies, 12 sources, 108873 claims, 33526 movie-director, 100 labeled) Effectiveness of Latent Truth Model:

49

Presenter
Presentation Notes
Effectiveness: Our methods LTM and LTMinc (incremental version) achieves the highest accuracy and F1 score when the threshold is 0.5. Some methods are too conservative (high precision but low recall), such as Voting and 3-Estimate, since a lot of true attributes do not have many positive claims. Some methods are too optimistic like TruthFinder, Investment, and LTMpos which is our model only considering positive claims. Since these methods do not consider negative claims when they propagate the probabilities, the result truth probability is much higher than it should be. And If we vary the cutoff threshold, our method is consistently better which indicates that LTM gives true facts very high score and false facts very low score. Efficiency: Batch mode LTM is slightly slower than other methods due to random sampling, but incremental model LTMinc is as fast as simple voting without sacrificing effectiveness. The running time is linear to the number of claims in the dataset.
Page 50: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

50

Outline Why Heterogeneous Information Networks?

On the Power of Mining Structured Heterogeneous Info. Networks

Clustering & Classification: A Ranking-Based Approach

Similarity Search & Meta Path in Heterogeneous Info. Net

Prediction & Recommendation: A Meta Path-Based Approach

From Unstructured Data to Structured Networks

From Unstructured Text to Structured Hierarchies

Type Extraction and Role Discovery: Enrich Network Structures

Truth Analysis: Enhance Quality of Structured Network

Looking Forward to the Future

Page 51: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

On-Going and Future Work From small, structured data to real, big, unstructured data

From the DBLP network to networks constructed from news, PubMed, tweets, …

Construction of quality, semi-structured heterogeneous networks from massive, real data sets

Integration information network with data/text cube and OLAP technologies Data/text cubes have been shown its power in multi-

dimensional OLAP analysis, e.g., EventCube

Integration of heterogeneous network mining ⇒ OLAP mining on multi-dimensional social and information networks

Network exploration of mission- or query-based hidden networks

51

Page 52: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

From Data Mining to Mining Info. Networks

52

Han, Kamber and Pei, Data Mining, 3rd ed. 2011

Yu, Han and Faloutsos (eds.), Link Mining, 2010

Sun and Han, Mining Heterogeneous Information Networks, 2012

Page 53: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

53

Conclusions Heterogeneous information networks are ubiquitous

Most datasets can be “organized” or “transformed” into “structured” multi-typed heterogeneous info. networks

Examples: DBLP, IMDB, Flickr, Google News, Wikipedia, … Surprisingly rich knowledge can be mined from structured

heterogeneous info. networks Clustering, ranking, classification, path prediction, ……

Knowledge is power, but knowledge is hidden in massive, but “relatively structured” nodes and links!

Key issue: Construction of trusted, semi-structured heterogeneous networks from unstructured data

From data to knowledge: Much more to be explored but heterogeneous network mining has shown high promise!

Page 54: Construction and Mining of Heterogeneous Information ...hanj.cs.illinois.edu/slides/waim14_jhan_slides.pdfAvatar Aliens Titanic Revolutionary Road James Cameron Kate Winslet Leonardo

References M. Ji, J. Han, and M. Danilevsky, "Ranking-Based Classification of Heterogeneous Information

Networks", KDD'11. Y. Sun and J. Han, Mining Heterogeneous Information Networks: Principles and Methodologies,

Morgan & Claypool Publishers, 2012 Y. Sun, J. Han, et al., "RankClus: Integrating Clustering with Ranking for Heterogeneous

Information Network Analysis", EDBT’09 Y. Sun, Y. Yu, and J. Han, "Ranking-Based Clustering of Heterogeneous Information Networks with

Star Network Schema", KDD’09 Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “PathSim: Meta Path-Based Top-K Similarity Search in

Heterogeneous Information Networks”, VLDB'11 Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co-Author Relationship Prediction in

Heterogeneous Bibliographic Networks", ASONAM'11 Y. Sun, J. Han, C. C. Aggarwal, N. Chawla, “When Will It Happen? Relationship Prediction in

Heterogeneous Information Networks”, WSDM'12 F. Tao, et al., “EventCube: Multi-Dimensional Search and Mining of Structured and Text Data”,

(system demo) KDD’13 C. Wang, J. Han, et al., “Mining Advisor-Advisee Relationships from Research Publication

Networks", KDD'10 C. Wang, M. Danilevsky, et al., “A Phrase Mining Framework for Recursive Construction of a

Topical Hierarchy”, KDD’13 C. Wang, Marina Danilevsky, et al., "Constructing Topical Hierarchies in Heterogeneous

Information Networks", ICDM'13 54