ppt

Probabilistic Logic LearningProbabilistic Logic Learning

Ramya RamakrishnanRamya Ramakrishnan

Link Mining

Link Mining - OverviewLink Mining - Overview

• IntroductionIntroduction

• BackgroundBackground

• TasksTasks

• Authorities and HubsAuthorities and Hubs

• ChallengesChallenges

• Prob. Model of Group Membership and Link GenerationProb. Model of Group Membership and Link Generation

• SummarySummary

• ReferenceReference

IntroductionIntroduction• Traditional Data Mining – Traditional Data Mining –

RRandom sample of homogeneous objects from single andom sample of homogeneous objects from single relation.relation.

• Big challenge of Data Mining – Big challenge of Data Mining – Tackling the problem of mining heterogeneous datasets Tackling the problem of mining heterogeneous datasets

which are multi relational.which are multi relational.

• Domain consists of a variety of object types and they are Domain consists of a variety of object types and they are linked by –linked by –

- explicit link- explicit link- constructed link- constructed link

Introduction – Contd...Introduction – Contd...

• Statistical inference procedures: leads to inappropriate Statistical inference procedures: leads to inappropriate

conclusions.conclusions.

• Handle Potential correlations Handle Potential correlations

• Exploit Record linkageExploit Record linkage

• i.e. Information used to improve the predictive accuracy of i.e. Information used to improve the predictive accuracy of

the learned models the learned models

Link MiningLink Mining

• Newly emerging research area that is at the intersection of the Newly emerging research area that is at the intersection of the

work in link analyis, hypertext and web mining, relational work in link analyis, hypertext and web mining, relational

learning and inductive logic programming and graph mining.learning and inductive logic programming and graph mining.

• Instance of multi-relational data mining.Instance of multi-relational data mining.

• Encompasses tasks such as descriptive and predictive Encompasses tasks such as descriptive and predictive

modelling.modelling.

• Examples are:Examples are:

-- predicting the type of link between two objectspredicting the type of link between two objects

- - inferring the existence of a linkinferring the existence of a link

Linked Data Linked Data • Represented as a graph or networkRepresented as a graph or network

• Nodes are objectsNodes are objects• May have different kinds of objectsMay have different kinds of objects• Objects have attributesObjects have attributes• Objects may have labels or classesObjects may have labels or classes

• Edges are linksEdges are links• May have different kinds of linksMay have different kinds of links• Links may have attributesLinks may have attributes• Links may be directed, are not required to be binaryLinks may be directed, are not required to be binary

Example: Linked Bibliographic Example: Linked Bibliographic DataData

P2

P4

A1

P3

P1

I1

Objects:Papers

Authors

Institutions

Papers

P2

P4

P3

P1

Authors

A1

I1

Institutions

Links:

CitationCo-CitationAuthor-ofAuthor-affiliation

CitationCo-CitationAuthor-ofAuthor-affiliation

BackgroundBackground• To improve information retrieval results.To improve information retrieval results.

• Page rank measure and hubs & authority scores. Page rank measure and hubs & authority scores.

• Hypertext and web page classification.Hypertext and web page classification.

• Combines techniques of ILP with statistical learning Combines techniques of ILP with statistical learning algorithms.algorithms.

• Identifies certain types of hypertext regularities.Identifies certain types of hypertext regularities.

• Identification of communities or groups based on link Identification of communities or groups based on link structure.structure.

• Social and collaborative filtering.Social and collaborative filtering.

• Probabilistic models for linked data.Probabilistic models for linked data.

Link Mining TasksLink Mining Tasks

1.1. Link Based ClassificationLink Based Classification

2.2. Link Based Cluster AnalysisLink Based Cluster Analysis

3.3. Identifying Link TypeIdentifying Link Type

4.4. Predicting Link StrengthPredicting Link Strength

5.5. Link CardinalityLink Cardinality

6.6. Record LinkageRecord Linkage

Domains used are Web Page collection (web), Bibliographic Domains used are Web Page collection (web), Bibliographic

domain(bib) and Epidemiological studies (epi).domain(bib) and Epidemiological studies (epi).

Link-based Object Link-based Object ClassificationClassification

• Predicting the category of an object based on its attributes Predicting the category of an object based on its attributes andand its links its links andand attributes of linked objects attributes of linked objects

webweb: Predict the category of a web page: Predict the category of a web page

bibbib: Predict the topic of a paper: Predict the topic of a paper

epiepi: Predict disease type based on characteristics of the : Predict disease type based on characteristics of the people;people;

Link TypeLink Type

• Predicting type or purpose of linkPredicting type or purpose of link

webweb: predict advertising link or navigational link : predict advertising link or navigational link bibbib: predicting whether co-author is also an advisor; predict an : predicting whether co-author is also an advisor; predict an

advisor-advisee relationshipadvisor-advisee relationship

epiepi: predicting whether contact is familial, co-worker or : predicting whether contact is familial, co-worker or acquaintanceacquaintance

Predicting Link ExistencePredicting Link Existence

• Predicting whether a link exists between two objectsPredicting whether a link exists between two objects

webweb: predict whether there will be a link between two pages: predict whether there will be a link between two pages bibbib: predicting whether a paper will cite another paper: predicting whether a paper will cite another paper epiepi: predicting who a patient’s contacts are: predicting who a patient’s contacts are

Link Cardinality Estimation ILink Cardinality Estimation I

• Predicting the number of links to an objectPredicting the number of links to an object

webweb: predict the authoritativeness of a page based on the : predict the authoritativeness of a page based on the number of in-links; identifying hubs based on the number of number of in-links; identifying hubs based on the number of out-links out-links

bibbib: predicting the impact of a paper based on the number of : predicting the impact of a paper based on the number of citationscitations

epiepi: predicting the infectiousness of a disease based on the : predicting the infectiousness of a disease based on the

number of people diagnosednumber of people diagnosed..

Link Cardinality Estimation IILink Cardinality Estimation II• Predicting the number of objects reached along a path from an Predicting the number of objects reached along a path from an

objectobject• Important for estimating the number of objects that will be Important for estimating the number of objects that will be

returned by a queryreturned by a query

webweb: predicting number of pages retrieved by crawling a site : predicting number of pages retrieved by crawling a site bibbib: predicting the number of citations of a particular author in : predicting the number of citations of a particular author in

a specific journala specific journal epiepi: predicting the number of elderly contacts for a particular : predicting the number of elderly contacts for a particular

patient.patient.

Object IdentityObject Identity

• Predicting when two objects are the same, based on their Predicting when two objects are the same, based on their attributes attributes andand their links their links

• eg: record linkage, duplicate eliminationeg: record linkage, duplicate elimination

webweb: predict when two sites are mirrors of each other.: predict when two sites are mirrors of each other. bibbib: predicting when two citations are referring to the same : predicting when two citations are referring to the same

paper. paper. epiepi: predicting when two disease strains are the same.: predicting when two disease strains are the same.

Authorities and HubsAuthorities and Hubs

hubs authoritieshubs authorities

Authorities and HubsAuthorities and Hubs

• Authoritative pages are pages that has large in-degreeAuthoritative pages are pages that has large in-degree

• Hub pages are pages that have links to multiple relevant Hub pages are pages that have links to multiple relevant

authoritative pages.authoritative pages.

• A good hub is a page that points to many good authorities.A good hub is a page that points to many good authorities.

• A good authority is a page that is pointed by many good hubs.A good authority is a page that is pointed by many good hubs.

• Therefore hubs and authorities exhibit a mutually reinforcing Therefore hubs and authorities exhibit a mutually reinforcing

relationship.relationship.

Related workRelated work

• For defining notions of standing, impact and influence For defining notions of standing, impact and influence

• Standing – In the study of Standing – In the study of social networkssocial networks

• Impact – In Impact – In bibliometrics.bibliometrics.Also used in Also used in hypertext and WWW hypertext and WWW

RankingsRankings..

• Influence - In Influence - In bibliometricsbibliometrics

ChallengesChallenges1.1. Logical vs. Statistical DependencesLogical vs. Statistical Dependences

2.2. Feature ConstructionFeature Construction

3.3. Collective ClassificationCollective Classification

4.4. Effective Use of Unlabeled DataEffective Use of Unlabeled Data

5.5. Link PredictionLink Prediction

6.6. Object IdentityObject Identity

7.7. Statistical Challenges to Inductive Inference in Linked DataStatistical Challenges to Inductive Inference in Linked Data

• Challenges common to any link-based statistical model (Bayesian Logic Programs, Conditional Random Fields, Probabilistic Relational Models, Relational Markov Networks, Relational Probability Trees, Stochastic

Logic Programming to name a few)

Logical vs. Statistical DependencesLogical vs. Statistical Dependences• Challenge in link mining and multi – relational data mining is Challenge in link mining and multi – relational data mining is

coherently handling 2 different types of dependence structures coherently handling 2 different types of dependence structures

-- link structurelink structure: logical relationships between objects: logical relationships between objects

-- probabilistic dependencyprobabilistic dependency: statistical relationship between : statistical relationship between attribute of objectsattribute of objects

Probabilistic dependence limited to be among objects that are Probabilistic dependence limited to be among objects that are logically related.logically related.

In learning statistical models for multi-relational data, we must In learning statistical models for multi-relational data, we must not only search over probabilistic dependencies, but also not only search over probabilistic dependencies, but also search for diff. possible logical relationships between objects.search for diff. possible logical relationships between objects.

Feature ConstructionFeature Construction• The attribute of an object provide a basic description of an The attribute of an object provide a basic description of an

object. object.

• Traditional classification algorithms were based on these types Traditional classification algorithms were based on these types of object features.of object features.

• In link based approach, it may also make sense to use In link based approach, it may also make sense to use attributes of linked objects. Further if the links themselves attributes of linked objects. Further if the links themselves have attributes, it may also be used.have attributes, it may also be used.

• This is the idea behind propostionalisation.This is the idea behind propostionalisation.

• A main issue is how to deal with relationships that are not one-A main issue is how to deal with relationships that are not one-to-one; it may be appropriate to compute aggregate features to-one; it may be appropriate to compute aggregate features over the set of related objects.over the set of related objects.

Collective ClassificationCollective Classification• The challenge is classification using a learned model.The challenge is classification using a learned model.• A learned link based model specifies a distirbution over link A learned link based model specifies a distirbution over link

and content attributes, which may be correlated based on the and content attributes, which may be correlated based on the links between them.links between them.

• Intuitively for linked objects updating the category of one Intuitively for linked objects updating the category of one object can influence our inference about the categories of its object can influence our inference about the categories of its linked neighbours.linked neighbours.

• Iterative classification algorithms have been proposed for Iterative classification algorithms have been proposed for hypertext categorization and relational learning.hypertext categorization and relational learning.

• This algorithm has been studied in various fields such as This algorithm has been studied in various fields such as relation labeling in computer vision, inference in Markov relation labeling in computer vision, inference in Markov random fields and loopy belief propogation in Bayesian random fields and loopy belief propogation in Bayesian networks.networks.

• Allows us to learn the notion of hubs.Allows us to learn the notion of hubs.

Effective use of Unlabeled DataEffective use of Unlabeled Data

• Unique ways in which unlabeled data can be used to Unique ways in which unlabeled data can be used to

improve classification performance in relational domains:improve classification performance in relational domains:

1.1. Links among the unlabeled data (or test set) can provide Links among the unlabeled data (or test set) can provide

information that can help with classification.information that can help with classification.

2.2. Links between the labeled training data and unlabeled data Links between the labeled training data and unlabeled data

induce dependencies that should not be ignored.induce dependencies that should not be ignored.

3.3. Just as in the case of the classical machine learning Just as in the case of the classical machine learning

framework, in which there are no links among the data, framework, in which there are no links among the data,

unlabeled data can help us learn the distribution over object unlabeled data can help us learn the distribution over object

descriptions.descriptions.

Link PredictionLink Prediction

• Challenge here is link discovery, or predicting the existence of Challenge here is link discovery, or predicting the existence of

links between objects.links between objects.

• A range of tasks that we have described fall under the category A range of tasks that we have described fall under the category

of link prediction.of link prediction.

• The difficulty here is that the prior probability of a link among The difficulty here is that the prior probability of a link among

any set of individuals is typically very low.any set of individuals is typically very low.

• A further challenge is the discovery of common relational A further challenge is the discovery of common relational

patterns or subgraphs; some progress has been made but this is patterns or subgraphs; some progress has been made but this is

an inherently dífficult problem.an inherently dífficult problem.

Object IdentityObject Identity

• Challenge is identity detection.Challenge is identity detection.

• How do we infer aliases, i.e. determine that two objects refer How do we infer aliases, i.e. determine that two objects refer

to the same individual?to the same individual?

• Also whether our statistical models refer explicitly to Also whether our statistical models refer explicitly to

individuals or only to classes or categories of objects.individuals or only to classes or categories of objects.

• We would like to model that a connection to a particular We would like to model that a connection to a particular

object or individual is highly predictiveobject or individual is highly predictive

• On the other hand we‘d like to have our models generalize and On the other hand we‘d like to have our models generalize and

be applicable to new, unseen objects.be applicable to new, unseen objects.

Statistical Challenges to Inductive Inference in Linked Statistical Challenges to Inductive Inference in Linked DataData

1.1. Statistical dependencesStatistical dependences

2.2. Sampling densitySampling density

3.3. Feature combinatorics.Feature combinatorics.

Statistical DependenciesStatistical Dependencies

• Instance LinkageInstance Linkage

• Independent InstancesIndependent Instances

• Dependent InstancesDependent Instances

Bn

A2

B1

B1

A1

A2

An

A1

B2

An...

..

.

..

.

Sampling DensitySampling Density

A1 A2

A0

A3

A4

A7 A6 A5

A8

Partial Sampling

Feature CombinatoricsFeature Combinatorics

• Linked data intensify a challenge – adjusting for multiple comparisons.Linked data intensify a challenge – adjusting for multiple comparisons.

• Other induction algorithms use a procedure – Other induction algorithms use a procedure – • Generate n itemsGenerate n items• Calculate a score for each item based on the training setCalculate a score for each item based on the training set• Select the item with the maximum scoreSelect the item with the maximum score

• Linked data intensify these challenges.Linked data intensify these challenges.

• To adjust, techniques such as To adjust, techniques such as • new data samplesnew data samples• cross validation cross validation • randomization test and randomization test and • boneferroni adjustment.boneferroni adjustment.

Probabilistic Model of Group Membership and Link Probabilistic Model of Group Membership and Link GenerationGeneration

• Model considers both observed link evidence and Model considers both observed link evidence and demographic information about the entities.demographic information about the entities.

• Parameters of the model are learned via a maximum Parameters of the model are learned via a maximum likelihood search.likelihood search.

• System takes 2 types of input data: System takes 2 types of input data: 1.1. A database of entities and their demographic informationA database of entities and their demographic information2.2. A database of link dataA database of link data

• Outputs aOutputs a set of group memberships which is used to answer set of group memberships which is used to answer queries such as –queries such as –

1.1. List all members of group G1List all members of group G12.2. List all the groups for which E1 and E2 are both membersList all the groups for which E1 and E2 are both members3.3. List a set of suspected aliases (entities that are in the same group(s), but List a set of suspected aliases (entities that are in the same group(s), but

never appear in the same link).never appear in the same link).

Probabilistic Model of Group Membership and Link Probabilistic Model of Group Membership and Link GenerationGeneration

PersonPerson AgeAge JobJob NationalitNationalityy

AtkinsAtkins 2424 TeacherTeacher BritainBritain

BrownBrown 3434 ClerkClerk USAUSA

ChapmanChapman 3030 DriverDriver USAUSA

DickensDickens 1818 StudentStudent FranceFrance

Link typeLink type PPll PPRR

Phone Phone 0.030.03 0.00.033

MeetingMeeting 0.200.20 0.20.200

MoneyMoney 0.010.01 0.00.011

EmailEmail 0.050.05 0.00.055

PersonPerson GroupGroup

GG11

GG22

GG33

GG4 4

G5G5 G6 G6

AtkinsAtkins ** ** **

BrownBrown ** **

ChapmaChapmann

** **

DickensDickens

Persons Persons Type Type

{Atkins, {Atkins, Chapman}Chapman}

MoneyMoney

{Brown, Dickens}{Brown, Dickens} MeetinMeetingg

{Atkins, Brown}{Atkins, Brown} EmailEmail

:: ::

Demographic Modelp(Member G1 | demogrpahics) classifierp(Member G2 | demogrpahics) classifierp(Member G3 | demogrpahics) classifier

.

.p(Member G6 | demogrpahics) classifier

----- hidden information

Solid borders – observed data

Link model

Link data

Chart

Dem

o. D

ata

SummarySummary

• Link miningLink mining• exciting new research area exciting new research area • poses new statistical modeling challengesposes new statistical modeling challenges

• Link mining task should inform our choice of:Link mining task should inform our choice of:• Link-based statistical modelLink-based statistical model

• visualizationvisualization

ReferenceReference• L. Getoor. Link Mining: A New data Mining Challenge. L. Getoor. Link Mining: A New data Mining Challenge. SIGKDD SIGKDD

Explorations, volume 4, issue 2, 2003Explorations, volume 4, issue 2, 2003..• P. Domingos and M. Richardson. Mining the network value of the P. Domingos and M. Richardson. Mining the network value of the

customers. In customers. In Proceddings of the Seventh International Conference on Proceddings of the Seventh International Conference on Knowledge discovery and Data MiningKnowledge discovery and Data Mining, 2001., 2001.

• D. Jensen. Statistical Challenges to inductive inferences in linked data. In D. Jensen. Statistical Challenges to inductive inferences in linked data. In Seventh International Workshop on AI and StatisticsSeventh International Workshop on AI and Statistics, 1999., 1999.

• J. Kleinberg. Authoritative Sources in a hyperlinked environment. J. Kleinberg. Authoritative Sources in a hyperlinked environment. Journal Journal of the ACMof the ACM, 46(5):604-632, 1999., 46(5):604-632, 1999.

• J. Kubica, A. Moore, J. Schneider and Y. Yang. Stochastic link and group J. Kubica, A. Moore, J. Schneider and Y. Yang. Stochastic link and group detection. In detection. In Proceedings of AAAI-02Proceedings of AAAI-02, 2002., 2002.

• L. Getoor, E. Segal, B. Taskar, D. Koller. Probabilistic Models of Text and L. Getoor, E. Segal, B. Taskar, D. Koller. Probabilistic Models of Text and Link Structure for Hypertext Classification. Link Structure for Hypertext Classification. IJCAI Workshop on "Text IJCAI Workshop on "Text

Learning: Beyond SupervisionLearning: Beyond Supervision", Seattle, WA, August 2001", Seattle, WA, August 2001..

ppt

Documents

type of link

link existence

advertising link

link analyis

link structure

link type predicting

linkbased s

purpose of link web