relational data mining with inductive logic programming for link discovery
DESCRIPTION
Relational Data Mining with Inductive Logic Programming for Link Discovery. Ray Mooney, Prem Melville, Rupert Tang University of Texas at Austin Jude Shavlik, In ê s de Castro Dutra, David Page, V í tor Santos Costa University of Wisconsin at Madison. EELD Program. Evidence Extraction - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/1.jpg)
1
Relational Data Mining with Inductive Logic Programming for
Link Discovery
Ray Mooney, Prem Melville, Rupert Tang
University of Texas at Austin
Jude Shavlik, Inês de Castro Dutra, David Page, Vítor Santos Costa
University of Wisconsin at Madison
![Page 2: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/2.jpg)
2
EELD Program
• Evidence Extraction
• Link Discovery
• Pattern Learning
![Page 3: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/3.jpg)
3
Link Discovery Task(from Jim Antonisse, GITI)
Evidence
Alerts based on Hypothesized
cases
Legend: pre-run-time processing run-time processing
Domain Patterns
ProblemContext
Ontologies
Evidence request(s)
Pattern(s) of Interest
Vetted hypcases
Queries
Link Discovery
Core:Pattern Matching
![Page 4: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/4.jpg)
4
Link Discovery
• Data is multi-relational with many people, places, objects and actions and numerous types of relations between them.
• Link analysis in intelligence and criminology investigates exploring and visualizing such data as a graph with many nodes and edges of various types.
• Link discovery entails finding new links and recognizing threatening patterns in such highly-relational data.
![Page 5: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/5.jpg)
5
EELD Program
• Evidence Extraction
• Link Discovery
• Pattern Learning
![Page 6: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/6.jpg)
6
Pattern Learning for Link Discovery
• Automated discovery of “patterns of interest” that indicate potentially threatening activities in large amounts of heterogeneous, multi-relational data.
• Requires inducing multi-relational patterns that characterize multiple entities and multiple links between them.
![Page 7: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/7.jpg)
7
Limitations of Traditional Data Mining
• Traditional KDD methods assume the data to be mined is in a single relational table and that examples are flat tuples of attribute values.
• This assumption stems from:– 1) Properties of the typical data mining tasks
like market basket analysis.– 2) Focus in machine learning and statistics on
classification or regression using feature vectors as inputs.
![Page 8: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/8.jpg)
8
Relational Data Mining
• Data contains multiple relations.
• Patterns to be discovered contain multiple relations.
• Knowledge to be discovered may be the definition of another relation rather than a classification or regression function.
![Page 9: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/9.jpg)
9
Relational Data Mining Example
Bob
JohnFred
Alice
Mary Jack
Sue Carol
JaneTom
Married
Parent
MaleFemaleAlice
Jane
, Male(X), not(X=W).Uncle(X,Y) :- Parent(Z,X), Parent(Z,W), Parent(W,Y)
Tom
Carol
Uncle(tom, carol)
X
Y
Z
W
![Page 10: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/10.jpg)
10
Relational Data Mining Example (cont)
Bob
JohnFred
Alice
Mary Jack
Sue Carol
JaneTom
Married
Parent
MaleFemale
, Male(X), not(Z=V).
Uncle(jack, john)
Jack
John
Tom
Alice
Jane
Uncle(X,Y) :- Married(X,Z), Parent(W,Z), Parent(W,V), Parent(V,Y)
X Z
W
V
Y
![Page 11: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/11.jpg)
11
Most KDD Ignores RDM
• KDD textbooks barely mention RDM:– Han & Kamber, 2001– Hand, Mannila, & Smyth, 2001– Witten & Frank, 1999
• But there is a recent edited collection on RDM:– S. Džeroski & N. Lavrač, eds. Relational Data
Mining, Springer Verlag, 2001.
![Page 12: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/12.jpg)
12
Inductive Logic Programming(ILP)
• Standard formal language for representing relational knowledge is first-order predicate logic.
• ILP studies the induction of hypotheses in first-order predicate logic.
• Logic programs (e.g. Prolog) or function-free logic programs (e.g. Datalog), are a useful, reasonably-tractable subset of first-order predicate logic.
• ILP is the most well-studied approach to relational data mining.
![Page 13: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/13.jpg)
13
ILP Problem Definition
Given• Positive Example Set: P• Negative Example Set: N• Background Knowledge: B
Find• Hypothesis, H, such that
pHBPp : nHBNn :
P, N, B and H are all sets of rules in first-order logic (i.e. Horn clauses, logic programs)
![Page 14: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/14.jpg)
14
ILP Algorithms
• We have utilized two ILP systems for EELD problems in link discovery.– Aleph (Srinivasan, 2001) A variant of the
popular Progol algorithm (Muggleton, 1995)– mFoil+ (Tang and Mooney, 2002) A variant of
the popular Foil algorithm (Quinlan, 1990)
![Page 15: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/15.jpg)
15
EELD Russian Nuclear Smuggling Data
• Data manually extracted from new sources about events related to nuclear smuggling (developed by Veridian Inc.)
• Size of data set:– 40 relational tables– 2 to 800 tuples per relation
• Translated Access database to Prolog, mapping each relational table to a predicate.
• Used Aleph to learn rules for the relation Linked(A,B)which determines whether or not two events are part of the same incident.– 143 positive examples – 517 negative examples
![Page 16: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/16.jpg)
16
Partial Incident N Partial Incident M
New Event
Illustration of Linked Relation
![Page 17: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/17.jpg)
17
Expanded Incident N Partial Incident M
Find Correct Incident for New Event
![Page 18: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/18.jpg)
18
Sample Rule
linked(EventA,EventB) :- lk_event_material(_,EventA,_,_,_, ConcealmentG,DescH), lk_event_person(_,EventB,PersonD,_,C,C,_), lk_person_material(_,PersonD,MatF,EvE,_,_,_,_,_), lk_event_material(_,EvE,MatF,I,_, ConcealmentG,DescH), l_relations(I,_,"Stolen").
If A is linked to a specific type of material <G,H>, and B islinked to a person linked to the same specific type of material,through an event in which that material was stolen, then eventsA and B are linked.
![Page 19: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/19.jpg)
19
BA
EventMaterialPerson
Linked(A,B)
![Page 20: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/20.jpg)
20
B
MaterialType GH
A
EventMaterialPerson
Linked(A,B)
![Page 21: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/21.jpg)
21
B
MaterialType GH
E
A
MaterialType GH
D
EventMaterialPerson
Linked(A,B)
![Page 22: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/22.jpg)
22
B
MaterialType GH
E
A
MaterialType GH
D
EventMaterialPerson
Stolen
Linked(A,B)
![Page 23: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/23.jpg)
23
B
MaterialType GH
E
A
MaterialType GH
D
EventMaterialPerson
Stolen
Linked(A,B)
![Page 24: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/24.jpg)
24
Accuracy Results for Learning Linkedfor Nuclear Smuggling Data
• Experimental Method: 5-fold cross validation.
• Also tried bagging Aleph to produce an ensemble of 25 hypotheses.
Majority Class
(not Linked)
Aleph Bagged Aleph
78% 83% 86%
![Page 25: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/25.jpg)
25
Synthetic Contract Killing Data
• Data generated by a plan-based simulator that generates evidence emulating contract killings and other types of murders (developed by IET Inc.).
• Simulator used to generate evidence from 200 murder events of three types:– Murder for Hire (71 exs)
– First Degree (75 exs)
– Second Degree (54 exs)
• Use mFoil+ to classify events into one of these three categories.
![Page 26: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/26.jpg)
26
Sample Rules
• Murder For Hire(A):- groupMemberMaleficiary(A, B), subEvents(A, C), crimeMotive(C, economic).
• First Degree Murder(A):- subEvents(A, B), performedBy(B, C),
loves(C,D).
• Second Degree Murder(A):- subEvents(A, B),
eventOccursAtLocationType(B,publicProperty), crimeMotive(B, rival), occurrentSubeventType(B, stealing_Generic).
![Page 27: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/27.jpg)
27
Results on Synthetic Contract Killing Data
MurderForHire FirstDegree SecondDegree
PRECISION 85.52% 91.17% 95.83%
RECALL 91.07% 88.48% 59.45%
Majority Class mFOIL+
ACCURACY 37.50% 76.67%
![Page 28: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/28.jpg)
28
Recent Result from EELD Challenge Problem
murder_for_hire(A) :- eventOccursAt(A,B), perpetrator(A,C), agentPhoneNumber(C,D),callerNumber(E,D), accountHolder(F,C), to_Generic(G,F), from_Generic(G,H), to_Generic(I,H).
• Says an event is a “murder for hire” if it has a recorded location and perpetrator, we have a recorded phone call to the perpetrator, and there was a chain of bank transfers resulting in money reaching the perpetrator’s account.• 100% accuracy on a held-out test set.• Similar pattern found manually by LD researchers working on this challenge problem.
![Page 29: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/29.jpg)
29
Future Research
• Scaling to larger datasets– Stochastic search– Logic program optimization– Integration with relational and deductive
database technology.
• Integrating probabilistic reasoning– Logic programs with Bayes-net constraints
• Active Learning• Theory Refinement
![Page 30: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/30.jpg)
30
Related Research
• Graph-based Relational Data Mining– Subdue (Cook & Holder, UT Arlington)
• Probabilistic Relational Models– PRMs (Koller, Stanford)
• Relational Feature Construction– PROXIMITY (Jensen, UMass)
![Page 31: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/31.jpg)
31
Record Linkage
• Identify and merge duplicate field values and duplicate records in a database.
• Applications– Duplicates in mailing lists
– Merging multiple databases of stores, restaurants, etc.
– Matching bibliographic references in research papers (Cora/ResearchIndex)
– Identifying individuals who are trying to hide their identity by providing slightly erroneous personal information.
![Page 32: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/32.jpg)
32
Record Linkage Examples
Yoav Freund, H. Sebastian Seung, Eli Shamir, and Naftali Tishby
Information, prediction, and query by committee
Advances in Neural Information Processing System
San Mateo,
CA
1993
Freund, Y., Seung, H.S., Shamir, E. & Tishby, N.
Information, prediction, and query by committee
Advances in Neural Information Processing Systems
San Mateo,
CA.
Author Title Venue Address Year
Second Avenue Deli
156 2nd Ave. at 10th
New York Delicatessen
Second Avenue Deli
156 Second Ave. New York City Delis
Name Address City Cusine
![Page 33: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/33.jpg)
33
Trainable Record Linkage
• MARLIN (Multiply Adaptive Record Linkage using INduction)
• Learn parameterized similarity metrics for comparing each field.– Trainable edit-distance
• Use EM to set edit-operation costs
• Learn to combine multiple similarity metrics for each field to determine equivalence.– Use SVM to decide on duplicates
![Page 34: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/34.jpg)
34
MARLIN Record Linkage Framework
A.Field1 B.Field1 A.Fieldn B.FieldnA.Field2 B.Field2 …
m1 … mk m1 … mk m1 … mk…
Trainable similarity metrics
Trainable duplicate detector
![Page 35: Relational Data Mining with Inductive Logic Programming for Link Discovery](https://reader035.vdocument.in/reader035/viewer/2022062500/568153a3550346895dc1a6f1/html5/thumbnails/35.jpg)
35
Conclusions
• Pattern Learning for Link Discovery is an important application of data mining for counter-terrorism.
• Learning for Link Discovery requires Relational Data Mining (RDM).
• Other problem domains require RDM– Bioinformatics– Web– Natural Language Understanding
• RDM is an important next-generation KDD capability.