actionability and formal concepts: a data mining perspective

65
UMR 5205 Actionability and Formal Concepts: a Data Mining Perspective Jean-François Boulicaut INSA Lyon, LIRIS CNRS UMR 5205, France Montréal (Canada), ICFCA 2008

Upload: others

Post on 10-Dec-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

UMR 5205

Actionability and Formal Concepts: a Data Mining Perspective

Jean-François BoulicautINSA Lyon, LIRIS CNRS UMR 5205, France

Montréal (Canada), ICFCA 2008

2

Préambule

Actionability and Formal Concepts:

a Data Mining PerspectiveKnowledge Discovery based on Formal Concepts Data Mining as the “Art of Counting”Complete solvers for Inductive Queries are usefulA “personal” and obviously biased perspective

Joint work with:J. Besson (co-author of the invited paper)R. G. Pensa, C. Rigotti, C. Robardetinspired by many other colleagues

3

A conceptual view on KDD processes

Inductive Database

Management System

DataData

PatternsPatterns

ModelsModels

e.g., Inductive queries on 0/1 data for targeted applications in functional genomics

4

Motivation (1)

Pattern discovery from large 0/1 data sets

10010O710011O610011O511100O411101O310111O201101O1

P5P4P3P2P1

Objects x Properties0 x P

Size may be a problem: 102 ... 106 x 102 .. 104

5

Motivation (2)

Pattern discovery from large 0/1 data sets

Y∈2P ∧ |g(Y)| > 2

10010O710011O610011O511100O411101O310111O201101O1

P5P4P3P2P1

Objects x Properties

A 3-frequent itemset

{{o2,o5,o6},{p2,p5}}

g

f

6

Motivation (3)

Pattern discovery from large 0/1 data sets

X∈2O ∧ Y∈2P ∧ X=g(Y) ∧ Y=f(X)

10010O710011O610011O511100O411101O310111O201101O1

P5P4P3P2P1

Objects x Properties

A formal concept

{{o2,o5,o6},{p1,p2,p5}}

g

f

7

Motivation (4)

Pattern discovery from large 0/1 data sets

Fault-tolerant extensions to formal concepts?Dense itemsets?

10010O710011O610011O511100O411101O310111O201101O1

P5P4P3P2P1

Objects x Properties

Examples of subspaceclusters (bi-clusters)

8

Motivation (5)

Pattern discovery from large 0/1 data sets

From local patterns to global patterns

10010O710011O610011O511100O411101O310111O201101O1

P5P4P3P2P1

Objects x Properties

A co-clustering

9

Targeted applications

Understanding gene regulation?

GenesProteinsTranscription factorsPromoter sequences

10

… by means of formal concepts

01000Tf3

10111Tf2

10101Tf1

G5G4G3G2G1

01100E4

11101E3

10111E2

01101E1

G5G4G3G2G1

10E4

01E3

01E210E1

C2C1

G1, G3, and G5 are co-expressedgenes for all individuals of class C1. Tf1 and Tf2 might explain thisco-expression

11

A successful example (Ph.D. J. Besson)

Actionability and Knowledge Discovery

An outcome of a concrete KDD process:Human genes from {SPOP, ABCA7, FEM1B, HK2, MAPRE1, MORF4F4L2, ARF4, SF1, VSP29, CRYBA4, HIG1, SDC1, PGRMC2} appear to be co-regulated by insulin for all control individuals and the transcription factors from {SREBP, SP1, NF-Y, GATA-1, GATA-2, AML-1a} might support this co-regulation

12

Cont.

Discovering (new) regulation mechanismsIdentification of genes regulated by insulin from microarray data

Looking for formal concepts in a 156 x 344 matrix that encodes the association of transcription factors with genes that are regulated by insulin

13

Cont

> 5 Millions

Formal Concepts (X,Y)

1110TF4

1101TF3

1011TF2

1011TF1

G4G3G2G1

14

Cont.

Formal Concepts (X,Y) s.t. SREBP1∈X

> 3.6 Millions

SREBP1 (Sterol-responsive-element binding protein 1) known to be implied in transcriptional answer to insulin

1110TF4

1101TF3

1011TF2

1011TF1

G4G3G2G1

15

Cont.

Formal Concepts (X,Y) s.t. SREBP1∈X ∧ SP1∈X ∧ NF-Y∈X

1.477

SP1 and NF-Y known for « cooperating » with SREBP1

1110TF4

1101TF3

1011TF2

1011TF1

G4G3G2G1

16

Biological validation

The formal concept({SREBP, SP1, NF-Y, GATA-1, GATA-2, AML-1a}, {SPOP, ABCA7, FEM1B, HK2, MAPRE1, MORF4F4L2, ARF4, SF1, VSP29, CRYBA4, HIG1, SDC1, PGRMC2})

has been “studied” (wet biology, S. Rome et al.)90% of these genes indeed have an active binding site for SREBP1 when SP1 and NF-Y are present

New targeted genes for SREBP1, SP1, NF-Y

17

Our thesis (1)

Discovery based on formal concepts makessense … but

Mining formal concepts is a special case of bi-setmining under constraints and may be studied as such

(X,Y) X∈2O ∧ Y∈2P ∧X=g(Y) ∧ Y=f(X) ∧ ...X=g(Y) ∧ Y=f(g(Y)) ∧ ...

Z∈2P ∧ Z is free ∧ X=g(Z) ∧ Y=f(g(Z)) ∧ ...

« Pushing » constraints is a key technology for solvinginductive queries

18

Our thesis (1’)

Discovery based on formal concepts makessense … but

Mining formal concepts is a special case of bi-setmining under constraints and may be studied as such

(X,Y) X ∈2O ∧ Y∈2P ∧X=g(Y) ∧ Y=f(X) ∧ |X| > γX=g(Y) ∧ Y=f(g(Y)) ∧ |X| > γ

Z∈2P ∧ Z is free ∧ X=g(Z) ∧ Y=f(g(Z)) ∧ |X| > γ« Pushing » constraints is a key technology for solving

inductive queries

19

Our thesis (2)

Discovery based on formal concepts makessense … but

Mining formal concepts that satisfy user-definedconstraints may be difficult

Specialization order, enumeration and pruningstrategies have to be designed

Monotonicity properties are important

… ∧ |X| > α ∧ |Y| < β not that hard… ∧ |X| > α ∧ |Y| > β much harder… ∧ |X| x |Y| > α « «

20

Using such constraints in practice?

Minimal size (Objects) Minimal size (Properties)

Number of formalconcepts

... a given biological data set (TF x G)

21

Our thesis (3)

Discovery based on formal concepts makessense … but

... a somehow disapointing actionability

Too many patterns

Many patterns denote false positive associations or uninteresting ones

− Using randomization techniques may helpFault-tolerance?

22

Problems w.r.t. closed set mining (1)

0000000o71100000o61100000o50011111o40011111o30011111o20011111o1

p7p6p5p4p3p2p1

({o1,o2,o3,o4},{p1,p2,p3,p4,p5})

({o5,o6},{p6,p7})

23

Problems w.r.t. closed set mining (2)

Introducing « 10% errors »

0000001o70110000o61100000o50011111o40011111o30001111o20011101o1

p7p6p5p4p3p2p1

({o1,o2,o3,o4,o7},{p1})

({o1,o2,o3,o4},{p1,p3,p4})

({o2,o3,o4},{p1,p2,p3,p4})

({o3,o4},{p1,p2,p3,p4,p5})

({o1,o3,o4},{p1,p3,p4,p5})

({o1,o3,o4,o6},{p5})

({o5,o6},{p6})

({o5},{p6,p7})

({o6},{p5,p6})

24

Elements of solution

Supporting actionable pattern discovery1 0

Looking for « large enough » patterns may help

0 1

Looking for fault-tolerant patterns

For a more or less declarative specification of fault-tolerancePost-processing collections of local patterns in

general and formal concepts in particular

25

Condensed representations are useful

Answering frequency queries{(Y,e) ∈ 2PxΝ s.t. e=|g(Y)| ∧ e > γ}

Rules, clusters,

etc

0/1 data Frequentitemsets

26

Condensed representations are useful’

Answering frequency queries{(Y,e) ∈ 2PxΝ s.t. e=|g(Y)| ∧ e > γ}

Rules, clusters,

etc

0/1 data

CondensedRepresentationsof frequent sets

Frequentitemsets

Closed sets, δ-free sets,k-free sets, …

27

Condensed representations are useful’’

Answering frequency queries{(Y,e) ∈ 2PxΝ s.t. e=|g(Y)| ∧ e > γ}

Rules, clusters,

etc

0/1 data

CondensedRepresentationsof frequent sets

Closed sets, δ-free sets,k-free sets, …

28

p5

p3p5 p2p5

p2p3p5

About condensed representations (1)

{o1,o2,o5}

01111o610111o501010o401011o311110o210110o1

p5p4p3p2p1

An equivalence class perspective{p5} is a 0-free set and {p2,p3,p5} is its 0-closure

29

About condensed representations (2)

p3

p2p3 p3p5

p2p3p5

{o1,o2, ...}

01111o610111o501010o401011o311110o210110o1

p5p4p3p2p1

A near-equivalence class perspective{p3} is a 1-free set and {p2,p3,p5} is its 1-closure

30

Using ac-like (1)

Computing δ-free sets and their δ-closures

δ=0 p3 ∧ p4 => p2 : 2

{p3, p4} {p3, p4, p2} : {o2, o6} : 2

Given γ = 1, 11 sets with δ = 0NB. 0-free sets also called key patterns

Association rules, closed sets and formal concepts

01111o610111o501010o401011o311110o210110o1

p5p4p3p2p1

31

Using ac-like (2)

Computing δ-free sets and their δ-closures

δ=0 p3 ∧ p4 => p2 : 2

{p3, p4} {p3, p4, p2} : {o2, o6} : 2

δ=1 p3 => p2 ∧ p5(-1) : 4p3 ∧ p4 => p1(-1) ∧ p2 ∧ p5(-1) : 2

Given γ = 1, 11 sets with δ = 0, 7 sets with δ=1

Association rules, (almost) closed sets (FBS patterns)

01111o610111o501010o401011o311110o210110o1

p5p4p3p2p1

32

Using ac-like (3)

Computing δ-free sets and their δ-closures

δ=0 p3 ∧ p4 => p2 : 2

{p3, p4} {p3, p4, p2} : {o2, o6} : 2

δ=1 p3 => p2 ∧ p5(-1) : 4p3 ∧ p4 => p1(-1) ∧ p2 ∧ p5(-1) : 2

Given γ = 1, 11 sets with δ = 0, 7 sets with δ=1

Association rules, (almost) closed sets (FBS patterns)

01111o610111o501010o401011o311110o210110o1

p5p4p3p2p1

33

Introducing DMiner

{p1,p2,p3} – {o1,o2,o3,o4}

p1 – o1

{p1,p2,p3} – {o2,o3,o4} {p2,p3} – {o1,o2,o3,o4}

{p1,p2,p3} – {o2,o3} {p1} – {o2,o3,o4} {p2,p3} – {o1,o2,o3} ∅ - {o1,o2,o3,o4}

p2p3 –o4 p2p3 –o4

001o4111o3111o2110o1

p3p2p1

34

Experimental validation (connect4)

Time (s)

Minimal size (Objects)

35

Experimental validation (TF x G)

Time (s)

Minimal size (Objects)

36

Properties (see Ph.D. J. Besson, 2005)

DMiner is a correct and complete algorithm

Time complexity when n=|O|, m=|P| and n < m

Delay complexity (worse case)

O(n2 * m)

Delay complexity in average

(n - log2(|C|) + 1) O(n*m)

37

Back to fault-tolerance

Introducing « 10% errors »

0000001o70110000o61100000o50011111o40011111o30001111o20011101o1

p7p6p5p4p3p2p1

({o1,o2,o3,o4,o7},{p1})

({o1,o2,o3,o4},{p1,p3,p4})

({o2,o3,o4},{p1,p2,p3,p4})

({o3,o4},{p1,p2,p3,p4,p5})

({o1,o3,o4},{p1,p3,p4,p5})

({o1,o3,o4,o6},{p5})

({o5,o6},{p6})

({o5},{p6,p7})

({o6},{p5,p6})

38

Specifying fault-tolerance

Fault-tolerance extensions of formal concepts are proposals for “almost-closed” set patterns

Various attempts

FBS patterns

DRBS patterns

See also dense itemsets, large tiles, subspace clusters, support enveloppes in the Data Mining community

NB. Probably much more has been done within the “Concept Lattice” community

39

Example of a FBS pattern

{G2} is a 1-free set

40

Cont.

Computing its 1-closure

41

But …

42

Pros and Cons for FBS patterns

Efficient algorithms for mining δ-free itemsetsand thus FBS patterns

Non symmetrical

When δ=0, we get formal concepts

43

DR-bi-sets

No « 0 »At least ε« 0 » more than insideper column

At least ε« 0 » more than insideper row

At most α« 0 » per row and per column

At least one « 0 » per column

At least one « 0 »per row

Cdense

Crelevant

+ Maximality Constraint

44

An example of a DR-bi-set

α=α’=ε=ε’=1 At most 1 “0” value per row and per column

1 more “0” value outside (at least 2)

45

0000s50100s40010s30011s21111s1

g4g3g2g1

Computing DR bi-sets

46

Number of patterns

Mining the Internet benchmark (1)

47

Increasing ratio

Mining the Internet benchmark (2)

48

Pros and Cons w.r.t. DRBS

When α=α’=0 and ε=ε’=1, we get formal concepts

Symmetrical approach

Hard to compute … a correct and completealgorithm exists

Some nice properties but « preserving more » of the Galois connection properties would be nicer …

... It remains open … at least for a data miner

49

Post-processing local patterns

A pragmatic way to tackle fault-tolerancePost-processing collections of formal concepts

− e.g., grouping patterns w.r.t. their similarities to decrease the number of hypothesis that have to beinterpreted

clustering formal concepts

− To increase hypothesis relevancy thanks to fault-tolerance

This can be applied on many other pattern types

50

An application

Mining Synexpression Groups from SAGE data

90 x 5237 Boolean matrix encoding over-expression of human genes in various organs and tissues

Looking for sets of co-over-expressed genes and their associated samples

− From 64836 to 1669 formal concepts that have been grouped into QSGs by means of a hierarchicalclustering

One of these QSG has been analyzed in depth(Blachon et al. ISB 2007)

51

Mining quasi-synexpression groups

1. Measuringsimilaritiesbetweenformalconcepts

2. Hierarchicalclustering

3. Visualization

52

Relationship to co-clustering

Clustering and Co-Clustering

Useful feedback on global structures

Heuristic local optimization

Clustering objects and/or properties vs. clustering local patterns

10000o710010o610011o501100o401101o310010o201101o1p5p4p3p2p1

53

A proposal (Ph.D. R. G. Pensa, 2006)

10000O7

10010O6

10011O5

01100O4

01101O3

10010O2

01101O1

P5P4P3P2P1

10000O7

10010O6

10011O5

01100O4

01101O3

10010O2

01101O1

P5P4P3P2P1

Local patterns Global structure

Compute K (e.g., 2) co-clusters from available bi-sets

54

Back to the Inductive Database vision

Inductive Database

Management System

Extensional/intensional data Extensional/intensional data

PatternsPatterns

ModelsModels

55

SQUAT

SAGE data (H. Sapiens, G. Gallus, M. musculus)SAGE data (H. Sapiens, G. Gallus, M. musculus)

Domain knowledge, e.g., GODomain knowledge, e.g., GO

Collections of formal conceptsCollections of formal concepts

A concrete example

http://bsmc.insa-lyon.fr/squat/login.phphttp://bsmc.insa-lyon.fr/squat/login.php

56

ClosedSet

Mining

Faulttolerance

PatternDomains

n-ary relations, multi-relational data, sequences, trees, graphs ...

δ-bi-sets, DR-bi-sets, δ-tolerantclosed sets, …

Under Co

nstraint

s

Perspectives

FCA&

extensions

ICFCA may help

p1 p2 p3o1 1 0 1o2 0 1 0o3 1 1 0o4 1 1 1

< (o1,o4),(p1,p3)>

In a binary relation subset of O x P, 2-closed sets are formal concepts

Many solvers are available

An example of an extension

p1 p2 p3

q1 o1 1 0 1o2 0 1 0

q2 o1 1 1 1o2 1 0 1

< (o1),(p1,p3),(q1,q2)>

In a n-ary relation subset of A1 x ... x An, a closed n-sets generalizes a formal concept: it binds subsets of every Ai s.t. each of them is closed w.r.t. all the others.

Computing the patterns is much harder

Trias – CubeMiner - Data Peeler

SDM’08ICDM’06

VLDB’06

p1 p2 p3

q1 o1 1 0 1o2 0 1 0

q2 o1 1 1 1o2 1 0 1

< (o1),(p1,p3),(q1,q2)>

100011

p1 p3 p2

o1o2

011111

Trias – CubeMiner - Data Peeler

q1

q2

Trias – CubeMiner - Data Peeler

61

ClosedSets

LocalPatterns

Globalpatterns

Collections of local patterns

Clustering, co-clustering, classifiers

Condensed representations,newfeatures, model characterization…

Knowledge nuggets

Under Co

nstraint

s

« Local to Global »

62

Summary

Formal concepts are an interesting special case of constrained bi-sets and may be studied as such

Formal concepts are not actionable patternsin many real-case applications

Complete but also heuristic solvers thatexploit user-defined constraints can support the search for actionable patterns based on formal concepts, including fault-tolerant ones

63

To know more

Inductive Databases & Constraint-based MiningOutcome of the Black Forest meeting organized in March 2004 in Hinterzarten (D)

J-F. Boulicaut, L. De Raedt, H. Mannila (Eds.)Constraint-based Mining and Inductive DatabasesSpringer-Verlag LNAI 3848, 2006, 399 pages.

http://liris.cnrs.fr/~jboulica/

http://iq.ijs.si/IQ/

64

Thanks to EU funded FET projects

cInQ (2001-2004)

consortium on knowledge discovery by Inductive Queries

− Theory for local pattern mining

IQ (2005-2008)

Inductive Queries for mining patterns and models− Theory for global pattern/model mining

65

Thanks for your attention