association rules and compositional · 2017. 6. 4. · association rules and compositional data...

36
Association rules and compositional data analysis: implications to big data R. S. Kenett 1 , J.A. Martín-Fernández 2 , S. Thió-Henestrosa 2 and M. Vives-Mestres 2 1 KPA Group, Israel; University of Turin, Italy and Neaman Institute, Technion, Israel 2 Universitat de Girona, Spain

Upload: others

Post on 23-Sep-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

Association rules and compositional data analysis: implications to big dataR. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2 and M. Vives-Mestres2

1 KPA Group, Israel; University of Turin, Italy and Neaman Institute, Technion, Israel2Universitat de Girona, Spain

Page 2: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

CoDaWork 2017 2

Page 3: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

CoDaWork 2017 3

This is work in progress

The long term goal is to introduceCoDa to text (semantic data) analysis

and to scale it to big data….

Page 4: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

Association Rules(AR)

CoDaWork 2017 4

Transaction (document, itemset)

LHS (A) RHS (B)Antecedent Consequent

Basket Analysis

Terms, items, tokens, words

Page 5: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

CoDaWork 2017 5

AR: Support, Confidence, Lift and Odd Ratio

RHS ^RHS

LHS x1 x2 g

^LHS x3 x4 1-g

f 1-f 1

support {A=>B} = x1

A

B

4

1

1, 0 , 1...4.i i

i

x x i

Proportion of transactions in which an item set appears

confidence {A=>B} = x1/g

Strength of implication, or predictive power

lift {A=>B} = confidence{A=>B} / support{B}= support{A=>B}/support{A}support{B}

Lift < 1, A and B repel each otherLift > 1, A and B have affinity to each other

OR {A=>B} = (x1*x4)/(x2*x3)

OR < 1, A and B repel each otherOR > 1, A and B have affinity to each other

Page 6: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

The Simplex

4

1

1, 0 , 1...4.i i

i

x x i

RHS ^RHS

LHS x1 x2

^LHS x3 x4

A

B

CoDaWork 2017 6

Page 7: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

0D

M

DRLD

D

MDD

4 2 3D x x x x

Relative Linkage Disequilibrium

CoDaWork 2017 7

lift {A=>B} = 1

OR {A=>B} = 1

1 2 3 4( , , , )

( ,1 )

( ,1 )

(1, 1)

where

X x x x x

f f f

g g g

e

X f g De e

RHS ^RHS

LHS x1 x2 g

^LHS x3 x4 1-g

f 1-f 1

Kenett, R.S. (1983). On an Exploratory Analysis of Contingency Tables. J R Stat Soc Series D, 32, 395—403.

Kenett R.S. (2014). Frequenct vectors and contingency tables: a non paramtric and graphical analysis. Girona Seminar, 27/11/14.

independence dependence

Page 8: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

CoDa Analysis and Principles

CoDaWork 2017 8

• Multiplicative tools to CoDa are equivalent to classical additive (Euclidean) tools to log-ratio values

• Transform CoDa, e.g. isometric log-ratio coordinates: ilr(x)

• Scale invariance (Vectors P = [p1,…,pD] and P’ = αP, α > 0, give the same information

• Subcompositional coherence

Page 9: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

-2 -1 0 1 2-1

.5-0

.50.

51.

5

ilr1

ilr2

CoDaWork 2017 9

Real space: log-ratio coordinates (alr, clr, or ilr)

clr(𝒙) = (log(𝑥1

𝑔 𝑥), log(

𝑥2

𝑔 𝑥),..., log(

𝑥𝐷

𝑔 𝑥))

ilr(𝒙)=(ilr1(𝒙), … , ilrD−1(𝒙))

Simplex: raw data (%)

𝒙 = (𝑥1, 𝑥2,..., 𝑥𝐷)

x1 x2

x3

x1

x3

x2

3

Logratio (multiplicative) approach

Page 10: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

CoDaWork 2017 10

CoDa Analysis of 2X2 tables T

𝑖𝑙𝑟 𝐓 =1

2ln

𝑥1𝑥4𝑥2𝑥3

,2

2ln

𝑥1𝑥4

,2

2ln

𝑥2𝑥3

.

Sequential Binary Partition (SBP), Pawlowsky-Glahn and Buccianti, 2011, Chapter 2.

RHS ^RHS

LHS x1 x2 g

^LHS x3 x4 1-g

f 1-f 1

A

B

4

1

1, 0 , 1...4.i i

i

x x i

Page 11: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

CoDa Analysis of 2X2 tables

CoDaWork 2017 11

ilr-coordinates ilr1 ilr2 ilr3

T1

2ln

𝑥1𝑥4𝑥2𝑥3

2

2ln

𝑥1𝑥4

2

2ln

𝑥2𝑥3

Tind 02

2ln

𝑥1𝑥4

2

2ln

𝑥2𝑥3

Tint

1

2ln

𝑥1𝑥4𝑥2𝑥3

0 0

independence

interaction

Perturbation operation subtracting table Tind from T

ilr1(T) < 0 : negative effect between itemsets (A true,

B less likely true)

ilr1(T) = 0 : independence

ilr1(T) > 0 : positive effect (A true, B more likely true)

Page 12: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

CoDa Analysis of 2X2 tables

CoDaWork 2017 12

X f g De e

ilr(T)=ilr(Tind)+ilr(Tint).

Let 𝐓 𝑎 = 𝑖𝑙𝑟(𝐱) be the Aitchison norm of a table T,

then 𝐓 𝑎2 = 𝐓𝑖𝑛𝑑 𝑎

2 + 𝐓𝑖𝑛𝑡 𝑎2 , that is, one has a

decomposition of the Aitchison norm of table T.

independence interaction

Page 13: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

CoDaWork 2017 13

Page 14: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

CoDa Simplical Deviance (SD)

CoDaWork 2017 14

𝑆𝐷(𝐓) = 𝐓𝑖𝑛𝑡 𝑎2 =

1

4l𝑛2

𝑥1𝑥4

𝑥2𝑥3= 𝑖𝑙𝑟1

2(𝐓)

𝑥1𝑥4𝑥2𝑥3

= 1 ⟺ 𝑙𝑛𝑥1𝑥4𝑥2𝑥3

= 0 ⟺ 𝑖𝑙𝑟1 𝐓 = 0 ⟺ 𝑆𝐷 = 0

X f g De e independence interaction

Page 15: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

CoDa Relative Simplical Deviance (SD)

CoDaWork 2017 15

𝑅𝑆𝐷(𝐓) =𝑆𝐷

𝐓 𝑎2 =

)𝑖𝑙𝑟12(𝐓

)𝑖𝑙𝑟(𝐓 2M

DRLD

D

3 2

3

2

0If Dthenif x x

Dthen RLD

D xD

else RLDD x

1 4

1

4

elseif x x

Dthen RLD

D xD

else RLDD x

RSD takes values in an interval [0,1]

Page 16: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

Bootstrap Algorithm

CoDaWork 2017 16

Egozcue et al. (2015) introduce a bootstrap algorithm consisting of followingsteps:i) Calculate Tind, Tint, SD and RSD.ii) Simulate 10000 multinomial samples (T(k)) assuming the independence

hypothesis H0: T=Tind is true. For each table T(k), calculate T(k)ind, T(k)

int, SD(k)

and RSD(k).iii) Compare respectively the value of SD and RSD with the distribution of the 10000 values of SD(k) and RSD(k) to obtain the percentile p-value (left tail). Calculate the 0.05 significance critical points (5th quantile) in the left tail of each distribution.

Page 17: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

CoDa Measures for Association Rules

CoDaWork 2017 17

lift(AR) =𝑥1

𝑥1 + 𝑥2)(𝑥1 + 𝑥3

𝐷(AR) = 𝑥1𝑥4 − 𝑥2𝑥3

lift AR = 1 +)𝐷(AR

𝑥1 + 𝑥2)(𝑥1 + 𝑥3

Page 18: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

CoDa Measures for Association Rules

CoDaWork 2017 18

𝑂𝑅∗ AR = 𝑌𝑢𝑙𝑒′𝑠 𝑄 𝐴𝑅 =𝑥1𝑥4−𝑥2𝑥3

𝑥1𝑥4+𝑥2𝑥3

OR(AR) = odds(B/A)/odds(B/cA) = (x1x4)/(x2x3).

Lift(AR) =1 D(AR) =0 OR(AR) =1

OR is defined in [0, +Infinite), OR* is defined in [-1,1]

Page 19: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

CoDa Measures for Association Rules

CoDaWork 2017 19

𝐶 AR = 𝑖𝑙𝑟1 𝐓

𝐶∗ AR = tanh 𝐶 AR = 𝑂𝑅∗ AR = 𝑌𝑢𝑙𝑒′𝑠 𝑄 𝐴𝑅

C is defined in (-Infinite, +Infinite), C* is defined in [-1,1]

Page 20: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

CoDaWork 2017 20

CoDa Measures for Association Rules - 𝐶(AR)

Page 21: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

A Case Study

CoDaWork 2017 21

Page 22: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

CoDaWork 2017 22

https://treato.com

Page 23: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

CoDaWork 2017 23

https://treato.com/Nicardipine/?a=s

Page 24: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

CoDaWork 2017 24

Page 25: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

CoDaWork 2017 25

Page 26: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

CoDaWork 2017 26

Document Term Matrix (DTM)

Page 27: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

CoDaWork 2017 27

Association Rules Analysis

Page 28: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

CoDaWork 2017 28

Association Rules by consequent

(lowest Lift)Vasoconstriction

Page 29: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

CoDaWork 2017 29

Top 10 Association Rules by Lift (in red)

Page 30: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

CoDaWork 2017 30

Top 10 Association Rules by Lift (in red)

Page 31: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

CoDa AR Visualizationilr plot

by consequent item

CoDaWork 2017 31

interaction

ilr.1

fre

qu

en

cy

0.2 0.4 0.6 0.8 1.0

05

10

15

Page 32: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

CoDa AR Visualizationilr plot

CoDaWork 2017 32

interaction

Page 33: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

CoDa AR Visualizationclr plot

by consequent item

CoDaWork 2017 33

interaction

Page 34: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

Conclusions

• Compositional measures of independence SD and RSD are coherent with the simplicial geometry of the simplex, the sample space of contingency tables of AR.

• The relation between CoDa-AR measures and other common measures facilitates the interpretation of negative and positive effects between itemsets.

• The CoDa geometry provides visualization techniques of measures when all the significant AR of a large database are analyzed.

• The principles of coherence and scalability, that are fundamental to CoDa, are relevant to big data text analysis.

• More research in this area is needed

CoDaWork 2017 34

Page 35: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

Acknowledgements

CoDaWork 2017 35

https://www.kaggle.com/c/instacart-market-basket-analysis

Page 36: Association rules and compositional · 2017. 6. 4. · Association rules and compositional data analysis: implications to big data R. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2

Thank you for your attention