association rules and compositional · 2017. 6. 4. · association rules and compositional data...
TRANSCRIPT
Association rules and compositional data analysis: implications to big dataR. S. Kenett1, J.A. Martín-Fernández2, S. Thió-Henestrosa2 and M. Vives-Mestres2
1 KPA Group, Israel; University of Turin, Italy and Neaman Institute, Technion, Israel2Universitat de Girona, Spain
CoDaWork 2017 2
CoDaWork 2017 3
This is work in progress
The long term goal is to introduceCoDa to text (semantic data) analysis
and to scale it to big data….
Association Rules(AR)
CoDaWork 2017 4
Transaction (document, itemset)
LHS (A) RHS (B)Antecedent Consequent
Basket Analysis
Terms, items, tokens, words
CoDaWork 2017 5
AR: Support, Confidence, Lift and Odd Ratio
RHS ^RHS
LHS x1 x2 g
^LHS x3 x4 1-g
f 1-f 1
support {A=>B} = x1
A
B
4
1
1, 0 , 1...4.i i
i
x x i
Proportion of transactions in which an item set appears
confidence {A=>B} = x1/g
Strength of implication, or predictive power
lift {A=>B} = confidence{A=>B} / support{B}= support{A=>B}/support{A}support{B}
Lift < 1, A and B repel each otherLift > 1, A and B have affinity to each other
OR {A=>B} = (x1*x4)/(x2*x3)
OR < 1, A and B repel each otherOR > 1, A and B have affinity to each other
The Simplex
4
1
1, 0 , 1...4.i i
i
x x i
RHS ^RHS
LHS x1 x2
^LHS x3 x4
A
B
CoDaWork 2017 6
0D
M
DRLD
D
MDD
4 2 3D x x x x
Relative Linkage Disequilibrium
CoDaWork 2017 7
lift {A=>B} = 1
OR {A=>B} = 1
1 2 3 4( , , , )
( ,1 )
( ,1 )
(1, 1)
where
X x x x x
f f f
g g g
e
X f g De e
RHS ^RHS
LHS x1 x2 g
^LHS x3 x4 1-g
f 1-f 1
Kenett, R.S. (1983). On an Exploratory Analysis of Contingency Tables. J R Stat Soc Series D, 32, 395—403.
Kenett R.S. (2014). Frequenct vectors and contingency tables: a non paramtric and graphical analysis. Girona Seminar, 27/11/14.
independence dependence
CoDa Analysis and Principles
CoDaWork 2017 8
• Multiplicative tools to CoDa are equivalent to classical additive (Euclidean) tools to log-ratio values
• Transform CoDa, e.g. isometric log-ratio coordinates: ilr(x)
• Scale invariance (Vectors P = [p1,…,pD] and P’ = αP, α > 0, give the same information
• Subcompositional coherence
-2 -1 0 1 2-1
.5-0
.50.
51.
5
ilr1
ilr2
CoDaWork 2017 9
Real space: log-ratio coordinates (alr, clr, or ilr)
clr(𝒙) = (log(𝑥1
𝑔 𝑥), log(
𝑥2
𝑔 𝑥),..., log(
𝑥𝐷
𝑔 𝑥))
ilr(𝒙)=(ilr1(𝒙), … , ilrD−1(𝒙))
Simplex: raw data (%)
𝒙 = (𝑥1, 𝑥2,..., 𝑥𝐷)
x1 x2
x3
x1
x3
x2
3
Logratio (multiplicative) approach
CoDaWork 2017 10
CoDa Analysis of 2X2 tables T
𝑖𝑙𝑟 𝐓 =1
2ln
𝑥1𝑥4𝑥2𝑥3
,2
2ln
𝑥1𝑥4
,2
2ln
𝑥2𝑥3
.
Sequential Binary Partition (SBP), Pawlowsky-Glahn and Buccianti, 2011, Chapter 2.
RHS ^RHS
LHS x1 x2 g
^LHS x3 x4 1-g
f 1-f 1
A
B
4
1
1, 0 , 1...4.i i
i
x x i
CoDa Analysis of 2X2 tables
CoDaWork 2017 11
ilr-coordinates ilr1 ilr2 ilr3
T1
2ln
𝑥1𝑥4𝑥2𝑥3
2
2ln
𝑥1𝑥4
2
2ln
𝑥2𝑥3
Tind 02
2ln
𝑥1𝑥4
2
2ln
𝑥2𝑥3
Tint
1
2ln
𝑥1𝑥4𝑥2𝑥3
0 0
independence
interaction
Perturbation operation subtracting table Tind from T
ilr1(T) < 0 : negative effect between itemsets (A true,
B less likely true)
ilr1(T) = 0 : independence
ilr1(T) > 0 : positive effect (A true, B more likely true)
CoDa Analysis of 2X2 tables
CoDaWork 2017 12
X f g De e
ilr(T)=ilr(Tind)+ilr(Tint).
Let 𝐓 𝑎 = 𝑖𝑙𝑟(𝐱) be the Aitchison norm of a table T,
then 𝐓 𝑎2 = 𝐓𝑖𝑛𝑑 𝑎
2 + 𝐓𝑖𝑛𝑡 𝑎2 , that is, one has a
decomposition of the Aitchison norm of table T.
independence interaction
CoDaWork 2017 13
CoDa Simplical Deviance (SD)
CoDaWork 2017 14
𝑆𝐷(𝐓) = 𝐓𝑖𝑛𝑡 𝑎2 =
1
4l𝑛2
𝑥1𝑥4
𝑥2𝑥3= 𝑖𝑙𝑟1
2(𝐓)
𝑥1𝑥4𝑥2𝑥3
= 1 ⟺ 𝑙𝑛𝑥1𝑥4𝑥2𝑥3
= 0 ⟺ 𝑖𝑙𝑟1 𝐓 = 0 ⟺ 𝑆𝐷 = 0
X f g De e independence interaction
CoDa Relative Simplical Deviance (SD)
CoDaWork 2017 15
𝑅𝑆𝐷(𝐓) =𝑆𝐷
𝐓 𝑎2 =
)𝑖𝑙𝑟12(𝐓
)𝑖𝑙𝑟(𝐓 2M
DRLD
D
3 2
3
2
0If Dthenif x x
Dthen RLD
D xD
else RLDD x
1 4
1
4
elseif x x
Dthen RLD
D xD
else RLDD x
RSD takes values in an interval [0,1]
Bootstrap Algorithm
CoDaWork 2017 16
Egozcue et al. (2015) introduce a bootstrap algorithm consisting of followingsteps:i) Calculate Tind, Tint, SD and RSD.ii) Simulate 10000 multinomial samples (T(k)) assuming the independence
hypothesis H0: T=Tind is true. For each table T(k), calculate T(k)ind, T(k)
int, SD(k)
and RSD(k).iii) Compare respectively the value of SD and RSD with the distribution of the 10000 values of SD(k) and RSD(k) to obtain the percentile p-value (left tail). Calculate the 0.05 significance critical points (5th quantile) in the left tail of each distribution.
CoDa Measures for Association Rules
CoDaWork 2017 17
lift(AR) =𝑥1
𝑥1 + 𝑥2)(𝑥1 + 𝑥3
𝐷(AR) = 𝑥1𝑥4 − 𝑥2𝑥3
lift AR = 1 +)𝐷(AR
𝑥1 + 𝑥2)(𝑥1 + 𝑥3
CoDa Measures for Association Rules
CoDaWork 2017 18
𝑂𝑅∗ AR = 𝑌𝑢𝑙𝑒′𝑠 𝑄 𝐴𝑅 =𝑥1𝑥4−𝑥2𝑥3
𝑥1𝑥4+𝑥2𝑥3
OR(AR) = odds(B/A)/odds(B/cA) = (x1x4)/(x2x3).
Lift(AR) =1 D(AR) =0 OR(AR) =1
OR is defined in [0, +Infinite), OR* is defined in [-1,1]
CoDa Measures for Association Rules
CoDaWork 2017 19
𝐶 AR = 𝑖𝑙𝑟1 𝐓
𝐶∗ AR = tanh 𝐶 AR = 𝑂𝑅∗ AR = 𝑌𝑢𝑙𝑒′𝑠 𝑄 𝐴𝑅
C is defined in (-Infinite, +Infinite), C* is defined in [-1,1]
CoDaWork 2017 20
CoDa Measures for Association Rules - 𝐶(AR)
A Case Study
CoDaWork 2017 21
CoDaWork 2017 22
https://treato.com
CoDaWork 2017 23
https://treato.com/Nicardipine/?a=s
CoDaWork 2017 24
CoDaWork 2017 25
CoDaWork 2017 26
Document Term Matrix (DTM)
CoDaWork 2017 27
Association Rules Analysis
CoDaWork 2017 28
Association Rules by consequent
(lowest Lift)Vasoconstriction
CoDaWork 2017 29
Top 10 Association Rules by Lift (in red)
CoDaWork 2017 30
Top 10 Association Rules by Lift (in red)
CoDa AR Visualizationilr plot
by consequent item
CoDaWork 2017 31
interaction
ilr.1
fre
qu
en
cy
0.2 0.4 0.6 0.8 1.0
05
10
15
CoDa AR Visualizationilr plot
CoDaWork 2017 32
interaction
CoDa AR Visualizationclr plot
by consequent item
CoDaWork 2017 33
interaction
Conclusions
• Compositional measures of independence SD and RSD are coherent with the simplicial geometry of the simplex, the sample space of contingency tables of AR.
• The relation between CoDa-AR measures and other common measures facilitates the interpretation of negative and positive effects between itemsets.
• The CoDa geometry provides visualization techniques of measures when all the significant AR of a large database are analyzed.
• The principles of coherence and scalability, that are fundamental to CoDa, are relevant to big data text analysis.
• More research in this area is needed
CoDaWork 2017 34
Acknowledgements
CoDaWork 2017 35
https://www.kaggle.com/c/instacart-market-basket-analysis
Thank you for your attention