putting oac-triclustering on mapreduce

54
Putting OAC-triclustering on MapReduce Sergey Zudin, Dmitry V. Gnatyshak, and Dmitry I. Ignatov National Research University Higher School of Economics, Russian Federation Faculty of Computer Science CLA 2015, Clermont-Ferrand, France October 13-16 S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 1 / 39

Upload: dmitrii-ignatov

Post on 06-Apr-2017

326 views

Category:

Science


0 download

TRANSCRIPT

Putting OAC-triclustering on MapReduce

Sergey Zudin, Dmitry V. Gnatyshak, and Dmitry I. Ignatov

National Research University Higher School of Economics, Russian FederationFaculty of Computer Science

CLA 2015, Clermont-Ferrand, FranceOctober 13-16

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 1 / 39

Outline

1 Motivation and previous work

2 Prime OAC-triclusteringTriadic Formal concept analysisBasic algorithmOnline version of the algorithm

3 OAC-triclustering on MapReduceMapReduce technologyMapReduce implementation

4 ExperimentsDescription of the experimentsDatasetsResults

5 Conclusion

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 2 / 39

Outline

1 Motivation and previous work

2 Prime OAC-triclusteringTriadic Formal concept analysisBasic algorithmOnline version of the algorithm

3 OAC-triclustering on MapReduceMapReduce technologyMapReduce implementation

4 ExperimentsDescription of the experimentsDatasetsResults

5 Conclusion

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 3 / 39

Motivation

Big amount of multimodal data:

Gene expression dataFolksonomiesRecommender SystemsCommunities in multi-mode (social) networksPattern mining in relational databases. . .

Non-binary data can be scaled (possibly increasing the dimensionality)

Increasing amount of big data: fast and/or distributed algorithms arerequired (linear or sublinear, one-pass)

Existing methods: finding all n-sets (mulitimodal clusters) satisfying someconditions (often the exponential number of patterns)

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 4 / 39

MotivationIMDB example, [Mirkin et al., 2011]

Clump Movie-Keyword-Genre

Bicluster{12 Angry Men (1957), To Kill a Mockingbird (1962), Wit-ness for the Prosecution (1957)}, {Murder, Trial}, {n/a }

Tricluster

{12 Angry Men (1957), Double Indemnity (1944), China-town (1974), The Big Sleep (1946), Witness for the Pros-ecution (1957), Dial M for Murder (1954), Shadow of aDoubt (1943) }, {Murder, Trial, Widow, Marriage, Privatedetective, Blackmail, Letter}, {Crime, Drama, Thriller,Mystery, Film-Noir }

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 5 / 39

Previous and related workA short (not full) list

Triadic FCA [Wille, 1995; Lehman and Wille,1995] and Polyadic FCA[Voutsadakis, 2002]

TRIAS [Jaeschke et al., 2006] for mining (frequent) triconcepts

DataPeeler for closed n-sets [Cerf et al., 2009], MultiDupeHack [Cerf et al,2013]

TriBox [Mirkin et al., 2011] for mining dense triboxes with LS criterion

Box OAC-triclustering and Spectral Triclustering [Ignatov et al., 2011,2013]

Multi-way set enumeration in weight tensors [Scholkopf et al, 2011]

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 6 / 39

Previous and related workA short (not full) list

Quadri-concepts for personalised folksnomies [Jelassi et al., 2012, 2013]

Prime OAC-triclustering [Gnatyshak et al., 2012–2014]

Triadic Boolean tensor factorisation [Miettinen et al., 2011; Belohlavek et al.,2013] and Boolean tensor clustering [Miettinen et al., 2015]

Closed and connected patterns in multi-relational data. [Spyropoulu et al.,2011–14]

Triadic FCA and triclustering: Searching for optimal patterns. MachineLearning journal [Ignatov et al., 2015] and CLA 2013

. . .

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 7 / 39

Outline

1 Motivation and previous work

2 Prime OAC-triclusteringTriadic Formal concept analysisBasic algorithmOnline version of the algorithm

3 OAC-triclustering on MapReduceMapReduce technologyMapReduce implementation

4 ExperimentsDescription of the experimentsDatasetsResults

5 Conclusion

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 8 / 39

Prime OAC-triclusteringFormal concept analysis: triadic case

DefinitionLet G , M, B be sets and the ternary relation I be a subset of their Cartesianproduct: I ⊆ G ×M × B. Then the tuple K = (G ,M,B, I ) is called a triadicformal context.G is a set of objects, M is a set of attributes, B is a set of conditions.

G\M m1 m2 m3 m1 m2 m3 m1 m2 m3

g1 x x x x x x x x

g2 x x x x x

g3 x x x x

g4 x x x x x x

B b1 b2 b3

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 9 / 39

Prime OAC-triclusteringFormal concept analysis: triadic case

Definition

Galois operators (prime operators) are defined in similar way to the dyadic case:

2G → 2M × 2B 2G × 2M → 2B

2M → 2G × 2B 2G × 2B → 2M

2B → 2G × 2M 2M × 2B → 2G

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 9 / 39

Prime OAC-triclusteringFormal concept analysis: triadic case

G\M m1 m2 m3 m1 m2 m3 m1 m2 m3

g1 x x x x x x x x

g2 x x x x x

g3 x x x x

g4 x x x x x x

B b1 b2 b3

({g1, g2}, {m1,m2})′ = {b1, b3}

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 9 / 39

Prime OAC-triclusteringFormal concept analysis: triadic case

G\M m1 m2 m3 m1 m2 m3 m1 m2 m3

g1 x x x x x x x x

g2 x x x x x

g3 x x x x

g4 x x x x x x

B b1 b2 b3

m′2 = {(g1, b1), (g2, b1), (g3, b1), (g1, b2), (g1, b3), (g2, b3), (g4, b3)}

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 9 / 39

Prime OAC-triclusteringFormal concept analysis: triadic case

Definition

The triple (X ,Y ,Z ) is called triadic formal concept of the contextK = (G ,M,B, I ), if X ⊆ G ,Y ⊆ M, Z ⊆ B, (X ,Y )′ = Z , (X ,Z )′ = Y ,(Y ,Z )′ = X .X is called (formal) extent, Y — (formal) intent, Z — (formal) modus.

G\M m1 m2 m3 m1 m2 m3 m1 m2 m3

g1 x x x x x x x x

g2 x x x x x

g3 x x x x

g4 x x x x x x

B b1 b2 b3

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 9 / 39

Prime OAC-triclusteringBasic algorithm [Gnatyshak et al., 2013]

This method uses the following types of prime operators (for the contextK = (G ,M,B, I )):

(g ,m)′ = {b ∈ B | (g ,m, b) ∈ I},(g , b)′ = {m ∈ M | (g ,m, b) ∈ I},(m, b)′ = {g ∈ G | (g ,m, b) ∈ I}

Definition

Then the triple T = ((m, b)′, (g , b)′, (g ,m)′) is called the prime-basedOAC-tricluster for a triple (g ,m, b) ∈ I . The sets of tricluster are called,respectively, tricluster extent, intent, and modus. Triple (g ,m, b) is called agenerating triple of the tricluster T .

Definition

Density of a tricluster: ρ(X ,Y ,Z ) = |I∩(X×Y×Z)||X ||Y ||Z |

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 10 / 39

Prime OAC-triclusteringBasic algorithm

An example of a tricluster based on triple (g , m, b):

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 11 / 39

Prime OAC-triclusteringBasic algorithm

Input: K = (G ,M,B, I ) — triadic context;ρmin — density threshold

Output: T = {T = (X ,Y ,Z)}1: T := ∅2: for all (g ,m) : g ∈ G ,m ∈ M do3: PrimesObjAttr [g ,m] = (g ,m)′

4: end for5: for all (g , b) : g ∈ G ,b ∈ B do6: PrimesObjCond [g , b] = (g , b)′

7: end for8: for all (m, b) : m ∈ M,b ∈ B do9: PrimesAttrCond [m, b] = (m, b)′

10: end for11: for all (g ,m, b) ∈ I do12: T = (PrimesAttrCond [m, b],PrimesObjCond [g , b],PrimesObjAttr [g ,m])13: Tkey = hash(T )14: if Tkey ∈ T .keys ∧ ρ(T ) ≥ ρmin then15: T [Tkey ] := T16: end if17: end for

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 12 / 39

Prime OAC-triclusteringOnline version of the algorithm [Gnatyshak et al., 2014]

Let K = (G ,M,B, I ) be a triadic context. We do not know G , M, B, I , or theircardinalities in advance.

Input on each iteration: {(g ,m, b)} = J ⊆ I .Goal: maintain an updated version of the results and efficiently update them whennew triples are received.

We need to keep in memory the results of prime operators’ application (primesets):

PrimesObjAttr — dictionary with elements of type ((g ,m), {b ∈ B}), g ∈ G ,m ∈ M;

PrimesObjCond — dictionary with elements of type ((g , b), {m ∈ M}),g ∈ G , b ∈ B;

PrimesAttrCond — dictionary with elements of type ((m, b), {g ∈ G}),m ∈ M, b ∈ B.

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 13 / 39

Prime OAC-triclusteringOnline version of the algorithm

RemarkIn this case we need to consider triclusters based on different triples different, evenif their extents, intents, and modi are equal.

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 14 / 39

Prime OAC-triclusteringOnline version of the algorithm

Algorithm of triples addition:

Input: J is a set of triples to add;T = {T = (∗X , ∗Y , ∗Z)} is a current tricluster set;PrimesObjAttr , PrimesObjCond , PrimesAttrCond ;

Output: T = {T = (∗X , ∗Y , ∗Z)};PrimesObjAttr , PrimesObjCond , PrimesAttrCond ;

1: for all (g ,m, b) ∈ J do2: PrimesObjAttr [g ,m] := PrimesObjAttr [g ,m] ∪ b3: PrimesObjCond [g , b] := PrimesObjCond [g , b] ∪m4: PrimesAttrCond [m, b] := PrimesAttrCond [m, b] ∪ g5: T :=

T ∪ (&PrimesAttrCond [m, b],&PrimesObjCond [g , b],&PrimesObjAttr [g ,m])6: end for

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 15 / 39

Prime OAC-triclusteringOnline version of the algorithm

A user may require to remove the triclusters with the same extent, intent andmodus at the post-processing stage. At this stage we can also check variousconditions (for instance, minimal density condition).

Input: T = {T = (∗X , ∗Y , ∗Z)} is a current tricluster set;Output: T = {T = (∗X , ∗Y , ∗Z)} — processed tricluster hash-set;1: for all T ∈ T do2: Compute hash(T )3: if hash(T ) ∈ T .keys() then4: T := T ∪ T5: end if6: end for

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 16 / 39

Prime OAC-triclusteringOnline version of the algorithm

Complexity summary:

Time complexity: O(|I |) (as there is a constant number of operations oneach step);

More precisely: 8|I | operations in total;1 Modification of 3 prime sets (3);2 Creation of a new tricluster (1);3 Addition of pointers to its extent, intent, and modus (3);4 Addition of the tricluster to the set of all triclusters (1).

Memory complexity: O(|I |) (as we need to keep in memory only prime sets,|I | elements in each dictionary + keys).

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 17 / 39

Prime OAC-triclusteringOnline version of the algorithm

Example:

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39

Prime OAC-triclusteringOnline version of the algorithm

→ (g1,m1, b1)

1 PrimesObjAttr = {((g1,m1), {b1})}2 PrimesObjCond = {((g1, b1), {m1})}3 PrimesAttrCond = {((m1, b1), {g1})}4 T := T ∪ {PrimesAttrCond [m1, b1],PrimesObjCond [g1, b1],PrimesObjAttr [g1,m1]}

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39

Prime OAC-triclusteringOnline version of the algorithm

→ (g1,m2, b1)

1 PrimesObjAttr = {((g1,m1), {b1}), ((g1,m2), {b1})}2 PrimesObjCond = {((g1, b1), {m1,m2})}3 PrimesAttrCond = {((m1, b1), {g1}), ((m2, b1), {g1})}4 T := T ∪ {PrimesAttrCond [m2, b1],PrimesObjCond [g1, b1],PrimesObjAttr [g1,m2]}

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39

Prime OAC-triclusteringOnline version of the algorithm

→ (g2,m1, b1)

1 PrimesObjAttr = {((g1,m1), {b1}), ((g1,m2), {b1}), ((g2,m1), {b1})}2 PrimesObjCond = {((g1, b1), {m1,m2}), ((g2, b1), {m1})}3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1})}4 T := T ∪ {PrimesAttrCond [m1, b1],PrimesObjCond [g2, b1],PrimesObjAttr [g2,m1]}

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39

Prime OAC-triclusteringOnline version of the algorithm

→ (g2,m2, b1)

1 PrimesObjAttr = {((g1,m1), {b1}), ((g1,m2), {b1}), ((g2,m1), {b1}), ((g2,m2), {b1})}2 PrimesObjCond = {((g1, b1), {m1,m2}), ((g2, b1), {m1,m2})}3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2})}4 T := T ∪ {PrimesAttrCond [m2, b1],PrimesObjCond [g2, b1],PrimesObjAttr [g2,m2]}

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39

Prime OAC-triclusteringOnline version of the algorithm

→ (g3,m3, b1)

1 PrimesObjAttr ={((g1,m1), {b1}), ((g1,m2), {b1}), ((g2,m1), {b1}), ((g2,m2), {b1}), ((g3,m3), {b1})}

2 PrimesObjCond = {((g1, b1), {m1,m2}), ((g2, b1), {m1,m2}), ((g3, b1), {m3})}3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2}), ((m3, b1), {g3})}4 T := T ∪ {PrimesAttrCond [m3, b1],PrimesObjCond [g3, b1],PrimesObjAttr [g3,m3]}

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39

Prime OAC-triclusteringOnline version of the algorithm

→ (g1,m2, b2)

1 PrimesObjAttr = {((g1,m1), {b1}), ((g1,m2), {b1, b2}), ((g2,m1),{b1}), ((g2,m2), {b1}), ((g3,m3), {b1})}

2 PrimesObjCond = {((g1, b1), {m1,m2}), ((g2, b1), {m1,m2}), ((g3, b1),{m3}), ((g1, b2), {m2})}

3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2}), ((m3, b1),{g3}), ((m2, b2), {g1})}

4 T := T ∪ {PrimesAttrCond [m2, b2],PrimesObjCond [g1, b2],PrimesObjAttr [g1,m2]}

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39

Prime OAC-triclusteringOnline version of the algorithm

→ (g2,m1, b2)

1 PrimesObjAttr = {((g1,m1), {b1}), ((g1,m2), {b1, b2}), ((g2,m1), {b1, b2}),((g2,m2), {b1}), ((g3,m3), {b1})}

2 PrimesObjCond = {((g1, b1), {m1,m2}), ((g2, b1), {m1,m2}), ((g3, b1), {m3}),((g1, b2), {m2}), ((g2, b2), {m1})}

3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2}), ((m3, b1), {g3}),((m2, b2), {g1}), ((m1, b2), {g2})}

4 T := T ∪ {PrimesAttrCond [m1, b2],PrimesObjCond [g2, b2],PrimesObjAttr [g2,m1]}

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39

Prime OAC-triclusteringOnline version of the algorithm

→ (g2,m2, b2)

1 PrimesObjAttr = {((g1,m1), {b1}), ((g1,m2), {b1, b2}), ((g2,m1), {b1, b2}),((g2,m2), {b1, b2}), ((g3,m3), {b1})}

2 PrimesObjCond = {((g1, b1), {m1,m2}), ((g2, b1), {m1,m2}), ((g3, b1), {m3}),((g1, b2), {m2}), ((g2, b2), {m1,m2})}

3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2}), ((m3, b1), {g3}),((m2, b2), {g1, g2}), ((m1, b2), {g2})}

4 T := T ∪ {PrimesAttrCond [m2, b2],PrimesObjCond [g2, b2],PrimesObjAttr [g2,m2]}

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39

Prime OAC-triclusteringOnline version of the algorithm

→ (g3,m3, b2)

1 PrimesObjAttr = {((g1,m1), {b1}), ((g1,m2), {b1, b2}), ((g2,m1), {b1, b2}), ((g2,m2),{b1, b2}), ((g3,m3), {b1, b2})}

2 PrimesObjCond = {((g1, b1), {m1,m2}), ((g2, b1), {m1,m2}), ((g3, b1), {m3}), ((g1, b2),{m2}), ((g2, b2), {m1,m2}), ((g3, b2), {m3})}

3 PrimesAttrCond = {((m1, b1), {g1, g2}), ((m2, b1), {g1, g2}), ((m3, b1), {g3}),((m2, b2), {g1, g2}), ((m1, b2), {g2}), ((m3, b2), {g3})}

4 T := T ∪ {PrimesAttrCond [m3, b2],PrimesObjCond [g3, b2],PrimesObjAttr [g3,m3]}

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39

Prime OAC-triclusteringOnline version of the algorithm

Postprocessing:

1 T(g1,m1,b1) = (g1, g2,m1,m2, b1)← add

2 T(g1,m2,b1) = (g1, g2,m1,m2, b1, b2)← add

3 T(g2,m1,b1) = (g1, g2,m1,m2, b1, b2)← the same as T(g1,m2,b1), skip

4 T(g2,m2,b1) = (g1, g2,m1,m2, b1, b2)← the same as T(g1,m2,b1), skip

5 T(g3,m3,b1) = (g3,m3, b1, b2)← add

6 T(g1,m2,b2) = (g1, g2,m2, b1, b2)← add

7 T(g2,m1,b2) = (g2,m1,m2, b1, b2)← add

8 T(g2,m2,b2) = (g1, g2,m1,m2, b1, b2)← the same as T(g1,m2,b1), skip

9 T(g3,m3,b2) = (g3,m3, b1, b2)← the same as T(g3,m3,b1), skip

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39

Prime OAC-triclusteringOnline version of the algorithm

The final output set of triclusters:

1 T1 = ({g1, g2}, {m1,m2}, {b1})2 T2 = ({g1, g2}, {m1,m2}, {b1, b2})3 T3 = ({g3}, {m3}, {b1, b2})4 T4 = ({g1, g2}, {m2}, {b1, b2})5 T5 = ({g2}, {m1,m2}, {b1, b2})

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 18 / 39

Outline

1 Motivation and previous work

2 Prime OAC-triclusteringTriadic Formal concept analysisBasic algorithmOnline version of the algorithm

3 OAC-triclustering on MapReduceMapReduce technologyMapReduce implementation

4 ExperimentsDescription of the experimentsDatasetsResults

5 Conclusion

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 19 / 39

MapReduce TechnologyMapReduce scheme [Dean and Ghemawat, 2004]

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 20 / 39

MapReduce TechnologyMapReduce example

Figure: Word counting. Source:http://blog.trifork.com/2009/08/04/introduction-to-hadoop/

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 21 / 39

MapReduce TechnologyCommunication costs: Mining of Massive Datasets [Leskovec et al., 2013]

Chapter 2: MapReduce and the New Software Stack

“Replication Rate and Reducer Size: It is often convenient to measurecommunication by the replication rate, which is the communication per input.Also, the reducer size is the maximum number of inputs associated with anyreducer. For many problems, it is possible to derive a lower bound on replicationrate as a function of the reducer size.”

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 22 / 39

MapReduce ImplementationThe previous lattice-oriented M/R implementations

A version of Close-by-One algorithm was ported to M/R framework [Krajca& Vychodil, 2009]

A M/R algorithm for computation of closed cube lattices was proposed[Kudryavcev & Kuznecov, 2009]

[Xu et al., 2012] demonstrated that iterative algorithms like Ganter’sNextClosure can benefit from the usage of iterative M/R schemes

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 23 / 39

MapReduce ImplementationTechnologies and code repositories

Technologies used

Apache Hadoop 1

Apache Maven (framework for automatic project assembling)

Apache Commons (for work with extended Java collections)

Google Guava (utilities and data structures)

Jackson JSON (open-source library for transformation of object-orientedrepresentation of an object like tricluster to string)

TypeTools (for real-time type resolution of inbound and outbound key-valuepairs)

. . .

Implementations

Source 1: “Chaining-job” module2

Source 2: M/R-based OAC Triclustering3

1http://hadoop.apache.org/2https://github.com/zydins/chaining-job3https://github.com/zydins/DistributedTriclustering

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 24 / 39

Two-stage MapReduce ImplementationDistributed OAC-triclustering: First Map

Input: S is a set of input triples as strings;r is a number of reducers;i is a grouping index (objects, attributes or conditions).

Output: J is a list of ⟨key , triple⟩ pairs.1: for all s ∈ S do2: t := transform(s)3: key := hash(t[i ]) mod r4: J := J ∪ {⟨key , t⟩}5: end for

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 25 / 39

Two-stage MapReduce ImplementationDistributed OAC-triclustering: First Reduce

Input: J is a list of triples (for a certain key);T = {T = (X ,Y ,Z )} is a current set of triclusters;PrimesOA, PrimesOC , PrimesAC .

Output: file of strings – encoded ⟨triple, tricluster⟩ pairs.1: Primes ← initialise a new multimap2: for all (g ,m, b) ∈ J do3: Primes[g ,m] := Primes[g ,m] ∪ {b}4: Primes[g , b] := Primes[g , b] ∪ {m}5: Primes[m, b] := Primes[m, b] ∪ {g}6: end for7: for all (g ,m, b) ∈ J do8: T := (set(Primes[m, b]), set(Primes[g , b]), set(Primes[g ,m]))9: s := encode(⟨(g ,m, b),T ⟩)

10: store s11: end for

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 26 / 39

Two-stage MapReduce ImplementationDistributed OAC-triclustering: Second Map

Input: S is a list of strings.Output: T is an list of ⟨tricluster , tricluster⟩ pairs.1: Primes ← initialise a new multimap2: for all s ∈ S do3: ⟨(g ,m, b),T ⟩ := decode(s)4: update Primes multimap appropriately5: I := I ∪ {(g ,m, b)}6: end for7: for all (g ,m, b) ∈ I do8: T := (set(Primes[m, b]), set(Primes[g , b]), set(Primes[g ,m]))9: T := T ∪ {⟨T ,T ⟩}

10: end for

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 27 / 39

Two-stage MapReduce ImplementationDistributed OAC-triclustering: Second Reduce

Input: T is a list of ⟨tricluster , list of triclusters⟩ pairs.Output: File with a final set of triclusters {T = (X ,Y ,Z )}.1: for all ⟨T , [T , . . . ,T ]⟩ ∈ T do2: store T3: end for

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 28 / 39

Two-stage MapReduce ImplementationCommunication costs

The time complexity of the M/R solution is composed from two terms foreach stage: O(|I |/r) (or O(|I |)) and O(|I |).

The replication rate for the first M/R stage r1 = 1 (each triple is passed asone key-value pair), the reducer size q1 = |I |/r

The replication rate for the second M/R stage is r2 = 1 (it assigns onekey-value pair for each tricluster), but the reducer size varies from qmin

2 = 1(no duplicate triclusters) and qmax

2 = |I | (one final tricluster when all theinitial triples belong to one absolutely dense cuboid).

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 29 / 39

Outline

1 Motivation and previous work

2 Prime OAC-triclusteringTriadic Formal concept analysisBasic algorithmOnline version of the algorithm

3 OAC-triclustering on MapReduceMapReduce technologyMapReduce implementation

4 ExperimentsDescription of the experimentsDatasetsResults

5 Conclusion

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 30 / 39

ExperimentsDescription of the experiments

OS X 10, 1.8 GHz Intel Core i5, 4 Gb 1600 MHz DDR3 and 8 Gb free spaceon the hard drive (a typical commodity hardware).

Two M/R modes have been tested: sequential mode of tasks completion andemulation of distributed one with 16 first reducers and 32 threads for thesecond stage.

To evaluate the runtime more carefully, for each context the average result of5 runs of the algorithms has been recorded.

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 31 / 39

ExperimentsDatasets

Synthetic datasets. 1) 20,000 triples (25 unique entities of each type); 2) 100,000 triples (50unique entities of each type); 3) 1,000,000 triples (all possible combinations of 100 uniqueentities of each type).The 1st dataset contains duplicates since 25× 25× 25 gives only 15,625 unique triples. The 2ndone contains less triples than 503 = 125, 000, the number of all possible combinations. The 3rdone is an absolutely dense cuboid 100× 100× 100.The 3rd dataset does not result in 3min(|G |,|M|,|B|) formal triconcepts, this is an example of theworst case scenario for the second reducer (qmax

2 = |I |).IMDB. Top-250 list of the best movies from Internet Movie Database

Bibsonomy. The data of bibsonomy.org from ECML PKDD discovery challenge 2008.

Context |G | |M| |B| # triples Density20k 25 25 25 20,000 1100k 50 50 50 100,000 0.81m 100 100 100 1,000,000 1

IMDB 250 795 22 3,818 0.00087BibSonomy 2,337 67,464 28,920 816,197 1.8 · 10−7

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 32 / 39

ExperimentsResults

Algorithm/Context IMDB 20k 100k 1m Bibsonomy(≈3k triples) triples triples triples (≈800k triples)

Tribox 324 800 1,265 >3,000 >3,000TRIAS 189 362 862 >3,000 >3,000OAC Box 374 756 1,265 >3,000 >3,000OAC Prime 7 8 734 >3,000 >3,000Online OAC prime 3 3 3 5 >3,000M/R OAC prime seq. 12 30 81 166 1,534M/R OAC prime distr. 1 15 20 25 520

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 33 / 39

Alternative MapReduce decompositionVariant I: First stage

First Map: Finding primes. During this phase every input triple (g ,m, b) isencoded by three key-value pairs ⟨(g ,m), b⟩, ⟨(g , b),m⟩, and ⟨(m, b), g⟩. Thesepairs are passed to the first reducer.

The replication rate is r1 = 3.

First Reduce: Finding primes. This reducer fills three corresponding dictionariesfor primes of keys. So, for example, the first dictionary, PrimeOA containskey-value pairs ⟨(g ,m), {b1, b2, . . . , bn}⟩.

The reducer size is q1 = max(|G |, |M|, |B|)

The process can be stopped after the first reduce phase and all the triclustersfound as (Prime[g ,m],Prime[g , b],Prime[m, b]) each by enumeration of(g ,m, b) ∈ I . However, to do it faster and keep the result for furthercomputation, it is possible to use M/R as well.

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 34 / 39

Alternative MapReduce decompositionVariant I: Second stage

Second Map: Tricluster generation. The second map does tricluster combiningjob, i.e. for each triple (g ,m, b) it composes the new key-value pair, ⟨(g ,m, b), ∅⟩.And for each pair of either type, ⟨(g ,m),Prime[g ,m]⟩, ⟨(g , b),Prime[g , b]⟩, and⟨(m, b),Prime[m, b]⟩ it generates key-values pairs ⟨(g ,m, b),Prime[g ,m]⟩,⟨(g , m, b),PrimeOC [g , b]⟩, and ⟨(g ,m, b),Prime[m, b]⟩, where g ∈ G , m ∈ M,and b ∈ B.

r2 = (|I |+ 3|G ||M||B|)/(|I |+ |G ||M|+ |G ||B|+ |M||B|) ≤(ρ+ 3)/(ρ+ 3/max(|G |, |M|, |B|)), where ρ is the input tricontext density.

Second Reduce: Tricluster generation. The second reducer just assembles onlyone value for each key (g ,m, b), the generating triple, its tricluster, (Prime[g ,m],Prime[g , b],Prime[m, b]). If there is no key-value pair ⟨(g ,m, b), ∅⟩ for aparticular triple (g ,m, b), it does not output any key-value pair for the key.

The reducer size q2 is either 3 (no output) or 4 (tricluster assembled).

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 35 / 39

Alternative MapReduce decompositionVariant II: Second stage

Second Map: Tricluster generation with duplicate generating triples.Second map does tricluster combining job, i.e. for each triple (g ,m, b) itcomposes a new key-value pair:⟨(Prime[g ,m],Prime[g , b],Prime[m, b]), (g ,m, b)⟩.

Second Map: Tricluster generation with duplicate generating triples.The second reducer just groups values for each key: ⟨(X ,Y ,Z ), {(g1,m1, b1), . . . ,(gn,mn, bn)}⟩.

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 36 / 39

Outline

1 Motivation and previous work

2 Prime OAC-triclusteringTriadic Formal concept analysisBasic algorithmOnline version of the algorithm

3 OAC-triclustering on MapReduceMapReduce technologyMapReduce implementation

4 ExperimentsDescription of the experimentsDatasetsResults

5 Conclusion

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 37 / 39

Conclusion and further work

MapReduce Prime OAC-triclustering implementation has been proposed.

Communication costs have been analysed.

Comparison of the online version and M/R one has been performed.

Further experiments are needed with other M/R variants and othertriclustering algorithms.

A proper comparison of the proposed OAC triclustering and noise tolerantpatterns in n-ary relations, e.g., by DataPeeler descendants [Cerf et al., 2013]is not yet conducted.

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 38 / 39

Thank you!

Questions?

S. Zudin et al. () OAC-triclustering on MapReduce CLA 2015 39 / 39