cmu scs u kang (cmu) 1kdd 2012 gigatensor: scaling tensor analysis up by 100 times – algorithms...

Post on 02-Jan-2016

227 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CMU SCS

U Kang (CMU) 1KDD 2012

GigaTensor: Scaling Tensor Analysis Up By 100 Times –

Algorithms and Discoveries

U Kang

ChristosFaloutsos

School of Computer ScienceCarnegie Mellon University

EvangelosPapalexakis

AbhayHarpale

CMU SCS

U Kang (CMU) 2KDD 2012

Outline

Problem Definition

Algorithm

Discoveries

Conclusions

CMU SCS

U Kang (CMU) 3KDD 2012

Background: Tensor

Tensors (=multi-dimensional arrays) are every-where Hyperlinks and anchor texts in Web graphs

URL 1

URL 2

Anchor Text

Java

C++

C#

11

1

1

1

11

CMU SCS

U Kang (CMU) 4KDD 2012

Background: Tensor

Tensors (=multi-dimensional arrays) are every-where Sensor stream (time, location, type) Predicates (subject, verb, object) in knowledge base

“Barrack Obama is the president of U.S.”

“Eric Clapton playsguitar”

(26M)

(26M)

(48M) NELL (Never Ending

Language Learner) dataNonzeros =144M

CMU SCS

U Kang (CMU) 5KDD 2012

Problem Definition

Q1: How to decompose a billion-scale tensor? Corresponds to SVD in 2D case

CMU SCS

U Kang (CMU) 6KDD 2012

Problem Definition

Q2: What are the important concepts and syn-onyms in a KB tensor? Q2.1: What are the dominant concepts in the

knowledge base tensor? Q2.2: What are the synonyms to a given noun

phrase?

(26M)

(26M)

(48M) NELL (Never Ending

Language Learner) dataNonzeros =144M

CMU SCS

U Kang (CMU) 7KDD 2012

Outline

Problem Definition

Algorithm

Discoveries

Conclusions

CMU SCS

U Kang (CMU) 8KDD 2012

Algorithm: Problem Definition

Q1: How to decompose a billion-scale tensor? Corresponds to SVD in 2D case

CMU SCS

U Kang (CMU) 9KDD 2012

Challenge

Alternating Least Square (ALS) Algorithm

• •

: pseudo-inverse

How to design fast MapReduce algorithm for the ALS?

: Hadamard: Khatri-Rao

(J=26M)

(I=26M)

(K=48M)

Details

CMU SCS

U Kang (CMU) 10KDD 2012

Main Idea

1. Ordering of Computation Our choice

FLOPS (NELL data)𝟖 ⋅𝟏𝟎𝟗

FLOPS (NELL data)𝟐 .𝟓⋅𝟏𝟎𝟏𝟕

Details

CMU SCS

U Kang (CMU) 11KDD 2012

Main Idea

2. Avoiding Intermediate Data Explosion

Size of Intermediate Data (NELL) - Naïve: 100 PB

(J=26M)

(I=26M)

(K=48M)

Details

CMU SCS

U Kang (CMU) 12KDD 2012

Main Idea

2. Avoiding Intermediate Data Explosion

Size of Intermediate Data (NELL)- Proposed: 1.5 GB

Details

Size of Intermediate Data (NELL) - Naïve: 100 PB

(Before) (After)

CMU SCS

U Kang (CMU) 13KDD 2012

Experiments

GigaTensor solves 100x larger problem

Number of nonzero= I / 50

(J)

(I)

(K)

GigaTensor

Tensor

Toolbox Out ofMemory

100x

CMU SCS

U Kang (CMU) 14KDD 2012

Outline

Problem Definition

Algorithm

Discoveries

Conclusions

CMU SCS

U Kang (CMU) 15KDD 2012

Discoveries: Problem Definition

Q2: What are the important concepts and syn-onyms in a KB tensor? Q2.1: What are the dominant concepts in the

knowledge base tensor? Q2.2: What are the synonyms to a given noun

phrase?

(26M)

(26M)

(48M) NELL (Never Ending

Language Learner) dataNonzeros =144M

CMU SCS

U Kang (CMU) 16KDD 2012

A2.1: Concept Discovery

Concept Discovery in Knowledge Base

CMU SCS

U Kang (CMU) 17KDD 2012

A2.1: Concept Discovery

CMU SCS

U Kang (CMU) 18KDD 2012

A2.2: Synonym Discovery

Synonym Discovery in Knowledge Base

a1 a2 aR…

(Given) noun phrase

(Discovered) synonym 1

(Discovered) synonym 2

CMU SCS

U Kang (CMU) 19KDD 2012

A2.2: Synonym Discovery

CMU SCS

U Kang (CMU) 20KDD 2012

Outline

Problem Definition

Algorithm

Discoveries

Conclusions

CMU SCS

U Kang (CMU) 21KDD 2012

Conclusion

GigaTensor: scalable tensor decomposition al-gorithm for billion-length modes tensors Algorithm: avoid intermediate data explosion Discoveries: concept discovery and contextual syn-

onym detection on KB tensor

CMU SCS

U Kang (CMU) 22KDD 2012

Thank you !www.cs.cmu.edu/~pegasuswww.cs.cmu.edu/~ukang

top related