data platforms and pattern miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § big data...

58
® IBM Platform Computing Group © IBM Corporation Data Platforms and Pattern Mining Morteza Zihayat

Upload: others

Post on 21-May-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

®

IBM Platform Computing Group

© IBM Corporation

Data Platforms and Pattern Mining

Morteza Zihayat

Page 2: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

About Myself§ Big Data Scientist4Platform Computing, IBM (2014 – Now)

§ PhD Candidate (2011 – Now)4Lassonde School of Engineering, York University

§ Main research interests4Big Data Mining and Engineering§ Finding Meaningful Patterns from Structured and Unstructured Big Data§ Parallel and Distributed Data Mining

4Mining Dynamic and Stream Data4Social Network Analysis

2

Page 3: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Agenda

§ How Big is this Big Data?

§ Big Data and Apache Hadoop

§ Apache Spark: Lightning-fast cluster computing

§ IBM Platform Conductor for Spark

§ A Case Study

4Mining High Utility Sequential Patterns

4User behavioral analysis

§ Globe and Mail Newspaper

3

Page 4: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

How Big is this Big Data?

4

2,9 MILLION

NUMBER OF MAILS SENTEVERY SECOND

DATA CONSUMED BY HOUSEHOLDS EACH DAY

375 MEGABYTES 20 HOURSVIDEO UPLOADED TO

YOUTUBE EVERY MINUTEDATA PER DAY PROCESSED

BY GOOGLE

24 PETABYTES

50 MILLIONTWEETS PER DAY

700 BILLIONTOTAL MINUTES SPENT ONFACEBOOK EACH MONTH

1,3 EXABYTESDATA SENT AND RECEIVED

BY MOBILE INTERNET USERS

398 ITEMSPRODUCTS ORDERED ON

AMAZON PER SECOND

Page 5: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

What is Big Data?

.5

Big Data

Page 6: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Challenges

§ Read/Write to disk is slow

4Use multiple disks for parallel read

§ Hardware failure

4Single machine/disk failure

4Keep multiple copies of data

§ How do we merge data from different reads

4Distributed processing or Hadoop MapReduce

6

Page 7: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Apache Hadoop

§ Hadoop is an open-source software framework written in Java

4Distributed storage

4Distributed processing of very large data sets

4Computer clusters

§ Two main components

4HDFS (Hadoop Distributed File System)

§ Provides Distributed Storage

4MapReduce (Distributed Data Processing Model)

§ Provides Distributed Processing

7

Page 8: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

HDFS

§ Based on Google’s GFS (Google File System)

§ Data is distributed across all nodes

§ Build around the idea of “write-once, read-many-times” (WORM)

§ File are stored as “Blocks”

464MB

4Different blocks of the same file are stored on different nodes

4Same block is replicated across several nodes for redundancy

8

Page 9: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

HDFS Architecture

§ NameNode

4It stores metadata (No of blocks, On

which rack which DataNode the data

is stored and other details)

§ DataNode

4It stores the actual data.

§ There is only one NameNode in

a cluster and many DataNodes

9

Page 10: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

MapReduce

§ MapReduce is a method for distributing a task across multiple nodes

§ Consists of two phases

4Map

4Reduce

§ The data is process in the form of <key,value>

§ Each map task processes a discrete portion of the overall data

§ After all Maps are complete, the system distributes the intermediate data to

nodes which perform Reduce phase (aggregation)

10

Page 11: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Example: Word counting• Consider the problem of counting the number of occurrences of each word in

a large collection of documents• Input: documents• Output: <word,frequency>

11

Divide collection of documents among the

class.

Each students gives count of individual word

in a document. Repeats for assigned quota of documents

Sum up the counts from all the documents

to give final answer

Page 12: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Word Count Execution

12

the quickbrown fox

the fox ate the cow

how nowbrown cow

Student 1

Student 2

Student 3

Student 1

Student 2

brown, 2fox, 2how, 1now, 1the, 3

ate, 1cow, 2quick, 1

the, 1brown, 1fox, 1quick, 1

the, 1fox, 1the, 1ate,1cow,1

how, 1now, 1brown, 1cow,1

brown, 1

brown, 1

Input Output

Page 13: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Word Count Execution

13

the quickbrown fox

the fox ate the cow

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown, 2fox, 2how, 1now, 1the, 3

ate, 1cow, 2quick, 1

the, 1brown, 1fox, 1quick, 1

the, 1fox, 1the, 1ate,1cow,1

how, 1now, 1brown, 1cow,1

brown, 1

brown, 1

Input OutputMap tasks Reduce tasks

Page 14: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

MapReduce: Execution overview

14

Reducers output the result on stable storage.

Shuffle phase assigns reducers to these buffers, which are remotely read and processed by reducers.

Map task reads the allocated data, saves the map results in local buffer.

Master Server distributes M map tasks to machines and monitors their progress.

Page 15: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

MapReduce on a Cluster with Hadoop HDFS

15

Page 16: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Problem #1

§ MapReduce I/O sandbags runtime for advanced analytics.

4Must persist results after each pass through data

4Advanced analytics often requires multiple passes through data

16

HadoopStorage

HadoopStorageCompute Store

Page 17: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Example: Logistic Regression§ Goal: find best line separating two sets of points

17

target

random initial line

0500

10001500200025003000350040004500

1 5 10 20 30

Run

ning

Tim

e (s

)

Number of Iterations

Hadoop

Page 18: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Problem #2§ Many “point” solutions for advanced analytics in Hadoop4Queries§ Apache drill, ….

4Machine Learning§ Mahout, H2o, …

4Graph Analytics§ Graphlab, …

4Streaming Analytics§ S4, …

18

Page 19: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Apache Spark

§ It was developed in response

to limitations in the

MapReduce

§ Originally developed in 2009 in

UC Berkeley’s AMP Lab

§ Fully open sourced in 2010 –

now a Top Level Project at the

Apache Software Foundation

19

Page 20: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Easy and Fast Big Data

§Easy to Develop4Rich APIs in Java, Scala,

Python4Interactive shell

§Fast to Run4In-memory storage

2-5× less code Up to 10× faster on disk,100× in memory

Page 21: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Resilient Distributed Datasets (RDD)

§ Spark revolves around RDDs

§ Fault-tolerant collection of elements that can be operated on in parallel

4Parallelized Collection: Scala collection which is run in parallel

4Hadoop Dataset: records of files supported by Hadoop

4The data flow from one iteration to another happens through memory

§ Doesn't touch the disk.

4When the memory is not sufficient enough for the data to fit it

§ It can be either spilled to the drive or is just left to be recreated upon request for the same

Page 22: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

RDD Operations

§ Transformations

4Creation of a new dataset from an existing

§ map, filter, distinct, union, sample, groupByKey, join, etc…

§ Actions

4Return a value after running a computation

§ collect, count, first, takeSample, foreach, etc…

Page 23: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

RDD Persistence / Caching§ Variety of storage levels4memory_only (default), memory_and_disk, etc…

§ API Calls4persist(StorageLevel)4cache() – shorthand for persist(StorageLevel.MEMORY_ONLY)

§ Considerations4Read from disk vs. recompute (memory_and_disk)4Total memory storage size (memory_only_ser)4Replicate to second node for faster fault recovery (memory_only_2)§ Think about this option if supporting a web application

Page 24: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Cache Scaling Matters

69

58

41

30

12

0

10

20

30

40

50

60

70

80

90

100

Cache disabled 25% 50% 75% Fully cached

Exe

cutio

n tim

e (s

)

% of working set in cache

Page 25: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Logistic Regression Performance

25

0

500

1000

1500

2000

2500

3000

3500

4000

4500

1 5 10 20 30

Run

ning

Tim

e (s

)

Number of Iterations

HadoopSpark

Page 26: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

A single integrated platform for advanced analytics

26

Page 27: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Spark Performance

§ Machine Learning

4100x faster than MapReduce

§ Queries (Shark)

4100x faster than Hive

§ Streaming

42X throughput of Storm

§ Graph (GraphX)

410X faster than MapReduce

27

Page 28: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

IBM Platform Conductor for Spark

§ Addresses the following challenges

4Integrating Spark into existing environments

4Managing Spark lifecycles in the face of frequent updates to open-source Spark distributions

4Managing numerous Spark deployments within the organization

§ Benefits

4Cuts costs by using a service orchestration framework to maximize resource utilization

4Improves performance and efficiency

4Simplifies application management

28

Page 29: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

®

IBM Platform Computing Group

© IBM Corporation

A Case Study

High Utility Sequential Pattern Mining in Big Data

29

Page 30: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Frequent Pattern Mining§ Frequent Pattern Mining (FPM)4FPM is a fundamental research topic in data mining4Example application§ Discover sets of items (i.e., itemsets) that are frequently purchased together by

customers

30

TID TransactionT1 {Bread, Milk}T2 {Bread, Milk}T3 {Bread, Milk, Diaper, Beer}T4 {Bread, Milk, Diaper, Beer}T5 {Diamond, Necklace}T6 {Diamond, Necklace}

Minimum support threshold: 60%Sup({Bread, Milk}) = 4/6 = 66.6%{Bread, Milk} is a frequent itemset

Page 31: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Domain Driven Actionable Knowledge Discovery

§ The identified patterns are handed over to business people

§ They cannot interpret the patterns for business use

4There are many patterns/not informative

4Not interested to business needs.

4How to interpret the patterns to business actions

31

Page 32: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Insufficiency of Frequent Pattern Mining

§ In Market Analysis

4Business objective: Increase Revenue

4May lose infrequent but valuable patterns

4May present too many frequent but unprofitable

patterns

4Cannot find patterns having high profits

32

Which patterns can bring high profits and make high revenue?

Page 33: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

A Motivation Example

33

TID TransactionT1 {Bread(1), Milk(1)}T2 {Bread(1), Milk(1)}T3 {Bread(1), Milk(1), Diaper(3), Beer(6)}T4 {Bread(1), Milk(1), Diaper(3), Beer(6)}T5 {Diamond(1), Necklace(1)}T6 {Diamond(1), Necklace(1)}

Item Unit ProfitBread 20Milk 30

Diamond 1,000Necklace 300Diaper 300Beer 70

{Diamond, Necklace}: $2600 {Diaper, Beer}:$2640 {Bread, Milk}: $200

Page 34: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

High Utility Sequential Pattern Mining

§ Given a set of sequences: find all sequences whose utility is > a user-

specified minimum threshold

4Each item has quantity in a transaction

4Each item has a value (e.g., price)

34

Items ProfitMilk $3Egg $2Birthday Cake $20Birthday Card $10Bread $1

CID TIDC1 T1 {(Bread,2), (Milk,6)}C1 T2 {(Birthday Card,2)}C1 T3 { (Birthday Cake,2), (egg,3)}C2 T3 {(Bread,2), (Milk,4), (Yoghurt,3), (Tuna,5)}C2 T4 {(egg,5), (Pizza,4), (Juice,2)}C3 T5 { (Bread,2),( Yoghurt,4), (Milk,3)}C3 T6 {(Milk,1), (cheese,2)}

Page 35: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

What is utility?§ Utility of item in a transaction = internal utility (quantity of items in the transaction) x external utility (profit of the item).

4U(Milk,T1 ) = 3 × 6 = 18§ Utility of itemset in a sequence = sum of utilities of its items:

4U({Bread, Milk}, C1)= 2 × 1 + 6 × 3 = 20§ Utility of sub-sequence in a sequence = sum of its itemsets’ utilities

4U(<{Milk}{egg}>,C1) = 3 × 6 + 3 × 2 = 244 If more than one occurrence, then maximum value among occurrences

35

Items ProfitMilk $3Egg $2Birthday Cake $20Birthday Card $10Bread $1

CID TIDC1 T1 {(Bread,2), (Milk,6)}C1 T2 {(Birthday Card,2)}C1 T3 { (Birthday Cake,2), (egg,3)}C2 T3 {(Bread,2), (Milk,4), (Yoghurt,3), (Tuna,5)}C2 T4 {(egg,5), (Pizza,4), (Juice,2)}C3 T5 { (Bread,2),( Yoghurt,4), (Milk,3)}C3 T6 {(Milk,1), (cheese,2)}

Page 36: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

®

IBM Platform Computing Group

© IBM Corporation

A Rule-Based News Recommendation System

36

Page 37: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Globe and Mail Dataset

§ Globe and Mail Inc.

4Corpus of news (142,163,909 articles)

4Web clickstream

§ 2 billion records

§ Each record

– A sequence of visited news

– 245 attributes

§ 4 TB

37

Page 38: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

A Rule-Based Recommendation System

38

Phase 1

Phase 2

Phase 3

• Discover Frequent Patterns from user logs

• <News1,New2>

• Extract rules from the patterns• News1 News2

User log

Offline

Online

Page 39: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Goal

§ A rule based news recommendation engine

4News1 News2

§ It is quite probable that not all the pages visited by the user are of interest to

him/her

§ The value or importance of a news changes over time

39

Page 40: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

A Utility Based News Recommendation System

40

Time Spent

cursor movement

freshness

Editorial Board

Rule-Based Recommendation

System

Navigational rules

Page 41: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

The Proposed Framework

41

Phase 1

Phase 2

Phase 3

• Extract Utility-Based rules from the patterns

• News1 News2

• Recommend articles based on current user activities

User log

Offline

Online

• Discover High Utility Patternsfrom user logs

• <News1,New2>

Page 42: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Preliminary Experiment

§ Only 120000 records

4Over 32 GB memory usage in average

4Around 55 mins run time

§ With the exponential growth of data

4It is impossible to execute algorithms on a single machine

42

Page 43: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

BigHUSP

43

Page 44: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

BigHUSP

44

Page 45: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

BigHUSP

45

• Partition data

• In each mapper

• Scan input sequences

• Construct UMatrixs,

• Calculate required parameters: threshold, …

• Remove all blocks for sequences from memory

• Store Umatrixs in Memory-Disk to re-use

Page 46: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

BigHUSP

46

• Call Uspan to mine local HUSPs using

the local UMatrixs and threshold

• Output

• Local HUSPs and their utilities

Page 47: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

BigHUSP

47

• Aggregate local HUSPs• If the pattern does not exist

• Estimate its utility • Output

• Global candidates and their approximate utilities

Page 48: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

BigHUSP

48

Page 49: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Our recommendations keep user longer

49

2,537 minutes spent by 107users

380 minutes spent by 254 users

Page 50: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Experimental Results

50

Page 51: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Summary§ What Big Data is

§ What Hadoop is and why it is important§ Hadoop main components§ What Spark is

§ Why Spark is popular§ High Utility Pattern Mining§ Rule-Based News Recommendation System

§ Big Data Analysis Using IBM Conductor for Spark

51

Page 52: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

[email protected]

52

Page 53: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Q1

§ According to Spark advocates, how much faster can Apache Spark

potentially run batch-processing programs when processed in memory than

MapReduce can?

§ 10 times faster

§ 20 times faster

§ 100 times faster

§ 200 times faster

53

Page 54: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Q2

§ What is the name of the programming framework originally developed by

Google that supports the development of applications for processing large

data sets in a distributed computing environment?

§ MapReduce

§ Hive

§ Spark

54

Page 55: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Q3

§ True or false? A big data analytics strategy is often defined by the three V's –

volume, variety and velocity -- which is helpful but ignores other commonly

cited characteristics, such as complexity and variability.

§ True

§ False

55

Page 56: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Q4

§ Just collecting and storing information isn't enough to produce real business

value. Big data analytics technologies are necessary to:

4Formulate eye-catching charts and graphs

4Extract valuable insights from the data

4Integrate data from internal and external sources

56

Page 57: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Q5

§ Which of the following application types can Spark run in addition to batch-

processing jobs?

4Stream processing

4Machine learning

4Graph processing

4All of the above

57

Page 58: Data Platforms and Pattern Miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § Big Data and Apache Hadoop § Apache Spark: Lightning-fast cluster computing ... 4Read from

IBM Software GroupIBM Platform Computing Group

Q6

§ Which of the following is NOT a characteristic shared by Hadoop and Spark?

4Both are data processing platforms

4Both are cluster computing environments

4Both have their own file system

4Both use open source APIs to link between different tools

58