data platforms and pattern miningdata.science.uoit.ca/wp-content/uploads/2016/01/... · § big data...
TRANSCRIPT
®
IBM Platform Computing Group
© IBM Corporation
Data Platforms and Pattern Mining
Morteza Zihayat
IBM Software GroupIBM Platform Computing Group
About Myself§ Big Data Scientist4Platform Computing, IBM (2014 – Now)
§ PhD Candidate (2011 – Now)4Lassonde School of Engineering, York University
§ Main research interests4Big Data Mining and Engineering§ Finding Meaningful Patterns from Structured and Unstructured Big Data§ Parallel and Distributed Data Mining
4Mining Dynamic and Stream Data4Social Network Analysis
2
IBM Software GroupIBM Platform Computing Group
Agenda
§ How Big is this Big Data?
§ Big Data and Apache Hadoop
§ Apache Spark: Lightning-fast cluster computing
§ IBM Platform Conductor for Spark
§ A Case Study
4Mining High Utility Sequential Patterns
4User behavioral analysis
§ Globe and Mail Newspaper
3
IBM Software GroupIBM Platform Computing Group
How Big is this Big Data?
4
2,9 MILLION
NUMBER OF MAILS SENTEVERY SECOND
DATA CONSUMED BY HOUSEHOLDS EACH DAY
375 MEGABYTES 20 HOURSVIDEO UPLOADED TO
YOUTUBE EVERY MINUTEDATA PER DAY PROCESSED
BY GOOGLE
24 PETABYTES
50 MILLIONTWEETS PER DAY
700 BILLIONTOTAL MINUTES SPENT ONFACEBOOK EACH MONTH
1,3 EXABYTESDATA SENT AND RECEIVED
BY MOBILE INTERNET USERS
398 ITEMSPRODUCTS ORDERED ON
AMAZON PER SECOND
IBM Software GroupIBM Platform Computing Group
What is Big Data?
.5
Big Data
IBM Software GroupIBM Platform Computing Group
Challenges
§ Read/Write to disk is slow
4Use multiple disks for parallel read
§ Hardware failure
4Single machine/disk failure
4Keep multiple copies of data
§ How do we merge data from different reads
4Distributed processing or Hadoop MapReduce
6
IBM Software GroupIBM Platform Computing Group
Apache Hadoop
§ Hadoop is an open-source software framework written in Java
4Distributed storage
4Distributed processing of very large data sets
4Computer clusters
§ Two main components
4HDFS (Hadoop Distributed File System)
§ Provides Distributed Storage
4MapReduce (Distributed Data Processing Model)
§ Provides Distributed Processing
7
IBM Software GroupIBM Platform Computing Group
HDFS
§ Based on Google’s GFS (Google File System)
§ Data is distributed across all nodes
§ Build around the idea of “write-once, read-many-times” (WORM)
§ File are stored as “Blocks”
464MB
4Different blocks of the same file are stored on different nodes
4Same block is replicated across several nodes for redundancy
8
IBM Software GroupIBM Platform Computing Group
HDFS Architecture
§ NameNode
4It stores metadata (No of blocks, On
which rack which DataNode the data
is stored and other details)
§ DataNode
4It stores the actual data.
§ There is only one NameNode in
a cluster and many DataNodes
9
IBM Software GroupIBM Platform Computing Group
MapReduce
§ MapReduce is a method for distributing a task across multiple nodes
§ Consists of two phases
4Map
4Reduce
§ The data is process in the form of <key,value>
§ Each map task processes a discrete portion of the overall data
§ After all Maps are complete, the system distributes the intermediate data to
nodes which perform Reduce phase (aggregation)
10
IBM Software GroupIBM Platform Computing Group
Example: Word counting• Consider the problem of counting the number of occurrences of each word in
a large collection of documents• Input: documents• Output: <word,frequency>
11
Divide collection of documents among the
class.
Each students gives count of individual word
in a document. Repeats for assigned quota of documents
Sum up the counts from all the documents
to give final answer
IBM Software GroupIBM Platform Computing Group
Word Count Execution
12
the quickbrown fox
the fox ate the cow
how nowbrown cow
Student 1
Student 2
Student 3
Student 1
Student 2
brown, 2fox, 2how, 1now, 1the, 3
ate, 1cow, 2quick, 1
the, 1brown, 1fox, 1quick, 1
the, 1fox, 1the, 1ate,1cow,1
how, 1now, 1brown, 1cow,1
brown, 1
brown, 1
Input Output
IBM Software GroupIBM Platform Computing Group
Word Count Execution
13
the quickbrown fox
the fox ate the cow
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown, 2fox, 2how, 1now, 1the, 3
ate, 1cow, 2quick, 1
the, 1brown, 1fox, 1quick, 1
the, 1fox, 1the, 1ate,1cow,1
how, 1now, 1brown, 1cow,1
brown, 1
brown, 1
Input OutputMap tasks Reduce tasks
IBM Software GroupIBM Platform Computing Group
MapReduce: Execution overview
14
Reducers output the result on stable storage.
Shuffle phase assigns reducers to these buffers, which are remotely read and processed by reducers.
Map task reads the allocated data, saves the map results in local buffer.
Master Server distributes M map tasks to machines and monitors their progress.
IBM Software GroupIBM Platform Computing Group
MapReduce on a Cluster with Hadoop HDFS
15
IBM Software GroupIBM Platform Computing Group
Problem #1
§ MapReduce I/O sandbags runtime for advanced analytics.
4Must persist results after each pass through data
4Advanced analytics often requires multiple passes through data
16
HadoopStorage
HadoopStorageCompute Store
IBM Software GroupIBM Platform Computing Group
Example: Logistic Regression§ Goal: find best line separating two sets of points
17
target
random initial line
0500
10001500200025003000350040004500
1 5 10 20 30
Run
ning
Tim
e (s
)
Number of Iterations
Hadoop
IBM Software GroupIBM Platform Computing Group
Problem #2§ Many “point” solutions for advanced analytics in Hadoop4Queries§ Apache drill, ….
4Machine Learning§ Mahout, H2o, …
4Graph Analytics§ Graphlab, …
4Streaming Analytics§ S4, …
18
IBM Software GroupIBM Platform Computing Group
Apache Spark
§ It was developed in response
to limitations in the
MapReduce
§ Originally developed in 2009 in
UC Berkeley’s AMP Lab
§ Fully open sourced in 2010 –
now a Top Level Project at the
Apache Software Foundation
19
IBM Software GroupIBM Platform Computing Group
Easy and Fast Big Data
§Easy to Develop4Rich APIs in Java, Scala,
Python4Interactive shell
§Fast to Run4In-memory storage
2-5× less code Up to 10× faster on disk,100× in memory
IBM Software GroupIBM Platform Computing Group
Resilient Distributed Datasets (RDD)
§ Spark revolves around RDDs
§ Fault-tolerant collection of elements that can be operated on in parallel
4Parallelized Collection: Scala collection which is run in parallel
4Hadoop Dataset: records of files supported by Hadoop
4The data flow from one iteration to another happens through memory
§ Doesn't touch the disk.
4When the memory is not sufficient enough for the data to fit it
§ It can be either spilled to the drive or is just left to be recreated upon request for the same
IBM Software GroupIBM Platform Computing Group
RDD Operations
§ Transformations
4Creation of a new dataset from an existing
§ map, filter, distinct, union, sample, groupByKey, join, etc…
§ Actions
4Return a value after running a computation
§ collect, count, first, takeSample, foreach, etc…
IBM Software GroupIBM Platform Computing Group
RDD Persistence / Caching§ Variety of storage levels4memory_only (default), memory_and_disk, etc…
§ API Calls4persist(StorageLevel)4cache() – shorthand for persist(StorageLevel.MEMORY_ONLY)
§ Considerations4Read from disk vs. recompute (memory_and_disk)4Total memory storage size (memory_only_ser)4Replicate to second node for faster fault recovery (memory_only_2)§ Think about this option if supporting a web application
IBM Software GroupIBM Platform Computing Group
Cache Scaling Matters
69
58
41
30
12
0
10
20
30
40
50
60
70
80
90
100
Cache disabled 25% 50% 75% Fully cached
Exe
cutio
n tim
e (s
)
% of working set in cache
IBM Software GroupIBM Platform Computing Group
Logistic Regression Performance
25
0
500
1000
1500
2000
2500
3000
3500
4000
4500
1 5 10 20 30
Run
ning
Tim
e (s
)
Number of Iterations
HadoopSpark
IBM Software GroupIBM Platform Computing Group
A single integrated platform for advanced analytics
26
IBM Software GroupIBM Platform Computing Group
Spark Performance
§ Machine Learning
4100x faster than MapReduce
§ Queries (Shark)
4100x faster than Hive
§ Streaming
42X throughput of Storm
§ Graph (GraphX)
410X faster than MapReduce
27
IBM Software GroupIBM Platform Computing Group
IBM Platform Conductor for Spark
§ Addresses the following challenges
4Integrating Spark into existing environments
4Managing Spark lifecycles in the face of frequent updates to open-source Spark distributions
4Managing numerous Spark deployments within the organization
§ Benefits
4Cuts costs by using a service orchestration framework to maximize resource utilization
4Improves performance and efficiency
4Simplifies application management
28
®
IBM Platform Computing Group
© IBM Corporation
A Case Study
High Utility Sequential Pattern Mining in Big Data
29
IBM Software GroupIBM Platform Computing Group
Frequent Pattern Mining§ Frequent Pattern Mining (FPM)4FPM is a fundamental research topic in data mining4Example application§ Discover sets of items (i.e., itemsets) that are frequently purchased together by
customers
30
TID TransactionT1 {Bread, Milk}T2 {Bread, Milk}T3 {Bread, Milk, Diaper, Beer}T4 {Bread, Milk, Diaper, Beer}T5 {Diamond, Necklace}T6 {Diamond, Necklace}
Minimum support threshold: 60%Sup({Bread, Milk}) = 4/6 = 66.6%{Bread, Milk} is a frequent itemset
IBM Software GroupIBM Platform Computing Group
Domain Driven Actionable Knowledge Discovery
§ The identified patterns are handed over to business people
§ They cannot interpret the patterns for business use
4There are many patterns/not informative
4Not interested to business needs.
4How to interpret the patterns to business actions
31
IBM Software GroupIBM Platform Computing Group
Insufficiency of Frequent Pattern Mining
§ In Market Analysis
4Business objective: Increase Revenue
4May lose infrequent but valuable patterns
4May present too many frequent but unprofitable
patterns
4Cannot find patterns having high profits
32
Which patterns can bring high profits and make high revenue?
IBM Software GroupIBM Platform Computing Group
A Motivation Example
33
TID TransactionT1 {Bread(1), Milk(1)}T2 {Bread(1), Milk(1)}T3 {Bread(1), Milk(1), Diaper(3), Beer(6)}T4 {Bread(1), Milk(1), Diaper(3), Beer(6)}T5 {Diamond(1), Necklace(1)}T6 {Diamond(1), Necklace(1)}
Item Unit ProfitBread 20Milk 30
Diamond 1,000Necklace 300Diaper 300Beer 70
{Diamond, Necklace}: $2600 {Diaper, Beer}:$2640 {Bread, Milk}: $200
IBM Software GroupIBM Platform Computing Group
High Utility Sequential Pattern Mining
§ Given a set of sequences: find all sequences whose utility is > a user-
specified minimum threshold
4Each item has quantity in a transaction
4Each item has a value (e.g., price)
34
Items ProfitMilk $3Egg $2Birthday Cake $20Birthday Card $10Bread $1
CID TIDC1 T1 {(Bread,2), (Milk,6)}C1 T2 {(Birthday Card,2)}C1 T3 { (Birthday Cake,2), (egg,3)}C2 T3 {(Bread,2), (Milk,4), (Yoghurt,3), (Tuna,5)}C2 T4 {(egg,5), (Pizza,4), (Juice,2)}C3 T5 { (Bread,2),( Yoghurt,4), (Milk,3)}C3 T6 {(Milk,1), (cheese,2)}
IBM Software GroupIBM Platform Computing Group
What is utility?§ Utility of item in a transaction = internal utility (quantity of items in the transaction) x external utility (profit of the item).
4U(Milk,T1 ) = 3 × 6 = 18§ Utility of itemset in a sequence = sum of utilities of its items:
4U({Bread, Milk}, C1)= 2 × 1 + 6 × 3 = 20§ Utility of sub-sequence in a sequence = sum of its itemsets’ utilities
4U(<{Milk}{egg}>,C1) = 3 × 6 + 3 × 2 = 244 If more than one occurrence, then maximum value among occurrences
35
Items ProfitMilk $3Egg $2Birthday Cake $20Birthday Card $10Bread $1
CID TIDC1 T1 {(Bread,2), (Milk,6)}C1 T2 {(Birthday Card,2)}C1 T3 { (Birthday Cake,2), (egg,3)}C2 T3 {(Bread,2), (Milk,4), (Yoghurt,3), (Tuna,5)}C2 T4 {(egg,5), (Pizza,4), (Juice,2)}C3 T5 { (Bread,2),( Yoghurt,4), (Milk,3)}C3 T6 {(Milk,1), (cheese,2)}
®
IBM Platform Computing Group
© IBM Corporation
A Rule-Based News Recommendation System
36
IBM Software GroupIBM Platform Computing Group
Globe and Mail Dataset
§ Globe and Mail Inc.
4Corpus of news (142,163,909 articles)
4Web clickstream
§ 2 billion records
§ Each record
– A sequence of visited news
– 245 attributes
§ 4 TB
37
IBM Software GroupIBM Platform Computing Group
A Rule-Based Recommendation System
38
Phase 1
Phase 2
Phase 3
• Discover Frequent Patterns from user logs
• <News1,New2>
• Extract rules from the patterns• News1 News2
User log
Offline
Online
IBM Software GroupIBM Platform Computing Group
Goal
§ A rule based news recommendation engine
4News1 News2
§ It is quite probable that not all the pages visited by the user are of interest to
him/her
§ The value or importance of a news changes over time
39
IBM Software GroupIBM Platform Computing Group
A Utility Based News Recommendation System
40
Time Spent
cursor movement
freshness
Editorial Board
Rule-Based Recommendation
System
Navigational rules
IBM Software GroupIBM Platform Computing Group
The Proposed Framework
41
Phase 1
Phase 2
Phase 3
• Extract Utility-Based rules from the patterns
• News1 News2
• Recommend articles based on current user activities
User log
Offline
Online
• Discover High Utility Patternsfrom user logs
• <News1,New2>
IBM Software GroupIBM Platform Computing Group
Preliminary Experiment
§ Only 120000 records
4Over 32 GB memory usage in average
4Around 55 mins run time
§ With the exponential growth of data
4It is impossible to execute algorithms on a single machine
42
IBM Software GroupIBM Platform Computing Group
BigHUSP
43
IBM Software GroupIBM Platform Computing Group
BigHUSP
44
IBM Software GroupIBM Platform Computing Group
BigHUSP
45
• Partition data
• In each mapper
• Scan input sequences
• Construct UMatrixs,
• Calculate required parameters: threshold, …
• Remove all blocks for sequences from memory
• Store Umatrixs in Memory-Disk to re-use
IBM Software GroupIBM Platform Computing Group
BigHUSP
46
• Call Uspan to mine local HUSPs using
the local UMatrixs and threshold
• Output
• Local HUSPs and their utilities
IBM Software GroupIBM Platform Computing Group
BigHUSP
47
• Aggregate local HUSPs• If the pattern does not exist
• Estimate its utility • Output
• Global candidates and their approximate utilities
IBM Software GroupIBM Platform Computing Group
BigHUSP
48
IBM Software GroupIBM Platform Computing Group
Our recommendations keep user longer
49
2,537 minutes spent by 107users
380 minutes spent by 254 users
IBM Software GroupIBM Platform Computing Group
Experimental Results
50
IBM Software GroupIBM Platform Computing Group
Summary§ What Big Data is
§ What Hadoop is and why it is important§ Hadoop main components§ What Spark is
§ Why Spark is popular§ High Utility Pattern Mining§ Rule-Based News Recommendation System
§ Big Data Analysis Using IBM Conductor for Spark
51
IBM Software GroupIBM Platform Computing Group
Q1
§ According to Spark advocates, how much faster can Apache Spark
potentially run batch-processing programs when processed in memory than
MapReduce can?
§ 10 times faster
§ 20 times faster
§ 100 times faster
§ 200 times faster
53
IBM Software GroupIBM Platform Computing Group
Q2
§ What is the name of the programming framework originally developed by
Google that supports the development of applications for processing large
data sets in a distributed computing environment?
§ MapReduce
§ Hive
§ Spark
54
IBM Software GroupIBM Platform Computing Group
Q3
§ True or false? A big data analytics strategy is often defined by the three V's –
volume, variety and velocity -- which is helpful but ignores other commonly
cited characteristics, such as complexity and variability.
§ True
§ False
55
IBM Software GroupIBM Platform Computing Group
Q4
§ Just collecting and storing information isn't enough to produce real business
value. Big data analytics technologies are necessary to:
4Formulate eye-catching charts and graphs
4Extract valuable insights from the data
4Integrate data from internal and external sources
56
IBM Software GroupIBM Platform Computing Group
Q5
§ Which of the following application types can Spark run in addition to batch-
processing jobs?
4Stream processing
4Machine learning
4Graph processing
4All of the above
57
IBM Software GroupIBM Platform Computing Group
Q6
§ Which of the following is NOT a characteristic shared by Hadoop and Spark?
4Both are data processing platforms
4Both are cluster computing environments
4Both have their own file system
4Both use open source APIs to link between different tools
58