nyc hug - application architectures with apache hadoop
DESCRIPTION
Presentation at NYC HUG on Application Architectures with Apache HadoopTRANSCRIPT
1
Headline Goes Here Speaker Name or Subhead Goes Here
DO NOT USE PUBLICLY PRIOR TO 10/23/12
ApplicaAon Architectures with Apache Hadoop Mark Grover | @mark_grover NYC HUG slideshare.com/markgrover October 14th, 2014
©2014 Cloudera, Inc. All Rights Reserved.
About Me • CommiPer on Apache Bigtop, commiPer and PPMC member on Apache Sentry (incubaAng).
• Contributor to Hadoop, Hive, Spark, Sqoop, Flume. • SoWware developer at Cloudera • @mark_grover
2 ©2014 Cloudera, Inc. All Rights Reserved.
Co-‐authoring O’Reilly book
• @hadooparchbook • hadooparchitecturebook.com • Strata Hadoop World Tutorial
• at 9 AM tomorrow
©2014 Cloudera, Inc. All Rights Reserved. 3
4
Click Stream Analysis
Case Study
©2014 Cloudera, Inc. All Rights Reserved.
AnalyAcs
©2014 Cloudera, Inc. All Rights Reserved. 5
AnalyAcs
©2014 Cloudera, Inc. All Rights Reserved. 6
AnalyAcs
©2014 Cloudera, Inc. All Rights Reserved. 7
AnalyAcs
©2014 Cloudera, Inc. All Rights Reserved. 8
AnalyAcs
©2014 Cloudera, Inc. All Rights Reserved. 9
AnalyAcs
©2014 Cloudera, Inc. All Rights Reserved. 10
AnalyAcs
©2014 Cloudera, Inc. All Rights Reserved. 11
Web Logs – Combined Log Format
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
12
Clickstream AnalyAcs
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
13
Similar use-‐cases
• Sensors – heart, agriculture, etc. • Casinos – session of a person at a table
©2014 Cloudera, Inc. All Rights Reserved. 14
Challenges of Hadoop ImplementaAon
©2014 Cloudera, Inc. All Rights Reserved. 15
Challenges of Hadoop ImplementaAon
©2014 Cloudera, Inc. All Rights Reserved. 16
Other challenges -‐ Architectural ConsideraAons
• Storage managers? • HDFS? HBase?
• Data storage and modeling: • File formats? Compression? Schema design?
• Data movement • How do we actually get the data into Hadoop? How do we get it out?
• Metadata • How do we manage data about the data?
• Processing • How can we transform it? How do we query it?
• OrchestraAon • How do we manage the workflow for all of this?
©2014 Cloudera, Inc. All Rights Reserved. 17
18
Since that’s all what the Ame allows todayJ
2. Processing
©2014 Cloudera, Inc. All Rights Reserved.
Processing
• De-‐duplicaAon • Filtering • SessionizaAon
19 ©2014 Cloudera, Inc. All Rights Reserved.
DeduplicaAon – remove duplicate records
©2014 Cloudera, Inc. All Rights Reserved. 20
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
Filtering – filter out invalid records
©2014 Cloudera, Inc. All Rights Reserved. 21
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U…
SessionizaAon
©2014 Cloudera, Inc. All Rights Reserved. 22
Website visit
Visitor 1 Session 1
Visitor 1 Session 2
Visitor 2 Session 1
> 30 minutes
Why sessionize?
Helps answers quesAons like: • What is my website’s bounce rate?
• i.e. how many % of visitors don’t go past the landing page? • Which markeAng channels (e.g. organic search, display ad, etc.) are leading to most sessions?
• Which ones of those lead to most conversions (e.g. people buying things, signing up, etc.)
• Do aPribuAon analysis – which channels are responsible for most conversions?
23 ©2014 Cloudera, Inc. All Rights Reserved.
SessionizaAon
©2014 Cloudera, Inc. All Rights Reserved. 24
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 165 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” 166
How to Sessionize?
1. Given a list of clicks, determine which clicks came from the same user
2. Given a parAcular user's clicks, determine if a given click is a part of a new session or a conAnuaAon of the previous session
25 ©2014 Cloudera, Inc. All Rights Reserved.
#1 – Which clicks are from same user?
• We can use: • IP address (244.157.45.12) • Cookies (A9A3BECE0563982D) • IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")
26 ©2014 Cloudera, Inc. All Rights Reserved.
#1 – Which clicks are from same user?
• We can use: • IP address (244.157.45.12) • Cookies (A9A3BECE0563982D) • IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")
27 ©2014 Cloudera, Inc. All Rights Reserved.
#1 – Which clicks are from same user?
©2014 Cloudera, Inc. All Rights Reserved. 28
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
#2 – Which clicks part of the same session?
©2014 Cloudera, Inc. All Rights Reserved. 29
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
> 30 mins apart = different sessions
30
github.com/hadooparchitecturebook/clickstream-‐tutorial
SessionizaAon in MapReduce
©2014 Cloudera, Inc. All Rights Reserved.
SessionizaAon in MapReduce
31 ©2014 Cloudera, Inc. All Rights Reserved.
Map
Reduce
Reduce
Log line IP1, log lines
IP1, log line
s
Log line, session ID
Map
Map
Log line
Log line IP2, log lines
IP2, log lines Log line, session ID
Mapper for SessionizaAon
32 ©2014 Cloudera, Inc. All Rights Reserved.
public static class SessionizeMapper extends Mapper<Object, Text, IpTimestampKey, Text> { private Matcher logRecordMatcher; public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { logRecordMatcher = logRecordPattern.matcher(value.toString()); // We only emit something out if the record matches with our regex. Otherwise, we assume the record is busted and simply ignore it if (logRecordMatcher.matches()) { String ip = logRecordMatcher.group(1); DateTime timestamp = DateTime.parse(logRecordMatcher.group(2), TIMESTAMP_FORMATTER); Long unixTimestamp = timestamp.getMillis(); IpTimestampKey outputKey = new IpTimestampKey(ip, unixTimestamp); context.write(outputKey, value); } } }
Reducer for SessionizaAon
33 ©2014 Cloudera, Inc. All Rights Reserved.
public static class SessionizeReducer extends Reducer<IpTimestampKey, Text, IpTimestampKey, Text> {
private Text result = new Text();
public void reduce(IpTimestampKey key, Iterable<Text> values, Context context
) throws IOException, InterruptedException { // The sessionId generated here is per day, per IP. So, any queries // that will be done as if this session ID were global, would require // a combination of the day in question and IP as well. String sessionId = null; Long lastTimeStamp = null; for (Text value : values) { String logRecord = value.toString();
Reducer for SessionizaAon
34 ©2014 Cloudera, Inc. All Rights Reserved.
// If this is the first record for this user or it's been more than the timeout since // the last click from this user, let's increment the session ID. if (lastTimeStamp == null || (key.getUnixTimestamp() -‐ lastTimeStamp > SESSION_TIMEOUT_IN_MS)) { sessionId = key.getIp() + "+" + key.getUnixTimestamp(); } lastTimeStamp = key.getUnixTimestamp(); result.set(logRecord + " " + sessionId); // Since we only care about printing out the entire record in the result, with session ID appended // at the end, we just emit out "null" for the key context.write(null, result); } } }
Secondary sorAng – by Amestamp
• Need records to reducer to be grouped by IP address and sorted by Amestamp – a concept called secondary sor/ng
• Instead of using just IP address as map output key and reduce input key
• We use a composite key (IP, Amestamp) as map output key and reduce input key
35 ©2014 Cloudera, Inc. All Rights Reserved.
Secondary sorAng – vocabulary
• Composite key – IP address, Amestamp • Natural key – IP address • Secondary sort key -‐ Amestamp
36 ©2014 Cloudera, Inc. All Rights Reserved.
Secondary sorAng
• Custom Grouping Comparator – on Natural Key (IP) • Custom Sort Comparator – on Composite Key (IP, address) • Custom ParAAoner – on Natural Key (IP) job.setGroupingComparatorClass(NaturalKeyComparator.class); job.setSortComparatorClass(CompositeKeyComparator.class); job.setPartitionerClass(NaturalKeyPartitioner.class);
37 ©2014 Cloudera, Inc. All Rights Reserved.
38
Final Architecture
©2014 Cloudera, Inc. All Rights Reserved.
©2014 Cloudera, Inc. All Rights Reserved. 39
Hadoop Cluster
BI/VisualizaAon tool (e.g.
microstrategy)
BI Analysts
Spark For machine learning and graph processing
R/Python StaAsAcal Analysis
Custom Apps
3. Accessing
2. Processing
4. OrchestraAon via Oozie 1. IngesAon
OperaAonal Data Store
CRM System Via Sqoop
Web servers
Website users
Final Architecture – High Level Overview
40
Data Sources IngesAon
Data Storage/Processing
Data ReporAng/Analysis
©2014 Cloudera, Inc. All Rights Reserved.
Final Architecture – High Level Overview
41
Data Sources IngesAon
Data Storage/Processing
Data ReporAng/Analysis
©2014 Cloudera, Inc. All Rights Reserved.
Final Architecture – IngesAon
42
Web App Avro Agent Web App Avro Agent
Web App Avro Agent Web App Avro Agent
Web App Avro Agent Web App Avro Agent
Web App Avro Agent Web App Avro Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Fan-‐in PaPern
MulA Agents for Failover and rolling restarts
HDFS
©2014 Cloudera, Inc. All Rights Reserved.
Final Architecture – High Level Overview
43
Data Sources IngesAon
Data Storage/Processing
Data ReporAng/Analysis
©2014 Cloudera, Inc. All Rights Reserved.
Final Architecture – Storage and Processing
44
/etl/weblogs/20140331/ /etl/weblogs/20140401/ …
Data Processing /data/markeAng/clickstream/bouncerate/ /data/markeAng/clickstream/aPribuAon/ …
©2014 Cloudera, Inc. All Rights Reserved.
Final Architecture – High Level Overview
45
Data Sources IngesAon
Data Storage/Processing
Data ReporAng/Analysis
©2014 Cloudera, Inc. All Rights Reserved.
Final Architecture – Data Access
46
Hive/Impala
BI/AnalyAcs Tools
DWH Sqoop
Local Disk
R, etc.
DB import tool
JDBC/ODBC
©2014 Cloudera, Inc. All Rights Reserved.
Contact info • Mark Grover
• @mark_grover • www.linkedin.com/in/grovermark
• Slides at slideshare.net/markgrover
47 ©2014 Cloudera, Inc. All Rights Reserved.
48 ©2014 Cloudera, Inc. All Rights Reserved.