architecting application with hadoop - using clickstream analytics as an example
TRANSCRIPT
![Page 1: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/1.jpg)
Application Architectures with Hadoop Northern Colorado Big Data meetup October 8, 2015 tiny.cloudera.com/app-arch-ft-collins Mark Grover | @mark_grover
![Page 2: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/2.jpg)
2
About the book • @hadooparchbook • hadooparchitecturebook.com • github.com/hadooparchitecturebook • slideshare.com/hadooparchbook
©2014 Cloudera, Inc. All Rights Reserved.
![Page 3: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/3.jpg)
3
About Me • Mark
– Software Engineer – Engineer on Apache Spark – Committer on Apache Bigtop, committer and PPMC member on Apache
Sentry (incubating). – Contributor to Hadoop, Hive, Spark, Sqoop, Flume
©2014 Cloudera, Inc. All Rights Reserved.
![Page 4: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/4.jpg)
4
Case Study Clickstream Analysis
![Page 5: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/5.jpg)
5
Analytics
©2014 Cloudera, Inc. All Rights Reserved.
![Page 6: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/6.jpg)
6
Analytics
©2014 Cloudera, Inc. All Rights Reserved.
![Page 7: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/7.jpg)
7
Web Logs – Combined Log Format
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
![Page 8: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/8.jpg)
8
Clickstream Analytics
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
![Page 9: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/9.jpg)
9
Challenges of Hadoop Implementation
©2014 Cloudera, Inc. All Rights Reserved.
![Page 10: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/10.jpg)
10
Challenges of Hadoop Implementation
©2014 Cloudera, Inc. All Rights Reserved.
![Page 11: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/11.jpg)
11
Hadoop Architectural Considerations • Storage managers?
– HDFS? HBase? • Data storage and modeling:
– File formats? Compression? Schema design? • Data movement
– How do we actually get the data into Hadoop? How do we get it out? • Metadata
– How do we manage data about the data? • Data access and processing
– How will the data be accessed once in Hadoop? How can we transform it? How do we query it?
• Orchestration – How do we manage the workflow for all of this?
©2014 Cloudera, Inc. All Rights Reserved.
![Page 12: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/12.jpg)
12
Architectural Considerations Data Storage and Modeling
![Page 13: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/13.jpg)
13
Data Modeling Considerations • We need to consider the following in our architecture:
– Storage layer – HDFS? HBase? Etc. – File system schemas – how will we lay out the data? – File formats – what storage formats to use for our data, both raw and
processed data? – Data compression formats?
©2014 Cloudera, Inc. All Rights Reserved.
![Page 14: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/14.jpg)
14
Architectural Considerations Data Modeling – Storage Layer
![Page 15: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/15.jpg)
15
Data Storage Layer Choices • Two likely choices for raw data:
©2014 Cloudera, Inc. All Rights Reserved.
![Page 16: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/16.jpg)
16
Data Storage Layer Choices
• Stores data directly as files • Fast scans • Poor random reads/writes
• Stores data as Hfiles on HDFS
• Slow scans • Fast random reads/writes
©2014 Cloudera, Inc. All Rights Reserved.
![Page 17: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/17.jpg)
17
Data Storage – Storage Manager Considerations
• Incoming raw data: – Processing requirements call for batch transformations across multiple
records – for example sessionization.
• Processed data: – Access to processed data will be via things like analytical queries – again
requiring access to multiple records.
• We choose HDFS – Processing needs in this case served better by fast scans.
©2014 Cloudera, Inc. All Rights Reserved.
![Page 18: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/18.jpg)
18
Architectural Considerations Data Modeling – Data Storage Format
![Page 19: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/19.jpg)
19
Our Format Choices… • Raw data
– Avro with Snappy
• Processed data – Parquet
©2014 Cloudera, Inc. All Rights Reserved.
![Page 20: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/20.jpg)
20
Architectural Considerations Data Modeling – HDFS Schema Design
![Page 21: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/21.jpg)
21
Recommended HDFS Schema Design • How to lay out data on HDFS?
©2014 Cloudera, Inc. All Rights Reserved.
![Page 22: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/22.jpg)
22
Recommended HDFS Schema Design /etl – Data in various stages of ETL workflow /data – shared data for the entire organization /tmp – temp data from tools or shared between users /user/<username> - User specific data, jars, conf files /app – Everything but data: UDF jars, HQL files, Oozie workflows
©2014 Cloudera, Inc. All Rights Reserved.
![Page 23: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/23.jpg)
23
Architectural Considerations Data Modeling – Advanced HDFS Schema Design
![Page 24: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/24.jpg)
24
Partitioning
©2014 Cloudera, Inc. All Rights Reserved.
dataset col=val1/file.txt col=val2/file.txt … col=valn/file.txt
dataset file1.txt file2.txt … filen.txt
Un-partitioned HDFS directory structure
Partitioned HDFS directory structure
![Page 25: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/25.jpg)
25
Partitioning considerations • What column to partition by?
– Don’t have too many partitions (<10,000) – Don’t have too many small files in the partitions – Good to have partition sizes at least ~1 GB
• We’ll partition by timestamp. This applies to both our raw and processed data.
©2014 Cloudera, Inc. All Rights Reserved.
![Page 26: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/26.jpg)
26
Architectural Considerations Data Ingestion
![Page 27: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/27.jpg)
27
File Transfers
• “hadoop fs –put <file>” • Reliable, but not
resilient to failure. • Other options are
mountable HDFS, for example NFSv3.
©2014 Cloudera, Inc. All Rights Reserved.
![Page 28: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/28.jpg)
28
Streaming Ingestion • Flume
– Reliable, distributed, and available system for efficient collection, aggregation and movement of streaming data, e.g. logs.
• Kafka – Reliable and distributed publish-subscribe messaging system.
©2014 Cloudera, Inc. All Rights Reserved.
![Page 29: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/29.jpg)
29
Flume vs. Kafka
• Purpose built for Hadoop data ingest.
• Pre-built sinks for HDFS, HBase, etc.
• Supports transformation of data in-flight.
• General pub-sub messaging framework.
• Just a message transport.
• Have to use third party tool to ingest.
©2014 Cloudera, Inc. All Rights Reserved.
![Page 30: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/30.jpg)
30
Flume vs. and Kafka • Kafka Source • Kafka Channel
©2014 Cloudera, Inc. All Rights Reserved.
![Page 31: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/31.jpg)
31
Sources Interceptors Selectors Channels Sinks
Flume Agent
Short Intro to Flume Twitter, logs, JMS,
webserver Mask, re-format,
validate… DR, critical
Memory, file, Kafka
HDFS, HBase, Solr
![Page 32: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/32.jpg)
32
A Brief Discussion of Flume Patterns – Fan-in
• Flume agent runs on each of our servers.
• These agents send data to multiple agents to provide reliability.
• Flume provides support for load balancing.
©2014 Cloudera, Inc. All Rights Reserved.
![Page 33: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/33.jpg)
33
Ingestion Decisions • Historical Data
– File transfer
• Incoming Data – Flume with the spooling directory source.
• Relational Data Sources – ODS, CRM, etc. – Sqoop
©2014 Cloudera, Inc. All Rights Reserved.
![Page 34: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/34.jpg)
34
Architectural Considerations Data Processing – Engines
![Page 35: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/35.jpg)
35
Processing Engines • MapReduce • Abstractions – Pig, Hive, Cascading, Crunch • Spark • Impala
Confidentiality Information Goes Here
![Page 36: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/36.jpg)
36
MapReduce • Oldie but goody • Restrictive Framework / Innovated Work Around • Extreme Batch
Confidentiality Information Goes Here
![Page 37: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/37.jpg)
37
MapReduce Basic High Level
Confidentiality Information Goes Here
Mapper
HDFS (Replicated)
Native File System
Block of Data
Temp Spill Data
Partitioned Sorted Data
Reducer
Reducer Local Copy
Output File
![Page 38: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/38.jpg)
38
Abstractions • SQL
– Hive
• Script/Code – Pig: Pig Latin – Crunch: Java/Scala – Cascading: Java/Scala
Confidentiality Information Goes Here
![Page 39: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/39.jpg)
39
Spark • The New Kid that isn’t that New Anymore • Easily 10x less code • Extremely Easy and Powerful API • Very good for machine learning • Scala, Java, and Python • RDDs • DAG Engine
Confidentiality Information Goes Here
![Page 40: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/40.jpg)
40
Impala • Real-time open source MPP style engine for Hadoop • Doesn’t build on MapReduce • Written in C++, uses LLVM for run-time code generation • Can create tables over HDFS or HBase data • Accesses Hive metastore for metadata • Access available via JDBC/ODBC
©2014 Cloudera, Inc. All Rights Reserved.
![Page 41: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/41.jpg)
41
Architectural Considerations Data Processing – What processing needs to happen?
![Page 42: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/42.jpg)
42
What processing needs to happen?
Confidentiality Information Goes Here
• Sessionization • Filtering • Deduplication • BI / Discovery
![Page 43: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/43.jpg)
43
Sessionization
Confidentiality Information Goes Here
Website visit
Visitor 1 Session 1
Visitor 1 Session 2
Visitor 2 Session 1
> 30 minutes
![Page 44: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/44.jpg)
44
Why sessionize?
Confidentiality Information Goes Here
Helps answers questions like: • What is my website’s bounce rate?
– i.e. how many % of visitors don’t go past the landing page?
• Which marketing channels (e.g. organic search, display ad, etc.) are leading to most sessions? – Which ones of those lead to most conversions (e.g. people buying things,
signing up, etc.)
• Do attribution analysis – which channels are responsible for most conversions?
![Page 45: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/45.jpg)
45
How to Sessionize?
Confidentiality Information Goes Here
1. Given a list of clicks, determine which clicks came from the same user
2. Given a particular user's clicks, determine if a given click is a part of a new session or a continuation of the previous session
![Page 46: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/46.jpg)
46
#1 – Which clicks are from same user? • We can use:
– IP address (244.157.45.12) – Cookies (A9A3BECE0563982D) – IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")
©2014 Cloudera, Inc. All Rights Reserved.
![Page 47: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/47.jpg)
47
#1 – Which clicks are from same user? • We can use:
– IP address (244.157.45.12) – Cookies (A9A3BECE0563982D) – IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")
©2014 Cloudera, Inc. All Rights Reserved.
![Page 48: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/48.jpg)
48
#1 – Which clicks are from same user?
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
![Page 49: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/49.jpg)
49
#2 – Which clicks part of the same session?
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
> 30 mins apart = different sessions
![Page 50: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/50.jpg)
50 ©2014 Cloudera, Inc. All rights reserved.
Sessionization engine recommendation • We have sessionization code in MR, Spark on github. The
complexity of the code varies, depends on the expertise in the organization.
• We choose MR, since it’s fairly simple and maintainable code.
![Page 51: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/51.jpg)
51
Filtering – filter out incomplete records
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U…
![Page 52: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/52.jpg)
52
Filtering – filter out records from bots/spiders
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 209.85.238.11 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
Google spider IP address
![Page 53: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/53.jpg)
53 ©2014 Cloudera, Inc. All rights reserved.
Filtering recommendation • Bot/Spider filtering can be done easily in any of the engines • Incomplete records are harder to filter in schema systems like
Hive, Impala, Pig, etc. • Pretty close choice between MR, Hive and Spark • Can be done in Flume interceptors as well • We can simply embed this in our sessionization job
![Page 54: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/54.jpg)
54
Deduplication – remove duplicate records
©2014 Cloudera, Inc. All Rights Reserved.
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
![Page 55: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/55.jpg)
55 ©2014 Cloudera, Inc. All rights reserved.
Deduplication recommendation • Can be done in all engines. • We already have a Hive table with all the columns, a simple
DISTINCT query will perform deduplication • We use Pig
![Page 56: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/56.jpg)
56 ©2014 Cloudera, Inc. All rights reserved.
BI/Discovery engine recommendation • Main requirements for this are:
– Low latency – SQL interface (e.g. JDBC/ODBC) – Users don’t know how to code
• We chose Impala – It’s a SQL engine – Much faster than other engines – Provides standard JDBC/ODBC interfaces
![Page 57: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/57.jpg)
57
Architectural Considerations Orchestration
![Page 58: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/58.jpg)
58 ©2014 Cloudera, Inc. All rights reserved.
• Workflow is fairly simple • Need to trigger workflow based on data • Be able to recover from errors • Perhaps notify on the status • And collect metrics for reporting
Choosing…
Easier in Oozie
![Page 59: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/59.jpg)
59 ©2014 Cloudera, Inc. All rights reserved.
• Workflow is fairly simple • Need to trigger workflow based on data • Be able to recover from errors • Perhaps notify on the status • And collect metrics for reporting
Choosing the right Orchestration Tool
Better in Azkaban
![Page 60: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/60.jpg)
60 ©2014 Cloudera, Inc. All rights reserved.
• The best orchestration tool is the one you are an expert on – Oozie – Spark Streaming, etc. don’t require orchestration
tool
Important Decision Consideration!
![Page 61: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/61.jpg)
61
Putting It All Together Final Architecture
![Page 62: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/62.jpg)
62 ©2014 Cloudera, Inc. All rights reserved.
Final architecture
Hadoop Cluster
BI/Visualization tool (e.g.
microstrategy)
BI Analysts
Spark For machine learning and graph processing
R/Python Statistical Analysis
Custom Apps
3. Accessing
2. Processing
4. Orchestration
1. Ingestion
Operational Data Store
CRM System Via Sqoop
Web servers
Website users
Web logs Via Flume
![Page 63: Architecting application with Hadoop - using clickstream analytics as an example](https://reader034.vdocument.in/reader034/viewer/2022051709/586fa0111a28abcc238b6629/html5/thumbnails/63.jpg)
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
Thank you