application architectures with hadoop and sessionization in mr
Post on 19-Aug-2014
2.027 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
Headline Goes Here Speaker Name or Subhead Goes Here
DO NOT USE PUBLICLY PRIOR TO 10/23/12
ApplicaAon Architectures with Apache Hadoop Mark Grover | @mark_grover East Bay JUG hadooparchitecturebook.com June 25th, 2014
©2014 Cloudera, Inc. All Rights Reserved.
About Me • CommiTer on Apache Bigtop, commiTer and PPMC member on Apache Sentry (incubaAng).
• Contributor to Hadoop, Hive, Spark, Sqoop, Flume. • SoYware developer at Cloudera • @mark_grover
2 ©2014 Cloudera, Inc. All Rights Reserved.
Co-‐authoring O’Reilly book
• @hadooparchbook • hadooparchitecturebook.com
©2014 Cloudera, Inc. All Rights Reserved. 3
What is Apache Hadoop?
4
Has the Flexibility to Store and Mine Any Type of Data
§ Ask quesAons across structured and
unstructured data that were previously impossible to ask or solve
§ Not bound by a single schema
Excels at Processing Complex Data
§ Scale-‐out architecture divides workloads
across mulAple nodes
§ Flexible file system eliminates ETL boTlenecks
Scales Economically
§ Can be deployed on commodity
hardware
§ Open source plaaorm guards against vendor lock
Hadoop Distributed File System (HDFS)
Self-‐Healing, High
Bandwidth Clustered Storage
MapReduce
Distributed CompuAng Framework
Apache Hadoop is an open source plaaorm for data storage and processing that is…
ü Scalable ü Fault tolerant ü Distributed
CORE HADOOP SYSTEM COMPONENTS
©2013 Cloudera, Inc. All Rights Reserved.
5
Click Stream Analysis
Case Study
©2014 Cloudera, Inc. All Rights Reserved.
AnalyAcs
©2014 Cloudera, Inc. All Rights Reserved. 6
Web Logs
©2014 Cloudera, Inc. All Rights Reserved. 7
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
Clickstream AnalyAcs
©2014 Cloudera, Inc. All Rights Reserved. 8
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
Click Stream Analysis (Before Hadoop)
9 ©2014 Cloudera, Inc. All Rights Reserved.
Web logs
ODS
Data Warehouse
Query Extract Transform
Load Business
Intelligence Transform
The Problems (Before Hadoop)
10 ©2014 Cloudera, Inc. All Rights Reserved.
Web logs
ODS
Data Warehouse
Query Extract Transform
Load Business
Intelligence Transform
1
1 1
Slow Data TransformaAons = Missed ETL SLAs.
2
2
Slow Queries = Frustrated Business Users.
Click Stream Analysis (AYer Hadoop)
11 ©2014 Cloudera, Inc. All Rights Reserved.
Web logs
ODS
Data Warehouse
Query Extract Transform
Load Business
Intelligence Transform X X
Web logs
ODS
Business Intelligence
Query
Click Stream Analysis (AYer Hadoop)
12 ©2014 Cloudera, Inc. All Rights Reserved.
Transform
Web logs
ODS
Business Intelligence
Query
Click Stream Analysis (AYer Hadoop)
13 ©2014 Cloudera, Inc. All Rights Reserved.
Transform
Flume or Sqoop?
Flume or Sqoop?
Hive/Impala/MR?
Hive/Impala/Spark?
Challenges of Hadoop ImplementaAon
14 ©2014 Cloudera, Inc. All Rights Reserved.
Challenges of Hadoop ImplementaAon
15 ©2014 Cloudera, Inc. All Rights Reserved.
Other challenges -‐ Architectural ConsideraAons
• Storage managers? • HDFS? HBase?
• Data storage and modeling: • File formats? Compression? Schema design?
• Data movement • How do we actually get the data into Hadoop? How do we get it out?
• Metadata • How do we manage data about the data?
• Data access and processing • How will the data be accessed once in Hadoop? How can we transform it? How do we query it?
• OrchestraAon • How do we manage the workflow for all of this?
16 ©2014 Cloudera, Inc. All Rights Reserved.
17
High level Design
©2014 Cloudera, Inc. All Rights Reserved.
Clickstream AnalyAcs
©2014 Cloudera, Inc. All Rights Reserved. 18
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
©2014 Cloudera, Inc. All Rights Reserved. 19
Hadoop Cluster
BI/VisualizaAon tool (e.g.
microstrategy)
BI Analysts
Spark For machine learning and graph processing
R/Python StaAsAcal Analysis
Custom Apps
3. Accessing
2. Processing
4. OrchestraAon via Oozie 1. IngesAon
OperaAonal Data Store
CRM System Via Sqoop
Web servers
Website users
20
Since that’s of most interest to this audience
2. Processing
©2014 Cloudera, Inc. All Rights Reserved.
Processing
• De-‐duplicaAon • Filtering • SessionizaAon
21 ©2014 Cloudera, Inc. All Rights Reserved.
DeduplicaAon – remove duplicate records
©2014 Cloudera, Inc. All Rights Reserved. 22
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
Filtering – filter out invalid records
©2014 Cloudera, Inc. All Rights Reserved. 23
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U…
SessionizaAon
©2014 Cloudera, Inc. All Rights Reserved. 24
Website visit
Visitor 1 Session 1
Visitor 1 Session 2
Visitor 2 Session 1
> 30 minutes
Why sessionize?
Helps answers quesAons like: • What is my website’s bounce rate?
• i.e. how many % of visitors don’t go past the landing page? • Which markeAng channels (e.g. organic search, display ad, etc.) are leading to most sessions?
• Which ones of those lead to most conversions (e.g. people buying things, signing up, etc.)
• Do aTribuAon analysis – which channels are responsible for most conversions?
25 ©2014 Cloudera, Inc. All Rights Reserved.
SessionizaAon
©2014 Cloudera, Inc. All Rights Reserved. 26
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 165 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” 166
How to Sessionize?
1. Given a list of clicks, determine which clicks came from the same user
2. Given a parAcular user's clicks, determine if a given click is a part of a new session or a conAnuaAon of the previous session
27 ©2014 Cloudera, Inc. All Rights Reserved.
#1 – Which clicks are from same user?
• We can use: • IP address (244.157.45.12) • Cookies (A9A3BECE0563982D) • IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")
28 ©2014 Cloudera, Inc. All Rights Reserved.
#1 – Which clicks are from same user?
• We can use: • IP address (244.157.45.12) • Cookies (A9A3BECE0563982D) • IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")
29 ©2014 Cloudera, Inc. All Rights Reserved.
#1 – Which clicks are from same user?
©2014 Cloudera, Inc. All Rights Reserved. 30
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
#2 – Which clicks part of the same session?
©2014 Cloudera, Inc. All Rights Reserved. 31
244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”
> 30 mins apart = different sessions
32
Intro to MapReduce
©2014 Cloudera, Inc. All Rights Reserved.
MapReduce
• Map – Apply a funcAon to each input record • Shuffle & Sort – ParAAon the map output and sort each parAAon
• Reduce – Apply aggregaAon funcAon to all values in each parAAon
33 ©2014 Cloudera, Inc. All Rights Reserved.
34
SessionizaAon in MapReduce
©2014 Cloudera, Inc. All Rights Reserved.
35
github.com/hadooparchitecturebook/hadoop-‐arch-‐book/tree/master/ch06/MRSessionize
SessionizaAon in MapReduce
©2014 Cloudera, Inc. All Rights Reserved.
SessionizaAon in MapReduce
36 ©2014 Cloudera, Inc. All Rights Reserved.
Map
Reduce
Reduce
Log line IP1, log lines
IP1, log line
s
Log line, session ID
Map
Map
Log line
Log line IP2, log lines
IP2, log lines Log line, session ID
Mapper for SessionizaAon
37 ©2014 Cloudera, Inc. All Rights Reserved.
public static class SessionizeMapper extends Mapper<Object, Text, IpTimestampKey, Text> { private Matcher logRecordMatcher; public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { logRecordMatcher = logRecordPattern.matcher(value.toString()); // We only emit something out if the record matches with our regex. Otherwise, we assume the record is busted and simply ignore it if (logRecordMatcher.matches()) { String ip = logRecordMatcher.group(1); DateTime timestamp = DateTime.parse(logRecordMatcher.group(2), TIMESTAMP_FORMATTER); Long unixTimestamp = timestamp.getMillis(); IpTimestampKey outputKey = new IpTimestampKey(ip, unixTimestamp); context.write(outputKey, value); } } }
Reducer for SessionizaAon
38 ©2014 Cloudera, Inc. All Rights Reserved.
public static class SessionizeReducer extends Reducer<IpTimestampKey, Text, IpTimestampKey, Text> { private Text result = new Text(); private static int sessionId = 0; private Long lastTimeStamp = null; public void reduce(IpTimestampKey key, Iterable<Text> values, Context context ) throws IOException, InterruptedException { for (Text value : values) { String logRecord = value.toString(); // If this is the first record for this user or it's been more than the timeout // since the last click from this user, let's increment the session ID. if (lastTimeStamp == null || (key.getUnixTimestamp() -‐ lastTimeStamp > SESSION_TIMEOUT_IN_MS)) { sessionId++; } lastTimeStamp = key.getUnixTimestamp(); result.set(logRecord + " " + sessionId); // Since we only care about printing out the entire record in the result, with // session ID appended at the end, we just emit out "null" for the key context.write(null, result); } } }
Secondary sorAng – by Amestamp
• Need records to reducer to be grouped by IP address and sorted by Amestamp – a concept called secondary sor/ng
• Instead of using just IP address as map output key and reduce input key
• We use a composite key (IP, Amestamp) as map output key and reduce input key
39 ©2014 Cloudera, Inc. All Rights Reserved.
Secondary sorAng – vocabulary
• Composite key – IP address, Amestamp • Natural key – IP address • Secondary sort key -‐ Amestamp
40 ©2014 Cloudera, Inc. All Rights Reserved.
Secondary sorAng
• Custom Grouping Comparator – on Natural Key (IP) • Custom Sort Comparator – on Composite Key (IP, address) • Custom ParAAoner – on Natural Key (IP) job.setGroupingComparatorClass(NaturalKeyComparator.class); job.setSortComparatorClass(CompositeKeyComparator.class); job.setPartitionerClass(NaturalKeyPartitioner.class);
41 ©2014 Cloudera, Inc. All Rights Reserved.
42
Data Storage and Modeling
©2014 Cloudera, Inc. All Rights Reserved.
Data Storage – Storage Manager consideraAons
• Popular storage managers for Hadoop • Hadoop Distributed File System (HDFS) • HBase
43 ©2014 Cloudera, Inc. All Rights Reserved.
Data Storage – HDFS vs HBase
HDFS
• Stores data directly as files • Fast scans • Poor random reads/writes
HBase
• Stores data as Hfiles on HDFS • Slow scans • Fast random reads/writes
44 ©2014 Cloudera, Inc. All Rights Reserved.
Data Storage – Storage Manager consideraAons
• We choose HDFS • AnalyAcal needs in this case served beTer by fast scans.
45 ©2014 Cloudera, Inc. All Rights Reserved.
46
Data Storage Format
©2014 Cloudera, Inc. All Rights Reserved.
Data Storage – Format ConsideraAons
• Store as plain text? • Sure, well supported by Hadoop. • Text can easily be processed by MapReduce, loaded into Hive for analysis, and so on.
• But… • Will begin to consume lots of space in HDFS. • May not be opAmal for processing by tools in the Hadoop ecosystem.
47 ©2014 Cloudera, Inc. All Rights Reserved.
Data Storage – Format ConsideraAons
• But, we can compress the text files… • Gzip – supported by Hadoop, but not spliTable. • Bzip2 – hey, spliTable! Great compression! But decompression is slooowww.
• LZO – spliTable (with some work), good compress/de-‐compress performance. Good choice for storing text files on Hadoop.
• Snappy – provides a good tradeoff between size and speed.
48 ©2014 Cloudera, Inc. All Rights Reserved.
Data Storage – More About Snappy
• Designed at Google to provide high compression speeds with reasonable compression.
• Not the highest compression, but provides very good performance for processing on Hadoop.
• Snappy is not spliTable though, which brings us to…
49 ©2014 Cloudera, Inc. All Rights Reserved.
SequenceFile
• Stores records as binary key/value pairs.
• SequenceFile “blocks” can be compressed.
• This enables spliTability with non-‐spliTable compression.
50 ©2014 Cloudera, Inc. All Rights Reserved.
Avro
• Kinda SequenceFile on Steroids.
• Self-‐documenAng – stores schema in header.
• Provides very efficient storage.
• Supports spliTable compression.
51 ©2014 Cloudera, Inc. All Rights Reserved.
Our Format Choices…
• Avro with Snappy • Snappy provides opAmized compression. • Avro provides compact storage, self-‐documenAng files, and supports schema evoluAon.
• Avro also provides beTer failure handling than other choices. • SequenceFiles would also be a good choice, and are directly supported by ingesAon tools in the ecosystem.
• But only supports Java.
52 ©2014 Cloudera, Inc. All Rights Reserved.
53
HDFS Schema Design
©2014 Cloudera, Inc. All Rights Reserved.
Recommended HDFS Schema Design
• How to lay out data on HDFS?
54 ©2014 Cloudera, Inc. All Rights Reserved.
Recommended HDFS Schema Design
/user/<username> -‐ User specific data, jars, conf files /etl – Data in various stages of ETL workflow /tmp – temp data from tools or shared between users /data – shared data for the enAre organizaAon /app – Everything but data: UDF jars, HQL files, Oozie workflows
55 ©2014 Cloudera, Inc. All Rights Reserved.
56
Advanced HDFS Schema Design
©2014 Cloudera, Inc. All Rights Reserved.
What is ParAAoning?
57
dataset col=val1/file.txt col=val2/file.txt . . . col=valn/file.txt
dataset file1.txt file2.txt . . . filen.txt
Un-‐parAAoned HDFS directory structure
ParAAoned HDFS directory structure
©2014 Cloudera, Inc. All Rights Reserved.
What is ParAAoning?
58
clicks dt=2014-‐01-‐01/clicks.txt dt=2014-‐01-‐02/clicks.txt . . . dt=2014-‐03-‐31/clicks.txt
clicks clicks-‐2014-‐01-‐01.txt clicks-‐2014-‐01-‐02.txt . . . clicks-‐2014-‐03-‐31.txt
Un-‐parAAoned HDFS directory structure
ParAAoned HDFS directory structure
©2014 Cloudera, Inc. All Rights Reserved.
ParAAoning
• Split the dataset into smaller consumable chunks • Rudimentary form of “indexing” • <data set name>/<parAAon_column_name=parAAon_column_value>/{files}
59 ©2014 Cloudera, Inc. All Rights Reserved.
ParAAoning consideraAons
• What column to bucket by? • HDFS is append only. • Don’t have too many parAAons (<10,000) • Don’t have too many small files in the parAAons (more than block size generally)
• We decided to parAAon by /mestamp
60 ©2014 Cloudera, Inc. All Rights Reserved.
What is buckeAng?
61
clicks dt=2014-‐01-‐01/clicks.txt dt=2014-‐01-‐02/clicks.txt
Un-‐bucketed HDFS directory structure
clicks dt=2014-‐01-‐01/file0.txt dt=2014-‐01-‐01/file1.txt dt=2014-‐01-‐01/file2.txt dt=2014-‐01-‐01/file3.txt dt=2014-‐01-‐02/file0.txt dt=2014-‐01-‐02/file1.txt dt=2014-‐01-‐02/file2.txt dt=2014-‐01-‐02/file3.txt
Bucketed HDFS directory structure
©2014 Cloudera, Inc. All Rights Reserved.
BuckeAng
• Hash-‐bucketed files within each parAAon based on a parAcular column
• Useful when sampling • In some joins, pre-‐reqs:
• Datasets bucketed on the same key as the join key • Number of buckets are the same or one is a mulAple of the other
62 ©2014 Cloudera, Inc. All Rights Reserved.
BuckeAng consideraAons?
• Which column to bucket on? • How many buckets? • We decided to bucket based on cookie
63 ©2014 Cloudera, Inc. All Rights Reserved.
De-‐normalizing consideraAons
• In general, big data joins are expensive • When to de-‐normalize?
• Decided to join the smaller dimension tables • Big fact tables are sAll joined
64 ©2014 Cloudera, Inc. All Rights Reserved.
65
Data IngesAon
©2014 Cloudera, Inc. All Rights Reserved.
File Transfers
• “hadoop fs –put <file>” • Reliable, but not resilient to failure.
• Other opAons are mountable HDFS, for example NFSv3.
66 ©2014 Cloudera, Inc. All Rights Reserved.
Streaming IngesAon
• Flume • Reliable, distributed, and available system for efficient collecAon, aggregaAon and movement of streaming data, e.g. logs.
• Ka�a • Reliable and distributed publish-‐subscribe messaging system.
67 ©2014 Cloudera, Inc. All Rights Reserved.
Flume vs. Ka�a
• Purpose built for Hadoop data ingest.
• Pre-‐built sinks for HDFS, HBase, etc.
• Supports transformaAon of data in-‐flight.
• General pub-‐sub messaging framework.
• Hadoop not supported, requires 3rd-‐party component (Camus).
• Just a message transport (a very fast one).
68 ©2014 Cloudera, Inc. All Rights Reserved.
Flume vs. Ka�a
• BoTom line: • Flume very well integrated with Hadoop ecosystem, well suited to ingesAon of sources such as log files.
• Ka�a is a highly reliable and scalable enterprise messaging system, and great for scaling out to mulAple consumers.
69 ©2014 Cloudera, Inc. All Rights Reserved.
A Quick IntroducAon to Flume
70
Flume Agent
Source Channel Sink DesAnaAon External Source
Web Server TwiTer JMS System logs …
Consumes events and forwards to channels
Stores events unAl consumed by sinks – file, memory, JDBC
Removes event from channel and puts into external desAnaAon
JVM process hosAng components
©2014 Cloudera, Inc. All Rights Reserved.
A Quick IntroducAon to Flume
• Reliable – events are stored in channel unAl delivered to next stage. • Recoverable – events can be persisted to disk and recovered in the event of failure.
71
Flume Agent
Source Channel Sink DesAnaAon
©2014 Cloudera, Inc. All Rights Reserved.
A Quick IntroducAon to Flume
• DeclaraAve • No coding required. • ConfiguraAon specifies how components are wired together.
72 ©2014 Cloudera, Inc. All Rights Reserved.
A Brief Discussion of Flume PaTerns – Fan-‐in
• Flume agent runs on each of our servers.
• These agents send data to mulAple agents to provide reliability.
• Flume provides support for load balancing.
73 ©2014 Cloudera, Inc. All Rights Reserved.
A Brief Discussion of Flume PaTerns – Spli�ng
• Common need is to split data on ingest.
• For example: • Sending data to mulAple clusters for DR.
• To mulAple desAnaAons. • Flume also supports parAAoning, which is key to our implementaAon.
74 ©2014 Cloudera, Inc. All Rights Reserved.
Sqoop Overview
• Apache project designed to ease import and export of data between Hadoop and external data stores such as relaAonal databases.
• Great for doing bulk imports and exports of data between HDFS, Hive and HBase and an external data store. Not suited for ingesAng event based data.
©2014 Cloudera, Inc. All Rights Reserved. 75
IngesAon Decisions
• Historical Data • Smaller files: file transfer • Larger files: Flume with spooling directory source.
• Incoming Data • Flume with the spooling directory source.
76 ©2014 Cloudera, Inc. All Rights Reserved.
77
Data Processing and Access
©2014 Cloudera, Inc. All Rights Reserved.
Data flow
78
Raw data
ParAAoned clickstream
data
Other data (Financial, CRM, etc.)
Aggregated dataset #2
Aggregated dataset #1
©2014 Cloudera, Inc. All Rights Reserved.
Data processing tools
79
• Hive • Impala • Pig, etc.
©2014 Cloudera, Inc. All Rights Reserved.
Hive
80
• Open source data warehouse system for Hadoop • Converts SQL-‐like queries to MapReduce jobs
• Work is being done to move this away from MR • Stores metadata in Hive metastore • Can create tables over HDFS or HBase data • Access available via JDBC/ODBC
©2014 Cloudera, Inc. All Rights Reserved.
Impala
81
• Real-‐Ame open source SQL query engine for Hadoop • Doesn’t build on MapReduce • WriTen in C++, uses LLVM for run-‐Ame code generaAon • Can create tables over HDFS or HBase data • Accesses Hive metastore for metadata • Access available via JDBC/ODBC
©2014 Cloudera, Inc. All Rights Reserved.
Pig
82
• Higher level abstracAon over MapReduce (like Hive) • Write transformaAons in scripAng language – Pig LaAn • Can access Hive metastore via HCatalog for metadata
©2014 Cloudera, Inc. All Rights Reserved.
Data Processing consideraAons
83
• We chose Hive for ETL and Impala for interac/ve BI.
©2014 Cloudera, Inc. All Rights Reserved.
84
Metadata Management
©2014 Cloudera, Inc. All Rights Reserved.
What is Metadata?
85
• Metadata is data about the data • Format in which data is stored • Compression codec • LocaAon of the data • Is the data parAAoned/bucketed/sorted?
©2014 Cloudera, Inc. All Rights Reserved.
Metadata in Hive
86
Hive Metastore
©2014 Cloudera, Inc. All Rights Reserved.
Metadata
87
• Hive metastore has become the de-‐facto metadata repository • HCatalog makes Hive metastore accessible to other applicaAons (Pig, MapReduce, custom apps, etc.)
©2014 Cloudera, Inc. All Rights Reserved.
Hive + HCatalog
88 ©2014 Cloudera, Inc. All Rights Reserved.
89
OrchestraAon
©2014 Cloudera, Inc. All Rights Reserved.
OrchestraAon
• Once the data is in Hadoop, we need a way to manage workflows in our architecture.
• Scheduling and tracking MapReduce jobs, Hive jobs, etc. • Several opAons here:
• Cron • Oozie, Azkaban • 3rd-‐party tools, Talend, Pentaho, InformaAca, enterprise schedulers.
90 ©2014 Cloudera, Inc. All Rights Reserved.
Oozie
• Supports defining and execuAng a sequence of jobs.
• Can trigger jobs based on external dependencies or schedules.
91 ©2014 Cloudera, Inc. All Rights Reserved.
92
Final Architecture
©2014 Cloudera, Inc. All Rights Reserved.
Final Architecture – High Level Overview
93
Data Sources IngesAon
Data Storage/Processing
Data ReporAng/Analysis
©2014 Cloudera, Inc. All Rights Reserved.
Final Architecture – High Level Overview
94
Data Sources IngesAon
Data Storage/Processing
Data ReporAng/Analysis
©2014 Cloudera, Inc. All Rights Reserved.
Final Architecture – IngesAon
95
Web App Avro Agent Web App Avro Agent
Web App Avro Agent Web App Avro Agent
Web App Avro Agent Web App Avro Agent
Web App Avro Agent Web App Avro Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Fan-‐in PaTern
MulA Agents for Failover and rolling restarts
HDFS
©2014 Cloudera, Inc. All Rights Reserved.
Final Architecture – High Level Overview
96
Data Sources IngesAon
Data Storage/Processing
Data ReporAng/Analysis
©2014 Cloudera, Inc. All Rights Reserved.
Final Architecture – Storage and Processing
97
/etl/weblogs/20140331/ /etl/weblogs/20140401/ …
Data Processing /data/markeAng/clickstream/bouncerate/ /data/markeAng/clickstream/aTribuAon/ …
©2014 Cloudera, Inc. All Rights Reserved.
Final Architecture – High Level Overview
98
Data Sources IngesAon
Data Storage/Processing
Data ReporAng/Analysis
©2014 Cloudera, Inc. All Rights Reserved.
Final Architecture – Data Access
99
Hive/Impala
BI/AnalyAcs Tools
DWH Sqoop
Local Disk
R, etc.
DB import tool
JDBC/ODBC
©2014 Cloudera, Inc. All Rights Reserved.
Contact info • Mark Grover
• @mark_grover • www.linkedin.com/in/grovermark
• Slides at slideshare.net/markgrover
100 ©2014 Cloudera, Inc. All Rights Reserved.
101 ©2014 Cloudera, Inc. All Rights Reserved.
top related