application architectures with hadoop and sessionization in mr

Headline Goes Here Speaker Name or Subhead Goes Here

DO NOT USE PUBLICLY PRIOR TO 10/23/12

ApplicaAon Architectures with Apache Hadoop Mark Grover | @mark_grover East Bay JUG hadooparchitecturebook.com June 25th, 2014

About Me •  CommiTer on Apache Bigtop, commiTer and PPMC member on Apache Sentry (incubaAng).

•  Contributor to Hadoop, Hive, Spark, Sqoop, Flume. •  SoYware developer at Cloudera • @mark_grover

Co-‐authoring O’Reilly book

• @hadooparchbook •  hadooparchitecturebook.com

What is Apache Hadoop?

Has the Flexibility to Store and Mine Any Type of Data

§  Ask quesAons across structured and

unstructured data that were previously impossible to ask or solve

§  Not bound by a single schema

Excels at Processing Complex Data

§  Scale-‐out architecture divides workloads

across mulAple nodes

§  Flexible file system eliminates ETL boTlenecks

Scales Economically

§  Can be deployed on commodity

hardware

§  Open source plaaorm guards against vendor lock

Hadoop Distributed File System (HDFS)

Self-‐Healing, High

Bandwidth Clustered Storage

MapReduce

Distributed CompuAng Framework

Apache Hadoop is an open source plaaorm for data storage and processing that is…

ü  Scalable ü  Fault tolerant ü  Distributed

CORE HADOOP SYSTEM COMPONENTS

Click Stream Analysis

Case Study

AnalyAcs

Web Logs

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”

Clickstream AnalyAcs

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”

Click Stream Analysis (Before Hadoop)

Web logs

Data Warehouse

Query Extract Transform

Load Business

Intelligence Transform

The Problems (Before Hadoop)

Web logs

Data Warehouse

Load Business

Intelligence Transform

Slow Data TransformaAons = Missed ETL SLAs.

Slow Queries = Frustrated Business Users.

Click Stream Analysis (AYer Hadoop)

Web logs

Data Warehouse

Load Business

Intelligence Transform X X

Web logs

Business Intelligence

Transform

Web logs

Business Intelligence

Transform

Flume or Sqoop?

Hive/Impala/MR?

Hive/Impala/Spark?

Challenges of Hadoop ImplementaAon

Other challenges -‐ Architectural ConsideraAons

•  Storage managers? •  HDFS? HBase?

•  Data storage and modeling: •  File formats? Compression? Schema design?

•  Data movement •  How do we actually get the data into Hadoop? How do we get it out?

•  Metadata •  How do we manage data about the data?

•  Data access and processing •  How will the data be accessed once in Hadoop? How can we transform it? How do we query it?

•  OrchestraAon •  How do we manage the workflow for all of this?

High level Design

Clickstream AnalyAcs

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”

Hadoop Cluster

BI/VisualizaAon tool (e.g.

microstrategy)

BI Analysts

Spark For machine learning and graph processing

R/Python StaAsAcal Analysis

Custom Apps

3. Accessing

2. Processing

4. OrchestraAon via Oozie 1. IngesAon

OperaAonal Data Store

CRM System Via Sqoop

Web servers

Website users

Since that’s of most interest to this audience

2. Processing

Processing

• De-‐duplicaAon •  Filtering •  SessionizaAon

DeduplicaAon – remove duplicate records

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”

Filtering – filter out invalid records

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U…

SessionizaAon

Website visit

Visitor 1 Session 1

Visitor 1 Session 2

Visitor 2 Session 1

> 30 minutes

Why sessionize?

Helps answers quesAons like: • What is my website’s bounce rate?

•  i.e. how many % of visitors don’t go past the landing page? • Which markeAng channels (e.g. organic search, display ad, etc.) are leading to most sessions?

• Which ones of those lead to most conversions (e.g. people buying things, signing up, etc.)

• Do aTribuAon analysis – which channels are responsible for most conversions?

SessionizaAon

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36” 165 244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1” 166

How to Sessionize?

1.  Given a list of clicks, determine which clicks came from the same user

2.  Given a parAcular user's clicks, determine if a given click is a part of a new session or a conAnuaAon of the previous session

#1 – Which clicks are from same user?

• We can use: •  IP address (244.157.45.12) •  Cookies (A9A3BECE0563982D) •  IP address (244.157.45.12)and user agent string ((KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")

#2 – Which clicks part of the same session?

> 30 mins apart = different sessions

Intro to MapReduce

MapReduce

• Map – Apply a funcAon to each input record •  Shuffle & Sort – ParAAon the map output and sort each parAAon

•  Reduce – Apply aggregaAon funcAon to all values in each parAAon

SessionizaAon in MapReduce

github.com/hadooparchitecturebook/hadoop-‐arch-‐book/tree/master/ch06/MRSessionize

Reduce

Log line IP1, log lines

IP1, log line

Log line, session ID

Log line

Log line IP2, log lines

IP2, log lines Log line, session ID

Mapper for SessionizaAon

public static class SessionizeMapper extends Mapper<Object, Text, IpTimestampKey, Text> { private Matcher logRecordMatcher; public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { logRecordMatcher = logRecordPattern.matcher(value.toString()); // We only emit something out if the record matches with our regex. Otherwise, we assume the record is busted and simply ignore it if (logRecordMatcher.matches()) { String ip = logRecordMatcher.group(1); DateTime timestamp = DateTime.parse(logRecordMatcher.group(2), TIMESTAMP_FORMATTER); Long unixTimestamp = timestamp.getMillis(); IpTimestampKey outputKey = new IpTimestampKey(ip, unixTimestamp); context.write(outputKey, value); } } }

Reducer for SessionizaAon

public static class SessionizeReducer extends Reducer<IpTimestampKey, Text, IpTimestampKey, Text> { private Text result = new Text(); private static int sessionId = 0; private Long lastTimeStamp = null; public void reduce(IpTimestampKey key, Iterable<Text> values, Context context ) throws IOException, InterruptedException { for (Text value : values) { String logRecord = value.toString(); // If this is the first record for this user or it's been more than the timeout // since the last click from this user, let's increment the session ID. if (lastTimeStamp == null || (key.getUnixTimestamp() -‐ lastTimeStamp > SESSION_TIMEOUT_IN_MS)) { sessionId++; } lastTimeStamp = key.getUnixTimestamp(); result.set(logRecord + " " + sessionId); // Since we only care about printing out the entire record in the result, with // session ID appended at the end, we just emit out "null" for the key context.write(null, result); } } }

Secondary sorAng – by Amestamp

• Need records to reducer to be grouped by IP address and sorted by Amestamp – a concept called secondary sor/ng

•  Instead of using just IP address as map output key and reduce input key

• We use a composite key (IP, Amestamp) as map output key and reduce input key

Secondary sorAng – vocabulary

•  Composite key – IP address, Amestamp • Natural key – IP address •  Secondary sort key -‐ Amestamp

Secondary sorAng

•  Custom Grouping Comparator – on Natural Key (IP) •  Custom Sort Comparator – on Composite Key (IP, address) •  Custom ParAAoner – on Natural Key (IP) job.setGroupingComparatorClass(NaturalKeyComparator.class); job.setSortComparatorClass(CompositeKeyComparator.class); job.setPartitionerClass(NaturalKeyPartitioner.class);

Data Storage and Modeling

Data Storage – Storage Manager consideraAons

•  Popular storage managers for Hadoop •  Hadoop Distributed File System (HDFS) •  HBase

Data Storage – HDFS vs HBase

•  Stores data directly as files •  Fast scans •  Poor random reads/writes

•  Stores data as Hfiles on HDFS •  Slow scans •  Fast random reads/writes

Data Storage – Storage Manager consideraAons

• We choose HDFS •  AnalyAcal needs in this case served beTer by fast scans.

Data Storage Format

Data Storage – Format ConsideraAons

•  Store as plain text? •  Sure, well supported by Hadoop. •  Text can easily be processed by MapReduce, loaded into Hive for analysis, and so on.

•  But… • Will begin to consume lots of space in HDFS. • May not be opAmal for processing by tools in the Hadoop ecosystem.

Data Storage – Format ConsideraAons

•  But, we can compress the text files… •  Gzip – supported by Hadoop, but not spliTable. •  Bzip2 – hey, spliTable! Great compression! But decompression is slooowww.

•  LZO – spliTable (with some work), good compress/de-‐compress performance. Good choice for storing text files on Hadoop.

•  Snappy – provides a good tradeoff between size and speed.

Data Storage – More About Snappy

•  Designed at Google to provide high compression speeds with reasonable compression.

•  Not the highest compression, but provides very good performance for processing on Hadoop.

•  Snappy is not spliTable though, which brings us to…

SequenceFile

• Stores records as binary key/value pairs.

• SequenceFile “blocks” can be compressed.

• This enables spliTability with non-‐spliTable compression.

• Kinda SequenceFile on Steroids.

• Self-‐documenAng – stores schema in header.

• Provides very efficient storage.

• Supports spliTable compression.

Our Format Choices…

• Avro with Snappy •  Snappy provides opAmized compression. •  Avro provides compact storage, self-‐documenAng files, and supports schema evoluAon.

•  Avro also provides beTer failure handling than other choices. •  SequenceFiles would also be a good choice, and are directly supported by ingesAon tools in the ecosystem.

•  But only supports Java.

HDFS Schema Design

Recommended HDFS Schema Design

• How to lay out data on HDFS?

Recommended HDFS Schema Design

/user/<username> -‐ User specific data, jars, conf files /etl – Data in various stages of ETL workflow /tmp – temp data from tools or shared between users /data – shared data for the enAre organizaAon /app – Everything but data: UDF jars, HQL files, Oozie workflows

Advanced HDFS Schema Design

What is ParAAoning?

dataset col=val1/file.txt col=val2/file.txt . . . col=valn/file.txt

dataset file1.txt file2.txt . . . filen.txt

Un-‐parAAoned HDFS directory structure

ParAAoned HDFS directory structure

What is ParAAoning?

clicks dt=2014-‐01-‐01/clicks.txt dt=2014-‐01-‐02/clicks.txt . . . dt=2014-‐03-‐31/clicks.txt

clicks clicks-‐2014-‐01-‐01.txt clicks-‐2014-‐01-‐02.txt . . . clicks-‐2014-‐03-‐31.txt

Un-‐parAAoned HDFS directory structure

ParAAoned HDFS directory structure

ParAAoning

•  Split the dataset into smaller consumable chunks •  Rudimentary form of “indexing” •  <data set name>/<parAAon_column_name=parAAon_column_value>/{files}

ParAAoning consideraAons

• What column to bucket by? •  HDFS is append only. •  Don’t have too many parAAons (<10,000) •  Don’t have too many small files in the parAAons (more than block size generally)

• We decided to parAAon by /mestamp

What is buckeAng?

clicks dt=2014-‐01-‐01/clicks.txt dt=2014-‐01-‐02/clicks.txt

Un-‐bucketed HDFS directory structure

clicks dt=2014-‐01-‐01/file0.txt dt=2014-‐01-‐01/file1.txt dt=2014-‐01-‐01/file2.txt dt=2014-‐01-‐01/file3.txt dt=2014-‐01-‐02/file0.txt dt=2014-‐01-‐02/file1.txt dt=2014-‐01-‐02/file2.txt dt=2014-‐01-‐02/file3.txt

Bucketed HDFS directory structure

BuckeAng

•  Hash-‐bucketed files within each parAAon based on a parAcular column

•  Useful when sampling •  In some joins, pre-‐reqs:

•  Datasets bucketed on the same key as the join key •  Number of buckets are the same or one is a mulAple of the other

BuckeAng consideraAons?

• Which column to bucket on? •  How many buckets? • We decided to bucket based on cookie

De-‐normalizing consideraAons

•  In general, big data joins are expensive • When to de-‐normalize?

•  Decided to join the smaller dimension tables •  Big fact tables are sAll joined

Data IngesAon

File Transfers

• “hadoop fs –put <file>” • Reliable, but not resilient to failure.

• Other opAons are mountable HDFS, for example NFSv3.

Streaming IngesAon

•  Flume •  Reliable, distributed, and available system for efficient collecAon, aggregaAon and movement of streaming data, e.g. logs.

•  Ka�a •  Reliable and distributed publish-‐subscribe messaging system.

Flume vs. Ka�a

• Purpose built for Hadoop data ingest.

• Pre-‐built sinks for HDFS, HBase, etc.

• Supports transformaAon of data in-‐flight.

• General pub-‐sub messaging framework.

• Hadoop not supported, requires 3rd-‐party component (Camus).

• Just a message transport (a very fast one).

Flume vs. Ka�a

•  BoTom line: •  Flume very well integrated with Hadoop ecosystem, well suited to ingesAon of sources such as log files.

•  Ka�a is a highly reliable and scalable enterprise messaging system, and great for scaling out to mulAple consumers.

A Quick IntroducAon to Flume

Flume Agent

Source Channel Sink DesAnaAon External Source

Web Server TwiTer JMS System logs …

Consumes events and forwards to channels

Stores events unAl consumed by sinks – file, memory, JDBC

Removes event from channel and puts into external desAnaAon

JVM process hosAng components

•  Reliable – events are stored in channel unAl delivered to next stage. •  Recoverable – events can be persisted to disk and recovered in the event of failure.

Flume Agent

Source Channel Sink DesAnaAon

• DeclaraAve • No coding required. •  ConfiguraAon specifies how components are wired together.

A Brief Discussion of Flume PaTerns – Fan-‐in

• Flume agent runs on each of our servers.

• These agents send data to mulAple agents to provide reliability.

• Flume provides support for load balancing.

A Brief Discussion of Flume PaTerns – Spli�ng

• Common need is to split data on ingest.

• For example: •  Sending data to mulAple clusters for DR.

•  To mulAple desAnaAons. • Flume also supports parAAoning, which is key to our implementaAon.

Sqoop Overview

• Apache project designed to ease import and export of data between Hadoop and external data stores such as relaAonal databases.

• Great for doing bulk imports and exports of data between HDFS, Hive and HBase and an external data store. Not suited for ingesAng event based data.

IngesAon Decisions

• Historical Data •  Smaller files: file transfer •  Larger files: Flume with spooling directory source.

•  Incoming Data •  Flume with the spooling directory source.

Data Processing and Access

Data flow

Raw data

ParAAoned clickstream

Other data (Financial, CRM, etc.)

Aggregated dataset #2

Aggregated dataset #1

Data processing tools

• Hive •  Impala •  Pig, etc.

• Open source data warehouse system for Hadoop •  Converts SQL-‐like queries to MapReduce jobs

• Work is being done to move this away from MR •  Stores metadata in Hive metastore •  Can create tables over HDFS or HBase data • Access available via JDBC/ODBC

Impala

•  Real-‐Ame open source SQL query engine for Hadoop • Doesn’t build on MapReduce • WriTen in C++, uses LLVM for run-‐Ame code generaAon •  Can create tables over HDFS or HBase data • Accesses Hive metastore for metadata • Access available via JDBC/ODBC

• Higher level abstracAon over MapReduce (like Hive) • Write transformaAons in scripAng language – Pig LaAn •  Can access Hive metastore via HCatalog for metadata

Data Processing consideraAons

• We chose Hive for ETL and Impala for interac/ve BI.

Metadata Management

What is Metadata?

• Metadata is data about the data •  Format in which data is stored •  Compression codec •  LocaAon of the data •  Is the data parAAoned/bucketed/sorted?

Metadata in Hive

Hive Metastore

Metadata

• Hive metastore has become the de-‐facto metadata repository • HCatalog makes Hive metastore accessible to other applicaAons (Pig, MapReduce, custom apps, etc.)

Hive + HCatalog

OrchestraAon

• Once the data is in Hadoop, we need a way to manage workflows in our architecture.

•  Scheduling and tracking MapReduce jobs, Hive jobs, etc. •  Several opAons here:

•  Cron •  Oozie, Azkaban •  3rd-‐party tools, Talend, Pentaho, InformaAca, enterprise schedulers.

• Supports defining and execuAng a sequence of jobs.

• Can trigger jobs based on external dependencies or schedules.

Final Architecture

Final Architecture – High Level Overview

Data Sources IngesAon

Data Storage/Processing

Data ReporAng/Analysis

Final Architecture – IngesAon

Web App Avro Agent Web App Avro Agent

Flume Agent

Fan-‐in PaTern

MulA Agents for Failover and rolling restarts

Final Architecture – Storage and Processing

/etl/weblogs/20140331/ /etl/weblogs/20140401/ …

Data Processing /data/markeAng/clickstream/bouncerate/ /data/markeAng/clickstream/aTribuAon/ …

Final Architecture – Data Access

Hive/Impala

BI/AnalyAcs Tools

DWH Sqoop

Local Disk

R, etc.

DB import tool

JDBC/ODBC

Contact info • Mark Grover

• @mark_grover •  www.linkedin.com/in/grovermark

•  Slides at slideshare.net/markgrover

application architectures with hadoop and sessionization in mr

Engineering

leveraging hadoop in polyglot architectures

hadoop operations powered by ... hadoop (hadoop summit 2014...

hadoop architecture and ecosystem -...

hadoop -...

lncs 9531 - mr-cof: a genetic mapreduce configuration...

hadoop , hadoop , hadoop !!!

new big data architectures · 2018. 11. 26. ·...

application architectures with hadoop - big data techcon sf...

2. hadoop -...

dell emc ready architectures for hadoopthe cloudera hadoop...

quelles architectures matérielles pour hadoop ?

linked (open) data - unige.it...(data-warehousing) virtually...

hadoop, hadoop, hadoop!!! jerome mitchell indiana university

data analytics and storage system (dass) mixing posix and...

open-bda hadoop summit 2014 - mr. krish krishnan (driving...

hadoop architecture and ecosystem · hadoop internals big...

performance and scale options for r with hadoop: a...

hadoop application architectures - big data handson ·...

mode of training: online, classroom, corporate faculty: mr...

application architectures with hadoop - uk hadoop user group