learn from homeaway hadoop development and operations best practices

Post on 17-Jan-2017

572 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Cascading Webinar

HomeAwayThe world leader for vacation rentals

Over a million listings worldwide and growing!

Hadoop is changingYou …

Need faster ROI Need compellinguse cases

Need more with less Need to leverage existing talent

Harnessing the power of hadoop MapReduce

Divides into smaller problems;;Assemble smaller answers into the answers to the bigger problems.

MapReduce Can be hard to learn Verbose;; Tedious Historically slow

New Engine Options Apache Tez Apache Spark Apache Flink

Problem at HomeAway

Cascading

Speaker Panel

• Austin Tobin -­ Software Engineer

File Storage Quotas :: Introduction to Cascading

• Michael McAllister -­ Staff Data Warehouse Engineer

Supplier Analytics :: Phoenix, HBase and Driven

• Francois Forster -­ Architect

User Analytics :: A/B Test Readouts

File Storage Quotas :: Introduction to Cascading

© Copyright 2015 HomeAway, Inc.

Introduction

1. What is it we are trying to solve

2. What is Cascading

3. How we applied Cascading to solve this problem

© Copyright 2015 HomeAway, Inc.

What is Mesa? What is the problem with Mesa?Mesa is an internal file systemDivided up into buckets, each bucket has a quotaEach bucket maintains a statistics file, locked on write and deleteAs usage increases, this locking creates performance bottlenecks

9

• Kafka• High performance messaging technology• Used to insert high volume of consistent log messages very quickly

• Avro• Compressible file-­format. Binarized, highly portable.

• Hadoop• Distributed file store and processing framework• enables near infinite horizontal scalability for storage and processing

• Cascading...

Key Technologies

Cascading

• Taps can be either sources or sinks• Sources are data inputs, and sinks are data outputs• They require a scheme, which is a set of column names (tuples), and a text-­delimiter

• The sink of one flow can be the source of another flow.• Pipes

• Abstractions to perform functions or transformations• Functions include split, merge, expression, and filter• The output of one pipe may be another pipe, • chain together to perform sequences of transformations

• Flows• Connect sources to sinks via pipes into a flow• Can connect multiple flows together into

a CASCADE

CASCADING

Cascading

The Cascading Archetype is project which makes it very easy to get started with cascading applications. Currently an internal project, which uses Spring to make defining taps and flows very easy.

1. Define your Taps2. Build your Flows.3. Cascade!

Cascading Archetype

© Copyright 2015 HomeAway, Inc.

Hadoop

Log Events

Mesa Stats Job

Mesa Metadata Old Catalog + Log EventsNew Catalog

+ Statistics

Mesa

Mesa Stats -­ The Big Picture

OLDCATALOGTAP

EVENT TAP

Clean Events Pipe

Build New Catalog Pipe

NEW CATALOG SINK

Flow Def -­ Create the New Catalog

CascadingOld Catalog Tap

Filter Non Mesa Events

Split the Message Field into multiple Fields

Remove Extraneous Fields

Pipe -­ Clean the Events

CascadingPipe -­ Clean the Events

Cleaned Event Pipe

Catalog Pipe

Sort Events by Latest Desc

Take Top 1 Event

Remove Deleted Events

Merge Events With Catalog Pipe

Pipe -­ Build the New Catalog

CascadingPipe -­ Build the Catalog

CascadingUpdate Catalog Flow Def -­ Revisited

NEW CATALOG TAP MESA QUOTA TAP

Sum File Sizes Per Bucket

Merge on Bucket Names

Divide Bucket File Sizes By Quota

STATISTICS SINK

Flow Def -­ Calculate the Statistics

CascadingPipe -­ Sum and Merge

CascadingFlow Def -­ Calculate the Statistics

CascadingFlow Def -­ Statistics Revisited

Thank you all!• Cascading For the Impatient

Supplier Analytics :: Phoenix, HBase and Driven

The goal

The goal: Expose our EDW analytics to suppliers. But ... More users of analytics = requirement to horizontally scale

SQL Server EDW + Managed Storage = Expensive to horizontally scale

The solution

Use Cascading with HBase / PhoenixCascading for ETLApache Phoenix as an abstraction layer over HbaseHomeAway created Cascading Phoenix Tap to simplify use of Phoenix.

What does our Cascading ETL look like?

Daily jobs scheduled in oozie Runs Cascading ETL developed as Java programs Examples:-­ETL listings that have changed since yesterday from EDW to HBaseETL listing metrics from current periodic snapshot fact partition over to HBase. ETL market group metrics from current periodic snapshot fact partition over to HBase

What does our Cascading ETL look like?

Extract -­ SQL statement issued against SQL Server JDBC tap

Transform Simple -­ do it in your SQL statement Complex -­ do it in your pipes -­ filters, cogroups, user defined functions, etc

Load -­ sink tap bound to Apache Phoenix Cascading tap This tap is in essence a HBase table

How Driven simplifies using Cascading

How Driven simplifies using Cascading

How Driven simplifies using Cascading

How Driven simplifies using Cascading

A real simple Cascading flow definition

User Analytics :: A/B Test Readouts

A/B Test Readouts

• We’re always running many A/B tests concurrently on our sites• Daily Cascading Job performs A/B test readout

– Readout for all running A/B tests at once– Rolling 3-­week

• Sliced and diced by site, by day, by test as well as various roll ups• Multiple conversion metrics• Millions of daily test exposures and conversions

A/B Test Readout Flow

Not The Full Cascade!

A/B Test Readout Cascade

• Includes Daily Intermediate Files–cascade.setFlowSkipStrategy(new FlowSkipIfSinkExists());

Using Driven For Performance Tuning

• Driven makes it easy to look at the time it takes to execute– Including the number of mappers or reducers

– Increase if needed:pipe.getStepConfigDef().setProperty("mapreduce.job.reduces","20");

Cascading Tips

• Store intermediate files to avoid re-­processing the same data over and over again–When running frequent jobs on rolling window

• Breakup your complex flows

• Use Driven to tweak # of reducers at various points

Deployment / Operational Issues

HomeAway CI/CD Pipelinecascading-­archetype

job-­A

job-­B

oozie-­job-­deployer

HomeAway

#wholevacation

Thank you!

top related