designing data pipelines using hadoop

26
Rocket Fuel Big Data and Artificial Intelligence for Digital Advertising Abhijit Pol Marilson Campos Designing Data Pipelines July, 2013

Upload: hadoop-summit

Post on 27-Jun-2015

5.603 views

Category:

Technology


4 download

DESCRIPTION

This presentation will cover the design principles and techniques used to build data pipelines taking into consideration the following aspects: architecture evolution, capacity, data quality, performance, flexibility and alignment with business objectives. The discussions will be based on the context of managing a pipeline with multi-petabyte data sets; a code-base composed of Java map/reduce jobs with HBase integration; Hive scripts and Kafka/Storm inputs. We?ll talk about how to make sure that data pipelines have the following features: 1) Assurance that the input data is ready at each step. 2) Workflows are easy to maintain. 3) Data quality and validation comes included in the architecture. Part of presentation will be dedicated to show how to organize the warehouse using layers of data sets. A suggested starting point for these layers are: 1) Raw Input (Logs, Messages, etc.), 2) Logical Input (Scrubbed data), 3) Foundational Warehouse Data (Most relevant joins), 4) Departmental/Project Data Sets and 5) Report Data Sets. (Used by Traditional Report engines) The final part will discuss the design of a rule-based system to perform validation and trending reporting.

TRANSCRIPT

Page 1: Designing Data Pipelines Using Hadoop

Rocket FuelBig Data and Artificial Intelligence for Digital Advertising

Abhijit PolMarilson Campos

Designing Data Pipelines

July, 2013

Page 2: Designing Data Pipelines Using Hadoop

What We Do?

Data Partners*

Optimize

Bid Request

Rocket Fuel Winning Ad

Ad Request

Ad Served to User

Page Request

Bid & Ad

Web Browser

Rocket Fuel Platform

Real-time BidderAutomated Decisions

Response Prediction

Model

Publishers

User Engagement Recorded

User Engages with Ad

Refresh learning

Campaign & User Data

Warehouse

Qualify Audience

Some Exchange Partners

AdExchange

Ads & Budget

Page 3: Designing Data Pipelines Using Hadoop

How Big Is This Problem Each Day?

Trades on NASDAQ

Facebook Page Views

Searches on Google

Bid Requests Considered by Rocket Fuel

Page 4: Designing Data Pipelines Using Hadoop

How Big Is This Problem Each Day?

Trades on NASDAQ

Facebook Page Views

Searches on Google

Bid Requests Considered by Rocket Fuel

~5 billion

10 million

30 billion

~20 billion

Page 5: Designing Data Pipelines Using Hadoop

BIG DATA + AI

Page 6: Designing Data Pipelines Using Hadoop

Advertising That Learns

Page 7: Designing Data Pipelines Using Hadoop

Outline

•Architecture Evolution•Hurdles and Challenges Faced•Data Pipelines Best Practices

Page 8: Designing Data Pipelines Using Hadoop

Architecture for Growth

•20 GB/month to 2 PB/month in 3 years•New and complex requirements•More consumers•Rapid growth

Page 9: Designing Data Pipelines Using Hadoop

How We Started?

Page 10: Designing Data Pipelines Using Hadoop

Architecture 2.0

Page 11: Designing Data Pipelines Using Hadoop

Current Architecture

Page 12: Designing Data Pipelines Using Hadoop

Outline

•Architecture Evolution•Hurdles and Challenges Faced•Data Pipelines Best Practices

Page 13: Designing Data Pipelines Using Hadoop

Hurdles and Challenges Faced

•Exponential data growth and user queries•Network issues•Bots•Bad user queries

Page 14: Designing Data Pipelines Using Hadoop

Outline

•Architecture Evolution•Hurdles and Challenges Faced•Data Pipelines Best Practices

Page 15: Designing Data Pipelines Using Hadoop

Data Pipeline Design Best Practices

Job Design

ConsistencyJob Features

Avoid Re-work Golden Input

Shadow ClusterData Collection

Dashboard

Page 16: Designing Data Pipelines Using Hadoop

Job Design / Consistency

• Idempotent

•Execution by different users

•Account for Execution Time

Page 17: Designing Data Pipelines Using Hadoop

Job Execution Timeline

Page 18: Designing Data Pipelines Using Hadoop

Job Features / Re-Work

•Smaller Jobs

•Record completion of steps

Page 19: Designing Data Pipelines Using Hadoop

Recording completion times

Start

Is mark already there?

Step of workflow, job or script

Yes

No

Execute work for the step.

Create the mark

End

Collect other data (Optional)

Page 20: Designing Data Pipelines Using Hadoop

Golden Input / Shadow Cluster

• Integration tests on realistic data sets.

•Safe environment to innovate.

Page 21: Designing Data Pipelines Using Hadoop

Data Collection - Delivery time view

J

Data product

Workflow Workflow

Job

Job

Job Job

Job Job

Job

Job

JobJob

Job

Hive/Pig SSH Script

J J… J

J

Hive

J J J

Pig

Page 22: Designing Data Pipelines Using Hadoop

Data collection : Data profiles view

Data product

Data set

Data set

= Data Set

= Transformation

Record Size & Type

Job Counts

Join success ratios Data Set Consistency

Page 23: Designing Data Pipelines Using Hadoop

Data Collection Hierarchy

wk_external_events

wk_build_profile

user_profile

extract_fields

consolidate_metrics

load_into_data_centers

extract_features

compact_user_profile

Workflow/Job/Script StepData Product

Page 24: Designing Data Pipelines Using Hadoop

Golden Input / Shadow Cluster

• Integration tests on realistic data sets.

•Safe environment to innovate.

Page 25: Designing Data Pipelines Using Hadoop

Dashboard

• Delivery Time• Data Profile Ratios• Counters• Alarms

Page 26: Designing Data Pipelines Using Hadoop

Thank you

www.rocketfuel.com